zlib classifier

This is a toy classifier (not supposed to do well) based on compression. Specifically, nearness of data instances to a class are calculated by seeing how well the instances compress together with the training instances of that class.

Lets begin with a few basic imports:

from zclf import ZlibClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.datasets import load_digits

For working with any dataset, we need to have a mechanism for encoding each instance (the X values) in a string of bytes so that the compressor can pick it up. To make the classifier very black boxy, lets just encode each x using its string representation.

def encoder(x):
    return str(x).encode("utf-8")

Classification goes like this

X = load_digits()["data"]; y = load_digits()["target"]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=1)

clf = ZlibClassifier(encoder)
accuracy_score(clf.fit(X_train, y_train).predict(X_test), y_test)

0.436026936027

The internal representation of each class is a concatenated string (bytes) of all the training instances belonging to that class. For a single instance, the input is a vector like:

X_train[0]

array([  0.,   0.,   0.,   0.,   5.,  15.,   8.,   0.,   0.,   0.,   0.,
         2.,  15.,  16.,   9.,   0.,   0.,   0.,   3.,  15.,  16.,  16.,
        10.,   0.,   0.,   7.,  16.,  10.,   8.,  16.,   7.,   0.,   0.,
         0.,   1.,   0.,   8.,  16.,   4.,   0.,   0.,   0.,   0.,   0.,
        11.,  16.,   1.,   0.,   0.,   0.,   0.,   0.,   9.,  16.,   1.,
         0.,   0.,   0.,   0.,   0.,   8.,  14.,   0.,   0.])

Its encoding (first 100 bytes) is:

encoder(X_train[0])[:100]

b'[  0.   0.   0.   0.   5.  15.   8.   0.   0.   0.   0.   2.  15.  16.   9.\n   0.   0.   0.   3.  15'

The compressed representation (first 20 bytes) looks something like

import zlib
zlib.compress(encoder(X_train[0]))[:20]

b'x\x9c\x8bVP0\xd0S\xc0 L\x81\x84!\x88P\xb0\xc0"\x0b'

Lets try changing the encoding to one which groups numbers

clf = ZlibClassifier(lambda x: str(x // 10).encode("utf-8"))
accuracy_score(clf.fit(X_train, y_train).predict(X_test), y_test)

0.638047138047

Neat. Lets now make the numbers look like a string of repetitions so that 4 is closer to 5.

clf = ZlibClassifier(lambda x: ".".join(["a" * int(i // 10) for i in x]).encode("utf-8"))
accuracy_score(clf.fit(X_train, y_train).predict(X_test), y_test)

0.338383838384

Doesn’t work that well.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
Pipfile		Pipfile
Pipfile.lock		Pipfile.lock
README.org		README.org
zclf.py		zclf.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

zlib classifier

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

zlib classifier

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages