This is a toy classifier (not supposed to do well) based on compression. Specifically, nearness of data instances to a class are calculated by seeing how well the instances compress together with the training instances of that class.
Lets begin with a few basic imports:
from zclf import ZlibClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.datasets import load_digitsFor working with any dataset, we need to have a mechanism for encoding each
instance (the X values) in a string of bytes so that the compressor can pick it
up. To make the classifier very black boxy, lets just encode each x using its
string representation.
def encoder(x):
return str(x).encode("utf-8")Classification goes like this
X = load_digits()["data"]; y = load_digits()["target"]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=1)
clf = ZlibClassifier(encoder)
accuracy_score(clf.fit(X_train, y_train).predict(X_test), y_test)0.436026936027
The internal representation of each class is a concatenated string (bytes) of all the training instances belonging to that class. For a single instance, the input is a vector like:
X_train[0]array([ 0., 0., 0., 0., 5., 15., 8., 0., 0., 0., 0.,
2., 15., 16., 9., 0., 0., 0., 3., 15., 16., 16.,
10., 0., 0., 7., 16., 10., 8., 16., 7., 0., 0.,
0., 1., 0., 8., 16., 4., 0., 0., 0., 0., 0.,
11., 16., 1., 0., 0., 0., 0., 0., 9., 16., 1.,
0., 0., 0., 0., 0., 8., 14., 0., 0.])
Its encoding (first 100 bytes) is:
encoder(X_train[0])[:100]b'[ 0. 0. 0. 0. 5. 15. 8. 0. 0. 0. 0. 2. 15. 16. 9.\n 0. 0. 0. 3. 15'
The compressed representation (first 20 bytes) looks something like
import zlib
zlib.compress(encoder(X_train[0]))[:20]b'x\x9c\x8bVP0\xd0S\xc0 L\x81\x84!\x88P\xb0\xc0"\x0b'
Lets try changing the encoding to one which groups numbers
clf = ZlibClassifier(lambda x: str(x // 10).encode("utf-8"))
accuracy_score(clf.fit(X_train, y_train).predict(X_test), y_test)0.638047138047
Neat. Lets now make the numbers look like a string of repetitions so that 4 is closer to 5.
clf = ZlibClassifier(lambda x: ".".join(["a" * int(i // 10) for i in x]).encode("utf-8"))
accuracy_score(clf.fit(X_train, y_train).predict(X_test), y_test)0.338383838384
Doesn’t work that well.