Skip to content
This repository was archived by the owner on Aug 26, 2018. It is now read-only.

lepisma/zclf

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 

Repository files navigation

zlib classifier

This is a toy classifier (not supposed to do well) based on compression. Specifically, nearness of data instances to a class are calculated by seeing how well the instances compress together with the training instances of that class.

Lets begin with a few basic imports:

from zclf import ZlibClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.datasets import load_digits

For working with any dataset, we need to have a mechanism for encoding each instance (the X values) in a string of bytes so that the compressor can pick it up. To make the classifier very black boxy, lets just encode each x using its string representation.

def encoder(x):
    return str(x).encode("utf-8")

Classification goes like this

X = load_digits()["data"]; y = load_digits()["target"]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=1)

clf = ZlibClassifier(encoder)
accuracy_score(clf.fit(X_train, y_train).predict(X_test), y_test)
0.436026936027

The internal representation of each class is a concatenated string (bytes) of all the training instances belonging to that class. For a single instance, the input is a vector like:

X_train[0]
array([  0.,   0.,   0.,   0.,   5.,  15.,   8.,   0.,   0.,   0.,   0.,
         2.,  15.,  16.,   9.,   0.,   0.,   0.,   3.,  15.,  16.,  16.,
        10.,   0.,   0.,   7.,  16.,  10.,   8.,  16.,   7.,   0.,   0.,
         0.,   1.,   0.,   8.,  16.,   4.,   0.,   0.,   0.,   0.,   0.,
        11.,  16.,   1.,   0.,   0.,   0.,   0.,   0.,   9.,  16.,   1.,
         0.,   0.,   0.,   0.,   0.,   8.,  14.,   0.,   0.])

Its encoding (first 100 bytes) is:

encoder(X_train[0])[:100]
b'[  0.   0.   0.   0.   5.  15.   8.   0.   0.   0.   0.   2.  15.  16.   9.\n   0.   0.   0.   3.  15'

The compressed representation (first 20 bytes) looks something like

import zlib
zlib.compress(encoder(X_train[0]))[:20]
b'x\x9c\x8bVP0\xd0S\xc0 L\x81\x84!\x88P\xb0\xc0"\x0b'

Lets try changing the encoding to one which groups numbers

clf = ZlibClassifier(lambda x: str(x // 10).encode("utf-8"))
accuracy_score(clf.fit(X_train, y_train).predict(X_test), y_test)
0.638047138047

Neat. Lets now make the numbers look like a string of repetitions so that 4 is closer to 5.

clf = ZlibClassifier(lambda x: ".".join(["a" * int(i // 10) for i in x]).encode("utf-8"))
accuracy_score(clf.fit(X_train, y_train).predict(X_test), y_test)
0.338383838384

Doesn’t work that well.

About

zlib classifier

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages