An experiment in featurizing images.
Python
Switch branches/tags
Nothing to show
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Failed to load latest commit information.
README.md
models.py
requirements.txt
train_digits.py
train_digits_rows.py
visualize.py

README.md

Image Classification with Regex-based Features

I wrote a blog post on this project.

Experiment 1

Source: train_digits.py.

I used regular expressions to featurize images of handwritten digits. To find good regexes, I used an evolutionary algorithm implemented using DEAP.

To evaluate my features, I trained an SVM on the featurized dataset. I was ultimately able to reach ~64% classification accuracy using regex-based features, which is significantly better than guessing. That said, it's not great—training an SVM on the raw image pixels gives ~95% accuracy.

Experiment 2

Source: train_digits_rows.py.

I divided each image into rows and repeated Experiment 1, training one model per row. I then featurized each image by featurizing its rows and concatenating the row features. This gave me ~88.7% classification accuracy.

What irks me is that I achieved optimal performance with a regex size of only 3. Most of the top-fitness regexes were simple strings like 101, 111, 000, etc. This actually makes a lot of sense, but it kind of defeats the purpose of using regex. Or maybe it shows that using regex is simply unnecessary for this particular task.

Instructions

First, install the required libraries: pip install -r requirements.txt.

models.py contains the main model class, EvolutionaryRegexFeaturizer. See train_digits.py and train_digits_rows.py for example usage.