In this notebook I aim to test out some different machinelearning techinques on the MNIST data set. I will not spend much time on feature engineering, but rather on exploring scikit-learn.

# Starting point
I have already tried out some basic feature engineering in another notebook. I noticed binarizing the ranges of all features from [0, 255] to [0, 1] had a positive effect when classifying with a Random Forest. I have also read about some tricks that are usefull when working on the MNIST data set that I will try to make use of:
* The data set can be extended by shifting digits in different directions while preserving their shape
* Nearest Neighbours is supposed to be more effective than Random Forest. As my knowledge is very limited on Random FOrest I cannot tell why, but I know Nearest Neighbours can handle non-linear systems, which I strongly suspect is necessary when classifying the MNIST set. I would like to try out Nearest Neighbours and finetune hyper parameters using scikit-learns Grid Search.
* Similar to the first point, I am considering extending the data by rotating digits by small degrees.

In [1]:
import pandas as pd
import numpy as np
import sklearn
import matplotlib.pyplot as plt
import seaborn as sns

%matplotlib inline

In [2]:
train_df = pd.read_csv('train.csv')
digits = train_df.iloc[:, 1:].values
labels = train_df['label'].values

# KNN on Binarized features
Let's see what results we get with the `KNeighborsClassifier` straight out of the box.

In [3]:
from sklearn.base import BaseEstimator, TransformerMixin
class DigitBinarizer(BaseEstimator, TransformerMixin):
    def __init__(self, cutoff = 1):
        self.cutoff = cutoff
    def fit(self, X, y = None):
        return self
    def transform(self, X, y= None):
        return np.apply_along_axis(self.binarize_digit_inner, 1, X)
    def binarize_digit_inner(self, digit):
        black = np.full(28*28, 1)
        white = np.full(28*28, 0)
        mask = digit >= self.cutoff
        return np.where(mask, black, white)

In [7]:
X = DigitBinarizer().transform(digits)
X.shape

In [9]:
from sklearn.neighbors import KNeighborsClassifier

In [57]:
28*28

784

My hopes of using GridSearch to find the optimal number of neighbours were stumped by the impossibly slow performance of my laptop on the large data set. The complexity of performing KNN classification using brute force distance meassurments is `O(N*D*Q)` where `N` (40 000) is the amount of samples in the training data and `D` the amount of features (784), and `Q` the number of queries (28 000). Which by the competitive programming rule of thumb stating we get about 10^8 operations/second tells me the worst case would run in about 2 hours and 30 minutes... I'm glad I seem to land far from the worst case, but classification is just way to slow for me to want to tinker with hyperparameters, even with the help of GridSearch.

In [10]:
knn = KNeighborsClassifier(algorithm="brute", n_neighbors=3)
knn.fit(X, labels)

In [12]:
test = pd.read_csv('test.csv').values
test = DigitBinarizer().transform(test)

When doing some experimenting I noticed my laptop grinds to a halt when making predictions on the whole test set, so I do it in batches of 1000.

In [32]:
%%time
predictions = [knn.predict(test[i*1000:(i+1)*1000]) for i in range(len(test) // 1000)]
%time

Wall time: 0 ns
Wall time: 1min 27s


In [33]:
predictions_flat = np.array(predictions).flatten()

In [34]:
submission_df = pd.DataFrame(list(zip(np.arange(1, 28001), predictions_flat)), columns = ['ImageID', 'Label'])

In [35]:
submission_df.set_index('ImageID').to_csv('submission2.csv')

This submission of Kaggle got an accuracy of 96.3 %.

# Extended training set
Let's see what accuracy we get if we extend the training set with images shifted 1 px in all directions. This makes the set 5x larger.

In [4]:
from scipy.ndimage.interpolation import shift
class DigitShifter(BaseEstimator, TransformerMixin):
    def __init__(self, directions = [(1, 0), (-1, 0), (0, 1), (0, -1)]):
        self.directions = directions
    def fit(self, X, y = None):
        return self
    def transform(self, X, y= None):
        X_shifts = [np.apply_along_axis(lambda x: self.shift_digit(x, direction[0], direction[1]), 1, X) for direction in self.directions]
        return np.concatenate([X] + X_shifts)
    def shift_digit(self, digit, delta_x=0, delta_y=0):
        digit = digit.reshape(28, 28)
        return shift(digit, (delta_x, delta_y)).flatten()

In [37]:
pipeline = Pipeline([
    ('binarize' , DigitBinarizer()),
    ('shifter' , DigitShifter())
])

In [38]:
X = pipeline.transform(digits)

In [39]:
y = np.concatenate([labels for i in range(len(X)//len(labels))])

In [40]:
knn.fit(X, y)

KNeighborsClassifier(algorithm='brute', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=1, n_neighbors=3, p=2,
           weights='uniform')

In [43]:
%%time
predictions = knn.predict(test[:1000])
%time

Wall time: 0 ns
Wall time: 19.1 s


In [45]:
%%time
predictions = [knn.predict(test[i*1000:(i+1)*1000]) for i in range(len(test) // 1000)]
%time

Wall time: 1 ms
Wall time: 7min 25s


In [46]:
predictions_flat = np.array(predictions).flatten()

In [47]:
submission_df = pd.DataFrame(list(zip(np.arange(1, 28001), predictions_flat)), columns = ['ImageID', 'Label'])

In [48]:
submission_df.set_index('ImageID').to_csv('submission3.csv')

This submission got an accuracy of 97.0%. Improvement!

# Conclusion
Using KNN on the extended training set I improved my prediction accuracy to 97%, compared to my previous RandomForest baseline of 94.5%.

I had high hopes for KNN, and wanted to try experimenting with scikit-learn to tune hyperparameters and try out some more extensions of the training set. However, the long prediction times make this very time consuming. I would rather focus on feature extraction.