<a href="https://colab.research.google.com/github/naraB/minst-classifier/blob/master/minst_classifier.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# MINST Dataset

..Using KNeighborsClassifier

In [0]:
import numpy as np
import matplotlib as mpl
import matplotlib.pyplot as plt

In [0]:
from google.colab import drive
from joblib import dump, load

drive.mount('/content/gdrive')

In [0]:
def getPath(name):
  return F"/content/gdrive/My Drive/{name}" 

## Getting the Dataset

In [27]:
from sklearn.datasets import fetch_openml
mnist = fetch_openml('mnist_784', version=1)
mnist.keys()

dict_keys(['data', 'target', 'frame', 'feature_names', 'target_names', 'DESCR', 'details', 'categories', 'url'])

In [0]:
X, y = mnist['data'], mnist['target']

Convert labels from string to integer

In [0]:
y = y.astype(np.uint8)

In [0]:
def plot_digit(digit):
  digit_img = digit.reshape(28, 28)
  plt.imshow(digit_img, cmap="binary")
  plt.axis("off")
  plt.show()

## Splitting the Data

MNIST dataset is already split into training and test set

In [0]:
X_train, X_test, y_train, y_test = X[:60000], X[60000:], y[:60000], y[60000:]

## Hyperparameter search

In [36]:
from sklearn.model_selection import GridSearchCV
from sklearn.neighbors import KNeighborsClassifier

knn_clf = KNeighborsClassifier()
param_grid = [{'weights': ['uniform', 'distance'], 'n_neighbors': [3, 4]}]
knn_grid_search = GridSearchCV(knn_clf, param_grid, verbose=3, scoring='accuracy')
knn_grid_search.fit(X_train, y_train)

Fitting 5 folds for each of 4 candidates, totalling 20 fits
[CV] n_neighbors=3, weights=uniform ..................................


[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.


[CV] ...... n_neighbors=3, weights=uniform, score=0.972, total=16.4min
[CV] n_neighbors=3, weights=uniform ..................................


[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed: 16.4min remaining:    0.0s


[CV] ...... n_neighbors=3, weights=uniform, score=0.971, total=16.4min
[CV] n_neighbors=3, weights=uniform ..................................


[Parallel(n_jobs=1)]: Done   2 out of   2 | elapsed: 32.8min remaining:    0.0s


[CV] ...... n_neighbors=3, weights=uniform, score=0.969, total=16.4min
[CV] n_neighbors=3, weights=uniform ..................................
[CV] ...... n_neighbors=3, weights=uniform, score=0.969, total=16.5min
[CV] n_neighbors=3, weights=uniform ..................................
[CV] ...... n_neighbors=3, weights=uniform, score=0.970, total=16.5min
[CV] n_neighbors=3, weights=distance .................................
[CV] ..... n_neighbors=3, weights=distance, score=0.972, total=16.4min
[CV] n_neighbors=3, weights=distance .................................
[CV] ..... n_neighbors=3, weights=distance, score=0.972, total=16.4min
[CV] n_neighbors=3, weights=distance .................................
[CV] ..... n_neighbors=3, weights=distance, score=0.970, total=16.4min
[CV] n_neighbors=3, weights=distance .................................
[CV] ..... n_neighbors=3, weights=distance, score=0.970, total=16.4min
[CV] n_neighbors=3, weights=distance .................................
[CV] .

[Parallel(n_jobs=1)]: Done  20 out of  20 | elapsed: 328.9min finished


GridSearchCV(cv=None, error_score=nan,
             estimator=KNeighborsClassifier(algorithm='auto', leaf_size=30,
                                            metric='minkowski',
                                            metric_params=None, n_jobs=None,
                                            n_neighbors=5, p=2,
                                            weights='uniform'),
             iid='deprecated', n_jobs=None,
             param_grid=[{'n_neighbors': [3, 4],
                          'weights': ['uniform', 'distance']}],
             pre_dispatch='2*n_jobs', refit=True, return_train_score=False,
             scoring='accuracy', verbose=3)

In [37]:
knn_grid_search.best_params_

{'n_neighbors': 4, 'weights': 'distance'}

In [38]:
knn_grid_search.best_score_

0.9716166666666666

## Model Persistence

Since finding the right hyperparameters takes approx. 5 hours, let's save the GridSearchCV object so we can load it without having to train it again

In [0]:
gs_save_name = F'knn_grid_search.joblib'

In [41]:
dump(knn_grid_search, getPath(gs_save_name))  

['/content/gdrive/My Drive/knn_grid_search.joblib']

In [0]:
knn_grid_search = load(getPath(gs_save_name))

## Predicting


In [0]:
y_test_pred = knn_grid_search.predict(X_test)

In [44]:
from sklearn.metrics import accuracy_score

accuracy = accuracy_score(y_test, y_test_pred)
print(accuracy)

0.9714


## Data Augmentation


Let's try to get an even better score by augmenting the data. Let's shift the every digit 1px in every direction (left, right, up, down)

In [0]:
from scipy.ndimage.interpolation import shift

def shift_img(image, x, y):
  image = image.reshape(28, 28)
  shifted_image = shift(image, [x, y], cval=0)
  return shifted_image.reshape(784,)

In [46]:
X_train_augmented = [image for image in X_train]
y_train_augmented = [label for label in y_train]

for image, label in zip(X_train, y_train):
    for x, y in ((1, 0), (-1, 0), (0, 1), (0, -1)):
      X_train_augmented.append(shift_img(image, x, y))
      y_train_augmented.append(label)

X_train_augmented = np.array(X_train_augmented)
y_train_augmented = np.array(y_train_augmented)
print(X_train_augmented.shape, y_train_augmented.shape)

(300000, 784) (300000,)


In [0]:
shuffle_index = np.random.permutation(len(X_train_augmented))
X_train_augmented = X_train_augmented[shuffle_index]
y_train_augmented = y_train_augmented[shuffle_index]

In [48]:
from sklearn.neighbors import KNeighborsClassifier

knn_clf = KNeighborsClassifier(**knn_grid_search.best_params_)
knn_clf.fit(X_train_augmented, y_train_augmented)

KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
                     metric_params=None, n_jobs=None, n_neighbors=4, p=2,
                     weights='distance')

In [49]:
from sklearn.metrics import accuracy_score, f1_score

y_pred = knn_clf.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print(accuracy)
f1 = f1_score(y_test, y_pred, average='macro')
print(f1)

0.9763
0.9762649231181537
