# MNIST Digit Recognizer Classification

This notebook is applied on the famous MNIST Digit Recognizer Dataset for Multiclass Classification. 

Steps:
- Data Collection
- Digit Visualization
- Preprocessing: Data Augmentation
- Modelling with KNN



In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

In [None]:
train = pd.read_csv('../input/digit-recognizer/train.csv')
test = pd.read_csv('../input/digit-recognizer/test.csv')

In [None]:
df_train = train.copy()

In [None]:
train.head()

In [None]:
train.info()

In [None]:
y_train = train.pop('label')
X = train

## Visualize pixels

In [None]:
import matplotlib as mpl
import matplotlib.pyplot as plt

some_digit = X.loc[0].values
some_digit_image = some_digit.reshape(28,28)

plt.imshow(some_digit_image, cmap='binary')
plt.axis('off')

In [None]:
y_train.loc[0]

The image looks like the number 1 and the label confirms it. 

# Preprocessing

## Shift MNIST images

Now we create a function that shiftes images in the desired direction and by the number of pixels we define. 

The next step is to perform data augmentation, so that we build a richer dataset by concatenating 4 new training sets, one for each direction.

In [None]:
labels = y_train
data = X.values

def shift_images(direction='up', nr_pixels=1):
        
    if direction =='up':
        data_shifted = [np.roll(x.reshape(28,28).ravel(), -nr_pixels, axis=0) for x in data]
    if direction =='down':
        data_shifted = [np.roll(x.reshape(28,28).ravel(), nr_pixels, axis=0) for x in data]
    if direction =='left':
        data_shifted = [np.roll(x.reshape(28,28), -nr_pixels, axis=1).ravel() for x in data]
    if direction =='right':
        data_shifted = [np.roll(x.reshape(28,28), nr_pixels, axis=1).ravel() for x in data]
        
    df_shifted = pd.DataFrame(data_shifted, index=X.index, columns=X.columns)
    df_shifted['label'] = labels
    
    return df_shifted 

In [None]:
df_up = shift_images(direction='up', nr_pixels=1)
df_down = shift_images(direction='down', nr_pixels=1)
df_right = shift_images(direction='right', nr_pixels=1)
df_left = shift_images(direction='left', nr_pixels=1)

df_train = pd.concat([df_train, df_up, df_down, df_right, df_left])

After this, we need to shuffle the new training set.

In [None]:
# Shuffle training set
df_train = df_train.sample(frac=1).reset_index(drop=True)

# Modelling

For the selection of the model we choose the KNeighborsClassifier. 

For hyperparameter tuning, GridSearchCV is the best option to search for the best (weight, n_neighbors) combination.

In [None]:
y_train = df_train.pop('label')
X = df_train

from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import GridSearchCV

knn = KNeighborsClassifier()


parameters = {'weights':['uniform', 'distance'], 
              'n_neighbors':[3, 6, 9]}

clf = GridSearchCV(knn, parameters)
clf.fit(X, y_train)

print(clf.best_params_)
print(clf.best_score_)

In [None]:
predictions = clf.predict(test)

output = pd.DataFrame({'ImageId': test.index.values, 'Label': predictions})
output.to_csv('my_submission.csv', index=False)
print("Your pipeline submission was successfully saved!")