# MNIST Dataset Classification

In this notebook, we will be looking at the famous MNIST Classification Problem. <br>

First, we will test two different classifiers to see which one works well. <br>
Second, we will multiply our dataset by shifting pixels around to see if more info helps the accuracy.

In [1]:
from sklearn.datasets import fetch_openml
import numpy as np
mnist = fetch_openml("mnist_784", version = 1)

In [2]:
X, y = mnist["data"], mnist["target"]


# Splitting the Test Set and Training Set

In [3]:
X_train, X_test, y_train, y_test = X[:60000], X[60000:], y[:60000], y[60000:]

# Trying Different Algorithms: Random Forest, KNN

In [4]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score

forest = RandomForestClassifier(n_estimators=100)

forest.fit(X_train, y_train)
prediction = forest.predict(X_test)
print("Accuracy of Random Forest is:", accuracy_score(y_test,prediction))

Accuracy of Random Forest is: 0.9705


In [5]:
from sklearn.neighbors import KNeighborsClassifier

knn = KNeighborsClassifier(n_neighbors=5)
knn.fit(X_train, y_train)
prediction = knn.predict(X_test)
print("Accuracy of KNN is:", accuracy_score(y_test,prediction))

Accuracy of KNN is: 0.9688


# Data Augmentation - Increasing the Training Set

Here, we want to generate more images so that model has more images to learn from. <br>
I've created a function called movePixel. <br>
It takes input image, number of pixels to be moved, and axis as an argument. <br>
The default value for axis is zero, which means the image moves up and down <br>

In [6]:
def movePixel(input, pixelNum, axis=0):
    pixels = pixelNum * 28
    
    if pixelNum > 0 and axis==0:  
        temp = input[:, pixels:]
        whitePixels = np.zeros((input.shape[0], pixels))

        return np.column_stack((temp, whitePixels))
    elif pixelNum < 0 and axis==0:
        temp = input[:,:pixels]
        whitePixels = np.zeros((input.shape[0], -pixels))
        
        return np.column_stack((whitePixels,temp))
    elif pixelNum > 0 and axis==1:
        temp = input.reshape(input.shape[0], 28, 28)
        temp = temp[:, : ,:-pixelNum]
        whitePixels = np.zeros((input.shape[0], 28, pixelNum))
        temp = np.dstack((whitePixels, temp))
        temp = temp.reshape(input.shape[0], 28*28)
        return temp
    else: 
        temp = input.reshape(input.shape[0], 28, 28)
        temp = temp[:, : ,-pixelNum:]
        whitePixels = np.zeros((input.shape[0], 28, -pixelNum))
        temp = np.dstack((temp, whitePixels))
        temp = temp.reshape(input.shape[0], 28*28)
        return temp
    
    
#I want to make the parameters for movePixel to be completely random, so we can have more unique images
#So, the number of pixel parameter have a range of [-2, 2] excluding zero.
#Also, the axis is randomized as well with the range of [0, 1].
#The resulting array has 120000 training examples

numPixel1 = np.random.uniform(1, 2, (60000, 1))  #range of [1,2]
numPixel2 = np.random.uniform(-2, -1, (60000, 1)) #range of [-2, -1]
axis = np.random.uniform(0,1, (120000,1)) 
temp = np.hstack((np.vstack((numPixel1,numPixel2)), axis)) #Stack numPixel1 and numPixel2 vertically. Then, attach axis horizontally
temp[:,0], temp[:,1] = np.round(temp[:,0]), np.round(temp[:,1])
temp = temp.astype(int) 
newImageArray = np.empty([120000, 784])

for i in range(60000): #Go through X_train first time
    newImage = movePixel(X_train[i,:].reshape([1,784]), temp[i,0], temp[i,1])
    newImageArray[i,:] = newImage

for i in range(60000,120000): #Go through X_train second time
    newImage = movePixel(X_train[i-60000,:].reshape([1,784]), temp[i,0], temp[i,1])
    newImageArray[i,:] = newImage
    


In [7]:
new_X_train = np.vstack((X_train, newImageArray)) #Stack newImages that we've created with exxisting training data
new_y_train = np.concatenate((y_train, y_train, y_train)) #Stack Labels

shuffle = np.hstack((new_X_train, new_y_train.reshape(180000,1))) #Stack images features with labels horizontally

In [8]:
np.random.shuffle(shuffle) #Shuffle!
new_X_train, new_y_train = shuffle[:,:784], shuffle[:,784] #Separate shuffled training X and y

# Model Trained with More Training Data

In [9]:
newForest = RandomForestClassifier(n_estimators=100)

    
newForest.fit(new_X_train, new_y_train)
prediction = newForest.predict(X_test)
print("Accuracy of New Random Forest is:", accuracy_score(y_test,prediction))


Accuracy of New Random Forest is: 0.9784


In [10]:
knn = KNeighborsClassifier(n_neighbors=5)
knn.fit(new_X_train, new_y_train)
prediction = knn.predict(X_test)
print("Accuracy of New KNN is:", accuracy_score(y_test,prediction))

Accuracy of New KNN is: 0.9708


So, data augmentation improved the accuracy of both classifers by around 0.5% because the training data was multiplied 3 times. I think that accuracy of Random Forest Classifier, which is 97.8% is very good for MNIST Dataset. We probably can further increase the accuracy by multiplying the training set even more. 