# MNIST - Classification

<center><img src="https://www.dropbox.com/s/i37mgynkrf1d3vb/supervised_flow_chart.png?raw=1" height=300px width=1000px></img></center>

# 01 : Frame the Problem

We need to build a classifier using the MNIST dataset, which is a set of 70,000 small images of digits handwritten by high school students and employees of the US Census Bureau. Each image is labeled with the digit it represents. This set has been studied so much that it is often called the “Hello World” of Machine Learning. 

Whenever people come up with a new classification algorithm, they are curious to see how it will perform on MNIST. Whenever someone learns Machine Learning, sooner or later they tackle MNIST. 

Each image is labeled with the digit it represents. Each digit is represented by 28 x 28, each cell representing pixel depth.

# 02 : Obtaining the Data

### Import the Libraries

In [0]:
import numpy as np
import os
import pandas as pd
%matplotlib inline
import matplotlib
import matplotlib.pyplot as plt

### Reading the data from CSV File

In [0]:
!wget https://www.dropbox.com/s/lskvhzb0gy8npv9/mnistdata.csv
mnist = pd.read_csv("mnistdata.csv")
mnist.info()

In [0]:
y = mnist['Label'] #getting the labels from Data
mnist.drop('Label',axis=1,inplace = True) #After we drop the Labels we have remaining data.
X = mnist

In [0]:
X.shape

In [0]:
type(y)

In [0]:
y.shape

In [0]:
28*28 

# 03 : Analyze Data

Prepare the Features and Target variables.
To analyze what is the shape of the feature set.

Visualizing a random digit using the Matplotlib Library

In [0]:
some_digit = X.iloc[35000]
some_digit_image = some_digit.reshape(28, 28) # to get it into a 2D array
plt.imshow(some_digit_image, cmap = matplotlib.cm.binary,
           interpolation="nearest")
plt.axis("off")
plt.show();

In [0]:
some_digit_image.shape

In [0]:
some_digit_image.max()

In [0]:
y[35000]

# 04 : Feature Engineering

MNIST data is divided as follows:  
- Train Data - First 60000 rows  
- Test Data - Last 10000 rows  

In [0]:
X_train, X_test, y_train, y_test = X[:60000], X[60000:], y[:60000], y[60000:]

As all the numbers are in the same place, we shuffle them randomly using numpy permutations function

In [0]:
import numpy as np

shuffle_index = np.random.permutation(60000)
X_train, y_train = X_train.iloc[shuffle_index], y_train.iloc[shuffle_index]

In [0]:
type(shuffle_index)

# 05-A : Model Selection





### Binary classifier

Instead of prediciting all the classes, we first predict whether a number is '5' or not.  
We create a target test and train variables such that we have True for digits representing '5' and False for digits representing other than '5'.

In [0]:
y_train_5 = (y_train == 5) 
y_test_5 = (y_test == 5)

In [0]:
y[12607]

In [0]:
y_train_5.head(40)

In [0]:
from sklearn.linear_model import SGDClassifier

sgd_clf = SGDClassifier(max_iter=5, random_state=42)
sgd_clf.fit(X_train, y_train_5)

# 06-A : Tune the Model

Predicting the digit for random number.

In [0]:
sgd_clf.predict([some_digit])

Cross Value Score will evaluate the model for each fold and gives us an array representing the accuracy obtained in each fold.

In [0]:
from sklearn.model_selection import cross_val_score
cross_val_score(sgd_clf, X_train, y_train_5, cv=3, scoring="accuracy")

Cross Value Predict will predict the target for each fold and append them into a series.

### Cross Validation

In [0]:
from sklearn.model_selection import cross_val_predict

y_train_pred = cross_val_predict(sgd_clf, X_train, y_train_5, cv=3)

In [0]:
y_train_pred.shape

In [0]:
len(y_train_pred)

In [0]:
from sklearn.metrics import accuracy_score

In [0]:
accuracy_score(y_train_5, y_train_pred)

In [0]:
from sklearn.metrics import confusion_matrix

confusion_matrix(y_train_5, y_train_pred)

In [0]:
from sklearn.metrics import precision_score, recall_score

precision_score(y_train_5, y_train_pred) # When we predict it it be 5, how often are we right? 

In [0]:
recall_score(y_train_5, y_train_pred) # When it is 5, how often do we predict it to be 5?

In [0]:
from sklearn.metrics import f1_score
f1_score(y_train_5, y_train_pred)

# 05-B : Model Selection

### Dummy Classifier

In the train set, only 10% of the rows are labelled to be 5 and remaining are not 5's. So instead of using a model, by using a dummy classifier that give 0 as the prediction for every row, we can obtain 90% accuracy.

In [0]:
from sklearn.base import BaseEstimator
class Never5Classifier(BaseEstimator):
    def fit(self, X, y=None):
        pass
    def predict(self, X):
        return np.zeros((len(X), 1), dtype=bool)

In [0]:
never_5_clf = Never5Classifier()

# 06-B : Tune the Model

In [0]:
cross_val_score(never_5_clf, X_train, y_train_5, cv=3, scoring="accuracy")

In [0]:
predictions = never_5_clf.predict(X_test)

In [0]:
predictions

In [0]:
confusion_matrix(y_test_5,predictions)

In [0]:
accuracy_score(y_test_5,predictions)

In [0]:
precision_score(y_test_5,predictions)

In [0]:
recall_score(y_test_5,predictions)

# Precision and Recall Threshold

We study the precision recall relationship using SGD Classifier

<center><img src="https://www.dropbox.com/s/anfhedig7uz35fw/tradeoff.png?raw=1" height=300px width=1000px></img></center>

In [0]:
from sklearn.model_selection import cross_val_predict
y_scores = cross_val_predict(sgd_clf, X_train, y_train_5, cv=3,method="decision_function")

In [0]:
from sklearn.metrics import precision_recall_curve

precisions, recalls, thresholds = precision_recall_curve(y_train_5, y_scores)

In [0]:
len(precisions)

In [0]:
def plot_precision_recall_vs_threshold(precisions, recalls, thresholds):
    plt.plot(thresholds, precisions[:-1], "b--", label="Precision", linewidth=2)
    plt.plot(thresholds, recalls[:-1], "g-", label="Recall", linewidth=2)
    plt.xlabel("Threshold", fontsize=16)
    plt.legend(loc="upper left", fontsize=16)
    plt.ylim([0, 1])

plt.figure(figsize=(8, 4))
plot_precision_recall_vs_threshold(precisions, recalls, thresholds)
plt.xlim([-700000, 700000])
plt.show()

# 05-C : Model Selection

### Multiclass classification

Using SGD Classifier for prediciton of 10 classes ([0-9])

In [0]:
sgd_clf.fit(X_train, y_train)
sgd_clf.predict([some_digit])

# 06-C : Tune the Model

In [0]:
y_hat = sgd_clf.predict(X_test)

In [0]:
confusion_matrix(y_test,y_hat)

In [0]:
some_digit_scores = sgd_clf.decision_function([some_digit])
some_digit_scores

In [0]:
np.argmax(some_digit_scores)

In [0]:
sgd_clf.classes_

In [0]:
sgd_clf.classes_[5]

# 05-E : Model Selection

### Random Forest Classifier

Using Random Forest Classifier

In [0]:
from sklearn.ensemble import RandomForestClassifier
forest_clf = RandomForestClassifier()
forest_clf.fit(X_train, y_train)
forest_clf.predict([some_digit])

# 06-E : Tune the Model

In [0]:
forest_clf.predict_proba([some_digit])

In [0]:
forest_clf_predictions = forest_clf.predict(X_test)

In [0]:
accuracy_score(y_test,forest_clf_predictions)

In [0]:
precision_score(y_test,forest_clf_predictions,average="macro")

In [0]:
recall_score(y_test,forest_clf_predictions,average='macro')

# 05-F : Model Selection

### Multilabel classification using K Neighbors Classifier

Multilabel Classification helps to give multiple labelled predictions for single output.  
For example here the labels are:  
- A number is greater than 7 or not.
- Is a number even or not.


In [0]:
from sklearn.neighbors import KNeighborsClassifier

y_train_large = (y_train >= 7)
y_train_odd = (y_train % 2 == 1)
y_multilabel = np.c_[y_train_large, y_train_odd]

knn_clf = KNeighborsClassifier()
knn_clf.fit(X_train, y_multilabel)

In [0]:
y_train_large.shape