# Introduction

Using *Yassine Ghouzam's* kernel as a reference, a **five-layered Sequential CNN** has been used for digits recognition from **MNIST dataset**. It is built with *Keras API* (Tensorflow backend). Firstly, I started with importing relevant libraries and then moved on to getting a quick description of the data. 

**After  lot of trial and error, the number of steps (epochs) has been set to 48, for better accuracy.** Initially for testing purposes epochs was set as 2 which gave an accuracy of ~85% and it crossed the 99% mark at epochs approximately equal to 25. On increasing the number of epochs beyond 50, the accuracy when down due to probable overfitting of the model. 

**Kaggle's GPU was used for getting more computational power.**

In [None]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt
import matplotlib.image as mpimg
import seaborn as sns
%matplotlib inline

np.random.seed(5)

from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix
import itertools

from keras.utils.np_utils import to_categorical # for ohe
from keras.models import Sequential
from keras.layers import Dense, Dropout, Flatten, Conv2D, MaxPool2D
from keras.optimizers import RMSprop
from keras.preprocessing.image import ImageDataGenerator
from keras.callbacks import ReduceLROnPlateau

sns.set(style='white', context='notebook', palette='deep')

import os
print(os.listdir("../input"))

# Any results you write to the current directory are saved as output.

In [None]:
# Loading the data

train = pd.read_csv("../input/train.csv")
test = pd.read_csv("../input/test.csv")


# Breaking up train.csv

The **label** column of the training dataset consists of the actual number denoted by the handwritten digit image. In the next step, the training data is separated into two dataframes *X_train* and *Y_train*, wherein *Y_train* represents the label column and *X_train* has a structure similar to the test dataset.

In [None]:
# preparation for a descriptive plot of the training data

Y_train = train['label']

# dropping the label column
X_train = train.drop(labels = ['label'], axis = 1)

del train

g = sns.countplot(Y_train)

Y_train.value_counts()

# Data check and Normalisation

Both the training and test data is checked for any null values so that appropriate measures could be taken to deal with it. In this case, luckily, there were no missing values. 
After that **Grayscale Normalisation** was done to both the datasets to reduce the effect of illumination differences. Moreover, CNNs converge faster on [0..1] data than on [0..255]. 

In [None]:
# checking the data
X_train.isnull().any().describe()

In [None]:
# checking for missing values in the test data
test.isnull().any().describe()

In [None]:
# no missing values in both train and test data 

X_train = X_train/255.0
test = test/255.0

# Reshaping and One-hot encoding

Train and test images of dimension 28px x 28px has been stock into pandas as dataframe of 1D vectors of 784 values. Reshaping all data to 28x28x1 3D matrices.
Keras requires an extra dimension in the end which correspond to channels. MNIST images are gray scaled so it use only one channel. For RGB images, there is 3 channels, and for that it would have reshaped 784px vectors to 28x28x3 3D matrices. 

The major advantage here of using **One-hot encoding** is that it speeds up the model considerably. 

In [None]:
# reshaping image in 3D (height = width = 28px; canal = 1)
# since it is a grayscale image, therefore channel is 1. For RGB image channel = 3
X_train = X_train.values.reshape(-1,28,28,1)
test = test.values.reshape(-1,28,28,1)

In [None]:
# one hot encoding of the labels
Y_train = to_categorical(Y_train, num_classes = 10)

In [None]:
# splitting training and validation set
random_seed = 2

# Internal splitting of data for Cross Validation

Using 10% of the dataset as a validation set for cross-validation. A random split is done using the train_test_split function because the dataset contains all the numbers in almost an equal proportion as can be seen in the plot generated above. .


In [None]:
X_train, X_val, Y_train, Y_val = train_test_split(X_train, Y_train, test_size = 0.1, random_state = random_seed)

In [None]:
# 10% of the training data has been used as a validation set and the rest is used to train the model
# some examples
g = plt.imshow(X_train[1][:,:,0])

# CNN Architechture

As mentioned in the introduction cell, Keras Sequential API has been used to build the model which allows us to add one layer at a time. 
The first layer is convolutional *(Conv2D)* layer.  32 filters for the two firsts conv2D layers and 64 filters for the two last ones were set. Each filter transforms a part of the image (defined by the kernel size) using the kernel filter.  Therefore, in short, filters can be seen as a transformation of the image.
The second important layer in CNN is the pooling *(MaxPool2D)* layer. These are used to reduce computational cost, and to some extent also reduce overfitting.
As a result of combination of convolutional and pooling layers, CNNs are able to combine local features and learn more global features of the image.

For regularization, **Dropout** method was used. Dropout is implemented by only keeping a neuron active with some probability p (a hyperparameter), or setting it to zero otherwise. This forces the network to be accurate even in the absence of certain information. It prevents overfitting by providing a way of approximately combining exponentially many different neural network architectures efficiently.

**'relu'** is the rectifier (activation function max(0,x) which is used to add non linearity to the network. 

After that, Flatten layer is use to convert the final feature maps into a one single 1D vector. This flattening step is needed so that fully connected layers can be used after some convolutional/maxpool layers. It combines all the found local features of the previous convolutional layers.


In [None]:
# USING THE KERAS SEQUENTIAL API
# SETTING THE CNN MODEL
# CNN architechture -> [[Conv2D->relu]]*2 -> MaxPool2D -> Dropout]*2 -> Flatten -> Dense -> Dropout -> Out

model = Sequential()

model.add(Conv2D(filters = 32, kernel_size = (5,5), padding = 'Same',
                activation = 'relu', input_shape =(28,28,1)))
model.add(Conv2D(filters = 32, kernel_size = (5,5), padding = 'Same',
                activation = 'relu'))
model.add(MaxPool2D(pool_size=(2,2)))
model.add(Dropout(0.25))

model.add(Conv2D(filters = 64, kernel_size = (3,3), padding = 'Same', activation = 'relu'))
model.add(Conv2D(filters = 64, kernel_size = (3,3), padding = 'Same', activation = 'relu'))

model.add(MaxPool2D(pool_size=(2,2), strides=(2,2)))
model.add(Dropout(0.25))

model.add(Flatten())
model.add(Dense(256, activation = 'relu'))
model.add(Dropout(0.5))
model.add(Dense(10, activation = 'softmax'))

# Optimisation Algorithm and Loss Function

A loss function is defined to measure how poorly our model performs on images with known labels. It is the error rate between the oberved labels and the predicted ones. A specific form for categorical classifications (>2 classes) called the "categorical_crossentropy" has been used.
The optimisation algorithm will iteratively improve parameters (filters kernel values, weights and bias of neurons, etc.) in order to minimise the loss. 
RMSprop is preferred as a optimizer over Stochastic Gradient Descent because SGD is generally slower. 
The Learning Rate is the step by which the optimizer walks through the 'loss landscape'. Higher LR leads to quicker convergence at the same time causing poor sampling.
Referring to Yassine's notebook, it's better to have a decreasing learning rate during the training to reach efficiently the global minimum of the loss function. 
To keep the advantage of the fast computation time with a high LR, ReduceLROnPlateau function from Keras.callbacks was used. A choice of reducing LR by half if the accuracy was not improved after three steps (epochs) was made.

In [None]:
# setting the optimizer and annealer
# setting up a score function, loss function, optimisation algorithm
# defining the optimizer

optimizer = RMSprop(lr=0.001, rho=0.9, epsilon=1e-08, decay=0.0)

In [None]:
# compiling the model
model.compile(optimizer = optimizer, loss = 'categorical_crossentropy', metrics = ['accuracy'])


In [None]:
# setting a learning rate annealer
learning_rate_reduction = ReduceLROnPlateau(monitor='val_acc', 
                                            patience=3, 
                                            verbose=1, 
                                            factor=0.5, 
                                            min_lr=0.00001)

In [None]:
epochs = 200
batch_size = 86

# Data Augmentation

To avoid overfitting, small variations are introduced in the training dataset to create new images. Methods of augmenting data used here:
1. Rotating images by 10 degrees
2. Randomly zooming by 10%
3. Randomly shift images horizontally or vertically by 10%

In [None]:
# avoiding overfitting
# data augmentation

# With data augmentation to prevent overfitting 

datagen = ImageDataGenerator(
        featurewise_center=False,  # set input mean to 0 over the dataset
        samplewise_center=False,  # set each sample mean to 0
        featurewise_std_normalization=False,  # divide inputs by std of the dataset
        samplewise_std_normalization=False,  # divide each input by its std
        zca_whitening=False,  # apply ZCA whitening
        rotation_range=10,  # randomly rotate images in the range (degrees, 0 to 180)
        zoom_range = 0.1, # Randomly zoom image 
        width_shift_range=0.1,  # randomly shift images horizontally (fraction of total width)
        height_shift_range=0.1,  # randomly shift images vertically (fraction of total height)
        horizontal_flip=False,  # randomly flip images
        vertical_flip=False)  # randomly flip images


datagen.fit(X_train)

In [None]:
# fittng the model
history = model.fit_generator(datagen.flow(X_train,Y_train, batch_size=batch_size),
                              epochs = epochs, validation_data = (X_val,Y_val),
                              verbose = 2, steps_per_epoch=X_train.shape[0] // batch_size
                              , callbacks=[learning_rate_reduction])

In [None]:
# confusion matrix helps in checking the drawbacks of the models

# looking at confusion matrix 

def plot_confusion_matrix(cm, classes,
                          normalize=False,
                          title='Confusion matrix',
                          cmap=plt.cm.Blues):
    """
    This function prints and plots the confusion matrix.
    Normalization can be applied by setting `normalize=True`.
    """
    plt.imshow(cm, interpolation='nearest', cmap=cmap)
    plt.title(title)
    plt.colorbar()
    tick_marks = np.arange(len(classes))
    plt.xticks(tick_marks, classes, rotation=45)
    plt.yticks(tick_marks, classes)

    if normalize:
        cm = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]

    thresh = cm.max() / 2.
    for i, j in itertools.product(range(cm.shape[0]), range(cm.shape[1])):
        plt.text(j, i, cm[i, j],
                 horizontalalignment="center",
                 color="white" if cm[i, j] > thresh else "black")

    plt.tight_layout()
    plt.ylabel('True label')
    plt.xlabel('Predicted label')

# Predict the values from the validation dataset
Y_pred = model.predict(X_val)
# Convert predictions classes to one hot vectors 
Y_pred_classes = np.argmax(Y_pred,axis = 1) 
# Convert validation observations to one hot vectors
Y_true = np.argmax(Y_val,axis = 1) 
# compute the confusion matrix
confusion_mtx = confusion_matrix(Y_true, Y_pred_classes) 
# plot the confusion matrix
plot_confusion_matrix(confusion_mtx, classes = range(10)) 

In [None]:
# investigating for the errors

# Displaying some error results 

# Errors are difference between predicted labels and true labels
errors = (Y_pred_classes - Y_true != 0)

Y_pred_classes_errors = Y_pred_classes[errors]
Y_pred_errors = Y_pred[errors]
Y_true_errors = Y_true[errors]
X_val_errors = X_val[errors]

def display_errors(errors_index,img_errors,pred_errors, obs_errors):
    """ This function shows 6 images with their predicted and real labels"""
    n = 0
    nrows = 2
    ncols = 3
    fig, ax = plt.subplots(nrows,ncols,sharex=True,sharey=True)
    for row in range(nrows):
        for col in range(ncols):
            error = errors_index[n]
            ax[row,col].imshow((img_errors[error]).reshape((28,28)))
            ax[row,col].set_title("Predicted label :{}\nTrue label :{}".format(pred_errors[error],obs_errors[error]))
            n += 1

# Probabilities of the wrong predicted numbers
Y_pred_errors_prob = np.max(Y_pred_errors,axis = 1)

# Predicted probabilities of the true values in the error set
true_prob_errors = np.diagonal(np.take(Y_pred_errors, Y_true_errors, axis=1))

# Difference between the probability of the predicted label and the true label
delta_pred_true_errors = Y_pred_errors_prob - true_prob_errors

# Sorted list of the delta prob errors
sorted_dela_errors = np.argsort(delta_pred_true_errors)

# Top 6 errors 
most_important_errors = sorted_dela_errors[-6:]

# Show the top 6 errors
display_errors(most_important_errors, X_val_errors, Y_pred_classes_errors, Y_true_errors)

In [None]:
# we can see that the errors committed by the model makes sense
# some of them could be even committed by humans

# prediction of results
results = model.predict(test)

# selecting the index with max prob.
results = np.argmax(results, axis=1)

results = pd.Series(results, name ='Label')

In [None]:
submission = pd.concat([pd.Series(range(1,28001), name = 'ImageId'), results], axis =1)

submission.to_csv('submission_five.csv', index=False)