### **CNN model using keras. Test set Accuracy Achieved is 99.685%. **
Hi all,   I was fascinated by the book written by Michael Nielsen on neural networks sometime ago and since than I have been a ML enthusiast. :)  Happy to share with you my first kernel in digit recognition. 
* In this kernel I have used various ideas shared before in this platform. I am just returning the benefits I have earned here.

### Here we go !!
I have used techniques like Convolutional Neural Network with Pooling layers and batch normalisation and dropout. Used Adam Optimization for optimizing the cost. I also have used Data Augmentation to improve accuracy by atleast 0.371% . 
I have not used learning rate decay function and also not used any validation set to measure the accuracy. Trained my model on 42000 images and I have achieved an accuracy of 99.685% with dara Augmentation ON and epochs 75. Without Data Augmentation I achieved test set accuracy of 99.314% [here](https://www.kaggle.com/hengulkakaty/digit-recogniotion-in-cnn)

### Calling the required function.
* Keras is a Library which gives us liberty to call different function to use ML algorithms. It is easy to learn. Keras       documentations are easy enough to follow.
* matplotlib to plot some images and lines.
* Pandas to load the excell and do simple pre processing.
* numpy is an awsome mathematical library

In [None]:
import numpy as np
from keras import layers
from keras.layers import Input, Dense, Activation, ZeroPadding2D, BatchNormalization, Flatten, Conv2D
from keras.layers import AveragePooling2D, MaxPooling2D, Dropout, GlobalMaxPooling2D, GlobalAveragePooling2D
from keras.models import Model
from keras.preprocessing import image
from keras.utils import layer_utils
from keras.utils.data_utils import get_file
from keras.applications.imagenet_utils import preprocess_input
import pydot
from IPython.display import SVG
from keras.utils.vis_utils import model_to_dot
from keras.utils import plot_model
from keras.utils import to_categorical
import matplotlib.pyplot as plt 
import pandas as pd
np.set_printoptions(suppress=True)


###  Loading the Train and Test Set. paths are set by default if you use kaggle kernel

In [None]:
df1 = pd.read_csv('../input/train.csv')
df2 = pd.read_csv('../input/test.csv')

 Here we are just formatting the data into an array.

In [None]:
train_data = np.array(df1)
test_data  = np.array(df2)

X_train_orig    = np.asarray( train_data[:, 1::], dtype=np.float32 )
Y_train_orig    = np.asarray( train_data[:, 0], dtype=np.float32 )

X_test_orig     = np.asarray( test_data[:, 0::], dtype=np.float32 )

###  Normalizing the input. 
The data what we have in our training set are scattered from 0 to 255. So we are dividing each inoput by 255 so that we have a range of values from 0 to 1. This helps in improve accuracy and faster to converge.

In [None]:
X_train = X_train_orig / 255.
X_test = X_test_orig / 255.

In [None]:
X_train.shape

### Reshaping the arrays into desired format.
In Keras we need to reshape the arrays into [no of images, images size in x & y, no of channels]
We have 42000 examples in our training set
Each Image is 28x28 size
Each image have only 1 dimension because we only have grey scale images. This value would have been 3 if we have a colored image. Colored image have three channel which is best known as RGB channels.

So Our Training set have become  [42000,  28,  28,  1]

In [None]:
X_train = X_train.reshape(-1,28,28,1)

X_test = X_test.reshape(-1,28,28,1)

Y_train = Y_train_orig.reshape(-1, 1)

Y_train = to_categorical(Y_train, num_classes=10)

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_val, Y_train, Y_val = train_test_split(X_train, Y_train, test_size = 0.2)

### Lets see the dimension of each array
Till now we have 4 array. 
* X_train contains the pixels values of each images from 42000 set. On this set our model will train and learn to recognise digits.
* Y_train contains the labels of each corresponding images. 
* X_test contains the pixels values of each images from 28000 set. We will test our model on this set. This set does not have labels. We will predict the labels here.

Lets look:

In [None]:
print ("number of training examples = " + str(X_train.shape[0]))
print ("number of test examples = " + str(X_test.shape[0]))
print ("X_train shape: " + str(X_train.shape))
print ("Y_train shape: " + str(Y_train.shape))
print ("X_test shape: " + str(X_test.shape))

### Let's look into our data.
As our data has been well pre processed now lets look at our data once before proceeding to train our model.
We have already turned our Y_train into categorical values. This means that each label has been turned into a vector of 10 values. If our label is 1 than our vector will be like [0, 1, 0, 0, 0, 0, 0, 0, 0, 0]. All zeros except the position 1. In python count always starts from 0. This process is famously known as one hot encoding. We had to do this as we have 10 different classes to predict. 10 different classes are from 0, 1, 2, 3, 4, .....  10.

In [None]:
plt.figure(figsize=(10,10))   # to fix a shape for each image print
for ax in range(10):          # using a for loop to display a number of images
    plt.subplot(5, 5, ax+1) # we need to use this function to print an array of pictures 
    plt.imshow(X_train[ax,:,:,0],cmap=plt.cm.binary) # this will call the images from train set one by one
    print(" Ground Truth Label = ", Y_train[ax,:])   # Lets also look into the labels 
    plt.axis('off')                                  # axis has been turned off to have clear view

## Data Augmentation:
#### Let us do some data Augmentation and see what happens.
In data Augmentation we are giving a tilt of 15 degree, zooming 10%, shifting the widhth and height by 10% out of total fraction.
All this process is done selecting random images in the data set. 
* We achieved 99.314 % accuracy without Data Augmentation.
* And with Data Augmentation we achieved 99.685%. 

In [None]:
from keras.preprocessing.image import ImageDataGenerator
datagen = ImageDataGenerator(
        rotation_range=15,  # randomly rotate images in the range (degrees, 0 to 180)
        zoom_range = 0.1, # Randomly zoom image 
        width_shift_range=0.1,  # randomly shift images horizontally (fraction of total width)
        height_shift_range=0.1)  # randomly shift images vertically (fraction of total height)

datagen.fit(X_train)

## Build the CNN model:
### Conv2D(32 filter with 3x3 kernel, stride 1 padding same) * 2 ---> MaxPool ---> Dropout(0.4) ---> Conv2D(64 filter with 3x3 kernel, stride 1 padding same) ---> MaxPool ---> Dropout(0.4) ---> Flatten ---> Dense(1024) ---> Dense(256) ---> Dropout(0.5) ---> Softmax

Lets try to define each term above in simple language:
* Convolution is a process of looking into the image part by part. Part means a small window. In our case Window is of size 3x3. We will use such 3x3 windown 32 times in one single CNN layer. This number 32 is called filter. Filters are cool to detect the edges whether vertical or horizontal. More number of filters means model is capable to handling more number of edges.

* Pooling is a concept same as convolutional layer but here we take the maximum or average of all the number along the size of the window. No gradient decent while implementing pooling layers. No parameters to learn if we implement Pooling.
Why we use Pooling… No one realy knows. In many experiments, it was found that Pooling really works well. Dimension concepts are same as in Convolutional Layer. When using Pooling Padding are rarely used.

* Padding is a concept of putting some extra layer around our image. This helps in keeping our dimensions correct. While we do Convolution or pooling the dimensions of the picture shrinks. But with the use of padding some bits(0 or 1) around the picture we used to retain the dimension of the picture.

* How Dimensions are effected.. If Image size is n x n and window size is 3 x 3 and we use padding p than our output image size will be:
    n + 2p - f + 1     x      n + 2p - f + 1  
We also use padding when we need to retain the input size of the image. To do this we choose padding carefully as :               p  =  (f-1) / 2  ,    f is almost always odd.

* What is stride... When we use Stride of 2 we jump over by 2 step while doing convolution. While using Stride we use the below formula for the output image: (  ((n + 2p - f ) / s)  +  1  )  x   (  ((n + 2p - f ) / s)  +  1  )

* Batchnormalization are used to normalize the activations. Normaly we do normalize the input values and with batch normalization we are doing the same thing to the activations. Simply saying we do normalization in the hidden layers too. This helps in improving accuracy and Speed up tarining process etc.

* Dropout is a cool thing where we nullify the effect of some neuron so as to improve the accuracy of other neuron. Basicaly we set some random weights to zero and thus It helps in a overfitting model and also helps in generalization. Overfitting Model is a model where Training Accuracy is more but Test accuracy is poor. Generalization means how well our model predicts for unseen data.

* At last we flatten our output from CNN so as to get a vector. Lets think Vector as a single column in an excel. We do this to use our data in Fully connected layers. Here we have introduced two FC layers of 1024 neurons and 256 neurons.

* To get the Output we have used a softmax function. This function helps us to give 10 different classes.

## So thats it.. Now lets build our model in keras.
In Keras there are two ways to build our model. One is sequential model and another is the functional API. I think sequential model is a bit popular but here we will use functional API model. 

In [None]:
def Keras_Model(input_shape):    
    
    X_input = Input(input_shape)
    
    # First Convolutional Layer
    X = Conv2D(64, (3, 3), strides = (1, 1), padding = 'same', name = 'conv0')(X_input) 
    X = BatchNormalization(axis = 3, name = 'bn0')(X)
    X = Activation('relu')(X)
    
    # Second Convolutional Layer
    X = Conv2D(64, (3, 3), strides = (1, 1), padding = 'same', name = 'conv1')(X) 
    X = BatchNormalization(axis = 3, name = 'bn1')(X)
    X = Activation('relu')(X)
    
    # First Pooling Layer
    X = MaxPooling2D((2, 2), name='max_pool_1')(X)
                           
    X = Dropout(0.35)(X)
    
    # Third Convolutional Layer
    X = Conv2D(128, (3, 3), strides = (1, 1), padding = 'same', name = 'conv2')(X) 
    X = BatchNormalization(axis = 3, name = 'bn2')(X)
    X = Activation('relu')(X)
    
    # Fourth Convolutional Layer
    X = Conv2D(128, (3, 3), strides = (1, 1), padding = 'same', name = 'conv3')(X) 
    X = BatchNormalization(axis = 3, name = 'bn3')(X)
    X = Activation('relu')(X)
    
    # Second Pooling Layers
    X = MaxPooling2D((2, 2), name='max_pool_2')(X)                       
                           
    X = Dropout(0.35)(X)     
        
    
    # Fifth Convolutional Layer
    X = Conv2D(256, (3, 3), strides = (1, 1), padding = 'same', name = 'conv4')(X) 
    X = BatchNormalization(axis = 3, name = 'bn4')(X)
    X = Activation('relu')(X)

    X = Dropout(0.35)(X) 
 
    # Flatten the data.
    X = Flatten()(X)
    # Dense Layer
    X = Dense(1000, activation='relu', name='fc0')(X)
    X = Dropout(0.5)(X) 
    X = Dense(256, activation='relu', name='fc2')(X)
    
    # Using softmax function to get the output
    X = Dense(10, activation='softmax', name='fc3')(X)
    
    model = Model(inputs = X_input, outputs = X, name='model')
    
    return model

### Let's call the above function with our input shape. Our Input is X_train right....

In [None]:
Keras_Model = Keras_Model(X_train.shape[1:4])

## Adam optimizer:
There are several Optimizing function to use. this time I preferred Adam Optimizer. This is widely used and fast for converging. We are keeping our Learning rate to 0.0001

In [None]:
from keras.optimizers import Adam
epochs = 100
batch_size = 64
lrate = 0.0001
decay = lrate/epochs
optimizer = Adam(lr=lrate, epsilon=1e-08, decay = 0.00)

We need to compile the model before training. This step is related to keras. We used categorical crossentropy function as we have One hot encoded our labels.

In [None]:
Keras_Model.compile(loss='categorical_crossentropy', optimizer=optimizer, metrics=['accuracy'])

In [None]:
from keras.callbacks import ReduceLROnPlateau
learning_rate_reduction = ReduceLROnPlateau(monitor='val_acc', 
                                            patience=2, 
                                            verbose=1, 
                                            factor=0.5, 
                                            min_lr=0.0000001)

In [None]:
from keras.callbacks import ModelCheckpoint, Callback, EarlyStopping
early_stopping = EarlyStopping(monitor='val_loss', patience=3, verbose=1, mode='auto')

## Train the Model
Lets train the model with batch size of 32 and epoch of 75. As we have used data augmentation we need to use a special function called fit generator. I have trained the model to 75 number of epochs beyond which I believe model will overfit the data. Though you can try different number of epochs. I have tried different batch size but it barely effects the accuracy. batch size impacts on speed of training.

In [None]:
history = Keras_Model.fit(x = X_train, y = Y_train, batch_size = batch_size, 
                        epochs=epochs, verbose=1, 
                        validation_data = (X_val, Y_val),
                        
                          steps_per_epoch= None, validation_steps=None,
                                  callbacks=[learning_rate_reduction, early_stopping] )

In [None]:
preds = Keras_Model.evaluate(X_train, Y_train)
print ("Loss = " + str(preds[0]))
print ("Train set Accuracy = " + str(preds[1]))

## Train Accuracy:

I have achieved almost same train accuracy with or without data Augmentation. But the model with data augmentation generalizes well. Please see the kernel without data augmentation [here](http://www.kaggle.com/hengulkakaty/digit-recogniotion-in-cnn?scriptVersionId=6090971)

In [None]:
preds = Keras_Model.evaluate(X_val, Y_val)
print ("Loss on Val set= " + str(preds[0]))
print ("Val set Accuracy = " + str(preds[1]))

#### History is used to find out accuracy and loss in Tran as well as Validation set.

In [None]:
history_dict = history.history
history_dict.keys()

In [None]:
val_loss = history_dict['val_loss']
val_acc = history_dict['val_acc']
loss = history_dict['loss']
acc = history_dict['acc']
epochs = range(1,len(history_dict['val_loss'])+1)

In [None]:
plt.plot(epochs,acc,'b-')
plt.title('Accuracy of Model')
plt.xlabel('epochs')
plt.ylabel('Accuracy')

plt.plot(epochs,val_acc,'b-', color = 'red')
plt.title('Accuracy of Model')
plt.xlabel('epochs')
plt.ylabel('Accuracy')
plt.show()

In the above graph we have seen  that our validation accuracy(RED Line) is slightly on higher side than Training accuracy.  So we can expect a good generalisation.

In [None]:
plt.plot(epochs,loss,'b-')
plt.title('loss function')
plt.xlabel('epochs')
plt.ylabel('Loss')

plt.plot(epochs,val_loss,'b-', color = 'red')
plt.title('loss function')
plt.xlabel('epochs')
plt.ylabel('val_loss')
plt.show()

Above Graphs shows the Loss on Train and Validation set. RED line suggesting Loss on Validation is lower that train loss.

### Let's see where our model is going wrong.
* STEP 1:  We will predict on Validation set and convert data to integer value using argmax and store it in a variable Y_val_pred_label.
* STEP 2:  We will find out the integer value from the Ground truth label from Validation set as well and store it in a variable Y_val_True_label.
* STEP 3:  So Predicted label and True label should be equal to each other in ideal case. If they are different than that is our point of concern.
* STEP 4:  When there is a difference than we will check the probability. If model is wrong in predicting the label and probability of prediction is quite high than we will check those instance.
* STEP 5:  We will priint the corresponding image using X_val and also print the predicted label and true label to get an idea how our model is going wrong.

In [None]:
# STEP 1:
predicted_val_probability = Keras_Model.predict(X_val, batch_size=32)
Y_val_pred_label = np.argmax(predicted_val_probability, axis = 1)

In [None]:
# STEP 2:
Y_val_True_label = np.argmax(Y_val, axis = 1)

In [None]:
# STEP 3: 
j = []
max_probability = []
for i in range(len(Y_val_pred_label)):
    if Y_val_pred_label[i] != Y_val_True_label[i]:     
        j.append(i)

In [None]:
# STEP 4: 
l = []
for ele in j:
    if np.max(predicted_val_probability[ele]) > 0.95:
        l.append(ele)

In [None]:
# STEP 5: 
c = 1
plt.figure(figsize=(10,10))   # to fix a shape for each image print
for ax in l[:]:          # using a for loop to display a number of images
    plt.subplot(6,5, c) # we need to use this function to print an array of pictures 
    plt.imshow(X_val[ax,:,:,0],cmap=plt.cm.binary) # this will call the images from train set one by one
    plt.title('True label:{}\nPredicted label:{}'.format(Y_val_True_label[ax],Y_val_pred_label[ax]))  # Lets also look into the labels 
    plt.axis('off') 
    c = c+1

From the above picture it is very clear that some data are inconsistent. Even human can go wrong in these kind of cases. So we can conclude that our model is doing prity well in predicting hand written digits.

## Prediction on Test set:
Lets look at our prediction. rememeber that we have one hot encoded our labels and hence here we will get 10 different results for each test set examples. In the next step we will find out the highest number(probability) in a single row.

In [None]:
classes = Keras_Model.predict(X_test, batch_size=32)

we have used argmax function to find out the highest probability in a row. axis=1 doing the trick here.

In [None]:
class_test_set = np.argmax(classes, axis = 1)

## Lets prepare the submission excell sheet as desired and commit this kernel to see how we have done.

In [None]:
prediction = pd.DataFrame()
prediction['ImageId'] = np.asarray(range(1,28001))
prediction['Label'] = class_test_set

prediction.to_csv('submission.csv', index = False)

## Thank You
Thanks to kaggle for giving this platform to connect to some wonderfull people around the globe. Also Thanks a lot to Mr. Andrew Ng who motivatated milions of people including me to learn and Practice Deep Learning.
* Please do vote and comment on this kernel. Namaste !!