# COMPARING SGD, RMSprop, Adam FOR DIGIT RECOGNITION USING CNN:

* Stochastic Gradient Descent 
* RMSprop
* Adam

Let's see how well these optimiser help in predicting digits.

# Content
1. Understanding dataset
2. Checking null values
3. Intuition about dimensions used for CNN model
4. Normalization
5. Looking at some image examples
6. Imbalanced Class
7. One-Hot-Encoding
8. Using Keras for CNN
9. Defining different optimizers
10. Training CNN
11. Visualisation and insights from them
12. Conclusion


## Import libraries
Let's import some basic libraries first

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.image as mpimg
import seaborn as sns
%matplotlib inline

np.random.seed(2)

from sklearn.model_selection import train_test_split
import itertools

from keras.utils.np_utils import to_categorical # convert to one-hot-encoding
from keras.models import Sequential
from keras.layers import Dense, Dropout, Flatten, Conv2D, MaxPool2D
from keras.optimizers import RMSprop, Adam, SGD
from keras.callbacks import History 




**Let's read train and test values from the dataset. **

In [None]:
train = pd.read_csv('../input/train.csv')
test = pd.read_csv('../input/test.csv')

**OK! Now it's time to look how our dataset looks like. We'll randomly sample 10 values from both train and test**

In [None]:
train.sample(10)

# Understanding the dataset

* Each data point in our train has pixel values ranging from pixel0 to pixel783. 

* <img src="https://vignette.wikia.nocookie.net/vampirediaries/images/c/ca/But-why-meme-generator-but-why-84103d.jpg/revision/latest?cb=20130811194815" width="400px"/>


* We have 24 by 24 pixels for each data point or image in our dataset.
So 24 * 24 = 784. We'll see later in the notebook how to convert this into our desired shape so that it becomes an input for our convNet

* And obviously label column is the true label for that particular data point. 

Let's look at our test dataset too!

In [None]:
test.sample(10)

So as expected we have only the grayscale pixel value for all dataset without label. Obviously, because that's what we want to predict!

Let's see the shape or dimension of both train and test dataset

In [None]:
train.shape

In [None]:
test.shape

## Checking null values

* We have 42000 data points in our train data set and 28000 in our test data set!

* Next, let's look for any null values. 
* We will check null values in both train and test data set   
<img src="https://pics.me.me/obviously-31892135.png" width="400px"/>
 


In [None]:
train.isnull().any().describe()

In [None]:
test.isnull().any().describe()

## Woah! What a perfect day! No null value! ;)

<img src="https://memegenerator.net/img/instances/48506203/wow.jpg" width="200px"/>


## X_train and Y_train

Let's separate our train dataset into X_train and Y_train

What is X_train and Y_train??

* X_train -> It will have all our training examples without the actual/true label of classification (num_examples,columns)
* Y_train -> It will have all the actual/true label for corresponding X_train value (num_examples)

Thus, we have 42000 images in train as whole

In [None]:
Y_train = train["label"]
X_train = train.drop(labels="label",axis=1)

## Dimensions of X_train and Y_train

Let's check the shape of our X_train and Y_train!

### Any guess?
<img src="http://4.bp.blogspot.com/-aE-06rRzYZE/UWByCA1jelI/AAAAAAAA228/wDrgQIdFZHw/s1600/Any%2Bguesses%2Bas%2Bto%2Bwho%2Btrumps%2BPalestinians%2Bin%2Blibeling%2BIsrael.jpg" width="200px"/>

In [None]:
print (X_train.shape)
print (Y_train.shape)

Cool!! Our X_train is now just 42000 by 784 i.e 42000 datapoint with 784 pixels(label column is dropped)

And of course Y_train is just an array of the true label of 42000 images

# Intuition for Reshaping dataset for our CNN model

Said so much about that, did we see how this grayscale image looks like?? 

<img src="https://image.spreadshirtmedia.com/image-server/v1/mp/designs/1014185688,width=178,height=178/who-cares.png" width="200px"/>

Well, all of us care! 

But first let's convert our X_train in the shape (m,pixel,pixel,channels)
* m : number of datapoints
* pixel : Our image is 24 by 24 pixel. pixel holds 24 in our case
* channels : As this is grayscale image, we just have a single channel. channel = 3 for RGB images!

Our X_train has 42000 data points and test has 28000. As all of you would have predicted that our final shape, which is to be fed in model is as follows:
* X_train = (42000,28,28,1)
* test = (28000,28,28,1)

# Normalization

Just one last step before we reshape our data! We have to normalize our X_train and test

<img src="https://vignette.wikia.nocookie.net/vampirediaries/images/c/ca/But-why-meme-generator-but-why-84103d.jpg/revision/latest?cb=20130811194815" width="400px"/>

## Because

As we will use CNN, these models works better if the values are in [0,1], thus we divide our values with 255

But why to normalise test data??

Why not? We can't compare oranges with apples right? 

In [None]:
X_train = X_train/255.0
test = test/255.0

## Reshape
Awesome! Let's reshape X_train and test now!

In [None]:
X_train = X_train.values.reshape(-1,28,28,1)
test = test.values.reshape(-1,28,28,1)

In [None]:
#Confirming the X_train shape we earlier predicted
print (X_train.shape)

#confirming the test shape we earlier predicted
print(test.shape)

# Looking at our digit images

Too much of reshaping, normalizing and everything. Where is the image???????

OK! OK! Now let's look at first six images as subplots! We will also set title as the true label which is stored in corresponding Y_train

In [None]:
nrows = 2
ncols = 3
i = 0
fig, ax = plt.subplots(nrows,ncols,sharex=True,sharey=True)
for row in range(nrows):
    for col in range(ncols):
        ax[row,col].imshow(X_train[i][:,:,0])
        ax[row,col].set_title("True label :{}".format(Y_train[i]))
        i += 1


# Imbalanced Class
Good going!! There are the digits. So what next?

Now, before going further to train our model. We need to see one of the biggest factor in classification problem, i.e, imbalanced class. Let's plot a countplot and see whether we have imbalanced classes here

In [None]:
sns.countplot(Y_train)

<img src="https://memegenerator.net/img/instances/48506203/wow.jpg" width="200px"/>

What a day!! It looks like we have pretty balanced class. 

No need to apply any resampling techniques. We are free to go forward! Let's see counts for each digits anyways!

In [None]:
Y_train.value_counts()

# one-hot-encoding
Applying one hot encoding to Y_train (multi-class problem, thus we will need softmax transformtion)

In [None]:
##### num_classes = 10 because we have 10 classes from 0 to 9
Y_train = to_categorical(Y_train, num_classes=10)
#also let's look at our modified Y_train for the  1st 6 images displayed above. Remember: index starts from 0
Y_train[0:6]

# Splitting train and validation dataset

<img src="https://medicaldialogues.in/wp-content/uploads/2017/12/phew.jpg" width="200px"/>

Phew!! We are done with all pre processing and now we have data ready to be fed into our models!! Are you ready??

As we have balanced dataset, it's ok to split our dataset randomly. We split 90% data for training and 10% for validation. We'll judge our model on this validation data and secretly hide our test data! Let's initialize random seed to 2. Be free to change it if you need

In [None]:
random_seed = 2

In [None]:
X_train, X_val, Y_train, Y_val = train_test_split(X_train, Y_train, test_size = 0.1, random_state=random_seed)

# Convolutional Neural Networks

Let's look at LeNet-5 architecture

<img src="https://indoml.files.wordpress.com/2018/03/lenet-52.png?w=736" width="600px"/>


Time to use all our deep learning knowledge and ask keras to help us implement it faster!! 

I would highly recommend to see the [keras](http://keras.io/) documentation, if you're new to it.

P.S - This model is inspired by LeNet-5

In [None]:
model = Sequential()

model.add(Conv2D(filters = 16, kernel_size = (5,5),padding = 'Same', 
                 activation ='relu', input_shape = (28,28,1)))
model.add(MaxPool2D(pool_size=(2,2)))


model.add(Conv2D(filters = 32, kernel_size = (3,3),padding = 'Same', 
                 activation ='relu'))
model.add(MaxPool2D(pool_size=(2,2), strides=(2,2)))

model.add(Flatten())
model.add(Dense(128, activation = "relu"))

model.add(Dense(128, activation = "relu"))

model.add(Dense(10, activation = "softmax"))

#Note: I didn't use any regularisation yet! let's see how well our model acts without regularisation like dropout! We can always iterate later :)

In [None]:
#For faster convergence, i've used 10 epochs. 20 epochs seems to work a bit better! Try changing it to 20 or even 30 for better accuracy
epochs = 10
batch_size = 100

# Defining optimizers
Here comes the optimisers: We'll define 3 optimiser

* optimizerSDG - Stochastic gradient Descent optimiser
* optimizerRMSprop - RMSprop optimiser
* optimizerAdam - ADaptive Moment Estimation(Adam) optimiser

Click [here](http://ruder.io/optimizing-gradient-descent/) to know more about these optimisers

In [None]:
optimizerSGD = SGD(lr=0.01, momentum=0.0, decay=0.0, nesterov=False)
optimizerAdam = Adam(lr=0.001, beta_1=0.9, beta_2=0.999, epsilon=1e-08, decay=0.0, amsgrad=False)
optimizerRMSprop = RMSprop(lr=0.001, rho=0.9, epsilon=1e-08, decay=0.0)


In [None]:
history = History() #to keep track of accuracy parameters, we will see it's use soon

# Training Model
Let's compile and train our models separately for different optimizers. Our metrics will be stored in history variable, which could be used later for comparation

In [None]:
#training using SGD
model.compile(optimizer = optimizerSGD , loss = "categorical_crossentropy", metrics=["accuracy"])
historySGD = model.fit(X_train, Y_train, batch_size = batch_size, epochs = epochs, 
         validation_data = (X_val, Y_val), verbose = 2)

In [None]:
resultsSGD = model.predict(test)

In [None]:
#training using RMSprop
model.compile(optimizer = optimizerRMSprop , loss = "categorical_crossentropy", metrics=["accuracy"])
historyRMSprop = model.fit(X_train, Y_train, batch_size = batch_size, epochs = epochs, 
         validation_data = (X_val, Y_val), verbose = 2)

In [None]:
resultsRMSProp = model.predict(test)

In [None]:
#training using Adam
model.compile(optimizer = optimizerAdam , loss = "categorical_crossentropy", metrics=["accuracy"])
historyAdam = model.fit(X_train, Y_train, batch_size = batch_size, epochs = epochs, 
         validation_data = (X_val, Y_val), verbose = 2)

In [None]:
resultsAdam = model.predict(test)

# Storing accuracy metrics
Time to compare training accuracy and validation accuracy for each model with different optimiser. 

But where are the accuracy values stores??

We have history of every epoch in all 3 models in history<optimizer_name>. Let's retrieve those values and compare!

And also, i'll explain why i've used results<optimizer_name> after training every model

In [None]:
SGD_acc = historySGD.history['acc']
SGD_val_acc = historySGD.history['val_acc']
RMSprop_acc = historyRMSprop.history['acc']
RMSprop_val_acc = historyRMSprop.history['val_acc']
Adam_acc = historyAdam.history['acc']
Adam_val_acc = historyAdam.history['val_acc']

# Visualization

In [None]:
plt.plot(SGD_acc)
plt.plot(RMSprop_acc)
plt.plot(Adam_acc)
plt.legend(['SGD', 'RMSprop', 'Adam'], loc='lower right')
plt.title('Training accuracy: SGD vs RMSprop vs Adam')
plt.show()

# INSIGHTS FROM TRAINING ACURACY

### Things to note:

*  Clearly we can observe that Adam is performing best among three.

*  We also observe that RMSprop started will a lower accuracy for the first two epochs. This can be improved using bias correction,  but it started learning fast and competes closely with Adam optimizer on higher epochs. Thus we can omit bias correction(it is recommended though).

* Stochastic Gradient Descent however keeps learning and becomes better that it's previous value. However it's far away to compete with Adam or RMSprop.



In [None]:
plt.plot(SGD_val_acc)
plt.plot(RMSprop_val_acc)
plt.plot(Adam_val_acc)
plt.legend(['SGD', 'RMSprop', 'Adam'], loc='lower right')
plt.title('Validation accuracy: SGD vs RMSprop vs Adam')
plt.show()

# INSIGHTS FROM VALIDATION ACURACY

### Validation accuracy is the most important part because it is the measure of how well our model works on the unseen datapoints.
I am always exited to analyse validation accuracy or validation error(1 - validation accuracy). This factor gives us insight about how well our model can be improved further!!

### Things to note:

* Stochastic Gradient Descent works good but of course is far away to be compared with Adam or RMSprop.

* It's interesting to see the performance of Adam vs RMSprop, RMSprop seems to perform better after around 6th epoch. However, Adam wins at the end of 10th epoch.

* As RMSprop seems to perform better for a considerable period of time, it's highly suggested to increase the number of epochs and analyse the performance. You can go up and change epochs to 20 or 30 and feel free to experiment further






## Overfitting
It's very crucial to observe whether we are overfitting our model. One intuition of overfitting can be thought as if your model works pretty well in training data but not so good in validation data, it can be an example of overfitting.

How to **overcome** overfitting?
* Try to add **regularisation** like add dropout with some keep_prob
* Observe the factors which maybe a reason for error (maybe sometimes by manually **observing** the error data points)

**Note**: Different overfitting techniques could be used depending upon the problem, feel free to google and know about them

### Let's just observe whether the overfitting reason(training performance is much better than validation) holds in our case

In [None]:
fig, ax = plt.subplots(1,3,sharex=True,sharey=True,figsize=(15, 5))
ax[0].plot(SGD_acc)
ax[0].plot(SGD_val_acc)
ax[0].legend(['SGD_train','SGD_val'], loc='lower right')
ax[0].set_title("SGD")

ax[1].plot(RMSprop_acc)
ax[1].plot(RMSprop_val_acc)
ax[1].legend(['RMSprop_train','RMSprop_val'], loc='lower right')
ax[1].set_title("RMSprop")

ax[2].plot(Adam_acc)
ax[2].plot(Adam_val_acc)
ax[2].legend(['Adam_train','Adam_val'], loc='lower right')
ax[2].set_title("Adam")

# INSIGHTS

* It's pretty clear performance on training data is better than validation data. This seems normal, but the difference bridge could be decreased using either overfitting techniques or adding more data in our model

# CONCLUSION
* Looking at out hyperparameters value, we come to conclusion that Adam seems better option as of now. But feel free to increase the number of epoch, and maybe you can see RMSprop working better
* Tuning hyperparameters like number of epochs, batch_size, etc., may result in finding better models.
* For this kerel, let's submit the results from Adam optimizer model. 

In [None]:
results = np.argmax(resultsAdam,axis=1)
results = pd.Series(results,name="Label")
submission = pd.concat([pd.Series(range(1,28001),name = "ImageId"),results],axis = 1)

In [None]:
submission.to_csv("submission.csv",index=False)

I hope this kernel **helped** in gaining some **insights** about **optimizers**. Feel free to fork it and experiment with it further. Also vote if you like the kernel. 

Thank You!!

<img src="https://albertonrecord.co.za/wp-content/uploads/sites/35/2018/04/thank-you-185078737_76252.jpg" width="300px"/>

