# Discover the power of Deep Learning

In this tutorial, you'll discover how to use "deep learning" (DL) to classify digits, ranging from 0 to 9. The dataset is quite famous, it's called 'MNIST' http://yann.lecun.com/exdb/mnist/. A French guy put it up, very famous in the DL comunity, he's called Yann Lecun and is now both head of the Facebook AI reseach program and head of something in the University of New York (you may want to search and pull the answer :p ).

I invite you to discover how MNIST truly is (class distribution, pixels distribution...).  
Luckilly for you, I managed to be organised this time, and you may find [this notebook](https://nbviewer.jupyter.org/github/marc-moreaux/Deep-Learning-classes/blob/master/notebooks/dataset_MNIST.ipynb) usefull.

Remember logistic regression ? I also happen to have a notebook about this [here](https://nbviewer.jupyter.org/github/marc-moreaux/Deep-Learning-classes/blob/master/notebooks/Logistic_regression.ipynb). It's all done with Keras and might help you

### Lets load the data

In [1]:
import keras
from keras import models
from keras import layers
from keras.datasets import mnist

(x_train, y_train), (x_test, y_test) = mnist.load_data()

# Reshape Xs and Ys
y_train = keras.utils.to_categorical(y_train)
y_test = keras.utils.to_categorical(y_test)
x_train = x_train.reshape(-1, 28 * 28)
x_test = x_test.reshape(-1, 28 * 28)

Using TensorFlow backend.


### WTF are we going to do in this notebook ?
Ok, if you are up to this line, I expect you to know what is MNIST, and it's associated classification task. If you didn't got the task you can google something like this : "what is the classification task of MNSIT".

Perfect. Therefore, we want to classify MNIST. To do so, we'll use a neural network !!
The neural net will be as follows :
- It takes as input a batch of shape(32, 28 * 28)
- Then has 3 * 128 fully connected layers (also called 'Dense layer') with Relu activations.
- And finishes with a 10 dimention dense layer (which should be interpreted as probabilities (<=> sums to one))


In [2]:
# Model definition
model = models.Sequential()
model.add(layers.Dense(64, input_shape=(28 * 28,), activation='relu'))
model.add(layers.Dense(64, activation='relu'))
model.add(layers.Dense(64, activation='relu'))
model.add(layers.Dense(10, activation='softmax'))

# Loss definition (from logits)
sgd = keras.optimizers.SGD(lr=0.01, decay=1e-6, momentum=0.9, nesterov=True)
model.compile(optimizer=sgd,
      loss='categorical_crossentropy',
      metrics=['accuracy'])

# TRAIN !!
Fit your data on the model you've created...

In [3]:
model.fit(x_train, y_train, batch_size=32, epochs=5, verbose=1, validation_split=0.3)

Train on 42000 samples, validate on 18000 samples
Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


<keras.callbacks.History at 0x7f96e8454a90>

### Ok... That's bad !
We've reached 10% accuracy, which is pure random, it's very bad ! Somehow, I can tell you that your classifier isn't converging, not even to random !! At random, you loss should be $ln(1/10) = 2.3$. By experience, I can tell you that your gradient step is too large...

You can try changing the lr parameter in `keras.optimizers.SGD` (yes, try this !) What happens ?

### LR wasn't the true culpable ...
The Gradient descent algorithms are quite scaled to normalized dataset... Yet our dataset has a poor distribution:

In [4]:
x_mean, x_std = x_train.mean(), x_train.std()
print(x_mean, x_std)

(33.318421449829934, 78.567489983397977)


In [5]:
# Normalize the dataset such that mean=0 and std=1
x_train = (x_train - x_mean) / x_std
x_test = (x_test - x_mean) / x_std

Let re-train.. 

In [6]:
sgd = keras.optimizers.SGD(lr=0.01, decay=1e-6, momentum=0.9, nesterov=True)
model.compile(optimizer=sgd,
      loss='categorical_crossentropy',
      metrics=['accuracy'])
model.fit(x_train, y_train, batch_size=32, epochs=5, verbose=1, validation_split=0.3)

Train on 42000 samples, validate on 18000 samples
Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


<keras.callbacks.History at 0x7f96e8454bd0>

# YEAH !!!
Try to beat this !! 
- you may use something else than SGD
- you may regularize the neurons
- you may try dropout
- you may use batch normalization