# Convolutional Neural Networks: In Class Codealong with MNIST Classfication

This code is based on the example given by Keras creater Francois Chollet [here](https://github.com/fchollet/keras/blob/master/examples/mnist_cnn.py)

In this runthrough, we're going to be working with the MNIST dataset, which contains images of 70,000 handwritten digits. The dataset is a bit like Iris or Boston Housing, it's one of the core datasets for learning artificial neural networks. Convolutional Neural Networks have proven to be the most effective method of tackling image processing tasks, so we're going to work through classifying these handwritten digits with a CNN in Keras. 

**A bit on Keras:** Keras is an API that runs on top of the machine learning libraries Theano and Tensorflow. For context, it's a bit like sklearn for neural networks. 

In [4]:
from keras.datasets import mnist ##For loading the dataset
from keras.utils.np_utils import to_categorical 
from keras.models import Sequential
from keras.layers import Activation, Dropout, Flatten, Dense
from keras.layers.convolutional import Convolution2D, MaxPooling2D

from keras import backend as K
K.set_image_dim_ordering('th') #tells Keras to expect the depth axis at index 1 of the input_dimension tuple.

First, we can download the data with Keras. 

In [24]:
(X_train, y_train), (X_test, y_test) = mnist.load_data()
n_train, height, width = X_train.shape
n_test, _, _ = X_test.shape

The datasets contains 60,000 28x28 training grayscale images and 10,000 28x28 test grayscale images.

### Preprocessing

We need to conduct just a few preprocessing steps to get the data into the format that we like.

In [25]:
X_train = X_train.reshape(n_train, 1, height, width).astype('float32')
X_test = X_test.reshape(n_test, 1, height, width).astype('float32')

In [26]:
X_test.shape
y_test.shape

(10000,)

In [27]:
X_train /= 255
X_test /= 255

In [28]:
n_classes = 10

In [29]:
y_train = to_categorical(y_train, n_classes)
y_test = to_categorical(y_test, n_classes)

### Architecture of CNNs 

Recall, the general architecture of a convolutional neural network is: 
- convolution layers, followed by pooling layers
- fully-connected layers
- a final fully-connected softmax layer

Keras gives us potentially one of the easiest ways to define an artificial neural network. To get started, we have to first initiate a **sequential model in Keras**, meaning that components and layers come one after another. 

### The Convolution Layers

General thoughts for constructing the convolution layer:
- The more complex the task, the more convolution layers we want in our network
- We don't want our window to be too large, or the end matrix might not be that useful! 
- How large do we want our pooling to be? Approximately proporational to the size of the image

In [11]:
# number of convolutional windows
n_filters = 16

# convolution window size
# i.e. we will use a n_conv x n_conv window
n_conv = 5

# pooling window size
# i.e. we will use a n_pool x n_pool pooling window
n_pool = 2

Now that we've set up these hyperparameters, we can begin adding layers to our network. We’re using only two convolutional layers because this is a relatively simple task. Generally for more complex tasks you may want more convolution layers to extract higher and higher level features.

We're going to be using ReLu as our activation function. 

The particular pooling layer we’re using is a max pooling layer, which can be thought of as a “feature detector”.

In [12]:
model = Sequential()
model.add(Convolution2D(
        n_filters, n_conv, n_conv,

        # apply the window to only full parts of the image
        # (i.e. do not "spill over" the border)
        # this is called a narrow convolution
        border_mode='valid',

        # we have a 28x28 single channel (grayscale) image
        # so the input shape should be (1, 28, 28)
        input_shape=(1, height, width),
        
        activation="relu"))

# then we apply pooling to summarize the features
# extracted thus far
model.add(MaxPooling2D(pool_size=(n_pool, n_pool)))

model.add(Convolution2D(n_filters, n_conv, n_conv, 
                        border_mode='valid', 
                        input_shape=(height, width, 1), 
                        activation="relu"
                       ))

model.add(MaxPooling2D(pool_size=(n_pool, n_pool)))

### Dropout + the Softmax Output Layer

**Recall Dropout**:
- Dropout is a form of regularization for a neural network
- It essentially forces an artificial neural network to learn multiple independent representations of the same data by alternately randomly disabling neurons in the learning phase.
- The effect of this is that neurons are prevented from co-adapting too much which makes overfitting less likely.

In Keras terminology, the dense layer is simply a **regular fully connected layer** for a Neural Network

In [13]:
model.add(Dropout(0.2))

# flatten the data for the 1D layers (These are the output layers)
model.add(Flatten())

# Dense Layer(n_outputs)
model.add(Dense(128, activation="relu"))
model.add(Dropout(0.2))

# the softmax output layer gives us a probablity for each class
model.add(Dense(n_classes))
model.add(Activation('softmax'))

### Compiling, Loss, and Optimizer

We tell Keras to compile the model using whatever backend we have configured. At this stage we specify the loss function we want to optimize. Here we’re using categorical cross-entropy, which is the standard loss function for multiclass classification.

We also specify the particular **optimization method** we want to use. An optimizer is one of the two arguments required for compiling a Keras model. You can either instantiate an optimizer before passing it to model.compile() , as in the above example, or you can call it by its name.  We've talked about plain vanilla Stochastic Gradient Descent (which we could use as an optimizer with SGD), however there are also varients which have been developed in the past few years that seek to perform further meta-optimization. One of these is Adam, developed in 2014. We're going to be using it here as it is the most recently developed iteration of stochastic gradient descent meta-optimization. You can read more about it [here](http://sebastianruder.com/optimizing-gradient-descent/index.html#adam). Adam adapts the learning rate based on how training is going and improves the training process.

The second required arguement is the **loss function**. Here we’re using categorical cross-entropy, which is the standard loss function for multiclass classification.

In [14]:
model.compile(
    loss='categorical_crossentropy',
    optimizer='adam',
    metrics=['accuracy'] 
)

Now that we've setup and compiled our network, we can begin training it! 

### Training the Network

In [32]:
# how many examples to look at during each training iteration
batch_size = 200

# how many times to run through the full set of examples
n_epochs = 8

# the training may be slow depending on your computer
model.fit(X_train,
          y_train,
          batch_size=batch_size,
          nb_epoch=n_epochs,
          validation_data=(X_test, y_test))

Train on 60000 samples, validate on 10000 samples
Epoch 1/8
Epoch 2/8
Epoch 3/8
Epoch 4/8
Epoch 5/8
Epoch 6/8
Epoch 7/8
Epoch 8/8


<keras.callbacks.History at 0x10e95ee50>

### Evaluating the Network

We can then evaluate the model much like we would in sklearn: 

In [23]:
loss, accuracy = model.evaluate(X_test, y_test)
print('loss:', loss)
print('accuracy:', accuracy)

('loss:', 0.025685556081288995)
('accuracy:', 0.99160000000000004)


Look at that. That's a **99.1% classification accuracy** for unstructed data.