# MNIST with a Convolutional Neural Network

In [1]:
# Larger CNN for the MNIST Dataset
import numpy
from keras.datasets import mnist
from keras.models import Sequential
from keras.layers import Dense
from keras.layers import Dropout
from keras.layers import Flatten
from keras.layers.convolutional import Conv2D
from keras.layers.convolutional import MaxPooling2D
from keras.utils import np_utils
from keras import backend as K

Using TensorFlow backend.


## The next line tells Keras that the images will be presented as Tensors with the channels first.

Huh? Channels are the red-green-blue channels of the image. So this tells us automatically that CNNs are typically used in image processing neural networks. There are two natural ways of creating Tensors for images. One is to put the channels first (e.g. 3 channels x 28 pixels wide x 28 pixels tall). The other is to put the channels last (28 x 28 x 3). TensorFlow does it one way; Theano does it the other. (It's sort of like the big-endian/little-endian wars.)

Keras just needs to know which way the raw data is stored so that it can reshape the input Tensors to whichever backend it is using.

Note: Our images are greyscale. So the Tensor is really 1 x 28 x 28 (one grey channel with values from 0 (black) to 255 (white)). Also, note that the Tensor is really 4D. The first dimension is the number of images in our dataset. So: # images in data x 1 x 28 x 28. 

In [2]:
K.set_image_data_format('channels_first')

In [3]:
# fix random seed for reproducibility
seed = 7
numpy.random.seed(seed)

In [4]:
# load data
(X_train, y_train), (X_test, y_test) = mnist.load_data()
# reshape to be [samples][pixels][width][height]
X_train = X_train.reshape(X_train.shape[0], 1, 28, 28).astype('float32')
X_test = X_test.reshape(X_test.shape[0], 1, 28, 28).astype('float32')

## We normalize the input between 0 and 1

So not zero mean, but still normalized.

In [5]:
# normalize inputs from 0-255 to 0-1
X_train = X_train / 255
X_test = X_test / 255

## One-hot encoding

We just want to "one-hot" encode the output labels. So instead of 0 through 9, we have labels [1,0,0,0,0,0,0,0,0,0] through [0,0,0,0,0,0,0,0,0,1]. So label 5 would be [0,0,0,0,1,0,0,0,0,0].

In [6]:
# one hot encode outputs
y_train = np_utils.to_categorical(y_train)
y_test = np_utils.to_categorical(y_test)
num_classes = y_test.shape[1]

## This is our actual CNN

We first have a 30 neuron CNN layer with a ReLu activation and a 5x5 convolutional filter. Then we do a max pooling on a 2 x 2 window. Then we have a 15 neuron layer with a ReLu activation and a 3 x 3 convolutional filter. Then we do a max pooling of this on a 2x2 window. We also add a 20% dropout to the layer. Then we flatten the image to a single vector (now that we have the spatial convolutional filters). We run that into a 128-neuron layer (ReLu) and a 50-neuron layer (ReLu). The output is the 10-class softmax.

Here's a sample of another CNN to give you an idea of how the convolution works.

![CNN](http://7xo0y8.com1.z0.glb.clouddn.com/ml_concept%2Fconv.gif)

In [7]:
# define the larger model
def larger_model():
# create model
    model = Sequential()
    
    # input_shape tells Keras/Tensorflow to expect a tensor input of 1 x 28 x 28
    model.add(Conv2D(30, (5, 5), input_shape=(1, 28, 28), activation='relu'))
    model.add(MaxPooling2D(pool_size=(2, 2)))
    model.add(Conv2D(15, (3, 3), activation='relu'))
    model.add(MaxPooling2D(pool_size=(2, 2)))
    model.add(Dropout(0.2))
    model.add(Flatten())
    model.add(Dense(128, activation='relu'))
    model.add(Dense(50, activation='relu'))
    model.add(Dense(num_classes, activation='softmax'))
    
    # Compile model
    model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
    return model

In [8]:
# build the model
model = larger_model()

## Here are the model layers again.

Remember that "None" is a placeholder meaning "I'm not sure". In the first layer (conv2d_1) the model doesn't know a priori how many images are the in the dataset. It will just get a stream of images one after the other at runtime.

In the conv2d_1 layer, we've gone from a 28 x 28 image down to a 24 x 24 image thanks to the 5x5 convolutional filter.  Then, the 2 x 2 max pooling reduces the image in half (12 x 12). The conv2d_2 filter then reduces it to a 10 x 10 due to the 2 x 2 filter. And, another 2 x 2 max pooling gets the image down to a 5 x 5 image.

So we've gone from a 28 x 28 image of the number to a 5 x 5 representation of the same number. This is dimensionality reduction! (just like PCA). These 5 x 5 and 2 x 2 filters are reducing the image by finding edges and other basic shapes. So it is in effect reducing the image by finding the spatial covariance in the pixels.

The final "flattened" image is actually 15 neurons each with a 5 x 5 representation of the original 28 x 28 image. So 375 elements rather than the original 784 elements.

In [9]:
model.summary()

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
conv2d_1 (Conv2D)            (None, 30, 24, 24)        780       
_________________________________________________________________
max_pooling2d_1 (MaxPooling2 (None, 30, 12, 12)        0         
_________________________________________________________________
conv2d_2 (Conv2D)            (None, 15, 10, 10)        4065      
_________________________________________________________________
max_pooling2d_2 (MaxPooling2 (None, 15, 5, 5)          0         
_________________________________________________________________
dropout_1 (Dropout)          (None, 15, 5, 5)          0         
_________________________________________________________________
flatten_1 (Flatten)          (None, 375)               0         
_________________________________________________________________
dense_1 (Dense)              (None, 128)               48128     
__________

## How do they calculate the "Param #"?

### For dense layers

$$\text{Param #} = (\text{# input features} + 1) \times (\text{# output features})$$

So for dense_1 it is $(375 + 1) \times 128 = 48{,}128$ trainable parameters (i.e. weights + biases).<br>
For dense_2 it is $(128+1) \times 50 = 6{,}450$<br>
For dense_3 it is $(50+1) \times 10 = 510$

The +1 is the added bias term needed in each layer (i.e. the y intercept or offset).


### For convolutional layers

$$\text{Param #} = (\text{# input features} \times \text{filter width} \times \text{filter height} + 1) \times (\text{# output features})$$

So for conv2d_1 = $(1 \times 5 \times 5 + 1) \times 30 = 780$<br>
*In the first convolution layer, we only have 1 depth channel. So it's like 1 neuron being passed in. *

For conv2d_2 = $(30 \times 3 \times 3 + 1) \times 15 = 4{,}065$ <br>
*Now we are being passed in the 30 neurons from the previous layer. *

In [10]:
# Fit the model
model.fit(X_train, y_train, validation_data=(X_test, y_test), epochs=10, batch_size=200)

Train on 60000 samples, validate on 10000 samples
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<keras.callbacks.History at 0x7ff38dff7550>

In [11]:
# Final evaluation of the model
scores = model.evaluate(X_test, y_test, verbose=0)
print("Large CNN Error: %.2f%%" % (100-scores[1]*100))

Large CNN Error: 0.71%


# So we've gone from about a 2% error to a less than 1% error
