# Going through end-to-end for MNIST

MNIST is a great way to experiment with & revise CNNs because

- it's very fast to train because it has 28x28 greyscale images,
- there are extensive benchmarks on what are the best approaches to MNIST.

In [1]:
from theano.sandbox import cuda
cuda.use('gpu2')

ERROR (theano.sandbox.cuda): nvcc compiler not found on $PATH. Check your nvcc installation and try again.


In [2]:
%matplotlib inline
import utils; reload(utils)
from utils import *
from __future__ import division, print_function

Using Theano backend.


## Setup

In [3]:
#batch_size=64
batch_size=4

### Load MNIST data

In [4]:
from keras.datasets import mnist
(X_train, y_train), (X_test, y_test) = mnist.load_data()
(X_train.shape, y_train.shape, X_test.shape, y_test.shape)

((60000, 28, 28), (60000,), (10000, 28, 28), (10000,))

### Pre-processing: 2 Steps

(1) Add the extra (empty) dimension: because Keras expects there to be a number of channels

In [5]:
X_test = np.expand_dims(X_test,1)
X_train = np.expand_dims(X_train,1)

In [6]:
X_train.shape

(60000, 1, 28, 28)

In [7]:
y_train[:5]

array([5, 0, 4, 1, 9], dtype=uint8)

(2) One-hot encode the labels

In [8]:
y_train = onehot(y_train)
y_test = onehot(y_test)

In [9]:
y_train[:5]

array([[ 0.,  0.,  0.,  0.,  0.,  1.,  0.,  0.,  0.,  0.],
       [ 1.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  0.,  1.,  0.,  0.,  0.,  0.,  0.],
       [ 0.,  1.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  1.]])

In [10]:
mean_px = X_train.mean().astype(np.float32)
std_px = X_train.std().astype(np.float32)

In [11]:
def norm_input(x): return (x-mean_px) / std_px

In [12]:
gen = image.ImageDataGenerator()
batches = gen.flow(X_train, y_train, batch_size=batch_size)
test_batches = gen.flow(X_test, y_test, batch_size=batch_size)

## Linear model

A linear model
- normalize & flatten the input (treat it as a single vector).
- has one Dense layer with 10 outputs

In [12]:
def get_lin_model():
    model = Sequential([
        Lambda(norm_input, input_shape=(1,28,28)),
        Flatten(),
        Dense(10, activation='softmax')
        ])
    model.compile(Adam(), loss='categorical_crossentropy', metrics=['accuracy'])
    return model

In [13]:
lm = get_lin_model()

### The best way to train a model

- Use the default learning rate (LR = 0.001), do one epoch. 
- Set the LR really high (LR = 0.1, the highest), do another epoch.
- Gradually, reduce the LR  by an order of magnitude at a time. For example, ...
    - Set LR to 0.01, do a few epochs at a time.
    - Set LR to 0.001, do a few epochs at a time.
    - ...
    - Keep doing that until you start overfitting.

In [15]:
lm.fit_generator(batches, batches.N, nb_epoch=1,
                validation_data=test_batches, nb_val_samples=test_batches.N)

Epoch 1/1


<keras.callbacks.History at 0x7fb58e3d8ad0>

In [17]:
lm.optimizer.lr=0.1

In [18]:
lm.fit_generator(batches, batches.N, nb_epoch=1,
                validation_data=test_batches, nb_val_samples=test_batches.N)

Epoch 1/1


<keras.callbacks.History at 0x7fb58b003e90>

In [19]:
lm.optimizer.lr=0.01

In [20]:
lm.fit_generator(batches, batches.N, nb_epoch=4,
                validation_data=test_batches, nb_val_samples=test_batches.N)

Epoch 1/4
Epoch 2/4
Epoch 3/4
Epoch 4/4


<keras.callbacks.History at 0x7fb5a6ade390>

## Single dense layer

Create a fully connected network with 1 hidden layer.

In [18]:
def get_fc_model():
    model = Sequential([
        Lambda(norm_input, input_shape=(1,28,28)),
        Flatten(),
        Dense(512, activation='softmax'),
        Dense(10, activation='softmax')
        ])
    model.compile(Adam(), loss='categorical_crossentropy', metrics=['accuracy'])
    return model

In [19]:
fc = get_fc_model()

In [20]:
fc.fit_generator(batches, batches.N, nb_epoch=1,
                validation_data=test_batches, nb_val_samples=test_batches.N)

Epoch 1/1


<keras.callbacks.History at 0x7fdf384f4a10>

To train, do the same thing.

In [None]:
fc.optimizer.lr=0.1

In [None]:
fc.fit_generator(batches, batches.N, nb_epoch=4,
                validation_data=test_batches, nb_val_samples=test_batches.N)

In [None]:
fc.optimizer.lr=0.01

In [None]:
fc.fit_generator(batches, batches.N, nb_epoch=4,
                validation_data=test_batches, nb_val_samples=test_batches.N)

## Basic 'VGG-style' CNN

Because VGG works quite well, we will create an architecture that is similar to VGG but it's much simpler because this contains 28x28 images.

VGG has a set of convolutional layers of 3x3 and a MaxPooling layer, and then more sets with twice as many filters. 

** Intuition: **

- after 2 lots of MaxPooling, it will go from 28x28 to 14x14 to 7x7. 
- So I add my 2 Dense layers.

In [13]:
def get_model():
    model = Sequential([
        Lambda(norm_input, input_shape=(1,28,28)),
        Convolution2D(32, 3, 3, activation='relu'),
        Convolution2D(32, 3, 3, activation='relu'),
        MaxPooling2D(),
        Convolution2D(64, 3, 3, activation='relu'),
        Convolution2D(64, 3, 3, activation='relu'),
        MaxPooling2D(),
        Flatten(),
        Dense(512, activation='relu'),
        Dense(10, activation='softmax')
        ])
    model.compile(Adam(), loss='categorical_crossentropy', metrics=['accuracy'])
    return model

In [14]:
model = get_model()

In [None]:
model.fit_generator(batches, batches.N, nb_epoch=1, 
                    validation_data=test_batches, nb_val_samples=test_batches.N)

In [None]:
model.optimizer.lr=0.1

In [None]:
model.fit_generator(batches, batches.N, nb_epoch=1, 
                    validation_data=test_batches, nb_val_samples=test_batches.N)

In [None]:
model.optimizer.lr=0.01

In [None]:
model.fit_generator(batches, batches.N, nb_epoch=8, 
                    validation_data=test_batches, nb_val_samples=test_batches.N)

** Trick: Train until overfitting **

Because the model is capable of overfitting, I think ...

- a model that is complex enough to handle your data,
- this is a good architecture.

## Data augmentation

- Use the same model (architecture)
- Reduce overfitting but reduce the complexity of the model no more than necessary.
- So, add a bit of data augmentation.
- Train it for a while.

In [15]:
model = get_model()

In [16]:
# data generator with data augmentation
gen = image.ImageDataGenerator(rotation_range=8, width_shift_range=0.08, shear_range=0.3,
                              height_shift_range=0.08, zoom_range=0.08)
batches = gen.flow(X_train, y_train, batch_size=batch_size)
test_batches = gen.flow(X_test, y_test, batch_size=batch_size)

In [None]:
model.fit_generator(batches, batches.N, nb_epoch=1, 
                    validation_data=test_batches, nb_val_samples=test_batches.N)

In [None]:
model.optimizer.lr=0.1

In [None]:
model.fit_generator(batches, batches.N, nb_epoch=4, 
                    validation_data=test_batches, nb_val_samples=test_batches.N)

In [None]:
model.optimizer.lr=0.01

In [None]:
model.fit_generator(batches, batches.N, nb_epoch=8, 
                    validation_data=test_batches, nb_val_samples=test_batches.N)

In [None]:
model.optimizer.lr=0.001

In [None]:
model.fit_generator(batches, batches.N, nb_epoch=14, 
                    validation_data=test_batches, nb_val_samples=test_batches.N)

In [None]:
model.optimizer.lr=0.0001

In [None]:
model.fit_generator(batches, batches.N, nb_epoch=10, 
                    validation_data=test_batches, nb_val_samples=test_batches.N)

By the end, it's overfitting again.

## Batchnorm + data augmentation

Data augmentation alone is not enough. So I add batchnorm. 

- I use batchnorm on every layer. 

** Notice ** when you use batchnorm on convolution layers, you have to add "axis=1".

To really understand batchnorm, understand why you need this here. 
* Figure out why you need this
	by reading the documentation about batchnorm.
	We'll have a discussion about it on the forum.

This model is like a quite good quality modern network with:
- convolution layers (3x3),
- batchnorm,
- MaxPooling,
- some Dense layers.

In [18]:
def get_model_bn():
    model = Sequential([
        Lambda(norm_input, input_shape=(1,28,28)),
        Convolution2D(32,3,3, activation='relu'),
        BatchNormalization(axis=1),
        Convolution2D(32,3,3, activation='relu'),
        MaxPooling2D(),
        BatchNormalization(axis=1),
        Convolution2D(64,3,3, activation='relu'),
        BatchNormalization(axis=1),
        Convolution2D(64,3,3, activation='relu'),
        MaxPooling2D(),
        Flatten(),
        BatchNormalization(),
        Dense(512, activation='relu'),
        BatchNormalization(),
        Dense(10, activation='softmax')
        ])
    model.compile(Adam(), loss='categorical_crossentropy', metrics=['accuracy'])
    return model

In [19]:
model = get_model_bn()

In [None]:
model.fit_generator(batches, batches.N, nb_epoch=1, 
                    validation_data=test_batches, nb_val_samples=test_batches.N)

In [None]:
model.optimizer.lr=0.1

In [None]:
model.fit_generator(batches, batches.N, nb_epoch=4, 
                    validation_data=test_batches, nb_val_samples=test_batches.N)

In [None]:
model.optimizer.lr=0.01

In [None]:
model.fit_generator(batches, batches.N, nb_epoch=12, 
                    validation_data=test_batches, nb_val_samples=test_batches.N)

In [None]:
model.optimizer.lr=0.001

In [None]:
model.fit_generator(batches, batches.N, nb_epoch=12, 
                    validation_data=test_batches, nb_val_samples=test_batches.N)

By the end, it's overfitting. 
So I add dropout.

## Batchnorm + data augmentation + dropout

** The rule for dropout: ** gradually increase it.

Try adding one dropout layer before the last layer

In [20]:
def get_model_bn_do():
    model = Sequential([
        Lambda(norm_input, input_shape=(1,28,28)),
        Convolution2D(32,3,3, activation='relu'),
        BatchNormalization(axis=1),
        Convolution2D(32,3,3, activation='relu'),
        MaxPooling2D(),
        BatchNormalization(axis=1),
        Convolution2D(64,3,3, activation='relu'),
        BatchNormalization(axis=1),
        Convolution2D(64,3,3, activation='relu'),
        MaxPooling2D(),
        Flatten(),
        BatchNormalization(),
        Dense(512, activation='relu'),
        BatchNormalization(),
        Dropout(0.5),
        Dense(10, activation='softmax')
        ])
    model.compile(Adam(), loss='categorical_crossentropy', metrics=['accuracy'])
    return model

In [21]:
model = get_model_bn_do()

In [None]:
model.fit_generator(batches, batches.N, nb_epoch=1, 
                    validation_data=test_batches, nb_val_samples=test_batches.N)

In [None]:
model.optimizer.lr=0.1

In [None]:
model.fit_generator(batches, batches.N, nb_epoch=4, 
                    validation_data=test_batches, nb_val_samples=test_batches.N)

In [None]:
model.optimizer.lr=0.01

In [None]:
model.fit_generator(batches, batches.N, nb_epoch=12, 
                    validation_data=test_batches, nb_val_samples=test_batches.N)

In [None]:
model.optimizer.lr=0.001

In [None]:
model.fit_generator(batches, batches.N, nb_epoch=1, 
                    validation_data=test_batches, nb_val_samples=test_batches.N)

After training it for a while at: .1, .01, .001, training accuracy and validation accuracy are pretty similar. So adding one layer of dropout seemed to be enough.

## Ensembling

= build multiple versions of your model & combine them together.

Trick to make every model better.

Put all of the code from that last section into one function: *fit_model()*.

- Use the same model.
- Train it at the learning rate of .1, .01, .001.
- Return a trained model.

In [None]:
def fit_model():
    model = get_model_bn_do()
    model.fit_generator(batches, batches.N, nb_epoch=1, verbose=0,
                        validation_data=test_batches, nb_val_samples=test_batches.N)
    model.optimizer.lr=0.1
    model.fit_generator(batches, batches.N, nb_epoch=4, verbose=0,
                        validation_data=test_batches, nb_val_samples=test_batches.N)
    model.optimizer.lr=0.01
    model.fit_generator(batches, batches.N, nb_epoch=12, verbose=0,
                        validation_data=test_batches, nb_val_samples=test_batches.N)
    model.optimizer.lr=0.001
    model.fit_generator(batches, batches.N, nb_epoch=18, verbose=0,
                        validation_data=test_batches, nb_val_samples=test_batches.N)
    return model

In [None]:
# Fit a model for 6 times and return a list of the results
models = [fit_model() for i in range(6)]

The variable *models* contains 6 models trained in the same way, but from different random starting points. 

In [22]:
path = 'data/mnist/'
model_path = path + 'models/'

In [None]:
for i,m in enumerate(models):
    m.save_weights(model_path+'cnn-mnist23-'+str(i)+'.pkl')

In [None]:
evals = np.array([m.evaluate(X_test, y_test, batch_size=256) for m in models])

In [None]:
evals.mean(axis=0)

In [None]:
# Go through every one of those 6 models and 
# predict the output for the test set.
all_preds = np.stack([m.predict(X_test, batch_size=256) for m in models])

In [None]:
# I have 10,000 test images by 10 outputs by 6 models.
all_preds.shape

** Idea: **
Since the 6 models will have errors in different places, let's take the average across the 6 models:

In [None]:
avg_preds = all_preds.mean(axis=0)

In [None]:
keras.metrics.categorical_accuracy(y_test, avg_preds).eval()