An attempt at building a Convolutional Neural Network (CNN) for automatic digit classification using the MNIST dataset.

This is my first attempt at image recognition. Sources that I used for inspiration/guidance/lookup when I got stuck were:

[https://www.kaggle.com/yassineghouzam/introduction-to-cnn-keras-0-997-top-6](https://www.kaggle.com/yassineghouzam/introduction-to-cnn-keras-0-997-top-6)

[https://www.kaggle.com/toregil/welcome-to-deep-learning-cnn-99](https://www.kaggle.com/toregil/welcome-to-deep-learning-cnn-99)

Running time on GPU: 5 min

**Importing Libraries**


First of all, the necessary libraries need to be imported.

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load in 

# importing libraries
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt # plotting graphs
import matplotlib.image as mpimg # plotting images
%matplotlib inline
import seaborn as sns # more graphs

# some machine learning tools
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix

# neural network tools
from keras.utils.np_utils import to_categorical # convert to one-hot-encoding
from keras.models import Sequential
from keras.layers import Dense, Dropout, Flatten, Conv2D, MaxPool2D, BatchNormalization
from keras.optimizers import RMSprop, Adam
from keras.preprocessing.image import ImageDataGenerator # for data augmentation
from keras.callbacks import ReduceLROnPlateau, LearningRateScheduler # for adapting learning rate

# Input data files are available in the "../input/" directory.
# For example, running this (by clicking run or pressing Shift+Enter) will list the files in the input directory

import os
print(os.listdir("../input"))

# Any results you write to the current directory are saved as output.

In [None]:
# adapting plot style
sns.set(style='white', context='notebook', palette='deep')

**The Data**

Time to load in the training and test datasets.

In [None]:
# Load data
X_train = pd.read_csv('../input/train.csv')
X_test = pd.read_csv('../input/test.csv')

Now for a first glance at the datasets:

In [None]:
# training dataset
print(X_train.shape)
print(X_train.info())
print(X_train.head())

In [None]:
# test dataset
print(X_test.shape)
print(X_test.info())
print(X_test.head())

So the training dataset has one additional column "label". This is what we're expected to predict on the test set. The correct labels need to be dropped from the training set and stored as the expected training output:

In [None]:
# drop label column and store it as expected output
y_train = X_train.pop('label')

# double check
print(y_train.shape)
print(X_train.shape, X_test.shape)
print(X_train.head())

Are any of the classes over- or underrepresented?

In [None]:
y_train.value_counts()

Good enough. Are there any missing or other NaN values? NaNs inside X_train or X_test could indicate a corrupted image file and NaNs inside y_train would be missing classification labels

In [None]:
# NaN in training input
print(X_train.isnull().values.any())
# NaN in test input
print(X_test.isnull().values.any())
# NaN in training expected output
print(y_train.isnull().values.any())

No, none. 

So what about the image data? What values represent the pixels?

In [None]:
print(X_train.apply(pd.value_counts))

Values between 0 and 255. That's a dataset of grayscale images. We should perform grayscale normalization to get values between 0 and 1 as that's easier to work with and will speed up the model too.

In [None]:
X_train = X_train / 255.0
X_test = X_test / 255.0

# check values
# print(X_train.apply(pd.value_counts))

Next we need to reshape the images.

In [None]:
# the shape should be 28x28x1 as keras requires an additional dimension for the canal
X_train = X_train.values.reshape(-1,28,28,1)
X_test = X_test.values.reshape(-1,28,28,1)

print(X_train.shape)
print(X_test.shape)

Right now the labels in y_train are single digit values ranging from 0 to 9. The model will work with one-hot vectors as its output though, so we need to change the encoding:

In [None]:
y_train = to_categorical(y_train, num_classes=10)

Next we will split the X_train input and y_train expected output into a training and a validation set.

In [None]:
## For now we will use a small training set (and consequently a large validation set)
## to speed up the running time during prototyping
## This will need to be changed before tuning the hyperparameters
## 
## temporary split
#X_train, X_val, y_train, y_val = train_test_split(X_train, y_train, test_size=0.9, random_state=42)
## replace with final split

# final train-test split
X_train, X_val, y_train, y_val = train_test_split(X_train, y_train, test_size=0.1, random_state=42)

**The Model**

Now we can finally build and train the model. Since we are dealing with images a convolutional neural network (CNN) seems like a good choice. CNNs generally perform well at image recognition tasks. As the classification of an image does not depend on any previous input we can stick with a conventional non-recurrent CNN. This is easily done in keras by instantiating a Sequential object and adding layers to it:

In [None]:
# create model
model = Sequential()

# (Conv2D -> BatchNormalization) * 2 -> MaxPool2D -> Dropout
model.add(Conv2D(filters=32, kernel_size=(5,5), padding='same', activation='relu', 
                 input_shape=(28,28,1)))
model.add(BatchNormalization())
model.add(Conv2D(filters=32, kernel_size=(5,5), padding='same', activation='relu'))
model.add(BatchNormalization())
model.add(MaxPool2D(pool_size=(4,4), strides=(2,2)))
model.add(Dropout(0.25))

# repeat above sequence
model.add(Conv2D(filters=32, kernel_size=(5,5), padding='same', activation='relu'))
model.add(BatchNormalization())
model.add(Conv2D(filters=32, kernel_size=(5,5), padding='same', activation='relu'))
model.add(BatchNormalization())
model.add(MaxPool2D(pool_size=(4,4), strides=(2,2)))
model.add(Dropout(0.25))

# Flatten -> Dense -> Dropout -> Dense
model.add(Flatten())
model.add(Dense(units=128, activation='relu'))
model.add(Dropout(0.25))
model.add(Dense(units=10, activation='softmax'))

Next we need to define an optimizer:

In [None]:
# the default parameter settings of RMSprop should work fine
# but maybe the learning rate needs to be changed later
optimizer = RMSprop()

Now the model can be compiled:

In [None]:
model.compile(optimizer=optimizer, loss='categorical_crossentropy', metrics=['accuracy'])

To make the CNN converge faster and more efficiently, we will set a decreasing learning rate. The built-in callback ReduceLROnPlateau from keras automatically reduces the learning rate when a metric has stopped improving.

In [None]:
reduce_lr = ReduceLROnPlateau(monitor='val_acc', factor=0.2,
                              patience=2, min_lr=0.000001, verbose=1)


In [None]:
# some additional parameter settings for the model
epochs = 30
batch_size = 86


**Data Augmentation**

When it comes to digit recognition or other computer vision tasks the robustness of a neural network - that is its ability to classify new input - depends a great deal on the size and quality of the training set. If all digits in the training set were written by the same person, the model would perform poorly on test images of digits in the handwriting of somebody else. However we want our model to translate well to handwritings not seen before during training. To increase its performance we need a large and varied training set.

If, for example, training images contain digits that are not always perfectly centered, that are rotated to the left or right, of different size etc. the network will focus on learning important features of the digits' form rather than their positions. One way to improve the model's ability to capture such variability in handwriting is to automatically create modifications of existing images and add them to the training set.

Keras offers a simple way to do this with the ImageDataGenerator class:

In [None]:
datagen = ImageDataGenerator(rotation_range=15, width_shift_range=0.1, 
                             height_shift_range=0.1, zoom_range=0.1, 
                             fill_mode='nearest')

datagen.fit(X_train)

Now we can fit the training dataset:

In [None]:
fit_model = model.fit_generator(datagen.flow(X_train, y_train, batch_size=batch_size),
                                epochs=epochs, validation_data=(X_val, y_val), verbose=2,
                                steps_per_epoch=X_train.shape[0] // batch_size, 
                                callbacks=[reduce_lr])

**Evaluation**

Let's have a closer look at how the model performed during training.

In [None]:
# plot the loss functions
plt.plot(fit_model.history['loss'], color='b', label='Training loss')
plt.plot(fit_model.history['val_loss'], color='r', label='Validation loss')
plt.title('Loss functions')
plt.legend()
plt.show()

# plot the development of the model's accuracy
plt.plot(fit_model.history['acc'], color='b', label='Training accuracy')
plt.plot(fit_model.history['val_acc'], color='r', label='Validation accuracy')
plt.title('Accuracy')
plt.legend()
plt.show()


The loss functions don't look too bad. The model saturates after just a couple of epochs and the curves flatten out.

The accuracy seems to be rather satisfying as well for a first attempt.

Next we should have a look at the confusion matrix to see which digits are misclassified by the model and for which digits they were mistaken.

In [None]:
def create_model_confusion_matrix(model, X_input, y_expected):
    """
    This function creates the confusion matrix and plots it.
    """
    # let the model predict the output given X_input
    y_predicted = model.predict(X_input)
    # convert predicted and expected output from one-hot vector to label
    y_predicted_classes = np.argmax(y_predicted, axis=1)
    y_expected_classes = np.argmax(y_expected, axis=1)
    
    # calculate the confusion matrix and convert it
    # to a DataFrame object for plotting
    cm = confusion_matrix(y_expected_classes, y_predicted_classes)
    df_cm = pd.DataFrame(cm, range(10), range(10))
    
    # plot the confusion matrix
    ax = sns.heatmap(df_cm)
    ax.set(xlabel='expected', ylabel='predicted')
    ax.set_title('Confusion Matrix')
    plt.show()
    
    return df_cm

create_model_confusion_matrix(model, X_val, y_val)

This pretty much cofirms the findings of the accuracy curves above: The vast majority of digits is classified correctly. 

There are a couple of mistakes though. 9 and 4 seem to be confused on occasion as are 6 and 5 or 7 and 1. These digits can indeed look similiar in some people's handwriting. 

Maybe we should have a look at some random images and how they were classified by the CNN:

In [None]:
# the necessary functions for plotting images along with their 
# predicted and expected labels

def plot_labeled_images(model, X_input, y_expected, mode='random'):
    """
    This function plots a total of 9 images from the given set 
    along with their predicted and expected labels.
    
    The function has two modes: 'random' and 'errors'.
    If mode is set to 'random', random images are plotted.
    If mode is set to 'errors', only images are plotted where the predicted 
    label does not match the expected one.
    """
    num = 9
    if mode == 'random':
        selected_digits = get_random_digits(model, X_input, y_expected, num)
    elif mode == 'errors':
        selected_digits = get_error_digits(model, X_input, y_expected, num)
    else:
        raise ValueError("Unknown value for mode. Only 'random' and 'errors' are accepted.")
    
    # plot the digits
    n = 0
    rows = 3
    cols = 3
    fig, ax = plt.subplots(rows, cols ,sharex=True, sharey=True)
    plt.subplots_adjust(top=1.5) 
    for row in range(rows):
        for col in range(cols):
            ax[row, col].imshow(selected_digits[n][0].reshape((28, 28)))
            ax[row, col].set_title("Predicted label: {}\nExpected label: {}".format(
                selected_digits[n][1], selected_digits[n][2]
            ))
            n +=1

def get_random_digits(model, X_input, y_expected, num):
    """
    This function returns a total of num random digits from the dataset.
    The output is a len(num) tuple of tuples containing an input array, 
    predicted label and expected label each.
    """
    # let the model predict the output given X_input
    y_predicted = model.predict(X_input)
    # convert predicted and expected output from one-hot vector to label
    y_predicted_classes = np.argmax(y_predicted, axis=1)
    y_expected_classes = np.argmax(y_expected, axis=1)
    
    # get num random digits (image, predicted label, expected label)
    digit_sets = get_digit_sets(
        num, X_input, y_expected_classes, y_predicted_classes
    )
    
    return digit_sets

def get_error_digits(model, X_input, y_expected, num):
    """
    This function returns a total of num random digits from the dataset
    where the predicted label does not match the expected one.
    The output is a len(num) tuple of tuples containing an input array, 
    predicted label and expected label each.
    """
    # let the model predict the output given X_input
    y_predicted = model.predict(X_input)
    # convert predicted and expected output from one-hot vector to label
    y_predicted_classes = np.argmax(y_predicted, axis=1)
    y_expected_classes = np.argmax(y_expected, axis=1)
    
    # pick only instances where predicted and expected labels don't match
    errors = (y_predicted_classes - y_expected_classes != 0)
    y_predicted_classes_errors = y_predicted_classes[errors]
    y_expected_classes_errors = y_expected_classes[errors]
    X_input_errors = X_input[errors]
    
    # get num random digits (image, predicted label, expected label)
    digit_sets = get_digit_sets(
        num, X_input_errors, 
        y_expected_classes_errors, y_predicted_classes_errors
    )
    
    return digit_sets
    
def get_digit_sets(num, X_possible, y_expected_classes, y_predicted_classes):
    """
    This function returns a tuple of len(num) containing random digit images
    along with their expected and predicted labels. 
    
    Each entry of the tuple is itself a tuple of the form 
    (image, y_predicted, y_expected).
    """
    indices = np.random.randint(X_possible.shape[0], size=num)
    digit_sets = tuple((X_possible[i], y_predicted_classes[i], y_expected_classes[i])
                      for i in indices)
    return digit_sets


In [None]:
# plot some random images to see whether they are labeled correctly
plot_labeled_images(model, X_val, y_val, mode='random')

This seems good enough. These digits are usually classified as they should be.

So let's have a look at some images that were misclassified:

In [None]:
plot_labeled_images(model, X_val, y_val, mode='errors')

Most of the misclassified digits seem to be rather tricky indeed. More often than not even a human reader could struggle with them.

**Submitting Predictions**

Time to finally submit the results:

In [None]:
# predict results
y_test_pred = model.predict(X_test)

# select the indices with the highest probability
# these are our predicted labels
y_test_pred = np.argmax(y_test_pred, axis=1)

# convert to DataFrame object
y_test_pred = pd.Series(y_test_pred, name="Label")

# convert to CSV file as required
submission = pd.concat([pd.Series(range(1, 28001), name="ImageId"), y_test_pred], axis=1)
submission.to_csv("cnn_mnist_datagen.csv",index=False)