# Digit Recogniser

In [3]:
import pandas as pd
import numpy as np
from keras import models
from scipy.stats import mode
from keras import layers
from keras import optimizers
from keras.utils.np_utils import to_categorical
from keras.callbacks import ReduceLROnPlateau
from keras.preprocessing.image import ImageDataGenerator
from keras.optimizers import RMSprop, Adam
from keras.models import load_model

Using TensorFlow backend.


In [4]:
train_data = pd.read_csv('./digit-recognizer/train.csv',delimiter=',')
test_data = pd.read_csv('./digit-recognizer/test.csv',delimiter=',')
# print(len(test_data))
train_labels = train_data['label']
train_data = train_data.drop(['label'],axis=1)
print(train_data.head())
print(train_data.shape)
print(train_labels.head())
print(train_labels.shape)


   pixel0  pixel1  pixel2  pixel3  pixel4  pixel5  pixel6  pixel7  pixel8  \
0       0       0       0       0       0       0       0       0       0   
1       0       0       0       0       0       0       0       0       0   
2       0       0       0       0       0       0       0       0       0   
3       0       0       0       0       0       0       0       0       0   
4       0       0       0       0       0       0       0       0       0   

   pixel9  ...  pixel774  pixel775  pixel776  pixel777  pixel778  pixel779  \
0       0  ...         0         0         0         0         0         0   
1       0  ...         0         0         0         0         0         0   
2       0  ...         0         0         0         0         0         0   
3       0  ...         0         0         0         0         0         0   
4       0  ...         0         0         0         0         0         0   

   pixel780  pixel781  pixel782  pixel783  
0         0         0   

Next to determine whether or not there are any null values.

In [3]:
null_values = train_data.isnull().any().describe()
print(null_values)

count       784
unique        1
top       False
freq        784
dtype: object


Reshape to include grayscale dimension. 3bytes required for rgb!

In [5]:
train_data = train_data.to_numpy().reshape((42000,28,28,1))
test_data = test_data.to_numpy().reshape((28000,28,28,1))

Encode target variable is not a must when you use a multiclass classification. It depends on the loss function you use. If you don't want to encode target variable, you must use sparsecategoricalcrossentropy as loss function. Otherwise, you use categorical_crossentropy as loss function and you have to encode target variable.

In [5]:
train_labels = to_categorical(train_labels,num_classes=10)
print(train_labels[0:4])

[[0. 1. 0. 0. 0. 0. 0. 0. 0. 0.]
 [1. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
 [0. 1. 0. 0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 1. 0. 0. 0. 0. 0.]]


Split train data into training and validation data

In [6]:
val_data = train_data[:5000]
train_data = train_data[5000:len(train_data)+1]

Now to add random geomtric transformations to each image. This will transform the data and replace it on every epoch and will guarantee better generalisation. The transformations used can be seen below.

In [7]:
datagen = ImageDataGenerator(
        featurewise_center=False,  # set input mean to 0 over the dataset
        samplewise_center=False,  # set each sample mean to 0
        featurewise_std_normalization=False,  # divide inputs by std of the dataset
        samplewise_std_normalization=False,  # divide each input by its std
        zca_whitening=False,  # apply ZCA whitening
        rotation_range=10,  # randomly rotate images in the range (degrees, 0 to 180)
        zoom_range = 0.1, # Randomly zoom image 
        width_shift_range=0.1,  # randomly shift images horizontally (fraction of total width)
        height_shift_range=0.1,  # randomly shift images vertically (fraction of total height)
        horizontal_flip=False,  # randomly flip images
        vertical_flip=False)  # randomly flip images
datagen.fit(train_data)
val_labels = train_labels[:5000]
train_labels = train_labels[5000:len(train_labels)+1]

With a GPU I'd create many more variations of these models and throw them together in an ensemble. But unforunately, I barely have enough processing power to make them overfit. CNN's are notoriously large models also. For this reason I've stuck with 3 models but know for a fact I'd get improved results with a larger model.

In a regular neural net each layer for this particular project would have 28*28*1(for grayscale) neurons. This is not scalable when considering much larger picture which need classifying. This is why convolutional layers are used which find local patterns rather than global patterns. Depending on the Kernel size used (which i have varied in my ensemble to find different patterns) it will look for patterns in windows of this many pixels. After learning these patterns they can recognise it again anywhere. The higher layers learn small patterns such as edges or corners where as the lower layers in a Conv net learn the largest patterns as pooling occurs by learning patterns made up of smaller features learnt in the upper layers. Commonly referred to as a spatial hierarchy. 

They operate over 3D tensors called feature maps which includes the x,y and depth/channel axis (grayscale). This depth axis is 3 bytes for a colour drawing. Convolution extracts patches of size kernel_size, applies same operation. A 28x28x1 input feature map with 32 channels/filters and a 3x3 kernel_size will have a 26x26x32 output feature map. The 26x26 grid represents response map of the filter at different locations of the input. This means every dimension of the depth axis is a feature / filter and so the number of filters chosen represents the number of features that can be learnt by each layers.

Response map - 2D map of presence of feature at different locations of input.

Hyperparameters that required tuning
kernel_size = 5x5, 4x4, 3x3
Depth / number of filters - picking too many can result in overfiltering

The conv net slides the kernel across the 3D input stopping at every possible location extracting 3D output of surrounding features. This output is then transformed via tensor product with weight matrix into a 1D vector the same shape as the output length. They are then reassembled into a new 3D tensor output map. Location between input and output tensors remains consistent.

Ouput and input widths and heights differ for 2 reasons.

-border effects (countered by padding the input)
-using strides

adding padding ensures you can center convolution windows around each element. it can take the following values

1) "valid" - no padding
2) "same" - pad so output has same width and height of input

stride is the distance between 2 windows. this downsizes the output feature map
Maxpooling is often used instead which will find the maximum value of an nxn pool and stride to the next location by n. this will rapidly shrink the layers and consequently increase the size of the features the next layer can find thus it will allow learning of a spatial hierarchy. Without downsampling, the features found will be virtually the same as the high level layers and there will not be much point. Additionally when coupling a large final feature map with a Dense layer of 256 you'll be left with far too many parameters to calculate which would lead to overfitting. I could also have used average pooling but maxpooling supposedly gives better results.

Batch normalization allows for easier generalisation and less over fitting. It adaptively normalizes data, mean and variance over time during training allowing for easier gradient propagation and deeper networks.

I additionally toyed with the idea of using depthwise separable conv layer. These lightweight layers are sometimes considered a good and more efficient alternative to normal conv2d layers. However, directly substituting for conv2d layers made for more volatile validation accuracy and and overall lower percentage accuracy. So I subbed the conv2d layers back in.

I used matplotlib to plot val acc against epochs to determine where the models overfit.

Learning rate reduction speaks for itself, as the minima is reached, the learning rate is reduced so as to not overshoot. 

In [8]:
kernel_size_arr=[5]
ensemble_model=[]
for i in kernel_size_arr:
    model = models.Sequential()
    model.add(layers.Conv2D(filters = 64, kernel_size = (i,i),padding = 'Same',use_bias=False, input_shape = (28,28,1)))
    model.add(layers.BatchNormalization())
    model.add(layers.Activation('relu'))                 
    model.add(layers.Conv2D(filters = 64, kernel_size = (i,i),padding = 'Same', use_bias=False))
    model.add(layers.BatchNormalization())
    model.add(layers.Activation('relu'))
    model.add(layers.MaxPool2D(pool_size=(2,2)))
    model.add(layers.Dropout(0.25))


    model.add(layers.Conv2D(filters = 128, kernel_size = (3,3),padding = 'Same',use_bias=False))
    model.add(layers.BatchNormalization())
    model.add(layers.Activation('relu'))
    model.add(layers.Conv2D(filters = 128, kernel_size = (3,3),padding = 'Same',use_bias=False))
    model.add(layers.BatchNormalization())
    model.add(layers.Activation('relu'))
    model.add(layers.MaxPool2D(pool_size=(2,2)))
    model.add(layers.Dropout(0.25))
    model.add(layers.Conv2D(filters = 128, kernel_size = (3,3),padding = 'Same',use_bias=False))
    model.add(layers.BatchNormalization())
    model.add(layers.Activation('relu'))
    model.add(layers.Dropout(0.25))

    model.add(layers.Flatten())
    model.add(layers.Dense(256, use_bias=False))
    model.add(layers.BatchNormalization())
    model.add(layers.Activation('relu'))
    model.add(layers.Dropout(0.25))
    model.add(layers.Dense(10, activation='softmax'))
    # optimizer = RMSProp(lr=0.001, rho=0.9, epsilon=1e-08, decay=0.0)
    model.compile(optimizer='adam',loss='categorical_crossentropy',metrics=['accuracy'])
    learning_rate_reduction = ReduceLROnPlateau(monitor='val_acc', 
                                                patience=3, 
                                                verbose=1, 
                                                factor=0.5, 
                                                min_lr=0.00001)
    model.fit_generator(datagen.flow(train_data,train_labels, batch_size=64),
                                epochs = 30, validation_data = (val_data,val_labels),
                                verbose = 2, steps_per_epoch=train_data.shape[0] // 64
                                , callbacks=[learning_rate_reduction])
    # history = model.fit_generator(datagen.flow(train_data,train_labels, batch_size=64),
    #                               epochs = 30, validation_data = (val_data,val_labels),
    #                               verbose = 2, steps_per_epoch=train_data.shape[0] // 64
    #                               , callbacks=[learning_rate_reduction])

    ensemble_model.append(model)
    model.save("model"+str(i)+".h5")
ensemble_model = np.array(ensemble_model)
final_predictions = []








Instructions for updating:
Please use `rate` instead of `keep_prob`. Rate should be set to `rate = 1 - keep_prob`.

Instructions for updating:
Use tf.where in 2.0, which has the same broadcast rule as np.where
Epoch 1/30
 - 402s - loss: 0.2211 - acc: 0.9304 - val_loss: 0.0825 - val_acc: 0.9734
Epoch 2/30
 - 417s - loss: 0.0783 - acc: 0.9768 - val_loss: 0.0390 - val_acc: 0.9884
Epoch 3/30
 - 401s - loss: 0.0643 - acc: 0.9801 - val_loss: 0.0524 - val_acc: 0.9836
Epoch 4/30
 - 432s - loss: 0.0545 - acc: 0.9830 - val_loss: 0.0329 - val_acc: 0.9896
Epoch 5/30
 - 458s - loss: 0.0527 - acc: 0.9835 - val_loss: 0.0258 - val_acc: 0.9912
Epoch 6/30
 - 397s - loss: 0.0449 - acc: 0.9862 - val_loss: 0.0400 - val_acc: 0.9872
Epoch 7/30
 - 395s - loss: 0.0430 - acc: 0.9873 - val_loss: 0.0245 - val_acc: 0.9920
Epoch 8/30
 - 394s - loss: 0.0410 - acc: 0.9877 - val_loss: 0.0413 - val_acc: 0.9872
Epoch 9/30
 - 394s - loss: 0.0391 - acc: 0.9886 - val_loss: 0.0214 - val_acc: 0.9928
Epoch 10/30
 - 397

In [6]:
model1 = load_model('model3.h5')
model2 = load_model('model4.h5')
model3 = load_model('model5.h5')
ensemb = [model1,model2,model3]
print('models loaded')








Instructions for updating:
Please use `rate` instead of `keep_prob`. Rate should be set to `rate = 1 - keep_prob`.

Instructions for updating:
Use tf.where in 2.0, which has the same broadcast rule as np.where
models loaded


In [25]:
for i in range(len(val_data)):
    predictions = []
    for model in ensemb:
        predict = model.predict(np.array([val_data[i]]))
        prediction = max(predict[0])
        int_prediction = predict[0].tolist().index(prediction)
        predictions.append(int_prediction)
    most_votes = mode(predictions)
    if val_labels[i][most_votes[0]]==1:
        final_predictions.append(True)
    else:
        final_predictions.append(False)

val_acc = final_predictions.count(True) / len(final_predictions)
val_loss = final_predictions.count(False) / len(final_predictions)
print('Validation Accuracy')
print(val_acc)
print('Validation Loss')
print(val_loss)

Validation Accuracy
0.996128181893072
Validation Loss
0.003871818106928083


I could attempt to retrain the model on the entire dataset rather than split on validation data but considering my computer nearly died I think I'll settle for this score. I could also use a larger ensemble but I think I'll save my computer for a better competition. Time to predict the test data and send to Kaggle.

In [9]:
test_predictions = np.empty((0,2),int)
for i in range(len(test_data)):
    predictions = []
    for model in ensemb:
        predict = model.predict(np.array([test_data[i]]))
        prediction = max(predict[0])
        int_prediction = predict[0].tolist().index(prediction)
        predictions.append(int_prediction)
    most_votes = mode(predictions)
    test_predictions=np.append(test_predictions,np.array([[i+1,int(most_votes[0][0])]]),axis=0)
    
df = pd.DataFrame(data=test_predictions)
print(df.head())
print('results are in!')

   0  1
0  1  2
1  2  0
2  3  9
3  4  0
4  5  3
results are in!


In [19]:
df.to_csv('results.csv',index=False, header=['ImageId','Label'])