**Contents**
* Loading files and Data Preprocessing
* Baseline Artificial Neural Network
* Evaluate Baseline Model
* CNN, Data Augmentation, & Preprocessing
* Construct CNN
* Evaluate CNN Model
* References

In [1]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load in 

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the "../input/" directory.
# For example, running this (by clicking run or pressing Shift+Enter) will list the files in the input directory

import os
print(os.listdir("../input"))

# Any results you write to the current directory are saved as output.
np.random.seed(0)

In [2]:
# Visualization library
import matplotlib.pyplot as plt
# Model for spliting data to training and test set
from sklearn.model_selection import train_test_split
# The Sequential is used to define the machine learning model in keras
from keras.models import Sequential
# The layers that we might need to construct the neural network
# Conv2D and MaxPool2D are used for CNN
# Dropout are used to fight with overfitting
# Flatten is used to transfer the dimension of data so that the data can be
# passed into fully connected network after extracting features from CNN
from keras.layers import Dense, Activation, Conv2D, MaxPool2D, Dropout, Flatten, BatchNormalization
from keras import optimizers
import keras.backend as K
# this model is used to preprocess the label data. It is a multiclass classification
# problem. We need to use transfer the label like 4,6,1 to something like 
# [0,0,0,1,0,0,0,...], [0,0,0,0,0,1,0,...]
from keras.utils.np_utils import to_categorical

# We might use data augmentation, this model can help us generate augmentation data
from keras.preprocessing.image import ImageDataGenerator
# Adjusting the learning rate during the training process to enhance the convergence
from keras.callbacks import ReduceLROnPlateau

**Loading data from the csv file**

In [3]:
train = pd.read_csv('../input/train.csv')
test = pd.read_csv('../input/test.csv')

**Preprocessing Training Data**
In this step I only use the data from train.csv. By spliting the data from train.csv to train, validation, and test sets, the model can be evaluated before fitting to the test set that is used to submit the result.

**1. Get train and label data**
* Extract label column from train.csv
* Obtain the remainder columns as data of train set

**2. Rescale data to range between 1 and 0 (Neural Networks work well for the data in this range) by divided 255 (the image is grey-scaled so that the pixel-value is an integer between 0 and 255).**
* Convert data type to 'float32' with numpy
* Divide all data in train from train.csv by 255
* Divide all data in test.csv by 255

**3.Split data to train, validation, and test set**
* Extract a portion of data as validation set with sample()
* Randomly split data to train and test sets with train_test_split from sklearn

**4. Manipulate label data with vectorization**
* Convert label data to vector
* 10 vectors will be generated becasue it is a multiclass classification problem with 10 classes


In [4]:
# Create an index list for convenience in spliting data
index = [i for i in range(len(train))]
index = pd.Series(index)
train['index'] = index
train.head(10)

In [5]:
# Check original data size in train
print('Original Size: ' + str(train.shape))
# Get a portion of data as validation set
val = train.sample(int(len(train)*0.1))
# Check the size in validation set
print('Validation Size: ' + str(val.shape))
# Concat originzal dataframe and validataion dataframe, drop the same columns so that the data which is 
# used as validation would be chosen as data in any of train or test set
X = train.drop(val['index'])
# Check the size in train set
print('Train Size before split: ' + str(X.shape))

In [None]:
X.head()

**Currently, we obtain a validation set with 20% of data from original dataset, and a training set which does not contain any data in validatiaon set**

Now, preprocess the data in training set

In [6]:
# Get training data through slicing to get rid of label data and index data
train_afsplit = X.iloc[:,1:-1]
# Get label data
y = X['label']
# Convert label data to vector
y_cat = to_categorical(y)

# Change its type to float for scaling purpose
train_afsplit = train_afsplit.astype('float32')
test = test.astype('float32')

# Rescale the data
train_afsplit /= 255.0
test /= 255.0

# Split data to train, validation, and test set
X_train, X_test, y_train, y_test = train_test_split(train_afsplit, y_cat, test_size = 0.2)

In [7]:
print('Train: ' + str(X_train.shape))
print('y train: ' + str(X_test.shape))
print('Test: ' + str(y_train.shape))
print('y test: ' + str(y_test.shape))

Now, preprocess the data in validation set

In [8]:
val_train = val.iloc[:, 1:-1]
val_cat = to_categorical(val['label'])

val_train = val_train.astype('float32')

val_train /= 255.0

print('val train: ' + str(val_train.shape))
print('val cat: ' + str(val_cat.shape))

In [None]:
print(type(val_train))
print(type(X_train))
print(type(X_test))

**Constructing a simple Artificical Neural Network (ANN) as baseline model**
* In this model, I decided to add 3 hiden layers with 512, 256, and 128 units seperately (the larger the architecture the larger probability of overfitting)
* The data is passed into the model as a long vector with length equal to 28 * 28 = 784 => the input shape of the model

In [9]:
# Defien the input dimension for the model
# Access the length of the vector with shape[1], the number of training examples can
# be access with shape[0]
dim = X_train.shape[1]

# save memory
K.clear_session()
# Defien the model
ANN = Sequential()
# Add a layer with 512 units to the model
# relu just like max(x, 0) here x is the input. If x was lower than 0,
# relu will map it to 0. If x was larger than 0, relu will keep it unchange
ANN.add(Dense(512, input_dim = dim, activation = 'relu'))
# Add other three layers
# In Keras, we just need to set the input_dim for the first layer. For the 
# remainder layers, keras will automatically figure out its input dimension
ANN.add(Dense(512, activation = 'relu'))
ANN.add(Dense(256, activation = 'relu'))
ANN.add(Dense(128, activation = 'relu'))
# Add the last layer, and set the units to 10 becasue we have 10 classes
# Use softmax for multiclass classification
ANN.add(Dense(10, activation = 'softmax'))

# compile is used to construct the model that we just defined
# Use categorical_crossentropy loss for multiclass classification
ANN.compile(loss = 'categorical_crossentropy',
              optimizer = 'rmsprop',
              metrics = ['accuracy'])
ANN.summary()

**Fit the model to the data and evaluate its performance**

In [10]:
# define the batch size for the model => how many data will be input to the model at once
bz = 128

ANN_pred = ANN.fit(X_train,
                   y_train,
                   batch_size = bz,
                   epochs = 15,
                   verbose = 2,
                   validation_data = (val_train, val_cat))

**By using the ANN with no data augmentation, the accuracy in the validation set is around 97%, while the accuracy in training set is around 99.6%.**
* Plotting the changes of training accuracy and validation accuracy to see whether the model can still be improved

In [11]:
plt.plot(ANN_pred.history['acc'])
plt.plot(ANN_pred.history['val_acc'])
plt.legend(['training accuracy', 'validation accuracy'])
plt.title('Accuracy')
plt.xlabel('Epochs')

After plotting the changes in accuracy, the accuracy in the training set seems like that still can be improved a little bit by running with more epochs. However, there is a big gap between the training accuracy and validation accuracy. It means that our model might be **overfitting**.

**Evaluate ANN model in test set splitted from training data**

In [12]:
# Get probability return as results
ANN_test_pred = ANN.predict_proba(X_test)
# Get the maximum probability in the predicted results => the probability belongs to a specific class
ANN_test_pred = ANN_test_pred.argmax(axis = 1)
ANN.evaluate(X_test, y_test)

**The evaluation results shows the accuracy in test set is around 97%, and it can be the baseline for the prediction**

**CNN with Data Augmentation**

**Construct a CNN model with Data Augmentation to see whether the result can be improved because the ANN model might almost reach its capacity based on the accuracy plot**
* Before passing data to the CNN, the data require to be transformed to tensor
* We need to reshape the vector to (28, 28, 1) => it means the size of image is 28 * 28 with 1 color channel beacause it is grey-scale image

In [13]:
print(type(val_train))
print(type(X_train))
print(type(X_test))

In [14]:
X_train = X_train.values
X_test = X_test.values

# -1 means let reshape function decides the dimension by itself
X_train = X_train.reshape(-1,28,28,1)
X_test = X_test.reshape(-1,28,28,1)

# for validation data
val_train = val_train.values
val_train = val_train.reshape(-1,28,28,1)

# for test data
test = test.values
test = test.reshape(-1,28,28,1)
print('X_train: ' + str(X_train.shape))
print('X_test: ' + str(X_test.shape))
print('val_train: ' + str(val_train.shape))
print('test: ' + str(test.shape))

**Using ImageDataGenerator to complete the Data Augmentation**
* It will be helpful to increase our training examples and to fight with overfitting

In [15]:
datagen = ImageDataGenerator(
    # set smaller range for this to prevent misclassification like 6 and 9, 2 and 5
    rotation_range=0.1,
    width_shift_range=0.1,
    height_shift_range=0.1,
    zoom_range=0.2,
    fill_mode='nearest')

Fit the generator to the training data

In [16]:
datagen.fit(X_train)

In [23]:
# save memory
K.clear_session()

# Adjusting the learning rate during the process to enhance the convergency
LR_Adjust = ReduceLROnPlateau(monitor = 'val_loss',
                              patience = 2,
                              verbose = 1,
                              factor = 0.3,
                              min_lr = 0)

optimizer = optimizers.rmsprop()

CNN = Sequential()

CNN.add(Conv2D(filters = 32, kernel_size = (3,3), padding = 'same', activation = 'relu', input_shape = (28, 28, 1)))
CNN.add(Dropout(0.1))
CNN.add(MaxPool2D(pool_size = (2,2)))
CNN.add(Conv2D(filters = 32, kernel_size = (3,3), padding = 'same', activation = 'relu'))
CNN.add(Dropout(0.1))
CNN.add(MaxPool2D(pool_size = (2,2)))

CNN.add(Conv2D(filters = 64, kernel_size = (3,3), padding = 'same', activation = 'relu'))
CNN.add(Dropout(0.1))

CNN.add(Conv2D(filters = 64, kernel_size = (3,3), padding = 'same', activation = 'relu'))
CNN.add(MaxPool2D(pool_size = (2,2)))
CNN.add(Dropout(0.05))

CNN.add(Flatten())
CNN.add(Dense(512, activation = 'relu'))
CNN.add(Dense(256, activation = 'relu'))
CNN.add(Dense(128, activation = 'relu'))
CNN.add(Dense(10, activation = 'softmax'))

CNN.compile(loss = 'categorical_crossentropy',
            optimizer = optimizer,
            metrics = ['accuracy'])


CNN.summary()

In [24]:
bz = 128

CNN_pred = CNN.fit_generator(datagen.flow(X_train,y_train, batch_size= bz),
                             epochs = 30, 
                             validation_data = (val_train,val_cat),
                             verbose = 2, 
                             steps_per_epoch=X_train.shape[0] // bz,
                             callbacks=[LR_Adjust])

In [25]:
plt.plot(CNN_pred.history['acc'])
plt.plot(CNN_pred.history['val_acc'])
plt.legend(['training accuracy', 'validation accuracy'])
plt.title('Accuracy')
plt.xlabel('Epochs')

**By using CNN and Data Augmentation, the training accuracy and validataion accuracy are very approching to each other, and it means that the model is comparatively generalized**
* After running for 30 epochs, the trainning accuracy is around 99.5% and the validation accuracy is around 99.3%

**The loss in the validation set is almost not changed; therefore, the model might already converge, and the accuracy might reach the max capacity of this model**

**Evaluate the CNN model**

In [27]:
# Get probability return as result
CNN_test_pred = CNN.predict_proba(X_test)
# Get the maximum probability in the predicted results => the probability belongs to a specific class
CNN_test_pred = CNN_test_pred.argmax(axis = 1)
CNN.evaluate(X_test, y_test)

**The accuracy in the test set is 99.3%, which is 2% higher than the baseline**

In [28]:
test_pred = CNN.predict_proba(test)

# select the indix with the maximum probability
test_pred = np.argmax(test_pred,axis = 1)

test_pred = pd.Series(test_pred,name="Label")

sub_index = [i for i in range(1, 28000+1)]

sub_index = pd.Series(sub_index, name='ImageId')

test_pred = pd.concat([sub_index, test_pred],axis = 1)

test_pred.to_csv("submission.csv",index=False)

**Analyze Error Cases**
* Confusion Matrix

In [43]:
from sklearn.metrics import confusion_matrix
import itertools

def plot_confusion_matrix(cm, classes, normalize = False,
                         title = 'Confusion matrix', cmap = plt.cm.Blues):
    plt.imshow(cm,interpolation = 'nearest', cmap = cmap)
    plt.title(title)
    plt.colorbar()
    tick_marks = np.arange(len(classes))
    # first parameter is how to arrange the x labels
    # second parameter is the actual label, here we set to classes 1 - 9
    plt.xticks(tick_marks, classes, rotation = 45)
    plt.yticks(tick_marks,classes)
    
    if normalize:
        cm = cm.astype('float32') / cm.sum(axis = 1)[:, np.newaxis]
        
    thresh = cm.max() / 2.
    for i, j in itertools.product(range(cm.shape[0]), range(cm.shape[1])):
        # display text in the box to show how many error cases
        plt.text(j,i, cm[i,j],
                horizontalalignment = 'center',
                color = 'white' if cm[i,j] > thresh else 'black')
    
    plt.tight_layout()
    plt.ylabel('True label')
    plt.xlabel('Predicted label')

In [44]:
val_pred = CNN.predict_proba(val_train)
val_pred_class = np.argmax(val_pred, axis = 1)
val_true = np.argmax(val_cat, axis=1)
cm = confusion_matrix(val_true, val_pred_class)
plot_confusion_matrix(cm,classes = range(10))

**Based on the confusion matrix, 5 cases of 7 being predicted as 2**

**Visualize the error cases**

In [45]:
errors = (val_pred_class - val_true != 0)

In [47]:
# errors is an array of boolean value, where True represents the errors
# val_pred_class[errors] will return the values that are true, the error
# label
val_pred_class_errors = val_pred_class[errors]
# return the error images to val_pred_errors
val_pred_errors = val_pred[errors]

In [49]:
val_true_errors = val_true[errors]
val_train_errors = val_train[errors]

In [53]:
def show_errors(error_index, img_errors, pred_errors, obs_errors):
    n = 0
    nrows = 2
    ncols = 3
    fig, ax = plt.subplots(nrows,ncols,sharex=True,sharey=True)
    for row in range(nrows):
        for col in range(ncols):
            error = error_index[n]
            ax[row, col].imshow((img_errors[error]).reshape((28,28)))
            ax[row, col].set_title(f'Predicted label:{pred_errors[error]}\n True label:{obs_errors[error]}')
            n += 1
    

In [54]:
val_pred_errors_prob = np.max(val_pred_errors, axis = 1)
true_prob_errors = np.diagonal(np.take(val_pred_errors, val_true_errors, axis = 1))
delta_pred_true_errors = val_pred_errors_prob - true_prob_errors
sorted_delta_errors = np.argsort(delta_pred_true_errors)

most_important_errors = sorted_delta_errors[-6:]

show_errors(most_important_errors, val_train_errors, val_pred_class_errors, val_true_errors)

**References:**

[https://www.kaggle.com/yassineghouzam/introduction-to-cnn-keras-0-997-top-6/log](http://)

[http://scikit-learn.org/stable/auto_examples/model_selection/plot_confusion_matrix.html#sphx-glr-auto-examples-model-selection-plot-confusion-matrix-py](http://)