# Digit Recognizer

## Table of Contents  

1. [Introduction](#section-1)
2. [Exploratory Data Analysis](#section-2)
3. [Data Preprocessing](#section-3)
4. [Machine Learning Models](#section-4)  
    4.1. [Neural Network with Densely Connected Layers](#section-4.1)  
    4.2. [Convolutional Neural Network](#section-4.2)  
5. [Final Comments](#section-5)

---
## Introduction <a id="section-1"></a>

This project aims to classify an image containing a handwritten digit into 1 of 10 categories, ranging from 0 to 9. To do so, I will be applying two neural networks, with the networks differing in terms of the type of layers used:

 * A network containing only **densely connected** layers
 * A network containing a mix of **convolutional and densely connected** layers
 
I will then evaluate the performace of the two models using the test set to compare which model gives the higher accuracy. To begin, I will first proceed with an overview of the dataset.

---
## Exploratory Data Analysis <a id="section-2"></a>

In [None]:
import pandas as pd

# storing the test and training data into variables
train_data = pd.read_csv("/kaggle/input/digit-recognizer/train.csv")
test_data = pd.read_csv("/kaggle/input/digit-recognizer/test.csv")
train_data.head()

In [None]:
train_data.shape

The training data contains 42000 observations (images in this case) and 785 attributes for each observation. The attributes and their meanings are as listed below:  

**label**: Ground truth of the handwritten digit, classified into 10 classes (0 to 9)  
**pixeln**: The nth pixel of the image, containing values which signify the level of greyness of each pixel

Each image is of the dimension (28 * 28 * 1), which represent the **image height**, **image width** and **image channels** respectively.

In [None]:
# plotting the distribution of the labels
import seaborn as sns
import matplotlib.pyplot as plt

sns.countplot(x = train_data['label'])
plt.title("Distribution of labels")
plt.show()

In [None]:
# creating a count table for labels
print("=== Label ===")
label_count = train_data['label'].value_counts().sort_values().reset_index()
label_count.columns = ["label", "counts"]
label_count = label_count.sort_values(by="label")
print(label_count.to_string(index=False))

The labels of the training set seem to be almost equally distributed, with the label 5 appearing the least (3795 times) and label 1 appearing the most (4684 times).

In [None]:
# visualizing the dataset
train_data_reshaped = train_data.drop(['label'], axis=1).values.reshape(-1, 28, 28)
plt.figure(figsize=(10, 15))
for i in range(10):
    plt.subplot(5, 5, i+1)
    plt.grid(False)
    plt.imshow(train_data_reshaped[i])
    plt.xlabel(train_data['label'].iloc[i])

The first 10 images and their respective labels are plotted above.

In [None]:
# getting the range for channels dimension
max_channel = max(train_data.max())
min_channel = min(train_data.min())
print(f"Minimum channel: {min_channel}")
print(f"Maximum channel: {max_channel}")

The coefficient for the pixel attribute ranges from 0 to 255, which represent the intensity of the greyscale. I shall now proceed with processing the data.

---
## Data Preprocessing <a id="section-3"></a>

In [None]:
# extracting labels from training data
train_labels = train_data['label']
train_data.drop(['label'], axis=1, inplace= True)

In [None]:
# normalize the coefficient of pixels
def normalize_coef(data):
    return data.astype("float32")/ 255

train_data = normalize_coef(train_data)
test_data = normalize_coef(test_data)
normalized_max_channel = max(train_data.max())
normalized_min_channel = min(train_data.min())
print(f"Normalized minimum channel: {normalized_min_channel}")
print(f"Normalized maximum channel: {normalized_max_channel}")

I normalized the coefficient of all pixel attributes for both training and test set such that the range of values they can take lie between 0 and 1. This creates a homogeneous dataset.

In [None]:
from sklearn.model_selection import train_test_split

# splitting the dataset into training and validation sets
X_train, X_val, y_train, y_val = train_test_split(train_data, train_labels, train_size = 0.80, random_state = 42)
print(f"Number of samples in training set: {X_train.shape[0]}")
print(f"Number of samples in validation set: {X_val.shape[0]}")

I split the training data into training and validation sets in a 80/20 split. The validation set will help us in finetuning the hyperparameters of the model later on.

In [None]:
from keras.utils import to_categorical

y_train = to_categorical(y_train)
y_val = to_categorical(y_val)

Since the labels can only fall into 10 categories, I applied one hot encoding to the labels for easier processing in the neural network.

---
## Machine Learning Models <a id="section-4"></a>

### Neural Network with Densely Connected Layers<a id="section-4.1"></a>

For this portion, I will be trying out 3 neural networks with different hyperparameters, namely:  
   1. Neural Network with 1 hidden layer and 256 hidden inputs  
   2. Neural Network with 1 hidden layer and 512 hidden inputs  
   3. Neural Network with 2 hidden layers and 512 hidden inputs per hidden layer  

Each network will be run for a total of 20 epochs using a batch size of 128. I will use the relu activation function for the hidden layers to account for the possibility of non-linear relationships, and the softmax activation function for the output layer to generate a probability distribution over the 10 different labels. Dropouts are applied to each hidden layer to reduce overfitting. Lastly, I will evaluate the performance of each network against the validation set using the accuracy metric.

In [None]:
from keras import models
from keras import layers

# creating a function that outputs a new model using the given parameters
def build_model(num_hidden_inputs, num_hidden_layers):
    model = models.Sequential()
    for i in range(num_hidden_layers):
        if i == 0:
            model.add(layers.Dense(num_hidden_inputs, activation="relu", input_shape=(784,)))
        else:
            model.add(layers.Dense(num_hidden_inputs, activation="relu"))
        model.add(layers.Dropout(0.3))
    model.add(layers.Dense(10, activation="softmax"))
    model.compile(optimizer="rmsprop",
                 loss="categorical_crossentropy",
                 metrics=["accuracy"])
    return model

In [None]:
# neural network with 1 hidden layer and 256 hidden inputs
model1 = build_model(256, 1).fit(X_train, y_train, epochs=20, batch_size=128, validation_data=(X_val, y_val), verbose=0)

# neural network with 1 hidden layer and 512 hidden inputs
model2 = build_model(512, 1).fit(X_train, y_train, epochs=20, batch_size=128, validation_data=(X_val, y_val), verbose=0)

# neural network with 2 hidden layers and 512 hidden inputs per hidden layer
model3 = build_model(512, 2).fit(X_train, y_train, epochs=20, batch_size=128, validation_data=(X_val, y_val), verbose=0)

In [None]:
# extracting the validation accuracy values from each model
model1_val_acc = model1.history['val_accuracy']
model2_val_acc = model2.history['val_accuracy']
model3_val_acc = model3.history['val_accuracy']

# plotting validation accuracies against epoch
epochs = range(1, 21)
plt.plot(epochs, model1_val_acc, label="Model1 Val Acc")
plt.plot(epochs, model2_val_acc, label="Model2 Val Acc")
plt.plot(epochs, model3_val_acc, label="Model3 Val Acc")
plt.xlabel("Epochs")
plt.ylabel("Accuracy")
plt.legend()
plt.show()

Based on the graph, model 3 seems to perform the best as it has the highest validation accuracy across all 3 models. The graph also shows that the performance on the validation set seems to deteriorate after the 17 epoch, likely due to overfitting on the training set. I will now train model 3 using all the available training data (training set + validation set) for a total of 17 epochs, and finally test the accuracy on the test set. 

In [None]:
# training the final model
final_nn = build_model(512, 2)
final_nn.fit(train_data, to_categorical(train_labels), epochs=17, batch_size=128) 

In [None]:
# getting the predictions on the test set
final_nn_predictions = final_nn.predict(test_data)

In [None]:
import numpy as np

final_nn_predicted_classes = np.argmax(final_nn_predictions, axis=1)
ImageId = list(range(1, len(final_nn_predicted_classes) + 1))
submissions = pd.DataFrame({"ImageId": ImageId,
                           "Label": final_nn_predicted_classes})
submissions.to_csv("submission_nn.csv", index=False, header=True)

The model has a 98.028% accuracy when evaluated against the test set. Now, I will build a CNN and compare its performance against the earlier model.

### Convolutional Neural Network<a id="section-4.2"></a>

For the convolutional neural network, I will use 3 convolutional layers, each with 32, 64, 64 filters respectively as well as a densely connected layer with 64 hidden inputs. I also included MaxPooling2D layers to ensure that the model can better learn the spacial hierachy of features. The output of the model will be of the same format as the output from the previous models.

In [None]:
# building the CNN

def build_cnn():
    cnn_model = models.Sequential()
    cnn_model.add(layers.Conv2D(32, (3, 3), activation="relu", input_shape=(28,28,1)))
    cnn_model.add(layers.MaxPooling2D((2, 2)))
    cnn_model.add(layers.Conv2D(64, (3, 3), activation="relu"))
    cnn_model.add(layers.MaxPooling2D((2, 2)))
    cnn_model.add(layers.Conv2D(64, (3, 3), activation="relu"))
    cnn_model.add(layers.Flatten())
    cnn_model.add(layers.Dense(64, activation="relu"))
    cnn_model.add(layers.Dense(10, activation="softmax"))
    cnn_model.compile(optimizer="rmsprop",
                 loss="categorical_crossentropy",
                 metrics=["accuracy"])
    return cnn_model

In [None]:
# reshaping the dataframe into a 4D tensor
X_train_cnn = X_train.values.reshape((33600, 28, 28, 1))
X_val_cnn = X_val.values.reshape((8400, 28, 28, 1))
cnn_model = build_cnn()
cnn_model.fit(X_train_cnn, y_train, epochs=20, batch_size=64, validation_data=(X_val_cnn, y_val))

In [None]:
# extracting the validation accuracy value the model
cnn_val_acc = cnn_model.history.history['val_accuracy']

# plotting validation accuracies against epoch
epochs = range(1, 21)
plt.plot(epochs, cnn_val_acc, label="CNN Val Acc")
plt.xlabel("Epochs")
plt.ylabel("Accuracy")
plt.legend()
plt.show()

The model fails to show significant improvements on the validation data after the 10th epoch. I will now train the final CNN model with all available training data.

In [None]:
# training the final cnn model on all available training data
cnn_train_data = train_data.values.reshape((42000, 28, 28, 1))
final_cnn = build_cnn()
final_cnn.fit(cnn_train_data, to_categorical(train_labels), epochs=10, batch_size=64)

In [None]:
# getting the predictions on the test set
test_data_reshaped = test_data.values.reshape((28000, 28, 28, 1))
final_cnn_predictions = final_cnn.predict(test_data_reshaped)

In [None]:
final_cnn_predicted_classes = np.argmax(final_cnn_predictions, axis=1)
cnn_submissions = pd.DataFrame({"ImageId": ImageId,
                           "Label": final_cnn_predicted_classes})
cnn_submissions.to_csv("submission_cnn.csv", index=False, header=True)

The model has a 99.017% accuracy when evaluated against the test set.

---
## Final Comments <a id="section-5"></a>

Overall, the CNN has a slightly higher accuracy than the NN with only dense layers, albeit coming at the cost of computational time. To further improve accuracy, I can perhaps further finetune the hyperparameters of the model, and use k-folds validation to ensure that the validation set is representative of the overall data set.