# Introduction

This is my entry for Kaggle's [Digit Recognizer Competition](https://www.kaggle.com/c/digit-recognizer). 
In this challenge, I will use computer vision techniques, more specifically image classification, to build a model to recognize hand-written digits from the famous MNIST dataset. 
To help me achieve this, I will use [Dr. Ghouzam's notebook](https://www.kaggle.com/yassineghouzam/introduction-to-cnn-keras-0-997-top-6) as well as [DataAI's notebook](https://www.kaggle.com/kanncaa1/recurrent-neural-network-with-pytorch/data#INTRODUCTION) to build a model in PyTorch to compete in this competition.

In [1]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# The Data

Before any machine learning model can be made, it is important to effectively analyze and understand the data we are given.
It is also imperative to know its interesting features, such as the amount of missing data and the outliers.
Fortunately, a thorough description of the dataset is given in the contest description.

I am using the MNIST dataset, described as the "hello world" dataset for computer vision. 

In [13]:
train = pd.read_csv('data/train.csv')
test = pd.read_csv('data/test.csv')

print('Training Set Shape:', train.shape)
num_train = train.shape[0]
num_feats = train.shape[1]

print('Testing Set Shape:', test.shape)
num_test = test.shape[0]

Training Set Shape: (42000, 785)
Testing Set Shape: (28000, 784)


In [14]:
train.head()

Unnamed: 0,label,pixel0,pixel1,pixel2,pixel3,pixel4,pixel5,pixel6,pixel7,pixel8,...,pixel774,pixel775,pixel776,pixel777,pixel778,pixel779,pixel780,pixel781,pixel782,pixel783
0,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,4,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [15]:
test.head()

Unnamed: 0,pixel0,pixel1,pixel2,pixel3,pixel4,pixel5,pixel6,pixel7,pixel8,pixel9,...,pixel774,pixel775,pixel776,pixel777,pixel778,pixel779,pixel780,pixel781,pixel782,pixel783
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


As shown, there are 42,000 training examples, and 28,000 testing examples. As described in the competition's homepage, these examples are 28px by 28px gray-scale images of hand-written digits. Therefore, there are 784 pixels total in each image, and thus 784 features. The last feature corresponds to the label, which in this case is the number the hand-written digit corresponds to. 

Each feature (or pixel) is represented by a value between 0 and 255, inclusive, which is the amount of light or dark in that specific pixel. Darker pixels have larger numbers.

**Locating a pixel**

Say we want to locate pixel $x$, where $0 \leq x \leq 783$. To do so, we have to use solve this equation: $$x = i * 28 + j$$ where $0 \leq i \leq 27$ and $0 \leq j \leq 27$. Here, $i$ refers to the $i^{th}$ row and $j$ refers to the $j^{th}$ column of a 28x28 pixel image. 

So, if I wanted to find the $543^{rd}$ pixel, I would find it on the $19^{th}$ row and $11^{th}$ column.


--- 
Now, lets visualize the distributions of the data.

## Looking at the training set

Are there any missing values in the training set?

In [33]:
# Looking for missing labels
print('No. of missing labels:', train['label'].isnull().sum())

# Looking for missing pixel values
print('No. of missing pixels:', sum(train.drop(columns = ['label']).isnull().sum()))

No. of missing labels: 0
No. of missing pixels: 0


What does the distribution of labels look like? 

Why do we care about this? An unbalanced dataset can cause many problems for our image classification model. If a large gap exists between the amount of labels, then the model would become biased, favoring the majority class by categorizing more images into that class. This will lead to a false sense of high accuracy. 

**Recall that accuracy is the amount of correct predictions over the total number of predictions**, which can be calculated by: $$\frac{TP + TN}{N}$$ where $TP$ is the amount of true positives, $TN$ is the amount of true negatives, and $N$ is the total amount of observations. The model may seem to have a higher performance, but in reality it is just biased to the majority class and is just predicting that class only. 

For example, say there is a dataset of 100 observations, where 90 are labeled as *A*, and 10 are labeled as *B*. Obviously, we have an unbalanced dataset. Now suppose we train a model on this dataset without taking the necessary precautions to deal with the imbalance. If 