# MNIST: learning to recognize handwritten digits

## Dataset exploration

Before starting a machine learning or data science task, it is always useful to familiarize yourself with the data set and its context.

### Required imports

In [None]:
from tensorflow import keras
from collections import Counter
from keras.datasets import mnist
import matplotlib.pyplot as plt
%matplotlib inline
import numpy as np

### Obtaining the dataset

In Keras' datasets module we have a handle to the MNIST dataset we want to use in this notebook.  Download the training and test set for this data.

In [None]:
(x_train, y_train), (x_test, y_test) = mnist.load_data()

### Dimensions and types

Determine the shape and type of the training and the test set.

The training set has 60,000 examples, the test set 10,000.  The input is a 28 $\times$ 28 matrix of unsigned 8-bit integers, the output a single unsigned 8-bit integer.

### Data semantics

Each input represents a scanned grayscale image of a handwritten digit, the output is the corresponding integer. Visualize the image, and check the label for the first training example.

In [None]:
rows = 5
cols = 7
figure, axes = plt.subplots(rows, cols, figsize=(5, 3))
plt.subplots_adjust(wspace=0.1, hspace=0.1)
for img_nr in range(rows*cols):
    row = img_nr//cols
    col = img_nr % cols
    axes[row, col].get_xaxis().set_visible(False)
    axes[row, col].get_yaxis().set_visible(False)
    axes[row, col].imshow(x_train[img_nr], cmap='gray')

In [None]:
y_train[:rows*cols].reshape(rows, cols)

So this proves that I'm certainly not the only one cursed with bad handwriting.

### Data distribution

An important question is whether all digits are represented in the training and test set, and what the distribution is.  This may have an impact on the accuracy of the trained model.

Although some digits like 1 are overrepresented, and others, e.g., 5 are underrepresented, the distribution seems to be reasonably uniform, and it is likely no special care needs to be taken.