# Develop a CNN for MNIST Handwritten Digit Classification

The MNIST handwritten digit classification problem is a standard dataset used in computer vision and deep learning.
Although the dataset is effectively solved, it can be used as the basis for learning and practicing how to develop, evaluate, and use convolutional deep learning neural networks for image classification from scratch. 
This includes how to develop a robust test harness for estimating the performance of the model and how to explore improvements to the model.

**To focus on the network architecture, we ignore the train/validation/test split, hyper-parameter tuning, and others.**

## Outcome

In this tutorial, you will discover how to develop a convolutional neural network for handwritten digit classification from scratch. 
After completing this tutorial, you will know:
- How to develop a test harness to develop a robust evaluation of a model and establish a baseline of performance for a classification task.
- How to explore extensions to a baseline model to improve learning and model capacity.
- How to develop a CNN model.

## Tutorial Overview

This tutorial is divided into five parts.
- MNIST Handwritten Digit Classification Dataset
- Model Evaluation Methodology
- How to Develop a Baseline Model
- How to Develop an Improved Model
- How to Develop a CNN model


## MNIST Handwritten Digit Classification Dataset

The MNIST dataset is an acronym that stands for the Modified National Institute of Standards and Technology dataset.
It is a dataset of 60,000 small square 28×28 pixel grayscale images of handwritten single digits between 0 and 9.
The task is to classify a given image of a handwritten digit into one of 10 classes representing integer values from 0 to 9, inclusively.

![](https://machinelearningmastery.com/wp-content/uploads/2019/02/Plot-of-a-Subset-of-Images-from-the-MNIST-Dataset-1024x768.png)

We can see that there are 60,000 examples in the training dataset and 10,000 in the test dataset and that images are indeed square with 28×28 pixels.

## How to Develop a Baseline Model

The first step is to develop a baseline model.
This is critical as it both involves developing the infrastructure for the test harness so that any model we design can be evaluated on the dataset, and it establishes a baseline in model performance on the problem, by which all improvements can be compared.
The design of the test harness is modular, and we can develop a separate function for each piece. 
This allows a given aspect of the test harness to be modified or inter-changed, if we desire, separately from the rest.
We can develop this test harness with five key elements. 
They are the loading of the dataset, the preparation of the dataset, the definition of the model, the evaluation of the model, and the presentation of results.

In [1]:
import torch
import torch.nn as nn
import torchvision
import torchvision.transforms as transforms
import matplotlib.pyplot as plt
import numpy as np
import os
import random

seed = 12345
np.random.seed(seed)
random.seed(seed)
torch.manual_seed(seed)
torch.cuda.manual_seed(seed)
torch.backends.cudnn.deterministic = True
torch.backends.cudnn.benchmark = False

os.environ["KMP_DUPLICATE_LIB_OK"] = "TRUE"

### Load Dataset

In PyTorch, you can use the torchvision.transforms module to apply the normalization transformation to the train and test datasets. Here's an example:

In [2]:
# Load the MNIST dataset
train_dataset = torchvision.datasets.MNIST(root='./data',
                                           train=True,
                                           download=True,
                                           transform=transforms.ToTensor())
test_dataset = torchvision.datasets.MNIST(root='./data',
                                          train=False,
                                          download=True,
                                          transform=transforms.ToTensor())

# Normalize the pixel values
train_dataset.transform = transforms.Compose(
    [transforms.ToTensor(),
     transforms.Normalize((0.1307, ), (0.3081, ))])
test_dataset.transform = transforms.Compose(
    [transforms.ToTensor(),
     transforms.Normalize((0.1307, ), (0.3081, ))])

In a dataset, there are a lot of data sample or instances. 
You can ask the model to take one sample at a time but usually you would let the model to process one batch of several samples. 
You may create a batch by extracting a slice from the dataset, using the slicing syntax on the tensor. 
For a better quality of training, you may also want to shuffle the entire dataset on each epoch so no two batch would be the same in the entire training loop. 
Sometimes, you may introduce data augmentation to manually introduce more variance to the data. 
This is common for image-related tasks, which you can randomly tilt or zoom the image a bit to generate a lot of data sample from a few images.
You can imagine there can be a lot of code to write to do all these. 
But it is much easier with the **DataLoader**.

In [3]:
batch_size = 32
train_dataloader = torch.utils.data.DataLoader(train_dataset,
                                               batch_size=batch_size,
                                               shuffle=True,
                                               num_workers=4)
test_dataloader = torch.utils.data.DataLoader(train_dataset,
                                              batch_size=batch_size,
                                              shuffle=True,
                                              num_workers=4)

### Define a FCN Model

We define a MLP with a SGD optimizer.

In [4]:
mlp = nn.Sequential(nn.Linear(784, 10), nn.Softmax(dim=1))

lr = 0.1
num_inputs, num_outputs = 784, 10
optimizer = torch.optim.SGD(mlp.parameters(), lr=lr)
criterion = torch.nn.CrossEntropyLoss()

for epoch in range(5):
    epoch_loss = []
    for i_batch, sample_batched in enumerate(train_dataloader):
        inputs, labels = sample_batched
        inputs = inputs.reshape(-1, num_inputs)
        prob_distr = mlp(inputs)
        loss = criterion(prob_distr, labels)
        loss.backward()
        optimizer.step()
        optimizer.zero_grad()

    accu_number = 0.
    for X, y in test_dataloader:
        inputs = X.reshape(-1, num_inputs)
        predicted_class = torch.argmax(mlp(inputs), dim=1)
        accu_number += torch.sum(predicted_class == y)
    print(
        f'{epoch+1}: testing accuracy: {accu_number / len(test_dataloader.dataset):0.4f}'
    )

1: testing accuracy: 0.9129
2: testing accuracy: 0.9201
3: testing accuracy: 0.9255
4: testing accuracy: 0.9265
5: testing accuracy: 0.9300


This is a traditional shadow model with low accuracy.
One may attempt to improve it by adding more layers.

### Define a Deep FCN

The model has two main aspects: the feature extraction front end comprised of convolutional and pooling layers, and the classifier backend that will make a prediction.

Given that the problem is a multi-class classification task, we know that we will require an output layer with 10 nodes in order to predict the probability distribution of an image belonging to each of the 10 classes. 
This will also require the use of a softmax activation function. 
Between the feature extractor and the output layer, we can add a dense layer to interpret the features, in this case with 100 nodes.

In [5]:
dnn = nn.Sequential(
    nn.Flatten(start_dim=1, end_dim=-1),
    nn.Linear(784, 100),
    nn.ReLU(),
    nn.Linear(100, 10),
    nn.Softmax(dim=1)
)

lr = 0.1
num_inputs, num_outputs = 784, 10
optimizer = torch.optim.SGD(dnn.parameters(), lr=lr)
criterion = torch.nn.CrossEntropyLoss()

for epoch in range(5):
    epoch_loss = []
    for i_batch, sample_batched in enumerate(train_dataloader):
        inputs, labels = sample_batched
        prob_distr = dnn(inputs)
        loss = criterion(prob_distr, labels)
        loss.backward()
        optimizer.step()
        optimizer.zero_grad()

    accu_number = 0.
    for X, y in test_dataloader:
        inputs = X.reshape(-1, num_inputs)
        predicted_class = torch.argmax(dnn(inputs), dim=1)
        accu_number += torch.sum(predicted_class == y)
    print(
        f'{epoch+1}: testing accuracy: {accu_number / len(test_dataloader.dataset):0.4f}'
    )

1: testing accuracy: 0.9334
2: testing accuracy: 0.9484
3: testing accuracy: 0.9619
4: testing accuracy: 0.9682
5: testing accuracy: 0.9725


If we have more layers, i.e., a deeper model, the performance is (slightly) improved.
But it is very slow to train as there are so many parameters.
In addition, it is hard to train as there may be more unexpectation.
Compared with the effort, the improvement is not significant.

## How to Develop an Improved Model

There are many ways to change the model configuration in order to explore improvements over the baseline models.
We will implement the very famous **LeNet-5** by adding convolutions.

In [6]:
lenet = nn.Sequential(
    nn.Conv2d(in_channels=1, out_channels=10, kernel_size=5, stride=1),
    nn.ReLU(), nn.MaxPool2d(kernel_size=2), nn.ReLU(),
    nn.Conv2d(in_channels=10, out_channels=20, kernel_size=5, stride=1),
    nn.ReLU(), nn.MaxPool2d(kernel_size=2), nn.Flatten(start_dim=1,
                                                       end_dim=-1),
    nn.Linear(in_features=20 * 4 * 4, out_features=100), nn.ReLU(),
    nn.Linear(in_features=100, out_features=10), nn.Softmax(dim=1))

lr = 0.1
optimizer = torch.optim.SGD(lenet.parameters(), lr=lr)
criterion = torch.nn.CrossEntropyLoss()

for epoch in range(5):
    epoch_loss = []
    for i_batch, sample_batched in enumerate(train_dataloader):
        inputs, labels = sample_batched
        prob_distr = lenet(inputs)
        loss = criterion(prob_distr, labels)
        loss.backward()
        optimizer.step()
        optimizer.zero_grad()

    accu_number = 0.
    for X, y in test_dataloader:
        predicted_class = torch.argmax(lenet(X), dim=1)
        accu_number += torch.sum(predicted_class == y)
    print(
        f'{epoch+1}: testing accuracy: {accu_number / len(test_dataloader.dataset):0.4f}'
    )

1: testing accuracy: 0.9596
2: testing accuracy: 0.9694
3: testing accuracy: 0.9782
4: testing accuracy: 0.9831
5: testing accuracy: 0.9871


## Summary

In this tutorial, you discovered how to develop a convolutional neural network for handwritten digit classification from scratch.

- How to develop a test harness to develop a robust evaluation of a model and establish a baseline of performance for a classification task.
- How to explore extensions to a baseline model to improve learning and model capacity.
- How to develop a CNN model.