# Introduction to Neural Networks with PyTorch on the MNIST Dataset

In this live coding session, we will go through the process of building and training a simple neural network using PyTorch. We will be working with the MNIST dataset, a classic in the field of machine learning, which contains tens of thousands of handwritten digits.
https://en.wikipedia.org/wiki/MNIST_database


In [1]:
# Environment Setup
# Run this cell to install the required packages if you haven't already.


# !pip install torch==2.0.1 torchvision==0.15.2 torchaudio==2.0.2 --extra-index-url https://download.pytorch.org/whl/cu117 
# !pip install matplotlib seaborn scikit-learn torchview


In [2]:
# Importing necessary libraries
# matplotlib.pyplot: Matplotlib module for creating static, animated, and interactive visualizations
# numpy: NumPy is a library for the Python programming language, adding support for large, multidimensional arrays and matrices, along with a large collection of high-level mathematical functions to operate on these arrays.
# torch: PyTorch library for tensor computations and deep learning
# torchvision: PyTorch library for computer vision tasks
# transforms: torchvision module for common image transformations
# nn, optim: PyTorch modules for neural networks and optimization algorithms
# random_split: PyTorch module for splitting datasets
# draw_graph: torchview module for visualizing neural network architectures

# seaborn: Seaborn is a Python data visualization library based on matplotlib. It provides a high-level interface for drawing attractive and informative statistical graphics.
# confusion_matrix: sklearn module for computing confusion matrix

import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns
import torch
import torchvision
import torchvision.transforms as transforms
from sklearn.metrics import confusion_matrix
from torch import nn, optim
from torch.utils.data import random_split
# from torchview import draw_graph

# Checking for a GPU
# torch.device: Returns a device object representing the device on which a torch.Tensor is or will be allocated.
# torch.cuda.is_available: Returns a bool indicating if CUDA is currently available.

device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
print(f'Using device: {device}')


# Loading the MNIST Dataset

The MNIST dataset comes prepackaged with PyTorch's `torchvision` module. We'll download the dataset and set up `DataLoader` instances to batch and shuffle the data for us.


In [3]:
# MNIST Data Loaders

# Defining preprocessing steps for the dataset
# transforms.ToTensor: Convert a PIL Image or numpy.ndarray to tensor of shape (C x H x W) in the range [0.0, 1.0] with type float instead of int with range [0, 255].
# transforms.Normalize: Normalize a tensor image with mean and standard deviation. Given mean: (M1,...,Mn) and std: (S1,..,Sn) for n channels, this transform will normalize each channel of the input torch.Tensor i.e. input[channel] = (input[channel] - mean[channel]) / std[channel]
# It is used to normalize pixel values to be in the range of [-1, 1].
transform = transforms.Compose([transforms.ToTensor(),
                                transforms.Normalize((0.5,), (0.5,))])

# Download and load the training data
train_set = torchvision.datasets.MNIST(root='./data', train=True, download=True, transform=transform)
test_set = torchvision.datasets.MNIST(root='./data', train=False, download=True, transform=transform)

# Split the training set into training and validation sets
total_size = len(train_set)
val_size = int(total_size * 0.2)  # 20% for validation
train_size = total_size - val_size
train_set, val_set = random_split(train_set, [train_size, val_size])

# Data loaders
train_loader = torch.utils.data.DataLoader(train_set, batch_size=64, shuffle=True)
val_loader = torch.utils.data.DataLoader(val_set, batch_size=64, shuffle=False)
test_loader = torch.utils.data.DataLoader(test_set, batch_size=64, shuffle=False)


# Visualizing the Dataset


In [4]:
# Print statistics of the dataset
print(f'Training images: {len(train_set)}')
print(f'Validation images: {len(val_set)}')
print(f'Test images: {len(test_set)}')

In [5]:
# Visualizing some training images
images, labels = next(iter(train_loader))

# Print the shape of the images
print(f'Image shape: {images.shape}')
print(f'Label shape: {labels.shape}')

# Print unique labels in the dataset
print(f'Unique Labels: {labels.unique()}')

In [6]:
# Function to show an image
def imshow(img):
    img = img / 2 + 0.5  # unnormalize
    np_img = img.numpy()
    plt.imshow(np.transpose(np_img, (1, 2, 0)))
    plt.show()

# Show images
imshow(torchvision.utils.make_grid(images))

# Defining the Neural Network

The model is a simple feed-forward neural network with two hidden layers and an output layer. Here's a breakdown of the structure:

1. **Input Layer**: The input to the model is a 28x28 pixel image, which is flattened into a 1D tensor of size 784 (28*28). This is done in the forward method of the model using the `view` function.

2. **First Hidden Layer (fc1)**: This is a fully connected (Linear) layer that takes the 784-dimensional input and transforms it into a 128-dimensional tensor. This is done using a weight matrix of size [784, 128] and a bias vector of size [128]. The ReLU activation function is applied to the output of this layer.

3. **Second Hidden Layer (fc2)**: This is another fully connected layer that takes the 128-dimensional output from the previous layer and transforms it into a 64-dimensional tensor. This is done using a weight matrix of size [128, 64] and a bias vector of size [64]. The ReLU activation function is applied to the output of this layer.

4. **Output Layer (fc3)**: This is the final fully connected layer that takes the 64-dimensional output from the previous layer and transforms it into a 10-dimensional tensor. This is done using a weight matrix of size [64, 10] and a bias vector of size [10]. The output of this layer is the final output of the model, representing the logits for each of the 10 classes (digits 0-9). The softmax function is typically applied to these logits outside the model to obtain the probability distribution over the classes.

In [7]:
# Neural Network Definition
# nn.Module: Base class for all neural network modules.
# nn.Linear: Applies a linear transformation to the incoming data.

class Net(nn.Module):
    def __init__(self):
        super(Net, self).__init__()
        self.fc1 = nn.Linear(28*28, 128)  # 28*28 is the size of MNIST images
        self.fc2 = nn.Linear(128, 64)
        self.fc3 = nn.Linear(64, 10)  # There are 10 classes in MNIST

    def forward(self, x):
        x = x.view(-1, 28*28)  # Flatten the image
        x = torch.relu(self.fc1(x))
        x = torch.relu(self.fc2(x))
        x = self.fc3(x)
        return x

net = Net().to(device)
print(net)


# Defining a Convolutional Neural Network
The Convolutional Neural Network (CNN) consists of the following layers:

1. **Convolutional Layer (conv1)**: This is the first layer of the CNN. It uses a convolution operation on the input layer to create several smaller feature maps. The layer takes a single-channel image (grayscale image) as input and applies 32 filters, each of size 3x3. Padding is applied to keep the spatial dimensions the same.

2. **Pooling Layer (pool)**: This layer is used to reduce the spatial dimensions of the input volume. It uses a 2x2 max pooling operation, which means it selects the maximum element from the feature map within the 2x2 window.

3. **Convolutional Layer (conv2)**: This is the second convolutional layer, which takes the 32 feature maps from the previous layer as input and applies 64 filters, each of size 3x3. Padding is applied to keep the spatial dimensions the same.

4. **Fully Connected Layer (fc1)**: After the second pooling layer, the feature maps are flattened into a single vector (1D tensor), which serves as input to this fully connected layer. This layer reduces the dimension from 64*7*7 to 128.

5. **Fully Connected Layer (fc2)**: This is the output layer of the network. It takes the 128-dimensional vector from the previous layer and reduces it to a 10-dimensional vector. Each element of this vector represents the probability of a particular class (digits 0-9).

The ReLU activation function is applied after each convolutional and fully connected layer except for the last one. This function introduces non-linearity into the model, allowing it to learn more complex patterns. The output of the final layer is typically passed through a softmax function to obtain the probability distribution over the classes.

![image.png](https://upload.wikimedia.org/wikipedia/commons/9/90/CNN-filter-animation-1.gif)

In [8]:
class CNN(nn.Module):
    def __init__(self):
        super(CNN, self).__init__()
        self.conv1 = nn.Conv2d(1, 32, kernel_size=3, stride=1, padding=1)
        self.pool = nn.MaxPool2d(kernel_size=2, stride=2)
        self.conv2 = nn.Conv2d(32, 64, kernel_size=3, stride=1, padding=1)
        self.fc1 = nn.Linear(64*7*7, 128)
        self.fc2 = nn.Linear(128, 10)

    def forward(self, x):
        x = self.pool(torch.relu(self.conv1(x)))
        x = self.pool(torch.relu(self.conv2(x)))
        x = x.view(-1, 64*7*7)
        x = torch.relu(self.fc1(x))
        x = self.fc2(x)
        return x

cnn = CNN().to(device)
print(cnn)

# Visualizing the Neural Network Architecture

In [9]:
# model_graph = draw_graph(cnn, input_size=(64,1,28,28))
# model_graph.visual_graph

# Training the Model

We will now train our model using the training data. We will run through the dataset multiple times, in "epochs", updating our weights each time to improve the model's performance.
Before we train the model, we need to define a loss function and choose an optimizer. We'll use cross-entropy loss and the SGD optimizer.

In [None]:
# Training Loop
# optimizer.zero_grad: Clears the gradients of all optimized torch.Tensor s.
# loss.backward: Computes the gradient of current tensor w.r.t. graph leaves.
# optimizer.step: Performs a single optimization step.

epochs = 5
train_losses = []  # Store losses here
val_losses = []  # Store validation losses here
model = cnn  # Select the model to train

# Define the loss function and optimizer
criterion = nn.CrossEntropyLoss()
optimizer = optim.SGD(model.parameters(), lr=0.01, momentum=0.9)


# Training loop
for epoch in range(epochs):
    running_loss = 0.0
    # Iterate over batches.
    for images, labels in train_loader:
        # Loading batch to device
        images, labels = images.to(device), labels.to(device)

        # Zero the gradients from previous batch
        optimizer.zero_grad()
        
        # Forward pass - make predictions and calculate loss
        outputs = model(images)
        loss = criterion(outputs, labels)
        
        # Backward pass - compute gradient and update weights
        loss.backward()
        optimizer.step()
        
        running_loss += loss.item()
    
    # Calculate average training loss for the epoch
    epoch_loss = running_loss/len(train_loader)
    train_losses.append(epoch_loss)
    
    # Validation
    running_val_loss = 0.0
    with torch.no_grad():  # No need to calculate gradients for validation, only for training
        for images, labels in val_loader:
            images, labels = images.to(device), labels.to(device)
            outputs = model(images)
            val_loss = criterion(outputs, labels)
            running_val_loss += val_loss.item()

    # Calculate average validation loss for the epoch
    epoch_val_loss = running_val_loss/len(val_loader)
    val_losses.append(epoch_val_loss)
    
    print(f'Epoch {epoch+1}, Training Loss: {epoch_loss}, Validation Loss: {epoch_val_loss}')
    
# Plot the training and validation losses
plt.plot(train_losses, label='Training loss')
plt.plot(val_losses, label='Validation loss')

# Add title, labels, and legend
plt.title('Training and Validation Loss over Epochs')
plt.xlabel('Epoch')
plt.ylabel('Loss')
plt.legend()

# Draw the updated plot
plt.show()    

# Evaluating the Model

Let's evaluate the performance of our trained model on the test dataset, which the model has not seen during training. We will measure the accuracy of the model.


In [None]:
# Model Evaluation
# torch.no_grad: Disables gradient calculation, useful for inference (when dont need Tensor.backward())
# torch.max: Returns the maximum value of all elements in the input tensor.

correct = 0
total = 0
with torch.no_grad():
    for images, labels in test_loader:
        images, labels = images.to(device), labels.to(device)
        outputs = model(images)
        _, predicted = torch.max(outputs.data, 1)
        total += labels.size(0)
        correct += (predicted == labels).sum().item()

print(f'Accuracy of the network on the 10000 test images: {100 * correct // total}%')

nb_classes = 10

# Initialize the prediction and label lists(tensors)
pred_list=torch.zeros(0, dtype=torch.long, device='cpu')
label_list=torch.zeros(0, dtype=torch.long, device='cpu')

with torch.no_grad():
    for i, (inputs, classes) in enumerate(test_loader):
        inputs = inputs.to(device)
        classes = classes.to(device)
        outputs = model(inputs)
        _, preds = torch.max(outputs, 1)

        # Append batch prediction results
        pred_list=torch.cat([pred_list, preds.view(-1).cpu()])
        label_list=torch.cat([label_list, classes.view(-1).cpu()])

# Confusion matrix
conf_mat=confusion_matrix(label_list.numpy(), pred_list.numpy())
plt.figure(figsize=(10, 10))
sns.heatmap(conf_mat, annot=True, fmt='d',
            xticklabels=range(nb_classes), yticklabels=range(nb_classes))
plt.ylabel('Actual')
plt.xlabel('Predicted')
plt.show()

# Conclusion

We have successfully trained a simple neural network to recognize handwritten digits with PyTorch! There are many ways we could improve this model, such as adding more layers, using different activation functions, applying more sophisticated optimizers, or implementing learning rate schedules.
