**NOTE: This notebook is written for the Google Colab platform, which provides free hardware acceleration. However it can also be run (possibly with minor modifications) as a standard Jupyter notebook, using a local GPU.** 



In [None]:
#@title -- Installation of Packages -- { display-mode: "form" }
import sys
!{sys.executable} -m pip install datasets

In [None]:
#@title -- Import of Necessary Packages -- { display-mode: "form" }
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from datasets import load_dataset
from matplotlib.colors import LogNorm
from sklearn.metrics import accuracy_score
from torch.utils.data import TensorDataset, DataLoader
import torch.nn as nn
import torch

## Classifying MNIST Digits

This example will illustrate how to construct a simple convolutional network for image classification on the MNIST handwritten digits dataset.

### Loading the Dataset

We are going to start by loading the MNIST dataset. This step will be very simple, because the `datasets` package from HugginFace includes a built-in function, which does so. We merely call the `load_dataset` function, specifying `"mnist"` as the dataset. We will get a dataset that is already split into the train and test folds – you can retrieve them using `dataset['train']` and `dataset['test']`.

Each fold contain two lists `'image'` and `'label'`; `'image'` is a list of `PIL` images, which can be cast to numpy arrays using `np.asarray`. Under `'label'`, you will find class labels (the desired outputs).

When loading the data, we need to make sure that the tensors are properly scaled and of the correct shape. Our data is composed of $28 \times 28$ images with a single colour channel. In `PyTorch` colour channels are represented by the 1st dimension of the tensor (with the 0th dimension being the batch dimension). Our tensor's shape is `(batch, 28, 28)` and we need it to be `(batch, 1, 28, 28)` so we call `.unsqueeze(1)`.

Finally, the values ranges from 0 to 255 – we are going to scale this into the range of 0 to 1, i.e. divide by 255.



In [None]:
device = "cuda" if torch.cuda.is_available() else "cpu"
dataset = load_dataset("mnist")

X_train_np = np.asarray([np.asarray(img) for img in dataset['train']['image']])
Y_train_np = np.asarray(dataset['train']['label'])
X_test_np = np.asarray([np.asarray(img) for img in dataset['test']['image']])
Y_test_np = np.asarray(dataset['test']['label'])

X_train = torch.as_tensor(X_train_np).to(device)
Y_train = torch.as_tensor(Y_train_np).to(device)
X_test = torch.as_tensor(X_test_np).to(device)
Y_test = torch.as_tensor(Y_test_np).to(device)

X_train = X_train.unsqueeze(1) / 255.0
X_test = X_test.unsqueeze(1) / 255.0

We can now display a few randomly selected examples from the train set:



In [None]:
num_rows = 4; num_cols = 4
fig, axes = plt.subplots(num_rows, num_cols)

for row in axes:
    for ax in row:
        ax.imshow(X_train_np[np.random.randint(0,
                    len(X_train_np)-1)],
                  cmap='Greys')
        ax.set_xticks([])
        ax.set_yticks([])

### Datasets and Data Loaders

So far we have been work with our data in full-batch mode: we always ran our entire dataset through our network. This is, of course, only possible if your dataset is small enough to fit into memory all at once. If you are running your network on a GPU, you can then run on all your data in parallel so it's computationaly efficient.

In deep learning, however, most datasets are far too large to fit into memory at once – they can easily have tens or hundreds of gigabytes and some are even larger. If your dataset is that large, it is, of course, essential that you are able to load the data from the hard drive on the fly and train in mini-batch mode. In PyTorch, this aspect of deep learning is handled using `Dataset` and `DataLoader` objects.

#### The Dataset Class

The `Dataset` class provides a unified interface for accessing data. There is a number of different classes that derive from `Dataset`, to support a number of different dataset formats, e.g. `ImageNet`, `VOCDetection`, `Cityscapes`, `CelebA`, etc. There is even the slightly more generic `ImageFolder`, which simply loads images and class labels from a folder.

To define a custom dataset, you would need to implement the following interface:

```
class CustomDataset(Dataset):
    def __init__(self, ...)
        ...

    def __len__(self):
        """
        Returns the number of samples in the dataset.
        """

        ...

    def __getitem__(self, idx):
        """
        Returns the sample at index idx from the dataset.
        """

        ...
```
#### The DataLoader Class

Given a dataset, a data loader is in charge of drawing mini-batches from it, making sure the data is properly shuffled, etc. Usually, you won't need to implement your own dataloader – you'll just be able to use the `DataLoader` class from `torch.utils.data`, e.g. like this:

```
from torch.utils.data import DataLoader

train_dataloader = DataLoader(training_data, batch_size=64, shuffle=True)
test_dataloader = DataLoader(test_data, batch_size=64, shuffle=True)
```
#### Our Example and TensorDataset

Our MNIST dataset is again small enough to fit into memory all at once. With datasets like this we can use the `TensorDataset` class, which merely wraps an existing tensor in the dataset interface and allows it to be used with data loaders.



In [None]:
train_dataset = TensorDataset(X_train, Y_train)
test_dataset = TensorDataset(X_test, Y_test)

train_dataloader = DataLoader(train_dataset, batch_size=512, shuffle=True)
test_dataloader = DataLoader(test_dataset, batch_size=512, shuffle=True)

### Constructing the Convolutional Network

When constructing a convolutional net the procedure is usually to study literature that deals with similar tasks and use that knowledge to design a similar neural architecture for the problem at hand (and possibly to tune it).

Given that the MNIST dataset is not especially difficult, we will use to illustrate an even simpler approach:

* We will keep chaining blocks of convolutional layers, ReLU functions and pooling layers.
* We will keep going until the dimensions of the inputs have decreased sufficiently.
* Once that happens we will append one or several standard linear layers and ReLUs.
To make it easier to keep track of what the dimensions of the output are after all the individual layers have been applied, we will not wrap our layers into a class just yet: we will instead experiment with them freely first. To this end we will take a few samples from the dataset, which will be used as a dummy input.



In [None]:
y = X_train[:5].to('cpu')

Let us now create our first block and apply it to tensor `y`. We will first create our 2D convolutional layer using class `nn.Conv2d`. We need to specify a few parameters: namely the number of input and output channels and the kernel size. The number of input channels is 1, of course, because as we mentioned, we have a single colour channel. The number of output channels is a hyperparameter – we are going to begin with 16 because our data is so simple. Note that in a typical convolutional network, the dimensions of the feature maps tend to decrease in later layers, but the numbers of channels tend to increase (the intuition is that the deeper the layer is, the more abstract – and more numerous – the concepts that it is going to represent).

Convolutional kernels can be of different sizes, but the conventional wisdom based on empirical evidence is that $3 \times 3$ kernels tend to work well. Making the kernel unnecessarily large is something we want to avoid because the larger the matrices we are working with, the longer it will take to multiply them.

After the convolutional layer we apply the ReLU activation function and max-pooling, for which we again need to specify a kernel size. With pooling, the larger the kernel size, the more rapidly our data will be downsampled. We are therefore using a small $2 \times 2$ kernel. A number of modern architectures have now dispensed with the use of pooling layers altogether and use strides or dilations in the convolutional layer to downsample the data.



In [None]:
conv1 = nn.Conv2d(
    in_channels=1, out_channels=8,
    kernel_size=(3, 3))

y = conv1(y)
y = torch.relu(y)
y = torch.max_pool2d(y, kernel_size=(2, 2))

After we have constructed our first block, let us check what effect this had on the dimensionality of our data.



In [None]:
np.product(y.shape[1:])

Alas, our data still has too many dimensions and we need to reduce its dimensionality further. Let's try to apply one more block to it. Note that while the feature map is getting smaller, we do actually increase the number of channels.



In [None]:
conv2 = nn.Conv2d(8, 16, (3, 3))
y = conv2(y)
y = torch.relu(y)
y = torch.max_pool2d(y, (2, 2))

In [None]:
np.product(y.shape[1:])

The number of dimensions is much more reasonable now. We can now flatten the output (transform it from a 2-dimensional image into a 1-dimensional vector) and apply some standard linear layers and ReLUs. Again we make sure that the dimension of the data decreases gradually and the change from one layer to the next is not too drastic. The output layer is going to have 10 output neurons, because we are going to classify into 10 classes: the digits. We will use softmax as its activation function.



In [None]:
y = torch.flatten(y, 1)

fc1 = nn.Linear(400, 128)
y = fc1(y)
y = torch.relu(y)

fc2 = nn.Linear(128, 10)
y = fc2(y)

y.shape

Now that we have designed our architecture, we need to wrap it in a class again. As usual, layers with parameters need to be constructed in `__init__` and then used in `forward`. To make the architecture a bit better, we are going to use `nn.PReLU` instead of `relu`.



In [None]:
class Net(nn.Module):
    def __init__(self, num_outputs):
        super(Net, self).__init__()
        self.conv1 = nn.Conv2d(1, 8, (3, 3))
        self.conv_acti1 = nn.PReLU()
        
        self.conv2 = nn.Conv2d(8, 16, (3, 3))
        self.conv_acti2 = nn.PReLU()

        self.fc1 = nn.Linear(400, 128)
        self.fc_acti1 = nn.PReLU()

        self.fc2 = nn.Linear(128, num_outputs)
        self.dropout = nn.Dropout(0.3)

    def forward(self, y):
        y = self.conv1(y)
        y = self.conv_acti1(y)
        y = torch.max_pool2d(y, kernel_size=(2, 2))
        y = self.dropout(y)
        
        y = self.conv2(y)
        y = self.conv_acti2(y)
        y = torch.max_pool2d(y, kernel_size=(2, 2))
        y = self.dropout(y)
        
        y = torch.flatten(y, 1)
        
        y = self.fc1(y)
        y = self.fc_acti1(y)
        y = self.dropout(y)

        y = self.fc2(y)
        return y

### Constructing and Training the Classifier

Our training loop is going to be a bit different now that we are using `Dataset` and `Dataloader` objects.

There are going to be two nested loops now:

* The outer one is iterating over epochs;
* The inner one is iterating over mini-batches within the same epoch;
Note that we are logging the loss for each mini-batch now. Consequently, the loss plot is going to be a bit more noisy – gradients are, of course, more stable when accumulated over the entire dataset rather than over the smaller mini-batches.



In [None]:
num_outputs = 10
model = Net(num_outputs).to(device)

In [None]:
criterion = nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(model.parameters(), lr=1e-3)
loss_train = []

for epoch in range(50):
    model.train()

    for X_batch, Y_batch in train_dataloader:
        X_batch = X_batch.to(device)
        Y_batch = Y_batch.to(device)
        
        y_batch = model(X_batch)
        loss = criterion(y_batch, Y_batch)

        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

        loss_train.append(loss.item())

    if epoch % 5 == 0:
        print(f"epoch {epoch}, loss: {np.mean(loss_train[-20:])}")

print(f"epoch {epoch}, loss: {np.mean(loss_train[-20:])}")

In [None]:
plt.plot(loss_train)
plt.xlabel("step")
plt.ylabel("loss")
plt.grid(ls='--')

### Testing

Finally, we apply our standard testing procedure for classifiers: we display the confusion matrix and the accuracy.

#### On the Train Set



In [None]:
model.eval()
with torch.no_grad():
    y_train_logit = model(X_train)
    y_train = y_train_logit.argmax(dim=1)

cm = pd.crosstab(
    Y_train.cpu().numpy(),
    y_train.cpu().numpy(),
    rownames=['actual'],
    colnames=['predicted']
)
print(cm, "\n")

acc = accuracy_score(Y_train.cpu().numpy(), y_train.cpu().numpy())
print("Accuracy = {}".format(acc))

#### On the Test Set



In [None]:
model.eval()
with torch.no_grad():
    y_test_logit = model(X_test)
    y_test = y_test_logit.argmax(dim=1)

cm = pd.crosstab(
    Y_test.cpu().numpy(),
    y_test.cpu().numpy(),
    rownames=['actual'],
    colnames=['predicted']
)
print(cm, "\n")

acc = accuracy_score(Y_test.cpu().numpy(), y_test.cpu().numpy())
print("Accuracy = {}".format(acc))