# MLP with PyTorch

At this point, I know what MLPs are. We built one from scratch one of the previous folder. In this notebook, I want to implement an MLP again, but using PyTorch. This MLP will trained and evaluated on the MNIST dataset. 

# The MNIST dataset

Conveniently, the MNIST dataset is provided in PyTorch through the `torchvision` module, specifically through the `torchvision.dataset` module.

In the following cell, I import the `torchvision` and `transforms` modules. The second module, as the name suggests, let us perform **common transformations on image data**. According to the [documentation](https://pytroch.org/vision/stable/transforms.html), Transforms are common image transformations available in the `torchvision.transforms` module.

Another interesting feature is that transform operations can be **chained** together using `Compose`. We will use it in a couple of cells below.

In [1]:
import torch
import torch.nn as nn
import torchvision
from torchvision import transforms

In [2]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
device

device(type='cuda')

With the modules loaded, I want to load the dataset itself, and specify hyperparameters such as the size of the training and testing sets, and size of the mini-batches.

In [3]:
image_path = './'

transform = transforms.Compose([
    transforms.ToTensor()
])

mnist_train_dataset = torchvision.datasets.MNIST(
    root=image_path, train=True,
    transform=transform, download=True 
)

mnist_test_dataset = torchvision.datasets.MNIST(
    root=image_path, train=False,
    transform=transform, download=True  
)

batch_size = 64
torch.manual_seed(1)

<torch._C.Generator at 0x7f7b1967c250>

Okay, what just happened? Since I want to download a dataset I created a `image_path` variable to store the path where I would like images to be stored, should they be downloaded or read from the filesystem, if I do not want the dataset to be downloaded.

Then I move on create a `transform` pipeline. Ours only has one operation: `transform.ToTensor()`. The `ToTensor()` method (1)converts the pixel features into a floating type tensor and (2) normalizes the pixel from range [0, 255] to range [0, 1].

After that is where I effectively create the training and testing dataset using the MNIST dataset. Since, I do not have it on my machine, I asked PyTorch to download it for me using the `download` paramater. I also want PyTorch to perform the `transform` we created earlier on the images being downloaded. I specify which operation to perform using the `transform` paramater.

I finish with specifying the batch size, and manually setting the seed number of random number generation.

With that being done, we cannot use the dataset just yet. We must pass the `Dataset` objects (`mnist_train_dataset` and `mnist_test_dataset`) into a dataset a `DataLoader` object. Remember through a `DataLoader`, we can properly iterate over a given dataset. Okay, let's do it:

In [4]:
from torch.utils.data import DataLoader
train_dl = DataLoader(mnist_train_dataset,
                      batch_size, shuffle=True)

We successfully created the data loader, with batches of 64 samples. Let's move on :)

# Building the Model

This section of the notebook is concerned with building the MLP to classify digits from the dataset we downloaded earlier.

Our MLP will have:

- an input layer
- a hidden layer (32 activation units)
- a hidden layer (16 activation units)
- an output layer

Let's define the above layers in code

In [5]:
hidden_units = [32, 16] #number of activation units in EACH hidden layer
image_size = mnist_train_dataset[0][0].shape #mnist_train_dataset[0] is a tuple (image[tensor], label)
input_size = image_size[0] * image_size[1] * image_size[2] #number of channels * image height * image width

# all the layers in the network 
all_layers = [nn.Flatten()]
for hidden_unit  in hidden_units:
    layer = nn.Linear(input_size, hidden_unit)
    all_layers.append(layer)
    all_layers.append(nn.ReLU())
    input_size = hidden_unit

all_layers.append(nn.Linear(hidden_units[-1], 10)) #output layer

We successfully created all the layers in the network, and we stored them all into the an array called: `all_layers`.

Let's now create a model containing the layers we just created.

In [6]:
model = nn.Sequential(*all_layers)

That's it. Done! Since each layer comes one after the other in an MLP, we use the `torch.nn.Sequential` module to place those layers *sequentially*.

Let's print the model and see all the layers we inserted.

In [7]:
model

Sequential(
  (0): Flatten(start_dim=1, end_dim=-1)
  (1): Linear(in_features=784, out_features=32, bias=True)
  (2): ReLU()
  (3): Linear(in_features=32, out_features=16, bias=True)
  (4): ReLU()
  (5): Linear(in_features=16, out_features=10, bias=True)
)

When an image, represented by a $(1 \times 28 \times 28)$ vector, gets fed into the network  the `Flatten` layer flattens it to a $(1 \times 784)$ vector.

This flattened vector goes through the first `Linear` layer. This is the **input** layer. It turns the $(1 \times 784)$ to a $(1 \times 32)$ vector. `(1)`

Right after that, the vector goes through a `RELU` activation function. The size of the vector remains the same: $(1 \times 32)$

The $(1 \times 32)$ vector goes through a second `Linear` layer, which downsizes it to a $(1 \times 16)$ vector. This vector also goes through, a `RELU` activation.

Lastly, the $(1 \times 16)$ vector goes through the last `Linear` layer. The **output** layer. Which turns the $(1 \times 16)$ vector into a $(1 \times 10)$ vector. Which is our prediction.

In [8]:
#moves model to the gpu
model = model.to(device)

Let's now train the model

In [9]:
loss_fn = nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(model.parameters(), lr=0.001)
num_epochs = 20

for epoch in range(num_epochs):
    accuracy_hist_train = 0
    for x_batch, y_batch in train_dl:
        #moves the batch to the gpu so it can be there with the model
        x_batch, y_batch = x_batch.to(device), y_batch.to(device)
        
        #compute forward pass and loss value
        pred = model(x_batch)
        loss = loss_fn(pred, y_batch)
        
        #compute gradients through backpropagation
        loss.backward()

        #update weights based on gradients, then reset the gradients to zero 
        optimizer.step()
        optimizer.zero_grad()

        #how correct was our model on this batch?
        is_correct = (
            torch.argmax(pred, dim=1) == y_batch
        ).float()

        #Number of times model was correct on this batch
        accuracy_hist_train += is_correct.sum()
    
    accuracy_hist_train /= len(train_dl.dataset)
    print(f'Epoch {epoch} Accuracy: '
          f'{accuracy_hist_train:.4f}')

Epoch 0 Accuracy: 0.8514
Epoch 1 Accuracy: 0.9287
Epoch 2 Accuracy: 0.9422
Epoch 3 Accuracy: 0.9492
Epoch 4 Accuracy: 0.9538
Epoch 5 Accuracy: 0.9584
Epoch 6 Accuracy: 0.9622
Epoch 7 Accuracy: 0.9647
Epoch 8 Accuracy: 0.9673
Epoch 9 Accuracy: 0.9688
Epoch 10 Accuracy: 0.9710
Epoch 11 Accuracy: 0.9720
Epoch 12 Accuracy: 0.9737
Epoch 13 Accuracy: 0.9750
Epoch 14 Accuracy: 0.9769
Epoch 15 Accuracy: 0.9780
Epoch 16 Accuracy: 0.9788
Epoch 17 Accuracy: 0.9799
Epoch 18 Accuracy: 0.9814
Epoch 19 Accuracy: 0.9816


Cool, our model finished training. The model acheived a 98% accuracy on training set, but let's see how well it performs on the testing set which is data it has never seen before.

In [13]:
pred = model(mnist_test_dataset.data.float().to(device))

is_correct = (
    torch.argmax(pred, dim=1) == mnist_test_dataset.targets.to(device)
).float()

print(f'Accuracy on Test set: {is_correct.mean():.4f}')

Accuracy on Test set: 0.9610


That's it!

We successfully trained and evaluated our model on the MNIST dataset. The performance on the training set is 2% higher, than the performance on the testing set. I think the model generalizes fairly well.

In [11]:
test_dl = DataLoader(mnist_test_dataset, batch_size=64, shuffle=True)

x, y = next(iter(test_dl))
x, y = x.to(device), y.to(device)
pred = model(x)



tensor([[ 1.4759e+01, -1.7047e+01, -2.0928e-01, -6.8313e+00, -1.2147e+01,
         -8.0281e+00,  6.3435e-01, -5.0821e+00, -4.6575e+00, -5.3072e+00],
        [-7.0193e+00, -1.5395e+01, -7.0325e+00, -3.8280e+00,  1.4617e-01,
         -2.1121e+00, -1.3263e+01, -2.7679e+00, -1.2906e+00,  3.2976e+00],
        [-1.1946e+01, -5.2218e+00,  1.0317e+00,  1.3162e+00, -1.9276e+01,
         -1.7503e+01, -3.4389e+01,  1.2010e+01, -7.3719e+00, -5.3010e+00],
        [-9.4134e+00, -7.6419e+00, -7.7214e+00, -3.6080e+00,  3.5553e+00,
         -3.3164e+00, -5.1598e+00, -1.0543e+01,  8.3371e+00, -7.1106e+00],
        [-1.5993e+01, -7.3527e+00, -1.0119e+01, -1.1165e+01, -2.4158e+00,
          4.4177e+00, -3.0337e+00, -1.5408e+01,  1.4155e+01, -7.6252e+00],
        [-7.2085e+00, -6.5058e+00, -6.3147e+00, -1.0603e+01,  1.7418e+01,
         -6.7194e+00, -8.4649e+00, -6.6750e-01, -5.0204e+00, -1.3512e+00],
        [-9.4327e+00,  8.8173e+00,  6.9889e-01, -2.5305e+00, -7.7006e+00,
         -1.1856e+01, -9.7926e+0

#  Potential Experiments

Here are few ideas to further the exploration:

1. Feed foward only ONE training example through the network. What is the shape of the output vector? Can you also print its values?

2. We used `RELU` as our activation function. Modify the model so it uses sigmoid instead. Does it affect the model performance on the testing set?

3. Using the model we built, can you reset it and graph the weights as the model learns. See how they change over time? Might be a cool visual.