<a href="https://colab.research.google.com/github/kalakhushi18/CNN/blob/main/MultiLayer_Perceptron_Assignment.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# PyTorch Assignment: Multi-Layer Perceptron (MLP)

### Multi-Layer Perceptrons

The simple logistic regression example we went over in the previous notebook is essentially a one-layer neural network, projecting straight from the input to the output predictions.
While this can be effective for linearly separable data, occasionally a little more complexity is necessary.
Neural networks with additional layers are typically able to learn more complex functions, leading to better performance.
These additional layers (called "hidden" layers) transform the input into one or more intermediate representations before making a final prediction.

In the logistic regression example, the way we performed the transformation was with a fully-connected layer, which consisted of a linear transform (matrix multiply plus a bias).
A neural network consisting of multiple successive fully-connected layers is commonly called a Multi-Layer Perceptron (MLP).
In the simple MLP below, a 4-d input is projected to a 5-d hidden representation, which is then projected to a single output that is used to make the final prediction.

![mlp](https://drive.google.com/uc?export=view&id=1xQdky_Wzzw0v7GuPj-eUGpElhD0SS6ah)

For the assignment, you will be building a MLP for MNIST.
Mechanically, this is done very similary to our logistic regression example, but instead of going straight to a 10-d vector representing our output predictions, we might first transform to a 500-d vector with a "hidden" layer, then to the output of dimension 10.
Before you do so, however, there's one more important thing to consider.

### Nonlinearities

We typically include nonlinearities between layers of a neural network.
There's a number of reasons to do so.
For one, without anything nonlinear between them, successive linear transforms (fully connected layers) collapse into a single linear transform, which means the model isn't any more expressive than a single layer.
On the other hand, intermediate nonlinearities prevent this collapse, allowing neural networks to approximate more complex functions.

There are a number of nonlinearities commonly used in neural networks, but one of the most popular is the [rectified linear unit (ReLU)](https://en.wikipedia.org/wiki/Rectifier_(neural_networks)):

\begin{align}
x = \max(0,x)
\end{align}

There are a number of ways to implement this in PyTorch.
We could do it with elementary PyTorch operations:

In [None]:
import torch

x = torch.rand(5, 3)*2 - 1
x_relu_max = torch.max(torch.zeros_like(x),x)

print("x: {}".format(x))
print("x after ReLU with max: {}".format(x_relu_max))

x: tensor([[ 0.1664, -0.8410,  0.6348],
        [ 0.3297, -0.2489, -0.9519],
        [ 0.8821,  0.4468, -0.1392],
        [-0.9356, -0.9706, -0.2056],
        [-0.7253,  0.9117, -0.6125]])
x after ReLU with max: tensor([[0.1664, 0.0000, 0.6348],
        [0.3297, 0.0000, 0.0000],
        [0.8821, 0.4468, 0.0000],
        [0.0000, 0.0000, 0.0000],
        [0.0000, 0.9117, 0.0000]])


Of course, PyTorch also has the ReLU implemented, for example in `torch.nn.functional`:

In [None]:
import torch.nn.functional as F

x_relu_F = F.relu(x)

print("x after ReLU with nn.functional: {}".format(x_relu_F))

x after ReLU with nn.functional: tensor([[0.1664, 0.0000, 0.6348],
        [0.3297, 0.0000, 0.0000],
        [0.8821, 0.4468, 0.0000],
        [0.0000, 0.0000, 0.0000],
        [0.0000, 0.9117, 0.0000]])


Same result.

### Assignment

Build a 2-layer MLP for MNIST digit classfication. Feel free to play around with the model architecture and see how the training time/performance changes, but to begin, try the following:

Image (784 dimensions) ->  
fully connected layer (500 hidden units) -> nonlinearity (ReLU) ->  
fully connected (10 hidden units) -> softmax

Try building the model both with basic PyTorch operations, and then again with more object-oriented higher-level APIs.
You should get similar results!


*Some hints*:
- Even as we add additional layers, we still only require a single optimizer to learn the parameters.
Just make sure to pass all parameters to it!
- As you'll calculate in the Short Answer, this MLP model has many more parameters than the logisitic regression example, which makes it more challenging to learn.
To get the best performance, you may want to play with the learning rate and increase the number of training epochs.
- Be careful using `torch.nn.CrossEntropyLoss()`.
If you look at the [PyTorch documentation](https://pytorch.org/docs/stable/nn.html#crossentropyloss): you'll see that `torch.nn.CrossEntropyLoss()` combines the softmax operation with the cross-entropy.
This means you need to pass in the logits (predictions pre-softmax) to this loss.
Computing the softmax separately and feeding the result into `torch.nn.CrossEntropyLoss()` will significantly degrade your model's performance!

In [1]:
%matplotlib inline

import numpy as np
import matplotlib.pyplot as plt
import torch
from tqdm.notebook import tqdm

import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.utils.data as data
import torch.optim as optim

In [2]:
def set_seed(seed):
    np.random.seed(seed)
    torch.manual_seed(seed)
    if torch.cuda.is_available(): # GPU operation have separate seed
        torch.cuda.manual_seed(seed)
        torch.cuda.manual_seed_all(seed)
set_seed(42)

In [None]:
device = torch.device("cpu") if not torch.cuda.is_available() else torch.device("cuda:0")
print("Using device", device)

Using device cuda:0


### Implementing MLP with Pytorch Operations

In [3]:
from torchvision import datasets, transforms
from warnings import filterwarnings
filterwarnings('ignore')

mnist_train = datasets.MNIST(root="./datasets", train=True, transform=transforms.ToTensor(), download=True)
mnist_test = datasets.MNIST(root="./datasets", train=False, transform=transforms.ToTensor(), download=True)

print("Number of MNIST training examples: {}".format(len(mnist_train)))
print("Number of MNIST test examples: {}".format(len(mnist_test)))

Downloading http://yann.lecun.com/exdb/mnist/train-images-idx3-ubyte.gz
Failed to download (trying next):
<urlopen error [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: certificate has expired (_ssl.c:1007)>

Downloading https://ossci-datasets.s3.amazonaws.com/mnist/train-images-idx3-ubyte.gz
Downloading https://ossci-datasets.s3.amazonaws.com/mnist/train-images-idx3-ubyte.gz to ./datasets/MNIST/raw/train-images-idx3-ubyte.gz


100%|██████████| 9.91M/9.91M [00:00<00:00, 52.4MB/s]


Extracting ./datasets/MNIST/raw/train-images-idx3-ubyte.gz to ./datasets/MNIST/raw

Downloading http://yann.lecun.com/exdb/mnist/train-labels-idx1-ubyte.gz
Failed to download (trying next):
<urlopen error [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: certificate has expired (_ssl.c:1007)>

Downloading https://ossci-datasets.s3.amazonaws.com/mnist/train-labels-idx1-ubyte.gz
Downloading https://ossci-datasets.s3.amazonaws.com/mnist/train-labels-idx1-ubyte.gz to ./datasets/MNIST/raw/train-labels-idx1-ubyte.gz


100%|██████████| 28.9k/28.9k [00:00<00:00, 1.87MB/s]


Extracting ./datasets/MNIST/raw/train-labels-idx1-ubyte.gz to ./datasets/MNIST/raw

Downloading http://yann.lecun.com/exdb/mnist/t10k-images-idx3-ubyte.gz
Failed to download (trying next):
<urlopen error [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: certificate has expired (_ssl.c:1007)>

Downloading https://ossci-datasets.s3.amazonaws.com/mnist/t10k-images-idx3-ubyte.gz
Downloading https://ossci-datasets.s3.amazonaws.com/mnist/t10k-images-idx3-ubyte.gz to ./datasets/MNIST/raw/t10k-images-idx3-ubyte.gz


100%|██████████| 1.65M/1.65M [00:00<00:00, 11.9MB/s]


Extracting ./datasets/MNIST/raw/t10k-images-idx3-ubyte.gz to ./datasets/MNIST/raw

Downloading http://yann.lecun.com/exdb/mnist/t10k-labels-idx1-ubyte.gz
Failed to download (trying next):
<urlopen error [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: certificate has expired (_ssl.c:1007)>

Downloading https://ossci-datasets.s3.amazonaws.com/mnist/t10k-labels-idx1-ubyte.gz
Downloading https://ossci-datasets.s3.amazonaws.com/mnist/t10k-labels-idx1-ubyte.gz to ./datasets/MNIST/raw/t10k-labels-idx1-ubyte.gz


100%|██████████| 4.54k/4.54k [00:00<00:00, 2.22MB/s]

Extracting ./datasets/MNIST/raw/t10k-labels-idx1-ubyte.gz to ./datasets/MNIST/raw

Number of MNIST training examples: 60000
Number of MNIST test examples: 10000





In [5]:
#Loading data in batches
train_loader = torch.utils.data.DataLoader(mnist_train, batch_size=100, shuffle=True)
test_loader = torch.utils.data.DataLoader(mnist_test, batch_size=100, shuffle=False)

#Params Initialize
W_1 = torch.randn(784, 500)/np.sqrt(784)
W_1.requires_grad_()

W_2 = torch.randn(500, 10)/np.sqrt(500)
W_2.requires_grad_()

# Initialize bias b as 0s
b_1 = torch.zeros(500, requires_grad=True)

b_2 = torch.zeros(10, requires_grad=True)

#Optimizer
optimizer = torch.optim.SGD([W_1,b_1, W_2, b_2], lr=0.1)

# Iterate through train set minibatchs
for images, labels in tqdm(train_loader):
    # Zero out the gradients
    optimizer.zero_grad()

    # Forward pass
    x = images.view(-1, 28*28)
    y_1 = torch.matmul(x, W_1) + b_1
    y_relu = F.relu(y_1)
    y_2 = torch.matmul(y_relu, W_2) + b_2
    cross_entropy = F.cross_entropy(y_2, labels)

    # Backward pass
    cross_entropy.backward()
    optimizer.step()

#testing
correct = 0
total = len(mnist_test)

with torch.no_grad():
    # Iterate through test set minibatchs
    for images, labels in tqdm(test_loader):

        # Forward pass
        x = images.view(-1, 28*28)

        y_1 = torch.matmul(x, W_1) + b_1

        y_relu = F.relu(y_1)

        y_2 = torch.matmul(y_relu, W_2) + b_2

        predictions = torch.argmax(y_2, dim=1)
        correct += torch.sum((predictions == labels).float())

print('Test accuracy: {}'.format(correct/total))

  0%|          | 0/600 [00:00<?, ?it/s]

  0%|          | 0/100 [00:00<?, ?it/s]

Test accuracy: 0.9272000193595886


### Implementing MLP with OOP/ Higher APIs

In [6]:
import torch.nn as nn

# input_size = 784
# hidden_size = 500
# num_classes = 10
# activation_function = nn.ReLU()

class MNIST_MLP(nn.Module):
    def __init__(self):
        super().__init__()
        self.lin1 = nn.Linear(784, 500)
        self.lin2 = nn.Linear(500, 10)
        self.activation_function = nn.ReLU()

    def forward(self, x):
        x = self.activation_function(self.lin1(x))
        x = self.lin2(x)
        return x


# Instantiate model
model = MNIST_MLP()

# Loss and Optimizer
criterion = nn.CrossEntropyLoss()

optimizer = torch.optim.SGD(model.parameters(), lr=0.1)

# Iterate through train set minibatchs
for images, labels in tqdm(train_loader):
    # Zero out the gradients
    optimizer.zero_grad()

    # Forward pass
    x = images.view(-1, 28*28)
    y = model(x)

    loss = criterion(y, labels)
    # Backward pass
    loss.backward()
    optimizer.step()

## Testing
correct_mlp = 0
total_mlp = len(mnist_test)

with torch.no_grad():
    # Iterate through test set minibatchs
    for images, labels in tqdm(test_loader):
        # Forward pass
        x = images.view(-1, 28*28)
        y = model(x)

        predictions = torch.argmax(y, dim=1)
        correct_mlp += torch.sum((predictions == labels).float())

print('Test accuracy: {}'.format(correct_mlp/total_mlp))

  0%|          | 0/600 [00:00<?, ?it/s]

  0%|          | 0/100 [00:00<?, ?it/s]

Test accuracy: 0.9223999977111816


In [None]:
num_params = 0
for i in model.parameters():
  num_params +=1
  print(i)

Parameter containing:
tensor([[ 0.0143, -0.0016, -0.0033,  ...,  0.0350,  0.0241,  0.0298],
        [ 0.0143,  0.0293, -0.0005,  ..., -0.0333, -0.0214, -0.0092],
        [ 0.0017, -0.0234,  0.0268,  ...,  0.0320,  0.0337,  0.0094],
        ...,
        [ 0.0221,  0.0272, -0.0115,  ...,  0.0324, -0.0241, -0.0272],
        [-0.0347, -0.0166, -0.0021,  ..., -0.0026, -0.0044, -0.0268],
        [ 0.0310, -0.0171, -0.0017,  ...,  0.0285, -0.0295, -0.0330]],
       requires_grad=True)
Parameter containing:
tensor([ 1.0350e-02, -3.5830e-02,  3.0167e-02, -3.3522e-03, -2.2334e-02,
         3.0425e-02,  2.1678e-02,  4.1740e-02,  7.2336e-03, -1.2662e-02,
        -2.3864e-02, -2.0684e-02,  8.4227e-05,  3.6366e-02,  2.5877e-02,
         2.7300e-02,  2.8153e-04, -2.7072e-02, -3.6930e-03,  3.8944e-02,
        -9.1018e-04, -1.8812e-02,  6.5286e-02,  6.4455e-03,  1.0603e-02,
        -1.6810e-02, -2.5126e-02, -3.4174e-02,  1.6122e-02, -3.4830e-02,
         2.3322e-02, -5.6894e-02,  2.5370e-02,  1.4578e-0

### Short answer
How many trainable parameters does your model have?
How does this compare to the logisitic regression example?

**MLP**

Input to Hidden Layer (Layer 1):

Weights: 784 (input) x 500 (hidden units) = 392,000 parameter

Biases: 500 (hidden units) = 500 parameters

Total for Layer 1: 392,000 + 500 = 392,500 parameters


Hidden Layer (Layer 1) to Output Layer (Layer 2):

Weights: 500 (hidden units) x 10 (output units) = 5,000 parameters

Biases: 10 (output units) = 10 parameters

Total for Layer 2: 5,000 + 10 = 5,010 parameters


Total Trainable Parameters: 392,500 + 5,010 = 397,510

-------------------------------------------------------

**Logistic Regression**
Weights: 784 (input) x 10 (hidden units) = 7840 parameter
Biases: 10


Total Trainable Parameters: 7850


### Changing the Number of Epochs

In [9]:
epochs = 4

# Instantiate model
model = MNIST_MLP()

# Loss and Optimizer
criterion = nn.CrossEntropyLoss()

optimizer = torch.optim.SGD(model.parameters(), lr=0.1)

for e in range(0,epochs):

  for images, labels in tqdm(train_loader):
      # Zero out the gradients
      optimizer.zero_grad()

      # Forward pass
      x = images.view(-1, 28*28)
      y = model(x)

      loss = criterion(y, labels)
      # Backward pass
      loss.backward()
      optimizer.step()

## Testing
correct_mlp = 0
total_mlp = len(mnist_test)

with torch.no_grad():
    # Iterate through test set minibatchs
    for images, labels in tqdm(test_loader):
        # Forward pass
        x = images.view(-1, 28*28)
        y = model(x)

        predictions = torch.argmax(y, dim=1)
        correct_mlp += torch.sum((predictions == labels).float())

print('Test accuracy: {}'.format(correct_mlp/total_mlp))

  0%|          | 0/600 [00:00<?, ?it/s]

  0%|          | 0/600 [00:00<?, ?it/s]

  0%|          | 0/600 [00:00<?, ?it/s]

  0%|          | 0/600 [00:00<?, ?it/s]

  0%|          | 0/100 [00:00<?, ?it/s]

Test accuracy: 0.9567999839782715


### Changing the Learning Rate

In [10]:
learning_rate = 0.01
epochs = 4

# Instantiate model
model = MNIST_MLP()

# Loss and Optimizer
criterion = nn.CrossEntropyLoss()

optimizer = torch.optim.SGD(model.parameters(), lr=learning_rate)

for e in range(0,epochs):

  for images, labels in tqdm(train_loader):
      # Zero out the gradients
      optimizer.zero_grad()

      # Forward pass
      x = images.view(-1, 28*28)
      y = model(x)

      loss = criterion(y, labels)
      # Backward pass
      loss.backward()
      optimizer.step()

## Testing
correct_mlp = 0
total_mlp = len(mnist_test)

with torch.no_grad():
    # Iterate through test set minibatchs
    for images, labels in tqdm(test_loader):
        # Forward pass
        x = images.view(-1, 28*28)
        y = model(x)

        predictions = torch.argmax(y, dim=1)
        correct_mlp += torch.sum((predictions == labels).float())

print('Test accuracy: {}'.format(correct_mlp/total_mlp))

  0%|          | 0/600 [00:00<?, ?it/s]

  0%|          | 0/600 [00:00<?, ?it/s]

  0%|          | 0/600 [00:00<?, ?it/s]

  0%|          | 0/600 [00:00<?, ?it/s]

  0%|          | 0/100 [00:00<?, ?it/s]

Test accuracy: 0.9016000032424927


**Observation**

1. We got higher accuracy with the Test set by increasing epochs.
2. We got lower accuracy on changing the learning parameter to 0.01