# Basics of MLP and Backpropagation on MNIST dataset

In this tutorial we will understand how to implement a Multilayer perceptron architecture with one hidden layer. This tutorial has two parts: (a) Implementing Back-propagation from scratch (b) Using the in-built 'Autograd' module to train the MLP network.

To make data loading simple, we would use the torchvision package created as part of PyTorch which has data loaders for standard datasets such as ImageNet, CIFAR10, MNIST.

## Import all the required packages

In [None]:
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
from torchvision import datasets, transforms
import numpy as np
print("Importing done!")

## Initialize the variables


In [None]:
batch_size = 32 # Batch size
input_dim = 784 # Input dimension (For MNIST dataset each image is of size 28 x 28 = 784)
num_of_hidden_nodes = 100 # number of hidden nodes in hidden layer
output_dim = 10 # Number of output nodes = no of classes in th dataset. In this case it is 10

learning_rate = 0.1
num_epochs = 5

In [None]:
a = 'hello'
b = 'World'
a + b

## Load the MNIST data. 
For convenience we have already downloaded the MNIST dataset and saved in the '../../data' folder. So, the argument download is set to 'False'. We then whiten the dataset.

In [None]:
train_loader = torch.utils.data.DataLoader(datasets.MNIST('.', train=True, download=True,
                                                          transform=transforms.Compose([
                                                              transforms.ToTensor(),
                                                              transforms.Normalize((0.1307,), (0.3081,))])),
                                           batch_size= batch_size, shuffle=True)

## Sigmoid activation function and its derivative

$\sigma(x)=\frac{1}{1+e^{-x}}$

$\sigma^{'}(x) = \sigma(x)(1-\sigma(x))$

In [None]:
def sigmoid(x):
    return 1/torch.exp(x.mul(-1)).add(1)
    

def sigmoid_diff(x):
    return torch.mul(sigmoid(x), sigmoid(x).mul(-1).add(1))

tensor = torch.FloatTensor([[1,2,3],[1,2,3]])
print(sigmoid(tensor)) # You can use it for debugging
torch.sigmoid(tensor)

## Initialize the weight matrices with some random values

$W_1 \in \mathbb{R}^{784 x 100}$

$W_2 \in \mathbb{R}^{100 x 10}$

In [None]:
# Initiliaze the weights
W_1 = torch.randn(input_dim, num_of_hidden_nodes).type(torch.FloatTensor) # Weights between input and hidden layer
W_2 = torch.randn(num_of_hidden_nodes, output_dim).type(torch.FloatTensor) # Weights between hidden layer and output

## The training loop with manual backpropagation

In each epoch, we will have several batches of data. We take each of the batches and do the forward pass. Then based on the error we back-propagate.

![alt text](images/mlp.png "MLP with 3-layers")


Assume, batch_size = 1, matrix multiplication $*$ and element-wise multiplication $.$

### Mean-Squared Loss Function:

$L = 0.5*(output - true\_output)^2$

### Forward Pass:

$Z = \sigma(W_1^{T}X)$           [$\mathbb{R}^{1 x 100}$]

$output = \sigma(W_2^{T}Z)$       [$\mathbb{R}^{1 x 10}$]

### Backward Pass:

Derivative of loss: $diff = (output - true\_output)$   [$\mathbb{R}^{1 x 10}$]

$\frac{\partial L}{\partial W_2} = Z^{T}*(diff.\sigma^{'}(output))$    [$\mathbb{R}^{100 x 10}$]

$\frac{\partial L}{\partial W_1} = X^{T} *((diff.\sigma^{'}(output))*W_2^{T}).\sigma^{'}(Z)$ [$\mathbb{R}^{784 x 100}$]

### Parameter Update:

$W_1 = W_1 - \eta \frac{\partial L}{\partial W_1}$

$W_2 = W_2 - \eta \frac{\partial L}{\partial W_2}$

In [None]:
for epoch in range(0, num_epochs):
    correct = 0
    loss = 0
    y_batch_onehot = torch.FloatTensor(batch_size, output_dim)
    for batch_idx, (x_batch, y_batch) in enumerate(train_loader):
        # Forward Pass
        x_batch = x_batch.view(-1, 784)
        hidden_state_output = sigmoid(torch.mm(x_batch, W_1))
        output = sigmoid(torch.mm(hidden_state_output, W_2))
        
        # Convert the labels to one hot encoded format
        y_batch_onehot.zero_()
        y_batch_onehot.scatter_(1, y_batch[:, None], 1)
        
        # Loss (Mean-Squared error)     
        loss += (output - y_batch_onehot).pow(2).sum()*0.5
        _, predicted_class = output.max(1)
        correct += predicted_class.eq(y_batch).sum()       
        
        #Backward Pass (Back-Propagation)
        # Derivative of MSE Loss        
        diff = (output - y_batch_onehot)

        grad_w2 = torch.mm(hidden_state_output.t(),torch.mul(diff, sigmoid_diff(output))) # 100 x 10 dimensional
        grad_w1 =  torch.mm(x_batch.t(),torch.mul(torch.mm(torch.mul(diff, sigmoid_diff(output)), W_2.t())
                             ,sigmoid_diff(hidden_state_output))) # 784 x 100
        
        # Perform gradient descent        
        W_1 -= learning_rate*grad_w1
        W_2 -= learning_rate*grad_w2
        
        
    print("Epoch: {0} | loss: {1} | accuracy: {2}".format(epoch, loss/len(train_loader)
                                                          , correct/float(len(train_loader.dataset))))

## Using in-built Autograd function

loss.backward():  calculates the gradients of the loss function w.r.t all the parameters in the network

optimizer.step(): updates all the parameters of the networks


In [None]:
import pdb
learning_rate = 0.1

W_1 = torch.randn(input_dim, num_of_hidden_nodes).type(torch.FloatTensor).cuda()
W_2 = torch.randn(num_of_hidden_nodes, output_dim).type(torch.FloatTensor).cuda()
W_1.requires_grad=True
W_2.requires_grad=True
y_batch_onehot = torch.FloatTensor(batch_size, output_dim).cuda()

for epoch in range(0, num_epochs):
    
    correct = 0
    total_loss = 0
    for batch_idx, (x_batch, y_batch) in enumerate(train_loader):
        
        x_batch = x_batch.view(-1,784).cuda()
        y_batch = y_batch.cuda()       
        
        # Forward Pass
        hidden_state_output = torch.sigmoid(torch.mm(x_batch, W_1))
        output = torch.sigmoid(torch.mm(hidden_state_output, W_2))
        
        
        
        
        # Convert the labels to one hot encoded format
        y_batch_onehot.data.zero_()
        y_batch_onehot.data.scatter_(1, y_batch[:, None].data, 1)

        
        # Loss (Mean-Squared error)  
        pdb.set_trace()
        loss = (output - y_batch_onehot).pow(2).sum().mul(0.5)
        total_loss += loss.item()
        loss.backward()

        # Calculate no of correct classifications
        _, predicted_class = output.max(1)
        correct += predicted_class.data.eq(y_batch.data).sum()              
        
       
        
        
        
        W_1.data -= learning_rate * W_1.grad.data
        W_2.data -= learning_rate * W_2.grad.data
                 # Manually zero the gradients before running the backward pass         
        W_1.grad.data.zero_()
        W_2.grad.data.zero_()

    print("Epoch: {0} | loss: {1} | accuracy: {2}".format(epoch, total_loss/len(train_loader)
                                                          , correct/float(len(train_loader.dataset))))
        
        