### Importing required packages

* First, we import pytorch, the deep learning library which we’ll be using, and torchvision, which provides our dataset and data transformations. 


* We also import torch.nn (pytorch’s neural network library), torch.nn.functional (includes non-linear functions like ReLu and sigmoid) and torch.optim for implementing various optimization algorithms.



In [None]:
# import libraries
import torch
import numpy as np
import torch.optim as optim
import torch.nn as nn
import torch.nn.functional as F
from torch.optim import lr_scheduler
from torchvision import datasets, models, transforms
import matplotlib.pyplot as plt
import time
import os
import copy

import warnings
warnings.filterwarnings("ignore")

### Initializing CUDA

CUDA is used as an interface between our code and the GPU.

Normally, we run the code in the CPU. To run it in the GPU, we need CUDA. Check if CUDA is available:

In [None]:
# To test whether GPU instance is present in the system of not.
use_cuda = torch.cuda.is_available()
print('Using PyTorch version:', torch.__version__, 'CUDA:', use_cuda)

If it's False, then we run the program on CPU. If it's True, then we run the program on GPU.

Let us initialize some GPU-related variables:

In [None]:
device = torch.device("cuda" if use_cuda else "cpu")
device

### Load MNIST data

Now, we'll load the MNIST data. For the first time, we may have to download the data, which can take a while.

Now, 

* We will load both the training set and the testing sets 

* We will use  transform.compose() to convert the datasets into tensors using transforms.ToTensor(). We also normalize them by setting the mean and standard deviation using transforms.Normalize().


In [None]:
# Normalize with mean and std (0.1307 and 0.3081 are the mean and std of MNIST data)
# convert data to torch.FloatTensor
transform = transforms.Compose([transforms.ToTensor()])

In [None]:
# Choose the training and test datasets
train_data = datasets.MNIST(root='data', train=True,
                                   download=True, transform=transform)
test_data = datasets.MNIST(root='data', train=False,
                                  download=True, transform=transform)

In [None]:
# Verifying mean and std of MNIST data
train_data.data.float().mean() / 255, train_data.data.float().std() / 255

In [None]:
# Number of training samples
len(train_data)

In [None]:
# Size of one training image
train_data[0][0].size()



**torch.utils.data.DataLoader** class represents a Python iterable over a dataset, with following features.

1. Batching the data
2. Shuffling the data
3. Load the data in parallel using multiprocessing workers.


The batches of train and test data are provided via data loaders that provide iterators over the datasets to train our models.

In [None]:
# Initializing batch size
batch_size = 32

# Loading the train dataset
train_loader = torch.utils.data.DataLoader(dataset=train_data, 
                                           batch_size=batch_size, 
                                           shuffle=True)

# Loading the test dataset
test_loader = torch.utils.data.DataLoader(dataset=test_data, 
                                          batch_size=batch_size, 
                                          shuffle=True)

The train and test data are provided via data loaders that provide iterators over the datasets.

The first element of training data (X_train) is a 4th-order tensor of size (batch_size, 1, 28, 28), i.e. it consists of a batch of images of size 1x28x28 pixels where '1' represents one input image channel i.e. grey scale. y_train is a vector containing the correct classes ("0", "1", ..., "9") for each training digit.

In [None]:
for (X_train, y_train) in train_loader:
    print('X_train:', X_train.size(), 'type:', X_train.type())
    print('y_train:', y_train.size(), 'type:', y_train.type())
    break

### Visualize a Batch of Training Data

In [None]:
pltsize=2
plt.figure(figsize=(15*pltsize, pltsize))

for i in range(10):
    plt.subplot(1,10,i+1)
    plt.axis('off')
   
    plt.imshow(X_train[i,:,:,:].numpy().reshape(28,28), cmap="gray")
    plt.title('Class: '+str(y_train[i]))

In [None]:
# obtain one batch of training images
dataiter = iter(train_loader)
images, labels = next(dataiter)
images = images.numpy()

pltsize=2
plt.figure(figsize=(15*pltsize, pltsize))

for i in range(10):
    plt.subplot(1,10,i+1)
    plt.axis('off')  
    plt.imshow(images[i,:,:,:].reshape(28,28), cmap="gray")
    # print out the correct label for each image
    plt.title('Class: '+str(labels[i]))

View an Image in More Detail

In [None]:
img = np.squeeze(images[1])

fig = plt.figure(figsize = (12,12)) 
ax = fig.add_subplot(111)
ax.imshow(img, cmap='gray')
width, height = img.shape
thresh = img.max()/2.5
for x in range(width):
    for y in range(height):
        val = round(img[x][y],2) if img[x][y] !=0 else 0
        ax.annotate(str(val), xy=(y,x),
                    horizontalalignment='center',
                    verticalalignment='center',
                    color='white' if img[x][y]<thresh else 'black')

### Define the Neural Network Architecture and Optimizer

Let's define the network as a Python class.

There are three functions that are defined in this class:

- ### **\__init__()**:
In this function, we shall declare all the layers of our neural network, including the number of neurons, non-linear activations, etc.

- ### **forward()**:
This is the function that is used to compute forward pass of the network. Here, we shall connect the different layers we had defined in \__init__(), according to the network architecture we want to make. In this case, $x -> fc1 -> relu -> fc2 -> out$.

"forward" can be called by calling the object of this class directly. For example:

```
model = Net()
out = model(x)
```


- ### **backward()**:
This function is used to compute gradients across the entire network, and is called from the loss function at the end of the network.

```
loss.backward()
```

We have to write the **\__init__()** and **forward()** methods, and PyTorch will automatically generate a **backward()** method for computing the gradients for the backward pass.

In this case, we pass input (X) through the first layer, pass it’s output through the Relu layer, pass it's output through second layer, pass it's output to the relu layer, pass it's output through the third layer, pass it's output through the log softmax layer.

In [None]:
class Net_pretrained(nn.Module):
    def __init__(self):
        super(Net_pretrained, self).__init__()
        # linear layer (784 -> 1 hidden node)
        self.fc1 = nn.Linear(28 * 28, 512) # First fully connected layer which takes input image 28x28 --> 784
        self.fc2 = nn.Linear(512, 512)
        self.fc3 = nn.Linear(512, 512)
        self.fc4 = nn.Linear(512, 512)
        self.fc5 = nn.Linear(512, 10) # Last fully connected layer which outputs our 10 labels

    def forward(self, x):
        # The view function is meant to flatten the tensor (28x28 is converted to 784)  
        x = x.view(-1, 28 * 28)
        # Add hidden layer, with relu activation function
        # Relu an activation function which allows positive values to pass through the network, whereas negative values are modified to zero
        x1 = F.relu(self.fc1(x))
        x2 = F.relu(self.fc2(x1))
        x3 = F.relu(self.fc3(x2))
        x4 = F.relu(self.fc4(x3))
        output = self.fc5(x4)
        return output, x4

Let us declare an object of class Net, and make it a CUDA model if CUDA is available:

### Calling the instances of the network

Let us declare an object of class Net, and make it a CUDA model if CUDA is available:

In [None]:
model_pretrained = Net_pretrained()
model_pretrained = model_pretrained.to(device) 

In [None]:
print(model_pretrained)

To get the parameter count of each layer, PyTorch has model.named_paramters() that returns an iterator of both the parameter name and the parameter itself.

In [None]:
from prettytable import PrettyTable

def count_parameters(model):
    table = PrettyTable(["Modules", "Parameters"])
    total_params = 0
    for name, parameter in model_pretrained.named_parameters():
        if not parameter.requires_grad: continue
        # calculate only the trainable parameters:
        params = parameter.numel()
        table.add_row([name, params])
        # sum the number of elements for every parameter group:
        total_params+=params
    print(table)
    print(f"Total Trainable Params: {total_params}")
    return total_params
    
count_parameters(model_pretrained)

### Declaring loss function the optimizer

In loss = nn.CrossEntropyLoss(), we pass output and the target as parameters where, output is the model prediction (what the model predicted on giving an image/data) and target is the actual label of the given image. We can see that PyTorch’s cross entropy function applies a softmax funtion to the output layer and then calculates the log loss.

This criterion computes the cross entropy loss between input and target.

Finally, we define an optimizer to update the model parameters based on the computed gradients. We select stochastic gradient descent (with momentum) as the optimization algorithm, and set the learning rate to 0.01. 

**Note** that there are several different options for the optimizer in PyTorch that we could use instead of SGD.

In [None]:
learning_rate = 0.01

# specify loss function
criterion = nn.CrossEntropyLoss()

# specify optimizer
optimizer = torch.optim.SGD(model_pretrained.parameters(), lr=learning_rate)

### Training and Testing the model

Let's now define functions to train() and test() the model.

In Training Phase, we iterate over a batch of images in the train_loader. For each batch, we perform  the following steps:

* First we zero out the gradients using zero_grad()

* We pass the data to the model i.e. we perform forward pass by calling the forward()

* We calculate the loss using the actual and predicted labels

* Perform Backward pass using backward() to update the weights

In [None]:
def train(epoch, log_interval=100):
    # First switch the module mode to model.train() so that new weights can be learned after every epoch. 
    model_pretrained.train()

    # Loop through each batch of images in train set
    for batch_idx, (data, target) in enumerate(train_loader):
       
        data, target = data.to(device), target.to(device)

        # Zero out the gradients from the preivous step 
        optimizer.zero_grad()

        # Forward pass (this calls the "forward" function within Net)
        output, _ = model_pretrained(data)

        # Compute the Loss
        loss = criterion(output, target)

        # Do backward pass
        loss.backward()

        # optimizer.step() updates the weights accordingly
        optimizer.step()

        if batch_idx % log_interval == 0:
            print('Train Epoch: {} [{}/{} ({:.0f}%)]\tLoss: {:.6f}'.format(
                epoch, batch_idx * len(data), len(train_loader.dataset),
                100. * batch_idx / len(train_loader), loss.item()))

Now we are ready to train our model using the train() function. An epoch means one pass through the whole training data. After each epoch, we evaluate the model using test():



In Testing Phase, we iterate over a batch of images in the test_loader. For each batch we perform the following steps:

* We pass the images through the model (network) to get the outputs
* Pick the class / label with the highest probability
* Calculate the accuracy

In [None]:
def test(loss_vector, accuracy_vector):
    model_pretrained.eval()                           # model.eval() here sets the PyTorch module to evaluation mode. 
                                           
    test_loss, correct = 0, 0

    for data, target in test_loader:
        data, target = data.to(device), target.to(device)  # Convert the data and target to Pytorch tensor 

        # Passing images/data to the model, which return the probabilites as outputs
        output,_ = model_pretrained(data) 

        # calculate the loss
        test_loss += criterion(output, target).item()

        # convert output with maximum probabilities to predicted class
        # # get the index of the max log-probability
        _, pred = torch.max(output, 1)

        # compare predictions to true label
        correct += (pred == target).sum().item()
    
    # Calculating the loss
    test_loss /= len(test_loader)
    loss_vector.append(test_loss)

    # Calculating the accuracy
    accuracy = 100. * correct / len(test_loader.dataset)

    accuracy_vector.append(accuracy)
    print('\nTest set: Average loss: {:.4f}, Accuracy: {}/{} ({:.0f}%)\n'.format(
        test_loss, correct, len(test_loader.dataset), accuracy))
    return accuracy_vector

In [None]:
%%time
epochs = 10

lossv, accv = [], []
acc_vector = []
for epoch in range(1, epochs + 1):
    train(epoch)
    acc_vector = test(lossv, accv)

Thus, GPU is much faster!

Let's now visualize how the training progressed.

Loss is a function of the difference of the network output and the target values. We are minimizing the loss function during training so it should decrease over time.

### Plotting epoch vs test error

In [None]:
plt.figure(figsize=(8,5))
plt.plot(np.arange(1,epochs+1), lossv)
plt.title('test loss')
plt.xlabel("epoch")
plt.ylabel("error");

### Save the trained model

**Note:** Refer to the following [link](https://pytorch.org/tutorials/recipes/recipes/what_is_state_dict.html) to save the pytorch models using `state_dict()`

In [None]:
# Specify a path
PATH = "pretrained_layer4_512n.pt"

# Save the pytorch trained model
torch.save(model_pretrained.state_dict(), PATH)

# Load
# model_pretrained = Net_pretrained()
# model_pretrained.load_state_dict(torch.load(PATH))
# model_pretrained.eval()