# Coursework2: Convolutional Neural Networks 

## instructions

Please submit a version of this notebook containing your answers **together with your trained model** on CATe as CW2.zip. Write your answers in the cells below each question.

A PDF version of this notebook is also provided in case the figures do not render correctly.

**The deadline for submission is 19:00, Thu 14th February, 2019**

### Setting up working environment 

For this coursework you will need to train a large network, therefore we recommend you work with Google Colaboratory, which provides free GPU time. You will need a Google account to do so. 

Please log in to your account and go to the following page: https://colab.research.google.com. Then upload this notebook.

For GPU support, go to "Edit" -> "Notebook Settings", and select "Hardware accelerator" as "GPU".

You will need to install pytorch by running the following cell:

In [0]:
!pip install torch torchvision



## Introduction

For this coursework you will implement one of the most commonly used model for image recognition tasks, the Residual Network. The architecture is introduced in 2015 by Kaiming He, et al. in the paper ["Deep residual learning for image recognition"](https://www.cv-foundation.org/openaccess/content_cvpr_2016/papers/He_Deep_Residual_Learning_CVPR_2016_paper.pdf). 
<br>

In a residual network, each block contains some convolutional layers, plus "skip" connections, which allow the activations to by pass a layer, and then be summed up with the activations of the skipped layer. The image below illustrates a building block in residual networks.

![resnet-block](utils/resnet-block.png)

Depending on the number of building blocks, resnets can have different architectures, for example ResNet-50, ResNet-101 and etc. Here you are required to build ResNet-18 to perform classification on the CIFAR-10 dataset, therefore your network will have the following architecture:

![resnet](utils/resnet.png)

## Part 1 (40 points)

In this part, you will use basic pytorch operations to define the 2D convolution and max pooling operation. 

### YOUR TASK

- implement the forward pass for Conv2D and MaxPool2D
- You can only fill in the parts which are specified as "YOUR CODE HERE"
- You are **NOT** allowed to use the torch.nn module and the conv2d/maxpooling functions in torch.nn.functional

In [0]:
import torch
import torch.nn as nn
import torch.nn.functional as F

In [0]:
class Conv2D(nn.Module):
    
    def __init__(self, inchannel, outchannel, kernel_size, stride, padding, bias = True):
        
        super(Conv2D, self).__init__()
        
        self.inchannel = inchannel
        self.outchannel = outchannel
        self.kernel_size = kernel_size
        self.stride = stride
        self.padding = padding
        
        self.weights = nn.Parameter(torch.Tensor(outchannel, inchannel, 
                                                 kernel_size, kernel_size))
        self.weights.data.normal_(-0.1, 0.1)
        
        if bias:
            self.bias = nn.Parameter(torch.Tensor(outchannel, ))
            self.bias.data.normal_(-0.1, 0.1)
        else:
            self.bias = None
            
        
    def forward(self, x):
        ##############################################################
        #                       YOUR CODE HERE                       #       
        ##############################################################
        
        W_out = int((x.shape[2] - self.kernel_size + 2*(self.padding)) / self.stride) + 1
        H_out = int((x.shape[3] - self.kernel_size + 2*(self.padding)) / self.stride) + 1
        C_out = self.outchannel
        
        # Unfold the input
        x_unf = torch.nn.functional.unfold(x, kernel_size=self.kernel_size, \
                                               stride = self.stride, padding = self.padding)
        # Multiply input with weights
        output_unf = x_unf.transpose(1, 2).matmul(self.weights.view(self.weights.size(0), -1).t()).transpose(1, 2)
        
        # View output in proper dimension size for the next layer
        output = output_unf.view(x.size(0), self.outchannel, W_out, H_out)  
        ##############################################################
        #                       END OF YOUR CODE                     #
        ##############################################################
        

        return output

In [0]:
class MaxPool2D(nn.Module):
    
    def __init__(self, pooling_size):
        # assume pooling_size = kernel_size = stride
        
        super(MaxPool2D, self).__init__()
        
        self.pooling_size = pooling_size
        

    def forward(self, x):
        ##############################################################
        #                       YOUR CODE HERE                       #       
        ##############################################################
        stride = self.pooling_size
        W_out = int((x.shape[2] - self.pooling_size) / stride) + 1
        H_out = int((x.shape[3] - self.pooling_size) / stride) + 1
        
        # Unfold the input
        x_unf = torch.nn.functional.unfold(x, kernel_size=self.pooling_size, \
                                               stride = self.pooling_size)
        
        # Kernel size
        w = self.pooling_size ** 2
        
        # View unfolded input in a beneficial dimension size
        x_unf = x_unf.view(x_unf.size(0), x.size(1), w, -1)
        
        # Get max values from the kernel window
        x_unf = x_unf.max(dim=2)[0]
        
        # View output in proper dimension size for the next layer
        output = x_unf.view(x.size(0), x.size(1), W_out, H_out)
        ##############################################################
        #                       END OF YOUR CODE                     #
        ##############################################################
                
        
        return output

In [0]:
# define resnet building blocks

class ResidualBlock(nn.Module): 
    def __init__(self, inchannel, outchannel, stride=1): 
        
        super(ResidualBlock, self).__init__() 
        
        self.left = nn.Sequential(Conv2D(inchannel, outchannel, kernel_size=3, 
                                         stride=stride, padding=1, bias=False), 
                                  nn.BatchNorm2d(outchannel), 
                                  nn.ReLU(inplace=True), 
                                  Conv2D(outchannel, outchannel, kernel_size=3, 
                                         stride=1, padding=1, bias=False), 
                                  nn.BatchNorm2d(outchannel)) 
        
        self.shortcut = nn.Sequential() 
        
        if stride != 1 or inchannel != outchannel: 
            
            self.shortcut = nn.Sequential(Conv2D(inchannel, outchannel, 
                                                 kernel_size=1, stride=stride, 
                                                 padding = 0, bias=False), 
                                          nn.BatchNorm2d(outchannel) ) 
            
    def forward(self, x): 
        
        out = self.left(x) 
        
        out += self.shortcut(x) 
        
        out = F.relu(out) 
        
        return out


In [0]:
# define resnet

class ResNet(nn.Module):
    
    def __init__(self, ResidualBlock, num_classes = 10):
        
        super(ResNet, self).__init__()
        
        self.inchannel = 64
        self.conv1 = nn.Sequential(Conv2D(3, 64, kernel_size = 3, stride = 1,
                                            padding = 1, bias = False), 
                                  nn.BatchNorm2d(64), 
                                  nn.ReLU())
        
        self.layer1 = self.make_layer(ResidualBlock, 64, 2, stride = 1)
        self.layer2 = self.make_layer(ResidualBlock, 128, 2, stride = 2)
        self.layer3 = self.make_layer(ResidualBlock, 256, 2, stride = 2)
        self.layer4 = self.make_layer(ResidualBlock, 512, 2, stride = 2)
        self.maxpool = MaxPool2D(4)
        self.fc = nn.Linear(512, num_classes)
        
    
    def make_layer(self, block, channels, num_blocks, stride):
        
        strides = [stride] + [1] * (num_blocks - 1)
        
        layers = []
        
        for stride in strides:
            
            layers.append(block(self.inchannel, channels, stride))
            
            self.inchannel = channels
            
        return nn.Sequential(*layers)
    
    
    def forward(self, x):
        
        x = self.conv1(x)
        
        x = self.layer1(x)
        x = self.layer2(x)
        x = self.layer3(x)
        x = self.layer4(x)
        
        x = self.maxpool(x)
        
        x = x.view(x.size(0), -1)
        
        x = self.fc(x)
        
        return x
    
    
def ResNet18():
    return ResNet(ResidualBlock)

## Part 2 (40 points)

In this part, you will train the ResNet-18 defined in the previous part on the CIFAR-10 dataset. Code for loading the dataset, training and evaluation are provided. 

### Your Task

1. Train your network to achieve the best possible test set accuracy after a maximum of 10 epochs of training.

2. You can use techniques such as optimal hyper-parameter searching, data pre-processing

3. If necessary, you can also use another optimiser

4. **Answer the following question:**
Given such a network with a large number of trainable parameters, and a training set of a large number of data, what do you think is the best strategy for hyperparameter searching? 

**YOUR ANSWER FOR 2.4 HERE**

A:
Various strategies can be used for hyperparameter searching. One of them is Manual Tuning, where we would have to use a set of hyperparameters, observe the performance of the network and then tune them. This is solely based on our empirical knowledge and the knowledge of the dataset.

Another naive strategy is Grid Search, where we choose a range for all hyperparameters and loop over the possible configurations, in search of the best combination, i.e the one yielding the best results. This is very inefficient.

A less costly solution would be Random Search, where random values are sampled for the hyperparameters given a "hypersphere" with certain radius from the currently chosen hyperparameters. This is done until a certain criterion is met (iterations or acceptable performance results).

A good strategy would be Bayesian Optimisation, where a Gaussian process will learn the mapping from the hyperparameters combinations to the performance metric used to evaluate the system.

In [0]:
import torch.optim as optim
from torch.utils.data import DataLoader
from torch.utils.data import sampler

import torchvision.datasets as dset

import numpy as np

import torchvision.transforms as T


transform = T.ToTensor()


# load data

NUM_TRAIN = 49000
print_every = 100


data_dir = './data'
cifar10_train = dset.CIFAR10(data_dir, train=True, download=True, transform=transform)
loader_train = DataLoader(cifar10_train, batch_size=64, 
                          sampler=sampler.SubsetRandomSampler(range(NUM_TRAIN)))

cifar10_val = dset.CIFAR10(data_dir, train=True, download=True, transform=transform)
loader_val = DataLoader(cifar10_val, batch_size=64, 
                        sampler=sampler.SubsetRandomSampler(range(NUM_TRAIN, 50000)))

cifar10_test = dset.CIFAR10(data_dir, train=False, download=True, transform=transform)
loader_test = DataLoader(cifar10_test, batch_size=64)


USE_GPU = True
dtype = torch.float32 

if USE_GPU and torch.cuda.is_available():
    device = torch.device('cuda')
else:
    device = torch.device('cpu')

Files already downloaded and verified
Files already downloaded and verified
Files already downloaded and verified


In [0]:
def check_accuracy(loader, model):
    # function for test accuracy on validation and test set
    
    if loader.dataset.train:
        print('Checking accuracy on validation set')
    else:
        print('Checking accuracy on test set')   
    num_correct = 0
    num_samples = 0
    model.eval()  # set model to evaluation mode
    with torch.no_grad():
        for x, y in loader:
            x = x.to(device=device, dtype=dtype)  # move to device
            y = y.to(device=device, dtype=torch.long)
            scores = model(x)
            _, preds = scores.max(1)
            num_correct += (preds == y).sum()
            num_samples += preds.size(0)
        acc = float(num_correct) / num_samples
        print('Got %d / %d correct (%.2f)' % (num_correct, num_samples, 100 * acc))


def train_part(model, optimizer, epochs=1):
    """
    Train a model on CIFAR-10 using the PyTorch Module API.
    
    Inputs:
    - model: A PyTorch Module giving the model to train.
    - optimizer: An Optimizer object we will use to train the model
    - epochs: (Optional) A Python integer giving the number of epochs to train for
    
    Returns: Nothing, but prints model accuracies during training.
    """
    model = model.to(device=device)  # move the model parameters to CPU/GPU
    for e in range(epochs):
#         print(len(loader_train))
        for t, (x, y) in enumerate(loader_train):
            model.train()  # put model to training mode
            x = x.to(device=device, dtype=dtype)  # move to device, e.g. GPU
            y = y.to(device=device, dtype=torch.long)

            scores = model(x)
            loss = F.cross_entropy(scores, y)

            # Zero out all of the gradients for the variables which the optimizer
            # will update.
            optimizer.zero_grad()

            loss.backward()

            # Update the parameters of the model using the gradients
            optimizer.step()

            if t % print_every == 0:
                print('Epoch: %d, Iteration %d, loss = %.4f' % (e, t, loss.item()))
                check_accuracy(loader_val, model)
                print()

In [0]:
# code for optimising your network performance

##############################################################
#                       YOUR CODE HERE                       #       
##############################################################
# Normalize data and do other transformations

transforms_1 = T.Compose([T.RandomCrop(32, padding=4),
                          T.RandomHorizontalFlip(),
                          T.ToTensor(), 
                          T.Normalize((0.4914, 0.4822, 0.4465), (0.247, 0.243, 0.261))])

cifar10_train = dset.CIFAR10(data_dir, train=True, download=True, transform=transforms_1)
loader_train = DataLoader(cifar10_train, batch_size=64, 
                          sampler=sampler.SubsetRandomSampler(range(NUM_TRAIN)))

cifar10_val = dset.CIFAR10(data_dir, train=True, download=True, transform=transforms_1)
loader_val = DataLoader(cifar10_val, batch_size=64, 
                        sampler=sampler.SubsetRandomSampler(range(NUM_TRAIN, 50000)))

cifar10_test = dset.CIFAR10(data_dir, train=False, download=True, transform=transforms_1)
loader_test = DataLoader(cifar10_test, batch_size=64)

##############################################################
#                       END OF YOUR CODE                     #
##############################################################


# define and train the network
model = ResNet18()
# Optimiser - Change Learning rate, Regularisation
optimizer = optim.Adam(model.parameters(), lr = 0.003, weight_decay = 0.0000038)
train_part(model, optimizer, epochs = 10)


# report test set accuracy

check_accuracy(loader_test, model)


# save the model
torch.save(model.state_dict(), 'model.pt')

# More test runs were made before this, but we didn't record them as they were
# unsatisfactory. After figuring a good learning rate at 0.003, we modify the
# weight decay to try to increase the accuracy
# Learning rate   |  Weight Decay   |   Accuracy
#     0.003          0.0000045           82.55
#     0.003          0.000004            83.75
#     0.003          0.0000035           83.29
#     0.003          0.0000042           83.49
#     0.003          0.0000038           84.01
#     0.003          0.00000385          83.69

Files already downloaded and verified
Files already downloaded and verified
Files already downloaded and verified
Epoch: 0, Iteration 0, loss = 2.6842
Checking accuracy on validation set
Got 90 / 1000 correct (9.00)

Epoch: 0, Iteration 100, loss = 2.0768
Checking accuracy on validation set
Got 181 / 1000 correct (18.10)

Epoch: 0, Iteration 200, loss = 2.0792
Checking accuracy on validation set
Got 216 / 1000 correct (21.60)

Epoch: 0, Iteration 300, loss = 1.8175
Checking accuracy on validation set
Got 236 / 1000 correct (23.60)

Epoch: 0, Iteration 400, loss = 1.8259
Checking accuracy on validation set
Got 215 / 1000 correct (21.50)

Epoch: 0, Iteration 500, loss = 1.9387
Checking accuracy on validation set
Got 321 / 1000 correct (32.10)

Epoch: 0, Iteration 600, loss = 1.5631
Checking accuracy on validation set
Got 386 / 1000 correct (38.60)

Epoch: 0, Iteration 700, loss = 1.8573
Checking accuracy on validation set
Got 378 / 1000 correct (37.80)

Epoch: 1, Iteration 0, loss = 1.52

In [0]:
## Part 3 (20 points)

The code provided below will allow you to visualise the feature maps computed by different layers of your network. Run the code (install matplotlib if necessary) and **answer the following questions**: 

1. Compare the feature maps from low-level layers to high-level layers, what do you observe? 

2. Use the training log, reported test set accuracy and the feature maps, analyse the performance of your network. If you think the performance is sufficiently good, explain why; if not, what might be the problem and how can you improve the performance?

3. What are the other possible ways to analyse the performance of your network?

**YOUR ANSWER FOR PART 3 HERE**

A:
1.Feature maps "capture" different information in the data.  Neurons in lower layers detect simple features, e.g doing edge detection or finding other more generic patterns. Neurons in middle layers detect parts of an object and learn some localisation, and neurons in higher layers detect concepts (adding semantic value). The output of the higher layers is not very clear to see in the printed output below, but the "condensed" pixels hide lots of useful information for the network to process and get results.

2.The network has a performance of ~84%. We notice that during training, the accuracy on the validation set reaches and exceeds  70% in 3 epochs. The accuracy reaches 80% in epoch 6 and after that it fluctuates. This percentage of accuracy is not bad for such a training set, where there are some classes that can be hard to be distinguished (e.g.small dogs and cats, horses and deer). Their size compared to the backrgound, for example, along with their similar shape and colour could "fool" the network.
 
My approach towards tackling the task at hand was to manually tune the hyperparameters of learning rate and weight penalty of the optimiser.  Also, we tested the performance of various optimisers, like SGD and a few variations of Adam. Adam optimiser produced the best results. We also do some transformations on the data before training.  We used horizontal flipping such that objects and their mirrored view are recognised. We also used random cropping such that the classifier learns parts of objects and not specific features of the objects as a whole in case those are not present in all instances.  Finally, we normalised the data using values specific to the dataset. Doing that we center the data around 0 and see how many standard deviations away each input is from 0. This is necessary, since the different channels/features might come from different ranges of distributions and when multiplying with all weights and the learning rate, they might produce unwanted results (over and undercompensation). 
 
The performance of the network can be improved. We are already doing a couple of stuff to improve performance, including normalisation, shuffle and batch training, regularisation and by default use momentum adjustment as Adam is our choice of optimiser. Also, the training set is balanced, meaning there are 10 classes with equal number of instances.
One way to improve the performance of the network is to get a larger training set. This can be done by doing data augmentation, i.e.do shifts, rotations and distortions to existing images and adding them to the training set. Another way is to use a scheduler for the learning rate. Instead of having a fixed learning rate, we can introduce a scheduler where the learning rate decays over iterations.

3.One way of analysing the performance of the network is by producing more debugging information during training. Instead of only showing the validation accuracy and loss, we could also show a small overview of gradients during training. We could also show the relative changes of weights. This might become ugly due to the large amount of information to be printed and will be hard to go through, especially if the number of epochs is a lot.
Another way is to figure out what some types of results clusters of neurons are producing and modify their location in the network. If you find a neuron that does a certain job, you may choose to move it around for better accuracy. This requires some previous knowledge about the network and the current results, in order to fiddle around with it.

In [0]:
#!pip install matplotlib

import matplotlib.pyplot as plt

plt.tight_layout()


activation = {}
def get_activation(name):
    def hook(model, input, output):
        activation[name] = output.detach()
    return hook

vis_labels = ['conv1', 'layer1', 'layer2', 'layer3', 'layer4']

for l in vis_labels:

    getattr(model, l).register_forward_hook(get_activation(l))
    
    
data, _ = cifar10_test[0]
data = data.unsqueeze_(0).to(device = device, dtype = dtype)

output = model(data)


for idx, l in enumerate(vis_labels):

    act = activation[l].squeeze()

    if idx < 2:
        ncols = 8
    else:
        ncols = 32
        
    nrows = act.size(0) // ncols
    
    fig, axarr = plt.subplots(nrows, ncols)
    fig.suptitle(l)


    for i in range(nrows):
        for j in range(ncols):
            axarr[i, j].imshow(act[i * nrows + j].cpu())
            axarr[i, j].axis('off')

**=============== END OF CW2 ===============**