In the previous practical we worked on a relatively small CNN-based model and observe how different approaches to regularize such model contributed to reach higher accuracies and less overfitting.

In this practical we'll use those techinques but start by defining a more advance model based on the [ResNet](https://arxiv.org/pdf/1512.03385.pdf) architecture.

In [None]:
!git clone https://github.com/mackopes/DNN_Practicals_Extras.git extras

In [None]:
import torch
import torch.nn as nn
import time
import os
from tensorboard import notebook

import numpy as np
import matplotlib.pyplot as plt
from mpl_toolkits.axes_grid1 import make_axes_locatable
%matplotlib inline

from IPython.display import Image

import extras.p05_p09.miscFuncs as misc

# A Residual Architecture

We have seen several popular architectures that the research community designed and pushed the capabilities of image classification networks forward. Today, residual layer connections, introduced with [ResNet](https://arxiv.org/pdf/1512.03385.pdf), have become a standard when training deep networks. Residual connections help during backpropagation addressing the gradient vanishing problem. Originally proposed in 2014, residual layers are still used in many of state of the art architectures in image classification and beyond.

Below, see a set of diagrams that illustrate the design of a standard residual layer (a), a bottleneck layer (b), both originally proposed when ResNet was introduced. Short after, [SqueezeNet](https://arxiv.org/pdf/1602.07360.pdf) proposed the Fire Module (c) which could be seen as a variant of residual layer when concatenating input and output. 


In [None]:
Image(filename='./extras/p05_p09/layers.png') 

## Residual Layers in Pytorch

Implementing each of the layers above (which are comprised of several [torch.nn.conv2d]() layers) should be straightforward. Below, we show how a standard residual layer can be implemented.

In [None]:
class basicResidualLayer(nn.Module):
    def __init__(self, inChannels, outChannels, kernelSize, stride=1):
        super(basicResidualLayer, self).__init__()
        self.inLayer = nn.Sequential(
                    nn.Conv2d(inChannels, outChannels, kernelSize, stride=1, padding=1, bias=False),
                    nn.BatchNorm2d(outChannels),
                    nn.ReLU()
                    )
        
        self.outLayer = nn.Sequential(
                    nn.Conv2d(outChannels, outChannels, kernelSize, stride=1, padding=1, bias=False),
                    nn.BatchNorm2d(outChannels)
                    )
        
        # if inchannels = outChannels, then the input can be directly added to the output of the second convolutional layer
        # if the above condition isn't true, then we need to expand the input in the channel dimension so the `add` can be done
        # A similar strategy would be needed if the widthxheight of the output of the second convolutional layer is smaller than
        # that of the input. In this case, however, we'll have to introduce a stride value to the 1x1 conv layer to fix the shape mismatch
        self.residualConnection = None
        
        if inChannels != outChannels:
          self.residualConnection = nn.Sequential(
                                    nn.Conv2d(inChannels, outChannels, kernel_size=1, stride=1, bias = False),
                                    nn.BatchNorm2d(outChannels))
        
        self.relu = nn.ReLU()

    def forward(self, x):
        
        residual = x
        out = self.inLayer(x) 
        # print(out.shape)

        out = self.outLayer(out)
        # print(out.shape)
        # here we are adding the input to the output tensor
        if self.residualConnection is not None:
          residual = self.residualConnection(residual)

        out += residual
        # print(out.shape)
      
        # finally, pass the output through the activation
        return self.relu(out)

Now we can evaluate this single layer with randomly initialized tensor in order to verify that the layer has been properly defined.

In [None]:
numImages = 128
width = height = 9
inCh = 32
outCh = 64

myResidual = basicResidualLayer(inCh, outCh, 3)

dummy_input = torch.randn(numImages, inCh, width, height)

# pass the input (this will call the forward() method of your module)
output = myResidual(dummy_input)

# the expected output is [numImages, outCh, width, height]
print(output.shape)

**(optional) Exercise**: Implement a Bottleneck Layer. Follow the design of the `basicResidualLayer` module above and keep the `inChannels` and `outChannels` as the only parameters to initialize the layer. Then, verify you've constructed if correctly by passing a randomly initialized tensor.

**Complete your code here:**

In [None]:
# Your Bottleneck Layer code
class bottleneckLayer(nn.Module):
    def __init__(self, inChannels, midChannels):
        super(bottleneckLayer, self).__init__()
        # complete your code here
        self.inLayer = nn.Sequential(
                    nn.Conv2d(inChannels, midChannels, 1, stride=1, padding=1, bias=False),
                    nn.BatchNorm2d(outChannels),
                    nn.ReLU()
        )
        
        self.midLayer = nn.Sequential(
                    nn.Conv2d(midChannels, midChannels, kernelSize, stride=1, padding=1, bias=False),
                    nn.BatchNorm2d(midChannels),
                    nn.ReLU()
        )
        
        self.outLayer = nn.Sequential(
                    nn.Conv2d(midChannels, inChannels, 1, stride=1, padding=1, bias=False),
                    nn.BatchNorm2d(midChannels),
                    nn.ReLU(),
        )

        self.residualConnection = None
        
        self.relu = nn.ReLU()

    def forward(self, x):
        # complete your code here
        residual = x
        x = self.inLayer(x) 

        x = self.midLayer(x)

        x = self.outLayer(x)

        x += residual
        # print(out.shape)
      
        # finally, pass the output through the activation
        return self.relu(x)

And **verify your implementantion here**:

In [None]:
# Verify your Bottleneck Layer implementation is correct
numImages = 128
width = height = 9
inCh = 128
outCh = 64

myBottleNeck = bottleneckLayer(inCh, outCh)

dummy_input = torch.randn(numImages, inCh, width, height)

# pass the input (this will call the forward() method of your module)
output = myBottleNeck(dummy_input)

# the expected output is [numImages, outCh, width, height]
print(output.shape)

## The ResNet-9 Architecture

So far, we have trained CNNs for image classification on the CIFAR-10 dataset using relativelly simple models. To push the accuracy even higher we need to either build a significantly larger model with more layers and parameters or use a more exotic architectural design. For this part of the practical (and in some future ones) we'll be using an architecture inspired by ResNets. 

In order to keep the training times to the minimum, we chose an optimized implementation of ResNet9. This architecture can reach **94% in just 24 epochs of training taking a total of 26 seconds** using a single GPU. To achieve this, this architecture makes use of several optimization techniques that are beyond the scope of this course. These include:

*   Using larger mini-batches with 512 images.
*   Used `cutout` data augmentation.
*   Custom piecewise `lr` scheduling.
*   Using half-precision arithmetic.

(optional) For a complete description on how and why these optimizations were implemented, please refer to this [blog post](https://myrtle.ai/how-to-train-your-resnet/).

For the topics we'll be covering in this and the next practical, we have slightly updated the code originally provided in [the GitHub repo](https://github.com/davidcpage/cifar10-fast).


In [None]:
from extras.p05_p09.core import *
from extras.p05_p09.torch_backend import *
import extras.p05_p09.ResNet9Funcs as ResNet9
import extras.p05_p09.miscFuncs as misc
import extras.p05_p09.utilsFuncs as utils

Let's visualize the architecture of our ResNet9 network.

In [None]:
Image(filename='./extras/p05_p09/ResNet9Arch.png') 

Below, we define the `main()` function. It follows a similar structure as in the previous practicals even though we had to accomodate some parts to the structure of the ResNet9 repository we're building upon.

In [None]:
def main(epochs, batch_size, train_batches, val_batches, test_batches):
       
    # Create directory to store TensorBoard checkpoints
    writer = misc.createTensorBoardWriter('./results/p05/ResNet9')   
          
    # Construct the model
    model = ResNet9.getModel()
    
    # Scheduler  
    lr_schedule = PiecewiseLinear([0, 3, epochs], [0, 0.4, 0])
    lr = lambda step: lr_schedule(step/len(train_batches))/batch_size
    
    # Define optimizer
    optim = SGD(trainable_params(model), lr=lr)
    
    # Train
    train(model, writer, optim, train_batches, val_batches, epochs)

    # Evaluate on test set
    test(model, writer, test_batches)
    
    writer.close()

Let's train this network for 10 epochs.

In [None]:
epochs = 10
batch_size = 512

# Get dataset
train_batches, val_batches, test_batches = ResNet9.getCIFAR10(batch_size)

# execute main()
main(epochs, batch_size, train_batches, val_batches, test_batches)

Even though we have trained this ResNet9 model for half of the number of epochs than in the previous notebook, we are achieving ~10% better accuracy. This demonstrates the important of having a good and optimized architecture. 

# Other Optimizers

We have always been using the mini-batch SGD optimizer but, as was covered in the slides, multiple different optimizers exists: SGD, SGD-momentum, Adam, Adamax, RMSprop, etc. You can see a side by side comparison in [this blog](https://ruder.io/optimizing-gradient-descent/) of the optimizers covered in the slides and others that have been recently proposed by the research community.

In general, when designing a DNN for a new problem, either SGD-momentum or Adam are reasonable choices to start with as they are the two most widely used optimizers.

Let's see how the same ResNet9 model as before performs when adding `momentum` and `Nesterov`. You might want to take a look at the documentation for [`torch.optim.SGD`](https://pytorch.org/docs/stable/optim.html#torch.optim.SGD).

**Complete your code here:**

In [None]:
def main(epochs, batch_size, train_batches, val_batches, test_batches):
       
    # Create directory to store TensorBoard checkpoints
    writer = misc.createTensorBoardWriter('./results/p05/ResNet9')   
          
    # Construct the model
    model = ResNet9.getModel()
    
    # Scheduler  
    lr_schedule = PiecewiseLinear([0, 3, epochs], [0, 0.4, 0])
    lr = lambda step: lr_schedule(step/len(train_batches))/batch_size
    
    # TODO: Define optimizer here
    optim = None

    # Train
    train(model, writer, optim, train_batches, val_batches, epochs)

    # Evaluate on test set
    test(model, writer, test_batches)
    
    writer.close()

Let's train the model with the new optimizer for 10 epochs.

In [None]:
epochs = 10
batch_size = 512

# Get dataset
train_batches, val_batches, test_batches = ResNet9.getCIFAR10(batch_size)


# print("Training for {} epochs using batch size {}".format(epochs, batch_size))
main(epochs, batch_size, train_batches, val_batches, test_batches)

Observe differences in TensorBoard.

In [None]:
%load_ext tensorboard

In [None]:
# Visualize TENSORBOARD

%tensorboard --logdir results/p05/

# What's Next

In this notebook we have covered the ResNet architecture, and demonstrated how a good choice of an optimizer can improve the performance of our network.

In the following notebooks, we will see how the quality of our dataset also plays a paramount role in solving machine learning tasks.