In the previous practical we worked on a relatively small CNN-based model and observe how different approaches to regularize such model contributed to reach higher accuracies and less overfitting.

In this practical we'll use those techinques but start by defining a more advance model based on the [ResNet](https://arxiv.org/pdf/1512.03385.pdf) architecture.

In [None]:
import torch
import torch.nn as nn
import time
import os 
import numpy as np
import matplotlib.pyplot as plt
from mpl_toolkits.axes_grid1 import make_axes_locatable
%matplotlib inline

from IPython.display import Image

# Advanced PyTorch

Up until now, we have mostly been using PyTorch layers (e.g. `nn.conv2d`, `nn.Sequential`, `nn.dropout`, etc) that are natively supported by the framework. The vast majority of architectures for various applications will make use of these.

As you probably have noticed, when making use of a PyTorch layer (e.g. a convolutional layer) we only need to (1) define it as an element in the model and (2) making use of it in the `forward()` method of the model. By now we know that training involves two steps: forward and backward passes. ****But we never have to type any code to do the backpropagation of each of the layers that comprises our model!**** This is because PyTorch automatically builds a computational graph recording the order at which the operations in the forward pass takes place and follows the reverse order to perform the backpropagation. In addition, all common layers (e.g. a convolutional layer) come with built-in rules on how to perform the backward pass (i.e. you don’t have to implement it). 

However, it could happen that you want a particular feature for which PyTorch doesn’t offer an implementation. In this section we will show how to build a custom PyTorch `nn.Module` from scratch. We will be creating a quantization layer which will be useful to reduce the (1) model size of the model, (2) make use of reduced-precission arithmetic and (3) act as a regularization mechanism.

## Custom `Autograd` modules
Autograd makes the backpropagation (and therefore the diferentiation of all functions applied in the forward pass) effortless to the engineer/developer training a nerual network using PyTorch. From the documentation: *torch.autograd provides classes and functions implementing automatic differentiation of arbitrary scalar valued functions. It requires minimal changes to the existing code - you only need to declare Tensors for which gradients should be computed with the `requires_grad=True` keyword.*
You can read more about it in the documentation [torch.autograd](https://pytorch.org/docs/stable/autograd.html)



**Quantization as an example of an autograd module**

As briefly introduced above, we are going to implement a quantization layer making use of custom `autograd` function. This function (see cell bellow) takes as input a standard PyTorch tensor (which by default uses 32-bit precission -- a `float`) and transforms its contents by quantizing its content to a user-defined number of bits. 

A few common approaches to perform quantization are described in [R. Krishnamoorhi (2018)](). Here, we followed symmetric per-layer quantization. Mathematically, this is implemented in the forward pass:

`x_out = (clamp(round(x/scale + zero_point), quant_min, quant_max)-zero_point)*scale`

During back-propagation we apply standard [Straight Through Estimator](https://arxiv.org/pdf/1308.3432.pdf). This is the standard procedure to train quantization-aware networks (often called *fake quantization* becuase we even though we quantize the data to the desired number of bits, the data type of the tensor is left very often unchanged as a `torch.float`)

In [None]:
class QuantizeSymmetric(torch.autograd.Function):

    @staticmethod
    def forward(ctx, input, num_bits=8, min_value=None, max_value=None):

        output = input.clone()

        # computing the max/min values (centered around zero) that can be covered by using `num_bits` bits
        qmin = -1.0 * (2**num_bits)/2
        qmax = -qmin - 1

        # compute qparams --> scale and zero_point
        max_val, min_val = float(max_value), float(min_value)
        min_val = min(0.0, min_val)
        max_val = max(0.0, max_val)

        # Computing scale
        if max_val == min_val:
            scale = 1.0
        else:
            max_range = max(-min_val, max_val) # largest mag(value)
            scale = max_range / ((qmax - qmin) / 2)
            scale = max(scale, 1e-8) # for stability purposes

        # Zero_point
        zero_point = 0.0

        # Quantization (implementing the equation above)
        output.div_(scale).add_(zero_point)
        output.round_().clamp_(qmin, qmax)  # quantize
        output.add_(-zero_point).mul_(scale)  # dequantize

        return output

    @staticmethod
    def backward(ctx, grad_output):
        # straight-through estimator (STE) --> it lets the gradients form the next layer in the network to be passed straight to the previous one
        grad_input = grad_output
        return grad_input, None, None, None

## A Custom Layer

The code cell above only defines a function with custom `forward()` and `backward()` implementations. Normally, we'd like to make use of `autograd` functions inside layers of a neural network.

Below, we define a quantization layer. If you look closely, you'll see it contains the same structure as used when defining a network architecture: (1) we define the elements of the layer in `__init()__` and (2) we state in `forward()` what happens when an input is fed to this quantization layer.

In [None]:
class Quant(nn.Module):

    def __init__(self, num_bits=8, momentum=0.01):
        super(Quant, self).__init__()
        self.min_val = self.max_val = None
        self.momentum = momentum
        self.num_bits = num_bits

    def forward(self, input):
        if self.training:

            min_val = self.min_val
            max_val = self.max_val

            if min_val is None or max_val is None: 
                # First step executing quantization
                min_val = input.detach().min() 
                max_val = input.detach().max() 
            else:
                # equivalent to --> min_val = min_val(1-self.momentum) + self.momentum * torch.min(input)
                min_val = min_val + self.momentum * (input.detach().min()  - min_val)
                max_val = max_val + self.momentum * (input.detach().max()  - max_val)

            self.min_val = min_val
            self.max_val = max_val

        return QuantizeSymmetric().apply(input, self.num_bits, self.min_val, self.max_val)

Let's create a simple example of how to generate an instance of a quantization layer that will quantize the input tensor (that by default is of type `torch.float`) into 8-bits. Then, we will visualize both, input and output tensors as well as the absolute difference (which could be seen as a metric of error introduced in the quantization process).

Try to change the `num_bits` to see what happens.

In [None]:
input = torch.randn(32,32)
quantizer = Quant(num_bits=8)

input_q = quantizer(input)


def showTensor(tensor, ax, label):
  im = ax.imshow(tensor)
  ax.set_title(label)
  divider = make_axes_locatable(ax)
  cax = divider.append_axes("right", size="5%", pad=0.1)
  fig.colorbar(im, cax=cax)


fig, axs = plt.subplots(1,3, figsize=(18, 9))

showTensor(input, axs[0], "Original")
showTensor(input_q, axs[1], "Quantized")
showTensor(abs(input-input_q), axs[2], "Absolute Diff")

Very recently, PyTorch realised a new version of the framework that nativelly supports multiple types of quantization (e.g. post-training quantization, training-aware quantization, etc). You can check all the details in [the documentation](https://pytorch.org/docs/stable/quantization.html).