In [None]:
%load_ext autoreload
%autoreload 2

import os
import requests
import time
import math
import pickle
import itertools
from typing import List

import numpy as np
import torch
import torch.nn as nn
import torch.optim as optim
import torch.utils.data
import torch.nn.functional as F
import matplotlib.pyplot as plt
from matplotlib import gridspec
from ipywidgets import interact, interactive, fixed, IntSlider
from IPython.display import display, Video

# 1. Convolutional Neural Nets
### University of Cambridge
### Engineering Part IIB/Module 4F12: Computer Vision
### Deep Learning for Computer Vision

*Lecturer: Matthew Johnson*

In this lecture, we are going to focus on a kind of neural net which has seen some amazing successes in recent years. As you remember from the Introduction lecture, this system is capable of recognizing thousands of object categories, recognizing faces, detection people and their emotions, and even writing human-level captions to describe arbitrary images. This is all enabled through the use of a combination of a series of engineering techniques that enable deep learning along with a new type of layer: the _convolutional layer_.

Convolution is a concept that comes from signal processing. One takes a audio or visual signal, usually from a noisy sensor, and then applies a _kernel_ at multiple points in that signal to enable easier processing.

In [None]:
display(Video("https://lectures.blob.core.windows.net/media/conv.mp4", width=980, height=490))

## Convolution

For the purpose of this lecture, we are going to consider convolution with arbitrary filters. We will start by looking at some standard filters used in image processing:

1. Blur
2. Edge
3. Bar
4. Blob

For each of these we define a kernel that is convolved with the image. We can see the kernels below:

In [None]:
import filters

filters.show_kernels()

### Blur
As you can see, the "hot spots" of the response correspond to the center of the strokes made by the writer.

In [None]:
filters.show_convolution(filters.blur_kernel)

### Horizontal Edges
Note how the blue responses indicate a black-to-white edge, and the red responses indicate a white-to-black edge.

In [None]:
filters.show_convolution(filters.horizontal_edge_kernel)

### Horizontal Bars
See here how the filter fires on the top of the 5, and the cross bar of the 4.

In [None]:
filters.show_convolution(filters.horizontal_bar_kernel)

### Blobs
First, note how the blue response from the blob kernel follows the contour of the number very clearly. Then note the spots of dark red that show up in black areas that are at the center of a ring of white.

In [None]:
filters.show_convolution(filters.blob_kernel)

By changing the orientations and scales of these basic features, an image can be processed to find all kinds of low-level structures that have proven useful for a variety of tasks in computer vision. A combination of filters like this is called a "filter bank". The image below depicts a famous selection of filters known as the Leung-Malik filter bank.

## Filter banks and biological vision

![Leung-Malik Filter Banks](https://lectures.blob.core.windows.net/media/lmfilters.jpg "Leung Malik Filter Bank")

We see some familiar shapes here: edge filters in the top left, bar filters in the top right, blob filters in the bottom left and low-pass filters in the bottom right. What is so interesting about these filters is that they can also be detected in the vision systems of mammallian brains. Filters of this kind seem to be part of a working, intelligent vision system. Given that, how can we incorporate these filters into our artificial neural nets for use in image processing?

## Convolutional Layers

Below is a very simple image $i$, along with two simple convolutional filters $w$ and their corresponding outputs $o$:

![Convolution Example](https://lectures.blob.core.windows.net/media/convolutionExample.png "Convolution Example")

Each output node is computed in the following way:

$$
\begin{align*}
o_{f,r,c} &= w_{f00}i_{r-1,c-1} + w_{f01}i_{r-1,c} + w_{f02}i_{r-1,c+1} \\
         &~~+ w_{f10}i_{r,c-1} + w_{f11}i_{r,c} + w_{f12}i_{r,c+1} \\
         &~~+ w_{f20}i_{r+1,c-1} + w_{f21}i_{r+1,c} + w_{f22}i_{r+1,c+1}
\end{align*}
$$

In the examples of convolution shown above, the values of $w$ were fixed to achieve particular effects that human engineers thought would be useful. What we want to do is to enable a neural net to learn these values automatically from the data use back-propagation, just like with the fully-connected layers we saw in the Multi-Layer Perceptron lecture. What makes this tricky, though, is that unlike the previous layers we have seen, the weights will touch multiple input values as they are convolved over the image. This means there are fewer parameters to train overall, but it complicates training considerably. We can solve this problem, however, by using more memory.

## im2col

Let's start with an image $\mathbf{x}$:

$$
\mathbf{x} = \left(
         \begin{array}{ccccc}
           a & b & c & d & e \\
           f & g & h & i & j \\
           k & l & m & n & o \\
           p & q & r & s & t \\
           u & v & w & x & y \\
         \end{array}
     \right)
$$

In this figure, each pixel of the image corresponds to a letter. We are going to extract all of the patches and embed them as the columns of a larger matrix $\mathbf{X}$ like so:

![Im2Col](https://lectures.blob.core.windows.net/media/im2col.svg)

We can now store the weights in a normal weight matrix, in which each row now operates as a convolution filter and perform convolution over the entire image as a simple matrix multiplication:

$$
\mathbf{F} = \left[
         \begin{array}{cccccccc}
            w_{000} & w_{001} & w_{002} & w_{010} & w_{011} & \ldots & w_{021} & w_{022} \\
            w_{100} & w_{101} & w_{102} & w_{110} & w_{111} & \ldots & w_{121} & w_{122} \\
         \end{array}
       \right]
$$

![Convolution as Multiplication](https://lectures.blob.core.windows.net/media/conv_mult.svg)


In [None]:
def im2col(x, size):   
    rows, cols = x.shape
    num_patches = (rows - size + 1)*(cols - size + 1)
    patch_size = size * size
    X = np.zeros((patch_size, num_patches), dtype='float32')
    patch = 0
    for r in range(rows - size + 1):
        for c in range(cols - size + 1):
            X[:,patch] = x[r:r+size,c:c+size].reshape(patch_size)
            patch += 1
            
    return X

In [None]:
x = np.arange(25).reshape(5,5)
X = im2col(x, 3)

print("x =\n",x)
print("im2col(x, 3) =\n", X)

Let's try looking at replacing the first layer of a Multi-Layer Perceptron with a convolutional layer in PyTorch:

In [None]:
class SimpleCNN(nn.Module):
    def __init__(self):
        super(SimpleCNN, self).__init__()
        self.c1 = nn.Conv2d(1, 9, 5)
        self.f3 = nn.Linear(5184, 10)

        self.outputs = [0] * 4
        self.num_layers = 4

    def __call__(self, x: torch.Tensor):
        h = self.c1(x)
        h = h.tanh().flatten(1)
        output = self.f3(h)

        return output


As you can see, PyTorch provides us with an easy-to-use Conv2d layer. Let's train the model and see how it does.

In [None]:
import cnn
from datasets import MulticlassDataset

simple_net_mnist = cnn.SimpleCNN()
dataset_mnist = MulticlassDataset.mnist(28).to_torch()
path = "mnist_simple_cnn.results"

if os.path.exists(path):
    results = torch.load(path)
    print(path, "loaded")
else:
    criterion = nn.CrossEntropyLoss()
    snapshots = cnn.train_model(dataset_mnist, simple_net_mnist, criterion, num_epochs=5)
    results = {"snapshots": snapshots, "net": simple_net_mnist.state_dict()}
    torch.save(results, path)

snapshots = results["snapshots"]
simple_net_mnist.load_state_dict(results["net"])
cnn.evaluate_model(dataset_mnist, simple_net_mnist, snapshots)
cnn.visualize_layer_output(simple_net_mnist, 1, 3, [3, 4, 3], 4)
cnn.visualize_layer_output(simple_net_mnist, 3, 3, [3, 4, 3], 4)
plt.show()


The learned filters are recognisable to us as oriented bars, edges, and corners, but have been tailored to the dataset and object. A different objective function or dataset may result in different features. However, we have created a new problem. Let's look at this network as a diagram:

![MNIST CNN](https://lectures.blob.core.windows.net/media/cnn_nopool.svg)

In the original MLP the final layer had $64 \times 10 = 640$ weights. Since each convolution produces an image, the final layer here has $9 \times 24 \times 24 \times 10 = 51,840$ weights. We need a way to reduce the dimensionality of the filter layer output.

For this, we shall first return to the simple filters we discussed at the beginning of the lecture. A low-pass filter has some interesting properties apart of reducing noise. Because it mixes together samples from a spatial area, it allows us to subsample the image in space.

## Subsampling

In [None]:
filters.show_convolution(filters.blur_kernel, stride=2)

As you can see, the resulting image is scaled down to half the size, or a quarter of the pixels, of the original. How can we incorporate this into CNN training? We are going to introduce a new layer called a _pooling layer_. The concept of pooling is similar to sub-sampling, whereby an image is reduced in size by representing a region in the input image via a single pixel. In standard sub-sampling this value is a single pixel in the input region, _e.g._ the center. The first kind of pooling we are going to look at is average pooling, where the value is the mean value of the pixels in the input region.

## Average Pooling

As can be seen below a filter of size 3 is applied at a _stride_ of 2, though the size and stride can be altered in practice as needed by the problem. This pooling acts as a _non-parameterised_ layer in the neural network, much like non-linearities. An output value is computed as follows:

$$o[r,c] = \frac{1}{9}\sum_{k = 0}^{3} \sum_{l = 0}^{3} i[2r + k - 1, 2c + l - 1]$$

This is equivalent to using the embedding function _im2col_ and then multiplying by a weight matrix:

$$\mathbf{W} = \left\{W_{i,j} | \forall i \forall j ~ W_{i,j} = \frac{1}{9}\right\}$$

In [None]:
import pooling

pooling.avg_pool()

## Max Pooling

An alternative to average pooling is maximum, or max, pooling. As with average pooling, the max pooling operator
has a set size and is applied at a predetermined stride. However, instead of taking the arithmetic mean of all of the
pixel values it selects only the maximum value as its output value.

In [None]:
pooling.max_pool()

Let's see what happens to our CNN when we add in max pooling.

In [None]:
class PoolingCNN(nn.Module):
    def __init__(self):
        super(PoolingCNN, self).__init__()
        self.c1 = nn.Conv2d(1, 9, 5)
        self.f3 = nn.Linear(1296, 10)

        self.num_layers = 4
        self.outputs = [0] * self.num_layers

    def __call__(self, x):
        h = self.c1(x)
        h = F.max_pool2d(h, 2, stride=2)
        h = h.tanh()
        output = self.f3(h.flatten(1))
        return output

In [None]:
mnist_pooling_cnn = cnn.PoolingCNN()
path = "mnist_pooling_cnn.results"

if os.path.exists(path):
    results = torch.load(path)
    print(path, "loaded")
else:
    criterion = nn.CrossEntropyLoss()
    snapshots = cnn.train_model(dataset_mnist, mnist_pooling_cnn, criterion, num_epochs = 5)
    results = {"snapshots": snapshots, "net": mnist_pooling_cnn.state_dict()}
    torch.save(results, path)

mnist_pooling_cnn.load_state_dict(results["net"])
snapshots = results["snapshots"]
cnn.evaluate_model(dataset_mnist, mnist_pooling_cnn, snapshots)
cnn.visualize_layer_output(mnist_pooling_cnn, 1, 3, [3, 4, 3], 4)
cnn.visualize_layer_output(mnist_pooling_cnn, 3, 3, [3, 4, 3], 4)
plt.show()

Here is a diagram of this network:

![MNIST Pooling CNN](https://lectures.blob.core.windows.net/media/cnn.svg)

This model has quarter the number of parameters as a result of the max pooling for the output layer ($9\times12\times12\times10 = 12,960$). As you chain together convolutional layers and max pooling layers and the dimensionality of the images decreases, more and more high-level features of the image can be detected and used for classification by the final layer.

We are quickly exhausting the possibilities of the MNIST dataset. For the remainder of this lecture we will be working with the CIFAR-10 dataset. 

## CIFAR-10

CIFAR-10, while it has many of the same characteristics as MNIST, is a considerable step up in difficulty. It is drawn from the "80 million tiny images" dataset which consists of (almost) 80 million $32 \times 32$ RGB images downloaded from the internet and labeled with one of 75,062 non-abstract nouns in English.

![CIFAR-10](https://lectures.blob.core.windows.net/media/cifar10.jpg "CIFAR-10")

CIFAR-10 (named after the Canadian Institute for Advanced Research where it was gathered) consists of
60,000 images gathered from this larger set by gathered by Alex Krizhevsky, Vinod Nair and
Geoffrey Hinton.  These images belong to 10 classes: (in order from left to right above)
airplane, automobile, bird, cat, deer, dog, frog, horse, ship and truck. As can be seen, the dataset contains
a lot of intra-class variance, in addition to complications brought about by color and the much larger variety inherent
in non-curated natural images.

What kind of network do we need in order to deal with this much more difficult dataset? Using the building blocks we have seen in this lecture, we can now fully examine the anatomy of a modern Deep Neural Net or DNN.

![CIFAR-10 DNN](https://lectures.blob.core.windows.net/media/dnn.svg "CIFAR-10 DNN")

<table>
  <tr><th>C1</th><td>Extracts low-level features in the image.</td></tr>
  
  <tr><th>P2</th><td>Provides some flexibility of location</td></tr>
  
  <tr><th>C3</th><td>Looks for parts that are combinations of features</td></tr>
  
  <tr><th>P4</th><td>Smooths the part responses before subsampling</td></tr>
  
  <tr><th>C5</th><td>Finds structures that are built from parts</td></tr>
  
  <tr><th>P6</th><td>SSmooths and subsamples the structural responses.</td></tr>
      
  <tr><th>F7</th><td>Sub-category latent space</td></tr>
  
  <tr><th>F8</th><td>Final classifier</td></tr>
</table>

In [None]:
class CifarCNN(nn.Module):
    def __init__(self):
        super(CifarCNN, self).__init__()
        self.c1=nn.Conv2d(3, 32, 5, stride=1, padding=2)
        self.c3=nn.Conv2d(32, 32, 5, stride=1, padding=2)
        self.c5=nn.Conv2d(32, 64, 5, stride=1, padding=2)            
        self.f7=nn.Linear(1024, 64)
        self.f8=nn.Linear(64, 10)

        self.num_layers = 9
        self.outputs = [0] * self.num_layers

    def __call__(self, x):
        self.outputs[0] = x
        self.outputs[1] = h = self.c1(x)
        self.outputs[2] = h = F.max_pool2d(h, 3, stride=2).relu()
        self.outputs[3] = h = self.c3(h).relu()
        self.outputs[4] = h = F.avg_pool2d(h, 3, stride=2, padding=1)
        self.outputs[5] = h = self.c5(h).relu()
        self.outputs[6] = h = F.avg_pool2d(h, 3, stride=2, padding=1)
        self.outputs[7] = h = self.f7(h.flatten(1))
        self.outputs[8] = h = self.f8(h)
        return h

In [None]:
dataset_cifar = MulticlassDataset.cifar().normalize()
cifar10_cnn = cnn.CifarCNN()
path = "cifar10.results"
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
image = dataset_cifar.unnormalize(dataset_cifar.val.values[0])
dataset_cifar = dataset_cifar.to_torch()

if os.path.exists(path):
    results = torch.load(path)
    print(path, "loaded")
else:
    print("Running on", device)
    criterion = nn.CrossEntropyLoss()
    snapshots = cnn.train_model(dataset_cifar, cifar10_cnn, criterion, device = device)
    results = {"snapshots": snapshots, "net": cifar10_cnn.state_dict()}
    torch.save(results, path)

cifar10_cnn.load_state_dict(results["net"])
snapshots = results["snapshots"]

plt.figure(figsize=(4, 4))
image = np.swapaxes(image, 0, 1)
image = np.swapaxes(image, 1, 2)
plt.imshow(image, interpolation='nearest')
plt.xticks([])
plt.yticks([])
plt.tight_layout()

cifar10_cnn.load_state_dict(results["net"])
snapshots = results["snapshots"]
cnn.evaluate_model(dataset_cifar, cifar10_cnn, snapshots, device=device)
cnn.visualize_layer_output(cifar10_cnn, 1, 0, [3, 4, 3], 4)
cnn.visualize_layer_output(cifar10_cnn, 3, 0, [3, 4, 3], 4)
cnn.visualize_layer_output(cifar10_cnn, 5, 0, [3, 4, 3], 4)
plt.show()

As you can see, the features learned by these more complex networks are harder to analyze and understand. While the features at the beginning map to the simple filters we used at the beginning of the lecture, the filters in later layers are hard to understand. That said, from the output images we can clearly see that they are pulling out parts of input object, identifying certain structures that are later used to perform classification.

Now that we have examined a deep net operating on the CIFAR-10 dataset, we are going to end the lecture by looking at the state of the art in Deep Convolutional Neural Nets. To do that, however, we are going to have to introduce one final dataset.

## Deep CNN Training

1. Divide the dataset into training $\mathbf{D}$, validation $\mathbf{V}$ and test $\mathbf{T}$ partitions
2. Compute the "mean" image of $\mathbf{D}$, and subtract it from all images
3. Scale all images so that pixel values are in the range $[-1,1]$
4. Design a network architecture, or adapt a known-good architecture to the dataset
5. Select an optimisation algorithm (_i.e._ a version of SGD)
6. For a pre-determined set of epochs, do the following:
    1. Randomly sample (without replacement) $\mathbf{D}_B$ images from $\mathbf{D}$
    2. Compute $\nabla \mathbf{W}^{\tau}$ for all $\mathbf{W}$ in the model
    3. Update all $\mathbf{W}^{\tau} \rightarrow \mathbf{W}^{\tau + 1} $
    4. Once all images in $\mathbf{D}$ have been seen (_i.e._ when an epoch is complete), evaluate the network on $\mathbf{D}$ and $\mathbf{V}$ to monitor changes in the loss
7. Evaluate final performance on $\mathbf{V}$, and repeat the above with different optimisation hyperparameters as necessary
8. Retrain the network using $\mathbf{D} \cup \mathbf{V}$ with the final hyperparameters, and evaluate on $\mathbf{T}$

## Data Augmentation

Sometimes we want to augmentation a dataset. One way of doing that is through simple image transforms:

![Data Augmentation](https://lectures.blob.core.windows.net/media/data_augmentation.png "Data Augmentation")

In this way one training image can become 10.

When it comes to deep networks, the only thing better than a lot of data is a whole lot of data. However, gathering
supervised data is very expensive. One way to overcome this problem is via _data augmentation_, whereby on each epoch each
training image is turned into several additional images by way of translated cropping and horizontal reflection as seen above.
    
Another common usage of data augmentation is during the testing phase, whereby instead of simply computing $P(\mathbf{x}=i)$ from a single image, $\mathbf{x}$ is altered as shown above and then each version of $\mathbf{x}$ is shown to the network. In this scenario, the predicted label is then:

$$
\DeclareMathOperator*{\argmax}{argmax}
i = \argmax_j \left\{j_a | \forall a: j_a = \argmax_k P(\mathbf{x}_a=k) \right\}
$$

This can result in significant improvement regardless of the network model used.

## Batch Normalisation

Batch Normalisation is a technique that is essential to the deepest neural net architectures, but also useful in general for stabilising and accelerating the training process in all circumstances. The concept is straightforward: if we knew the statistics of each layer's output over the entire dataset, we could normalise those outputs so that all output vectors had zero mean and unit variance. In mathematical terms, we want to do the following:

$$
\begin{align}
\mu_{\mathbf{D}} &= \frac{1}{|\mathbf{D}|}\sum_{\mathbf{x} \in \mathbf{D}} \mathbf{x} \\
\sigma^2_{\mathbf{D}} &= \frac{1}{|\mathbf{D}|}\sum_{\mathbf{x} \in \mathbf{D}} (\mathbf{x} - \mu_{\mathbf{D}})^{2} \\
\hat{\mathbf{x}} &= \frac{\mathbf{x} - \mu_{\mathbf{D}}}{\sqrt{\sigma^2_{\mathbf{D}}} + \epsilon}
\end{align}
$$

The problem is that we do not know $\mu_{\mathbf{D}}$ or $\sigma^2_{\mathbf{D}}$, nor can we compute them as they will change as the network itself changes during training. We overcome this by using the same batch trick to estimate these values _i.e._ by computing them for each batch during training. Once the network is trained, we can then compute the true values and use those during testing. We can even learn training parameters for scaling and shifting the normalised vectors:

$$
\mathbf{y} = \gamma\hat{\mathbf{x}} + \beta
$$

### ImageNet

The ImageNet dataset consists of over 14 million images mapped to around 20,000 labels. The images are extremely varied, coming from a variety of different sources all over the internet, and have each been hand-labeled by a human. 

![ImageNet](https://lectures.blob.core.windows.net/media/imagenet.jpg "ImageNet")

![ImageNet Samples](https://lectures.blob.core.windows.net/media/imagenet_samples.png "ImageNet Samples")

Left | Center | Right
--- | --- | ---
Handheld Computer | Dandie Dinmont Terrier | Teapot

## AlexNet

When ImageNet was first introduced, it was prohibitively expensive to train a neural network on it. However, in 2012 Alex Krizhevsky, Ilya Sutskever, and Geoffrey Hinton introduced AlexNet, a CNN architecture that was able to achieve a top-5 error rate of 15.3% on the ILSVRC 2012 dataset, a 10.8% improvement over the previous state-of-the-art. To achieve this they harnessed a new technology for use in training DNNs.

![AlexNet Diagram](https://lectures.blob.core.windows.net/media/alexnet.svg)

Graphical Processing Units, or GPUs, were originally designed for graphically intensive workflows, like video games. In particular, they are specialised towards the execution of per-pixel code in which linear algebra is used to project 3D model information to the camera plane. One can coerce the hardware into performing other computationally intensive image tasks, *e.g.* those in computer vision. As a demonstrator, see this implementation of Canny edge detection as using GPU shaders:

[Canny Edge Detection with shaders](https://matajoh.github.io/canny/)

Krizhevsky *et al.* realised that this same hardware could be used to perform the matrix multiplications required by CNNs, and that the parallel nature of GPUs would allow them to train much faster than on CPUs.

## Residual Networks

Even with all the tricks we have introduced so far, very deep networks fail to train due to a version of the vanishing gradient problem. In 2015, Kaiming He *et al.* introduced Residual Networks, a new architecture that was able to achieve a top-5 error rate of 3.57% on the ILSVRC 2012 dataset despite being 152 layers deep. Here is the 34-layer configuration:

![ResNet](https://lectures.blob.core.windows.net/media/resnet.svg)

The key contribution is the repeating *residual block* within this diagram. It has two branches: an identity branch which copies the input, and the following miniature network:

![ResNet Block](https://lectures.blob.core.windows.net/media/resblock.svg)

The identity branch acts as a shortcut, allowing the network to ignore large blocks of its capacity if they are not needed. In essence, the network grows deeper each time a residual block saturates during training.

One really helpful feature of PyTorch is the ability to easily download pre-trained models. Here's some code that runs a ResNet-50, trained on ImageNet, on images from the COCO dataset:

In [None]:
import resnet

num_images = 32
batch_size = 8
dataset_coco = load_coco("minival")
images_coco = dataset_coco["images"][:num_images]
categories, images, top5 = resnet.classify(images_coco, batch_size)

def plot(fig: plt.Figure, i: int):
    ax = fig.add_subplot(1, 2, 1)
    ax.imshow(images[i])
    ax.axis('off')
    ax = fig.add_subplot(1, 2, 2)
    width = [top5[i][j].score for j in range(5)][::-1]
    names = [categories[top5[i][j].id].replace(" ", "\n") for j in range(5)][::-1]
    ax.barh(range(5), width, tick_label=names)
    ax.set_xlim(0, 1)
    fig.tight_layout()

def slide_resnet(frame: int):
    fig = plt.figure(figsize=(8,4))
    plot(fig, frame)

interact(slide_resnet, frame=IntSlider(value=0, min=0, max=num_images-1))