[![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/lorenzobasile/DeepLearning2022/blob/main/4_cnn.ipynb)

# Lab 4

In [None]:
from google.colab import drive
import matplotlib.pyplot as plt
import torch
import torchvision


drive.mount('/content/drive')
%cd drive/MyDrive/DeepLearning2022

# Convolutional Neural Networks

As introduced in the previous lectures, Convolutional Neural Networks (CNNs) are the go-to architecture to deal with Computer Vision tasks, including image classification, segmentation and recognition.

The main advantages of CNNs lie in the Convolutional Layer, that introduces a useful position-invariance inductive bias while keeping very limited the number of necessary parameters.


## The building blocks of CNNs

### Convolutional layers

The basic building block for a CNN is the convolutional layer, accessible as `torch.nn.Conv<s>d`, where `<s>` represents the number of **spatial dimensions** of our data:
* `Conv1d` for 1 dimensional sequences. Example: audio. Audio is organized as a sequence of a given length (the single spatial dimension), where each single value in this sequence represent the intensity/amplitude of the signal for a given time point. Audio data can be organized in multiple **channels** (e.g., stereo data has 2 channels). The convolution operation is represented by a one-dimensional kernel;
* `Conv2d` for 2 dimensional data, like images.
* `Conv3d` for 3 dimensional data. An example might be a 3D reconstruction of an image. A convolution in that domain might equate to sliding a cubic kernel along all three dimensions.

(Some) parameters for constructors:
```
Conv2d(in_channels: int, out_channels: int, kernel_size: Union[int, Tuple[int, int]], stride: Union[int, Tuple[int, int]] = 1, padding: Union[int, Tuple[int, int]] = 0)
```
* in_channels: the number of channels of the incoming data
* out_channels: the number of channels for the output data, i.e., the number of convolutions that are operated
* kernel_size: the kernel size of each convolution. An int $k$ is interpreted as a tuple $(k, k)$ (i.e., a square kernel); for a rectangular kernel, pass a tuple.
* stride: the step used when moving the kernel on the input data
* padding: if set to >0, the incoming image is enlarged with `padding` rows and columns of zeros (unless otherwise specified)

To visualize these and other parameters and how they affect the convolution operation, please have a look at [this page](https://github.com/vdumoulin/conv_arithmetic/blob/master/README.md).

#### Note that the convolution does NOT require a specific spatial dimension as input/output, as convolution is oblivious to these factors.

In [None]:

conv_layer = torch.nn.Conv2d(in_channels=3, out_channels=32, kernel_size=3)
print("Parameters of convolution\n", "Weights\n", conv_layer.weight.shape, "\nBias\n", conv_layer.bias.shape)

print("Conv2d is applied independently of the input spatial dimension")
y = conv_layer(torch.rand(1,3,10,10))
print("Shape of y ", y.shape)

z = conv_layer(torch.rand(1,3,6,6))
print("Shape of z ", z.shape)

To better visualize how convolutions work with multi-channel data, have a look at [this](https://www.coursera.org/lecture/convolutional-neural-networks/convolutions-over-volume-ctQZz) short video by Andrew Ng.

### Pooling layers

Pooling layers are essentially convolutions without trainable kernels. For each overlap between the image and the kernel, they output the maximum (→ _maxpooling_) or the average (→ _avgpooling_) of the image in that specific region.

![](https://production-media.paperswithcode.com/methods/MaxpoolSample2.png)

```MaxPool2d(kernel_size: Union[int, Tuple[int, ...]], stride: Union[int, Tuple[int, ...], NoneType] = None, padding: Union[int, Tuple[int, ...]] = 0)```

Notice that now we have no input or output channels as parameter, because MaxPool/AvgPool act independently on each channel, so `in_channels=out_channels`

#### Adaptive Pooling

Adaptive (Max/Average) Pooling is still a pooling layer, but we have the option to specify the desired spatial dimension of the output instead of the parameters like kernel size, padding...

PyTorch works out by itself the params which are required in order for the pooling to produce an output of the desired size.

Maybe the most common application of this layer is when operating the channel-wise average pooling at the end of the cascade of convolutional layers. In this case, we specify a fixed size of $(1,1)$, s.t. PyTorch will essentially operate an average of each whole channel.

In [None]:
layer = torch.nn.AdaptiveAvgPool2d(output_size=(1, 1))
layer(torch.rand(1,3,32,32)).shape

### Another minor "layer"

To feed these data to a linear layer, we need one more building block: a flattening layer (actually more of an operation).

In [None]:
torch.nn.Flatten()(layer(torch.rand(1,3,32,32))).shape

## Building a CNN

We will work with a dataset we have already encountered: MNIST.

In [None]:

transforms = torchvision.transforms.Compose([
        torchvision.transforms.ToTensor(),
        torchvision.transforms.Normalize((0.1307,), (0.3081,))
    ])

trainset = torchvision.datasets.MNIST('./data/', transform=transforms,  train=True, download=True)
trainloader = torch.utils.data.DataLoader(trainset, batch_size=256, shuffle=True)

testset = torchvision.datasets.MNIST('./data/', transform=transforms, train=False, download=True)
testloader = torch.utils.data.DataLoader(testset, batch_size=512, shuffle=False)


Our Convolutional Network follows a standard architectural paradigm: we have a sequence of interleaved convolutional and pooling layers followed by a fully-connected classification head.

The thing we have to be the most careful about is the way data are reshaped by the conv and pooling layers, which depends on the parameters we set.

In [None]:
class CNN(torch.nn.Module):
    def __init__(self):
        super().__init__()
    
        self.conv = torch.nn.Sequential(
                torch.nn.Conv2d(in_channels=1, out_channels=16, kernel_size=3, stride=1, padding=0),
                torch.nn.LeakyReLU(),
                torch.nn.MaxPool2d(kernel_size=2),
                torch.nn.Dropout(p=0.2),
                torch.nn.Conv2d(in_channels=16, out_channels=8, kernel_size=3, stride=2, padding=1),
                torch.nn.LeakyReLU(),
                torch.nn.MaxPool2d(kernel_size=2,stride=1),
                torch.nn.Flatten(),
        )
        self.head = torch.nn.Linear(8*6*6, 10)
        
    def forward(self, x):
        return self.head(self.conv(x))
        
        
model = CNN()

In [None]:
def get_accuracy(model, dataloader):
    model.eval()
    with torch.no_grad():
        correct=0
        for x, y in iter(dataloader):
            out=model(x)
            correct+=(torch.argmax(out, axis=1)==y).sum()
        return correct/len(dataloader.dataset)

In [None]:
def train(model, optimizer, trainloader, testloader):
    epochs=5
    for epoch in range(epochs):
        print("Test accuracy: ", get_accuracy(model, testloader))
        model.train()
        print("Epoch: ", epoch)
        for x, y in iter(trainloader):
            out=model(x)
            l=loss(out, y)
            optimizer.zero_grad()
            l.backward()
            optimizer.step()
    print("Final accuracy: ", get_accuracy(model, testloader))

In [None]:
optimizer=torch.optim.Adam(model.parameters(), lr=1e-2)
loss=torch.nn.CrossEntropyLoss()

In [None]:
train(model, optimizer, trainloader, testloader)

Our CNN is performing better (with the same training length) than the MLP we trained a couple of labs ago.

Somehow counterintuitively, the number of parameters of our current model is much smaller than it was for the MLP.

What really makes CNNs so good at CV tasks is not the model complexity (number of parameters), but the inductive bias they insert.

In [None]:
def get_params_num(net):
    return sum(map(torch.numel, net.parameters()))

In [None]:
get_params_num(model)

In the following part we will need to access the state of the network in which we are now (in terms of weights and bias values). To do so, we can simply save the state dictionary of the model to a file using the following line:

In [None]:
torch.save(model.state_dict(), "mnist_cnn.pt")

# Transfer Learning

In Deep Learning practise, people very rarely train large models from scratch (i.e. from randomly initialized weights). This is especially true in Computer Vision, where pre-trained weights for many standard models are openly available online.

In a usual setting, when facing a Computer Vision task (unless for some reason you want to use a customized architecture), you load a pre-trained model and **fine-tune** it.

Fine-tuning can follow different paths: one possibility is to freeze all the layers of the network excluding the last one (the classification head), another is to train the whole end-to-end classifier starting from pre-determined weights.

## Fine-tuning the whole network

For our first example, we will see how to fine-tune an end-to-end classifier on a new dataset.

### kMNIST

For this example, we will work with kMNIST, a drop-in replacement for MNIST containing images of handwritten Kanji characters belonging to 10 classes:

In [None]:
transforms = torchvision.transforms.Compose([
        torchvision.transforms.ToTensor(),
        torchvision.transforms.Normalize((0.1307,), (0.3081,))
    ])

k_trainset = torchvision.datasets.KMNIST('./data/', transform=transforms,  train=True, download=True)
k_trainloader = torch.utils.data.DataLoader(k_trainset, batch_size=256, shuffle=True)

k_testset = torchvision.datasets.KMNIST('./data/', transform=transforms, train=False, download=True)
k_testloader = torch.utils.data.DataLoader(k_testset, batch_size=512, shuffle=False)


In [None]:
x,y=next(iter(k_trainloader))
first_img=x[0]
plt.imshow(first_img.reshape(28,28), cmap='gray')

In [None]:
optimizer=torch.optim.Adam(model.parameters(), lr=1e-2)

In [None]:
train(model, optimizer, k_trainloader, k_testloader)

## Feature extractor freezing

Because of how similar our task is to MNIST classification, it may make sense to keep the feature extraction section of the network freezed (in our case, the convolutional part of the model), while training only the fully connected classification head. 

In this way we can make the training process lighter, reducing drastically the number of trainable parameters.

First of all we have to restore the parameters we had after training on MNIST by loading them from memory:

In [None]:
model.load_state_dict(torch.load("mnist_cnn.pt"))

Now we can turn off training for some layers by filtering them by name and setting to `False` the `requires_grad` attribute.

In [None]:
for name,param in model.named_parameters():
    print(name)
    if "head" not in name:
        param.requires_grad = False

In [None]:
optimizer=torch.optim.Adam(model.parameters(), lr=1e-2)

In [None]:
train(model, optimizer, k_trainloader, k_testloader)