# Exercise 1: Convolutional neural networks (CNN)

##Import dependencies (run the following cells)

In [1]:
# @title import dependencies

from typing import Mapping, Union, Optional

import numpy as np
import argparse
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
import plotly.graph_objects as go
import torchvision
from torchvision import datasets, models, transforms
import os
from tqdm.notebook import tqdm

from __future__ import print_function, division

## CNNs in practice

Building a CNN today is very easy thanks to modern deep learning programming frameworks. As you have seen in the last notebook, the `torch.nn` package reduces the introduction of a convolution transformation to a single call to a function  (e.g. ```nn.Conv2d```  for 2D data).

Nevertheless, CNNs involve many different operations, including non-trivial changes in shape of the input tensor through the layers.

Keep in mind the two basic transformations introduced by CNNs:
- Convolution

![conv](https://upload.wikimedia.org/wikipedia/commons/1/19/2D_Convolution_Animation.gif)
- Pooling (Max pooling in the example below)

![pooling](https://upload.wikimedia.org/wikipedia/commons/e/e9/Max_pooling.png)

*Images from Wikipedia*



---

**Under the hood: how does a convolution work?**

Let's compute by hand a single output value of a convolution operation.

In [None]:
# Define an arbitrary input tensor, i.e.: [batch, channels, w, h]
batch_size, num_channels, width, height = 1, 2, 4, 4
a = torch.arange(batch_size * num_channels * width * height).reshape(batch_size, num_channels, width, height).float()
a, a.shape

(tensor([[[[ 0.,  1.,  2.,  3.],
           [ 4.,  5.,  6.,  7.],
           [ 8.,  9., 10., 11.],
           [12., 13., 14., 15.]],
 
          [[16., 17., 18., 19.],
           [20., 21., 22., 23.],
           [24., 25., 26., 27.],
           [28., 29., 30., 31.]]]]),
 torch.Size([1, 2, 4, 4]))

In [None]:
# Define a convolution
c = nn.Conv2d(in_channels=2, out_channels=3, kernel_size=2, bias=False)

In [None]:
# Let's look under the hood... somewhere there must be learnable weights...
# They have shape [3, 2, 2, 2], i.e.: [out, in, kernel, kernel]
c.weight, c.weight.shape

(Parameter containing:
 tensor([[[[ 0.3432,  0.2381],
           [-0.0373, -0.1192]],
 
          [[ 0.2985,  0.2948],
           [-0.0664,  0.3083]]],
 
 
         [[[-0.0435, -0.0830],
           [ 0.1895, -0.1782]],
 
          [[-0.3184,  0.3128],
           [ 0.2530, -0.2484]]],
 
 
         [[[-0.1078,  0.1343],
           [-0.1081,  0.3484]],
 
          [[ 0.2352, -0.3167],
           [ 0.2134,  0.1555]]]], requires_grad=True),
 torch.Size([3, 2, 2, 2]))

The weights you see above are random, so don't look for any particular meaning. What we care about is the **shape** of `c`:
- We asked for $2\times 2$ kernels.
- `nn.Conv2d` creates $2$ such kernels: one per input channel.
- Convolution will be applied to each channel separately, resulting in $2$ feature maps.
- These features maps are [summed together](https://pytorch.org/docs/stable/generated/torch.nn.Conv2d.html).
- All the above is done $3$ times: once per output channel.

As a result, the output of `nn.Conv2d` is a tensor with shape `[3, 2, 2, 2]` that will be convolved with the input.

> **EXERCISE:** What's the shape of the _output_ resulting from applying these convolutions?

Let's do a bit of unrolling to get a deeper insight into convolution.

In [None]:
# We define custom kernel weights for the first out channel
# In this way we can easily reproduce the computation
my_custom_2dkernel_in_channel1 = torch.tensor([[ .1, .2], [-.4, -.5, ]])
my_custom_2dkernel_in_channel2 = torch.tensor([[.4, .3], [-.42, -.45, ]])

In [None]:
my_custom_2dkernel_in_channel1, my_custom_2dkernel_in_channel1.shape

(tensor([[ 0.1000,  0.2000],
         [-0.4000, -0.5000]]),
 torch.Size([2, 2]))

In [None]:
my_custom_2dkernel_in_channel2, my_custom_2dkernel_in_channel2.shape

(tensor([[ 0.4000,  0.3000],
         [-0.4200, -0.4500]]),
 torch.Size([2, 2]))

In [None]:
# Modify the weights of the convolution

# out-channel 0, in-channel 0
c.weight.data[0, 0, ...] = my_custom_2dkernel_in_channel1

# out-channel 0, in-channel 1
c.weight.data[0, 1, ...] = my_custom_2dkernel_in_channel2

# ...accessing directly the .data attribute of a tensor by-passed some safety checks!

In [None]:
# Let's check: the tirst out-channel is correctly set
c.weight, c.weight.shape

(Parameter containing:
 tensor([[[[ 0.1000,  0.2000],
           [-0.4000, -0.5000]],
 
          [[ 0.4000,  0.3000],
           [-0.4200, -0.4500]]],
 
 
         [[[-0.0435, -0.0830],
           [ 0.1895, -0.1782]],
 
          [[-0.3184,  0.3128],
           [ 0.2530, -0.2484]]],
 
 
         [[[-0.1078,  0.1343],
           [-0.1081,  0.3484]],
 
          [[ 0.2352, -0.3167],
           [ 0.2134,  0.1555]]]], requires_grad=True),
 torch.Size([3, 2, 2, 2]))

As we mentioned, we have a $2 \times 2$ kernel for each input channel, in this case two, and we have as many kernel pairs as we have output channels. For simplicity, we will only consider one output channel in our unrolling.

> **EXERCISE**
>
> Stop one second. Try to apply the convolution yourself!
>
> Here is an input image with shape `torch.Size([1, 2, 4, 4])`, i.e. `[batch, channels, w, h]`:
>
> ```python
> a = tensor([[[[ 0.,  1.,  2.,  3.],
>               [ 4.,  5.,  6.,  7.],
>               [ 8.,  9., 10., 11.],
>               [12., 13., 14., 15.]],
>     
>              [[16., 17., 18., 19.],
>               [20., 21., 22., 23.],
>               [24., 25., 26., 27.],
>               [28., 29., 30., 31.]]]])
> ```
>
> This is the kernel tensor from before, having shape `torch.Size([3, 2, 2, 2])`, i.e. `[out_channels, in_channels, kernel_size, kernel_size]`:
>
> ```python
> c.weight = tensor([[[[ 0.1000,  0.2000],
>                      [-0.4000, -0.5000]],
>            
>                     [[ 0.4000,  0.3000],
>                      [-0.4200, -0.4500]]],
>            
>            
>                    [[[-0.3517,  0.2366],
>                      [ 0.2679,  0.1289]],
>            
>                     [[-0.2465, -0.3489],
>                      [-0.2871,  0.2636]]],
>            
>            
>                    [[[ 0.1697,  0.2975],
>                      [ 0.1852,  0.0895]],
>            
>                     [[-0.0035, -0.2689],
>                      [-0.3029, -0.3307]]]], requires_grad=True)
> ```
>
> **Question: What value will be in `output[0, 0, 0, 0]`, i.e. the first element of the output tensor after the convolution?**

>**Answer:**

In [None]:
# Here's the complete result when we apply this convolution
o = c(a)
o, o.shape

Is shape `[1, 3, 3, 3]` your answer to the previous exercise? If yes, you're on a good path!

In [None]:
# Let's compute the first value of the first out channel manually, i.e. this one:
o[0, 0, 0, 0]  # [batch, channels, w, h]

In [None]:
# Take the first window of the same size of the kernel in the first in_channel of the input
f1 = a[0, 0, :2, :2]
f1

In [None]:
# And the second input channel
f2 = a[0, 1, :2, :2]
f2

In [None]:
# Perform a point-wise multiplication along the (input) feature dimension, between the input and the kernel.
# In this case this can be done manually in this way:

f = f1 * my_custom_2dkernel_in_channel1 + f2 * my_custom_2dkernel_in_channel2
f

In [None]:
# Sum up the result
s = f.sum()
s

In [None]:
# i.e. the first element in the first out_channel of our output tensor
o

> **EXERCISE**
>
> How many parameters does this layer have? Note that this time there is a _bias_ as well! Check the docs to see how bias is defined for 2d convolutions.
>
> ```python
> nn.Conv2d(in_channels=5, out_channels=10, kernel_size=2, bias=True)
> ```

### Break the symmetry!

A small interlude before we implement our first CNN. Let's address two questions.

**_Why would I want more output channels than input channels?_**

Having multiple output channels (i.e. _feature maps_) allows the network to simultaneously learn a variety of features from the same piece of input data. For instance, one channel might become specialized in detecting horizontal edges, while another might focus on vertical edges, and a third might detect areas of high contrast.

**_If I initialize the kernel weights equally for all the output channels, will I get the same feature maps?_**

If you initialize the convolutional filter weights equally for all output channels, indeed you might find that the filters learn identical features and produce the same output. This is why **random initialization** is critical and universally adopted — it ensures that each filter starts from a slightly different state, allowing them to explore different paths and learn to capture various features of the input data.