# TUTORIAL 4: Convolutional Neural Networks

In [None]:
# Imports
import torch
import torch.nn.functional as F
import torchvision.datasets as datasets
import torchvision.transforms as transforms
from torch import optim
from torch import nn
from torch.utils.data import DataLoader
from tqdm import tqdm

# CNNs

Convolutional Neural Networks (CNN) use convolution operation to learn the features of inputs of shape 1d, 2d and 3d.

The **convolution operation** is the process of implying a combination of two functions that produce the third function as a result, employing **filters** across the entire input image allows the filter to discover that **feature** of the image. Which is also called a feature map.
In deep learning, we call these filters **kernel**. Kernel is a 1d/2d/3d structure that is *put* on top of the input to compute and extract the features of the area that it was put on.

A kernel always start at the top-left corner of the input, and moves always from left to right, and top to bottom. **Stride** defines how many steps the kernel takes between each computation. **Kernel size** defines the size (height, width and possibly depth) of a kernel. For example, if we set kernel_size to be (3, 3), then we will have a 2d 3x3 kernel.

**Feature Extraction**

When we apply a kernel to the input for feature extraction, we start from the top left and place the kernel to the edges of the input accordingly. Then, **weights** of the kernel get multiplied by the input values in the corresponding locations, and get summed to produce the **feature** that will be the output corresponding to the centre location of the kernel. You might think of this as a way to use the neightbouring values of the centre value to extract its feature.

![](https://miro.medium.com/v2/resize:fit:919/1*tNQvssqUaiYteDpREHQyFw.png)

If we do the operation this way, you can realize that because of the starting position of the kernel and the way the convolution operation is defined, output will be smaller in size than the input, as we will not have outputs that correspond to the edges. In order to mitigate this, we use **padding**, which is basically putting dummy values to the edges in order to expand the input, so we can put the kernel in a correct way to extract the features in the edges.


<img src="https://maucher.home.hdm-stuttgart.de/Pics/gif/same_padding_no_strides_transposed.gif" width="400">

**Channels**

Until this point, we assumed that the input would have a single channel, but as you know, most inputs have more channels, such as the RGB images with 3 channels, or latents with possibly higher number of channels. If the input has more than one channel, we expand the convolution operation to be applied to all channels. For example, in our 3x3 kernel case, if the input has 1 channel, we would first apply the kernel to a 3x3 = 9 top left positions in our very first convolution operation. But if the input 3 channels as in the case of RGB images, then we would apply the kernel to the top 3x3 positions of **each** channel, then sum all the values to extract the feature of a **single** output position. In this case, we would have (3x3) * 3 = 27 multiplications in our sum for a single output position. It can be seen that this convolution layer with the 3x3 kernel would map the input from a shape of 3xHxW to an output of shape HxW. **However** in such a case, the number of kernels equals to the number of input channels. Meaning that instead of passing the same 3x3 kernel through all of the channels, we pass a seperate kernel through channel 1, a seperate kernel through channel 2, and so on.

As we learnt in linear layers, in most deep learning models, we want to first expand the channels of the input data in hidden layers, in order to have more weights, and therefore more representation capability. This applies to convolutional layers as well. We can use multiple kernels in a convolutional layer to expand the channels of the input. For example, if we used 4 kernels in a single conv. layer in our previous RGB image example, each kernel would apply to the whole input data **seperately**, creating an output of 4xHxW. You can imagine this as using the first kernel to get an output of HxW, and then using the second kernel to again get an output of HxW, and so on, combining these HxW outputs to a single "3d" output.

**Weights and Biases**

So what are the weights in these convolutional layers?
As we shown in the visual example of the convolution operation, kernels have values in each of their positions. These values make the weights of the convolutional layers, and thus get updated in every training step accordingly. One can see that as we pass the kernel through the whole input data and do the calculations that will lead us to gradients, kernel is affected by all of the input data, instead of just a single partition of it.

Number of weights in a convolutional layer is dependent on the kernel size, number of input channels and number of output channels. For example, if the input has 3 channels, we want the output to have 1 channel, and the kernel is 3x3, then we would have 3 seperate 3x3 kernels in this layer, making our learnable weight count 3x3x3=27. But, if we want the output to have 5 channels, it can be easily inferred that this time, we will have 3 seperate 3x3 kernels for **each** output channel. Thus, the total weight count will be 5*(3x3x3) = 135. Regardless of the number of input channels and kernel size, we always have a **single bias term per each output channel**.

**Dilation**

**Dilation** (also called *atrous convolution*) means inserting gaps between the elements of a convolutional kernel.  
Instead of sampling only neighboring pixels, a dilated kernel skips pixels at regular intervals, allowing it to cover a larger area of the image without increasing the number of parameters.

A dilation rate of `1` corresponds to a normal convolution, while higher dilation rates spread out the kernel elements.  
This helps the layer **capture more context** (see a wider part of the image) and is especially useful in tasks like **semantic segmentation** or **audio sequence modeling**, where understanding broader spatial or temporal patterns matters.

<img src="https://i.sstatic.net/WfD07.gif" width="400">


# Creating and training Convolutional Neural Networks

As we discussed previously, convolutional layers make the most of the CNNs.
torch.nn has these convolutional layers just like the Linear layers discussed in earlier weeks. These convolutional layers are t**orch.nn.Conv1d, torch.nn.Conv2d and torch.nn.Conv3d**. Conv2d is the most used layer as it is the to-go approach in most image related tasks, and thus, we will be using it today for our CNN. However note that, we could also use Conv1d if we wanted, as for example doing 1d convolution with a kernel size of 3 is the same as doing 2d convolution with a kernel size of 3x1.

These layers have some key parameters:
- in_channels (# of channels in our input data)
- out_channels ( # of channels we want the output to have (corresponds to the # of kernels in this single layer)
- kernel_size (we can give a single number or a tuple. if we give a single number n in 2d case, torch will automatically set the kernel size to (nxn))
- stride (how many positions should the kernels move in each step)
- padding (how many padding values should we put around of the input. for example if we set padding to 1 as in most cases, we will put 0 values around the edges of the input, turning a NxN data to (N+2) x (N+2). usually, padding is done by putting 0's to the padded regions, but several approaches exist.)

Example: Lets create a 2d convolutional layer that will use a 3x3 kernel, take inputs with 3 channels, expand the channels to 64, and set padding and stride to 1 (stride=1 is the default value so we do not need to provide it, padding however is set to 0 as default, meaning that there won't be padding)

conv_layer = nn.Conv2d(in_channels=3, out_channels=64, kernel_size=(3,3), padding=1)

In addition to convolutional layers, we also have pooling layers that are widely used in CNNs. One example of these pooling layers is the max pooling layer:

**Max Pooling** is a pooling operation that calculates the maximum value for patches of a feature map, and uses it to create a downsampled (pooled) feature map. It is usually used after a convolutional layer. It adds a small amount of translation invariance - meaning translating the image by a small amount does not significantly affect the values of most pooled outputs.
![](https://production-media.paperswithcode.com/methods/MaxpoolSample2.png)
You can check the documentation for max_pool layers here:
https://pytorch.org/docs/stable/generated/torch.nn.MaxPool2d.html#torch.nn.MaxPool2d

There are also other pooling operations such as the average pooling: https://pytorch.org/docs/stable/generated/torch.nn.AvgPool2d.html

Ofcourse just like convolution operations, we can do 1d, 2d, and 3d pooling according to our task and input data.

You can check the other pooling layers torch provides here: https://pytorch.org/docs/stable/nn.html#pooling-layers



Enough theory, lets move on to coding a simple CNN and training it!

# Exercise 1:
Let's start with creating a simple CNN.

In [None]:
# Simple CNN
class CNN(nn.Module):
    def __init__(self, in_channels=1, num_classes=10):
        super(CNN, self).__init__()
        # Instructions: Implement a simple CNN with the following structure:
        #               Conv -> ReLU -> MaxPool2D -> Conv -> ReLU -> MaxPool2D -> FC
        #               You can use a kernel_size of 3 and a stride of 1 for both conv layers.
        #               For the first conv layer you can use out_channels=8, and for the second
        #               conv layer you can use out_channels=16. Your fully connected head will have a output size of num_classes.
        #               Be careful for the dimension match in your fully connected head. Do not forget to flatten x before passing it to the fc layer.
        #               NOTE: shape of the items in FashionMNIST are 1x28x28

        # TODO: Implement CNN here


    def forward(self, x):
        # TODO: FORWARD PASS.


# Exercise 2:
Set the training hyperparameters
we will use FashionMNIST dataset so our data will have a single channel, and as there are 10 possible classes in the dataset, we will have outputs of size 10.

In [None]:
# Set device

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

# Hyperparameters
in_channels = 1
num_classes = 10
learning_rate = 1e-3
batch_size = 8
num_epochs = 3

# Exercise 3:

Set up some data augmentations what can be used in this image classification related task. You can check last weeks tutorial and use the transformations that were mentioned.

In [None]:
# TODO: Set up transformations
my_transforms = transforms.Compose(
    [
      # ADD TRANSFORMATIONS HERE. Start with ToTensor()
    ]
)

# Exercise 4:
Set loaders, initialize the network and create optimizer and the criterion. As this is a classification task, you can use Cross Entropy.

In [None]:
train_dataset = datasets.FashionMNIST(root="dataset/", train=True, transform=my_transforms, download=True)
test_dataset = datasets.FashionMNIST(root="dataset/", train=False, transform=my_transforms, download=True)
train_loader = DataLoader(dataset=train_dataset, batch_size=batch_size, shuffle=True)
test_loader = DataLoader(dataset=test_dataset, batch_size=batch_size, shuffle=False)

# TODO: Initialize network


# TODO: set up criterion (cross entropy) and optimizer, and lr scheduler


In [None]:
def check_accuracy(loader, model, device):
    """
    Check accuracy of our trained model given a loader and a model

    Parameters:
        loader: torch.utils.data.DataLoader
            A loader for the dataset you want to check accuracy on
        model: nn.Module
            The model you want to check accuracy on

    Returns:
        acc: float
            The accuracy of the model on the dataset given by the loader
    """

    num_correct = 0
    num_samples = 0
    model.eval()

    # We don't need to keep track of gradients here so we wrap it in torch.no_grad()
    with torch.no_grad():
        # Loop through the data
        for x, y in loader:

            # Move data to device
            x = x.to(device=device)
            y = y.to(device=device)

            # Forward pass
            scores = model(x)
            _, predictions = scores.max(1)

            # Check how many we got correct
            num_correct += (predictions == y).sum()

            # Keep track of number of samples
            num_samples += predictions.size(0)

    model.train()
    return num_correct / num_samples

# Exercise 5:

Code a training loop from stratch and train your CNN. Print the accuracy of your model on both the training set and the test set, at the end of each epoch.
You should achieve at around 90.00% accuracy on the test set.

In [None]:
# TODO: Write the training loop and train your model. Lastly, evaluate it.
model.train()
for i in range(num_epochs):
  losses = []
  for batch_idx, (data, targets) in enumerate(tqdm(train_loader)):
    data = data.to(device)
    targets = targets.to(device)

    # todo: forward pass.


    losses.append(loss.item())
    mean_loss = sum(losses) / len(losses)
  print(f"Epoch: [{i+1}/{num_epochs}] | Loss: {mean_loss:.3f}")

model.eval()
# Check accuracy on training & test to see how good our model
print(f"Accuracy on training set: {check_accuracy(train_loader, model, device)*100:.2f}")
print(f"Accuracy on test set: {check_accuracy(test_loader, model, device)*100:.2f}")

The effective receptive field at a given layer is the amount of the image that has information contained at a certain point in the current layer. If we have a point in layer $k$, then this point has combined information from all the points around it in the previous layer. But, each of these points also has information from previous layers about even more points. Thus, we build a hierarchy of where the information has come from:

![](https://miro.medium.com/1*k97NVvlMkRXau-uItlq5Gw.png)

To calculate the receptive field at a layer, we can use the following formula:

$$
  r_{k} = r_{k-1}+(f_k - 1) * \prod_{i=1}^{k-1} s_i
$$

Here, $r_k$ is the size of the receptive field at layer $k$, $f_k$ is the kernel size at layer $k$ (assuming a square kernel), and $s_i$ is the stride of layer $i$.

Thus, we can see that larger strides allow for the receptive field to get quicker much more quickly.

In [None]:
import matplotlib.pyplot as plt
import numpy as np
import torch
import torch.nn as nn
from collections import OrderedDict

def receptive_field_progression(named_modules):
    """Return list of (name, RF_size, layer_type, k, s, d) for Conv2d / Pool2d layers."""
    results = []
    j, r = 1, 1
    for name, m in named_modules:
        if isinstance(m, (nn.Conv2d, nn.MaxPool2d, nn.AvgPool2d)):
            k = m.kernel_size if isinstance(m.kernel_size, tuple) else (m.kernel_size, m.kernel_size)
            s = m.stride if isinstance(m.stride, tuple) else (m.stride, m.stride)
            dval = getattr(m, "dilation", 1)
            d = dval if isinstance(dval, tuple) else (dval, dval)
            k_eff = (k[0] - 1) * d[0] + 1
            r = r + (k_eff - 1) * j
            j = j * s[0]
            results.append((name, r, m.__class__.__name__, k, s, d))
    return results

def draw_rf_grid(layer_name, rf_size, layer_type, k, s, d, canvas_size=None):
    """Render a centered square RF on a canvas grid."""
    if canvas_size is None:
        canvas_size = max(2*rf_size + 5, 40)
    if canvas_size % 2 == 0:
        canvas_size += 1

    grid = np.zeros((canvas_size, canvas_size))
    c = canvas_size // 2
    half = rf_size // 2
    top, left = c - half, c - half
    bottom, right = c + half, c + half
    grid[top:bottom+1, left:right+1] = 1.0

    plt.figure(figsize=(4,4))
    plt.imshow(grid, interpolation='nearest')
    plt.title(f"{layer_name} ({layer_type})\nRF size: {rf_size}Ã—{rf_size} | k={k[0]} s={s[0]} d={d[0]}")
    plt.axis('off')
    plt.show()

def visualize_model_rf(model, module_attr="net"):
    """Extract named modules from model.<module_attr> (or model if absent) and plot RF grids."""
    root = getattr(model, module_attr, model)
    named_modules = [(n, m) for n, m in root.named_modules() if n]
    rf_layers = receptive_field_progression(named_modules)
    max_rf = max(r for _, r, *_ in rf_layers) if rf_layers else 0
    canvas_size = max(2*max_rf + 5, 40)
    if canvas_size % 2 == 0:
        canvas_size += 1
    for name, r, t, k, s, d in rf_layers:
        draw_rf_grid(name, r, t, k, s, d, canvas_size=canvas_size)

class DemoNet(nn.Module):
    def __init__(self, in_ch=3):
        super().__init__()
        self.net = nn.Sequential(OrderedDict([
            ("conv1", nn.Conv2d(in_ch, 16, kernel_size=5, stride=3, padding=1)),
            ("conv2", nn.Conv2d(16, 32, kernel_size=5, stride=1, padding=1)),
            ("conv3", nn.Conv2d(32, 64, kernel_size=5, stride=1, padding=1)),
            ("conv4", nn.Conv2d(64, 64, kernel_size=5, stride=1, padding=1)),
            ("conv5", nn.Conv2d(64, 64, kernel_size=5, stride=1, padding=1)),
        ]))
    def forward(self, x):
        return self.net(x)

model = DemoNet()
visualize_model_rf(model, module_attr="net")
