# Understanding convolutions

## a. Cross-correlation operation

\begin{bmatrix}
0 & 1 & 2\\
3 & 4 & 5\\
6 & 7 & 8
\end{bmatrix} * \begin{bmatrix}
0 & 1\\
2 & 3
\end{bmatrix} = \begin{bmatrix}
19 & 25\\
37 & 43
\end{bmatrix}

In the two-dimensional cross-correlation operation, we begin with the convolution window positioned at the upper-left corner of the input tensor and slide it across the input tensor, both from left to right and top to bottom. When the convolution window slides to a certain position, the input subtensor contained in that window and the kernel tensor are multiplied elementwise and the resulting tensor is summed up yielding a single scalar value. This result gives the value of the output tensor at the corresponding location. Here, the output tensor has a height of 2 and width of 2 and the four elements are derived from the two-dimensional cross-correlation operation:

\begin{split}0\times0+1\times1+3\times2+4\times3=19,\\
1\times0+2\times1+4\times2+5\times3=25,\\
3\times0+4\times1+6\times2+7\times3=37,\\
4\times0+5\times1+7\times2+8\times3=43.\end{split}

Note that along each axis, the output size is slightly smaller than the input size. Because the kernel has width and height greater than one, we can only properly compute the cross-correlation for locations where the kernel fits wholly within the image, the output size is given by the input size $n_h*n_w$ minus the size of the convolution kernel $k_h*k_w$ via $(n_h - k_h + 1) * (n_w - k_w + 1)$.

This is the case since we need enough space to shift the convolution kernel across the image. `It is possible to keep the size unchanged by padding the image with zeros around its boundary so that there is enough space to shift the kernel`.

In [10]:
import torch
def corr2d(X, K):
    """Compute 2D cross-correlation."""
    h, w = K.shape
    Y = torch.zeros((X.shape[0] - h + 1, X.shape[1] - w + 1))
    for i in range(Y.shape[0]):
        for j in range(Y.shape[1]):
            Y[i, j] = (X[i:i + h, j:j + w] * K).sum()
    return Y

In [11]:
X = torch.tensor([[0.0, 1.0, 2.0], [3.0, 4.0, 5.0], [6.0, 7.0, 8.0]])
K = torch.tensor([[0.0, 1.0], [2.0, 3.0]])
corr2d(X, K)

tensor([[19., 25.],
        [37., 43.]])

## b. Convolutional layer

A convolutional layer cross-correlates the input and kernel and adds a scalar bias to produce an output. The two parameters of a convolutional layer are the kernel and the scalar bias. When training models based on convolutional layers, we typically initialize the kernels randomly, just as we would with a fully connected layer.

We are now ready to implement a two-dimensional convolutional layer based on the corr2d function defined above. In the `__init__` constructor method, we declare weight and bias as the two model parameters. The forward propagation method calls the corr2d function and adds the bias.



In [12]:
from torch import nn
import torch
class Conv2D(nn.Module):
    def __init__(self, kernel_size):
        super().__init__()
        self.weight = nn.Parameter(torch.rand(kernel_size))
        self.bias = nn.Parameter(torch.zeros(1))

    def forward(self, x):
        return corr2d(x, self.weight) + self.bias

Now let’s see whether we can learn the kernel that generated Y from X by looking at the input–output pairs only. We first construct a convolutional layer and initialize its kernel as a random tensor. Next, in each iteration, we will use the squared error to compare Y with the output of the convolutional layer. We can then calculate the gradient to update the kernel. For the sake of simplicity, in the following we use the built-in class for two-dimensional convolutional layers and ignore the bias.

In [28]:
X = torch.ones((6, 8))
X[:, 2:6] = 0
print(X.shape)
K = torch.tensor([[1.0, -1.0]])

Y = corr2d(X, K)
print(Y.shape)

# Construct a two-dimensional convolutional layer with 1 output channel and a
# kernel of shape (1, 2). For the sake of simplicity, we ignore the bias here
conv2d = nn.Conv2d(in_channels = 1, out_channels = 1, kernel_size=(1, 2), bias=False)

# The two-dimensional convolutional layer uses four-dimensional input and
# output in the format of (example, channel, height, width), where the batch
# size (number of examples in the batch) and the number of channels are both 1
X = X.reshape((1, 1, 6, 8))
Y = Y.reshape((1, 1, 6, 7))
lr = 3e-2  # Learning rate

for i in range(20):
    Y_hat = conv2d(X)
    l = ((Y_hat - Y) ** 2)
    conv2d.zero_grad()
    l.sum().backward()
    # Update the kernel
    conv2d.weight.data[:] -= lr * conv2d.weight.grad
    if (i + 1) % 5 == 0:
        print(f'Epoch {i + 1}, loss {l.sum():.3f}')

torch.Size([6, 8])
torch.Size([6, 7])
Epoch 5, loss 0.743
Epoch 10, loss 0.025
Epoch 15, loss 0.002
Epoch 20, loss 0.000


In [29]:
conv2d.weight.data

tensor([[[[ 1.0013, -0.9983]]]])

## c. Padding and Stride

In the following, we will explore two techniques that offer more control over the size of the output. As motivation, note that since kernels generally have width and height greater than 1, after applying many successive convolutions, we tend to wind up with outputs that are considerably smaller than our input. If we start with a $240 \times 240$ pixel image, 10 layers of $5 \times 5$ convolutions reduce the image to $200 \times  200$ pixels, slicing off $30\%$ of the image while removing any interesting information on the boundaries of the original image.

### i. Padding

\begin{bmatrix}
0 & 0 & 0 & 0 & 0\\
0 & 0 & 1 & 2 & 0\\
0 & 3 & 4 & 5 & 0\\
0 & 6 & 7 & 8 & 0 \\
0 & 0 & 0 & 0 & 0
\end{bmatrix} * \begin{bmatrix}
0 & 1\\
2 & 3
\end{bmatrix} = \begin{bmatrix}
0 & 3 & 8 & 4\\
9 & 19 & 25 & 10\\
21 & 37 & 43 & 16\\
6 & 7 & 8 & 0
\end{bmatrix}

If we add a total of $p_h$ rows of padding (roughly half on top and half on bottom) and a total of $p_w$columns of padding (roughly half on the left and half on the right), the output shape will be: $(n_h-k_h+p_h+1)\times(n_w-k_w+p_w+1)$.

In practice, we set $p_h = k_h - 1$ and $p_w=k_w-1$ to keep the output shape the same as the input shape. 

###  ii. Stride 

In the previous examples, we defaulted to sliding one element at a time. However, sometimes, either for computational efficiency or because we wish to downsample, we move our window more than one element at a time, skipping the intermediate locations. This is particularly useful if the convolution kernel is large since it captures a large area of the underlying image. We refer to the number of rows and columns traversed per slide as stride. So far, we have used strides of 1, both for height and width. Sometimes, we may want to use a larger stride.

For a stride $(s_h, s_w)$, we have an output shape of: $\lfloor(n_h-k_h+p_h+s_h)/s_h\rfloor \times \lfloor(n_w-k_w+p_w+s_w)/s_w\rfloor$

E.g. For the previous example with a stride of 3 vertically and stride of 2 horizontally:

\begin{bmatrix}
0 & 0 & 0 & 0 & 0\\
0 & 0 & 1 & 2 & 0\\
0 & 3 & 4 & 5 & 0\\
0 & 6 & 7 & 8 & 0 \\
0 & 0 & 0 & 0 & 0
\end{bmatrix} * \begin{bmatrix}
0 & 1\\
2 & 3
\end{bmatrix} = \begin{bmatrix}
0 & 8\\
6 & 8
\end{bmatrix}

## d. LeNet5

In [73]:
from torch import nn
from torch.nn import functional as F
class LeNet5(nn.Module):
    def __init__(self):
        super().__init__()
        self.conv1 = nn.Conv2d(1, 20, 5, 1)
        self.conv2 = nn.Conv2d(20, 20, 5, 1)
        self.fc1 = nn.Linear(20*5*5, 500)
        self.fc2 = nn.Linear(500, 10)

    def forward(self, x):
        print("Input shape:", x.shape)
        x = self.conv1(x)
        print("Conv2d:", x.shape)
        x = F.relu(x)
        x = F.max_pool2d(x, 2, 2)
        print("MaxPool2d:", x.shape)
        x = self.conv2(x)
        print("Conv2d:", x.shape)
        x = F.relu(x)
        x = F.max_pool2d(x, 2, 2)
        print("MaxPool2d:", x.shape)
        x = x.view(-1, 20*5*5)
        print("Flatten:", x.shape)
        x = F.relu(self.fc1(x))
        x = self.fc2(x)
        print(x.shape)
        x = nn.Softmax(dim=1)(x)
        return torch.argmax(x, dim=1)

In [74]:
input = torch.randn(1, 1, 32, 32)

In [75]:
lenet = LeNet5()

In [76]:
lenet(input)

Input shape: torch.Size([1, 1, 32, 32])
Conv2d: torch.Size([1, 20, 28, 28])
MaxPool2d: torch.Size([1, 20, 14, 14])
Conv2d: torch.Size([1, 20, 10, 10])
MaxPool2d: torch.Size([1, 20, 5, 5])
Flatten: torch.Size([1, 500])
torch.Size([1, 10])


tensor([4])