# Convolutions for Images

Now that we understand how convolutional layers work in theory,
we are ready to see how they work in practice.
Building on our motivation of convolutional neural networks
as efficient architectures for exploring structure in image data,
we stick with **images** as our running example.

In [None]:
import torch
from torch import nn

# Implementation of two-dimensional Cross-correlation Operation
 - Though the proper name is *cross-correlation*, it is commonly called *convolution*

We want to write a `corr2d` function implementing this operation.  
Needed elements:
 - input tensor `X` (a greyscale image) with shape: $n_h\times n_w$
 - kernel tensor `K` for the convolution, with window: $k_h \times k_w$
   * of course its shape must fit within the image
 - $\Rightarrow$ output tensor `Y` with shape: $(n_h-k_h+1) \times (n_w-k_w+1)$
 - The formula representing the cross-correlation between two matrices $X$ and $K$ is easily deducible from the class explanation:
 $$
 Y[i,j] = \sum_s \sum_t X[s,t]\cdot K[s+i,t+j]
 $$
 where the multi-index $(s,t)$ runs over the admissible domain, namely the shape of `Y`.

In [None]:
def corr2d(X, K):
    """Compute 2D cross-correlation."""
    h, w = K.shape
    Y = torch.zeros((X.shape[0] - h + 1, X.shape[1] - w + 1))
    for i in range(Y.shape[0]):
        for j in range(Y.shape[1]):
            Y[i, j] = (X[i:i + h, j:j + w] * K).sum()
    return Y

Example: let us construct an input tensor `X` and a kernel tensor `K`
to validate the cross-correlation operation


In [None]:
X = torch.tensor([[0.0, 1.0, 2.0], [3.0, 4.0, 5.0], [6.0, 7.0, 8.0]])
K = torch.tensor([[0.0, 1.0], [2.0, 3.0]])
corr2d(X, K)

tensor([[19., 25.],
        [37., 43.]])

# Modules

We learned that `nn.Sequential` is a container of layers that are **chained** in cascade.

There is a more general and flexible container: a *PyTorch module*, namely `nn.Module` class (not to be confused with the concept of module=library in Python)

- PyTorch module can contain other layers and modules as well, i.e. modules can be **nested** in a complicate manner
  - a `nn.Module` can be just one layer
- `nn.Sequential` is a subclass of `nn.Module`
  * it can be a sequence of modules
- price to pay for flexibility of `nn.Module`: writing more code
  * we are required to define a sublass implementing (the constructor `__init__` and) the `foward()` method
  * when a network is used in *forward computation*, the `forward()` method of the class is called
    - for `nn.Sequential` PyTorch suitably wires outputs/inputs of the chained layers automatically
    - while for a `nn.Module` instance the `forward()` method must be **implemented manually** by the programmer
- PyTorch module can be used to implement a *block*, which loosely stands for common group/structure of layers that occurs several times to build up the whole architecture, like in certain modern Deep NN architectures

## Convolutional Layers

We have the `corr2d` function implementing the 2D convolution.

This is enough for constructing a **convolutional layer**
- need to suitably define a subclass of PyTorch module
- `nn.Parameter` allows to define the intrinsic parameters of the layer, eg. kernel's weights

In [None]:
class Conv2D(nn.Module): # Conv2D inherits from nn.Module class
    # constructor: we need to initialize Conv2D object using the superclass' constructor first
    def __init__(self, kernel_size):
        super().__init__()
        self.weight = nn.Parameter(torch.rand(kernel_size))
        self.bias = nn.Parameter(torch.zeros(1)) # not much useful in this specific example, but convolutional layers may have additive bias

    # forward() method needed in order to implement network's forward computation
    def forward(self, x):
        return corr2d(x, self.weight) + self.bias

In
$h \times w$ convolution
or a $h \times w$ convolution kernel,
the height and width of the convolution kernel are $h$ and $w$, respectively.
We also refer to
a convolutional layer with a $h \times w$
convolution kernel simply as a $h \times w$ convolutional layer.

## Elementary convolutional layer implementation: Object Edge Detection in Images
Let us detect the edge of an object in a "synthetic image"
by finding the **location of the pixel change**.

1. construct a $6\times 8$ pixel image: middle 4 columns' values are 0=black and the rest are 1=white

In [None]:
X = torch.ones((6, 8))
X[:, 2:6] = 0
X

tensor([[1., 1., 0., 0., 0., 0., 1., 1.],
        [1., 1., 0., 0., 0., 0., 1., 1.],
        [1., 1., 0., 0., 0., 0., 1., 1.],
        [1., 1., 0., 0., 0., 0., 1., 1.],
        [1., 1., 0., 0., 0., 0., 1., 1.],
        [1., 1., 0., 0., 0., 0., 1., 1.]])

2. construct a  $1 \times 2$ kernel `K`

In [None]:
K = torch.tensor([[1.0, -1.0]])

It behaves as elementary vertical **edge detector**: when performing the cross-correlation operation with the input...
- if the **horizontally adjacent elements** are the **same**,
the output is 0
- otherwise, the output is $\neq 0$


3. perform cross-correlation
- edge from white to black will yield 1
- edge from black to white will yield -1

In [None]:
Y = corr2d(X, K)
Y

tensor([[ 0.,  1.,  0.,  0.,  0., -1.,  0.],
        [ 0.,  1.,  0.,  0.,  0., -1.,  0.],
        [ 0.,  1.,  0.,  0.,  0., -1.,  0.],
        [ 0.,  1.,  0.,  0.,  0., -1.,  0.],
        [ 0.,  1.,  0.,  0.,  0., -1.,  0.],
        [ 0.,  1.,  0.,  0.,  0., -1.,  0.]])

Does this particular convolutional kernel detect **only vertical** edges in an image?
 - it should be clear from the evident directionality
 - by transposing the image: surely the edge turns to horizontal
   * then try applying the same kernel

In [None]:
corr2d(X.t(), K)

tensor([[0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0.]])

This kernel simply calculated the difference between two adjacent entries: in numerical mathematics we would say that it calculates the *finite difference* (not far from the numerical approximation of a first-order directional derivative)

## Using built-in Convolutional Layer: learn a kernel
We had the synthetic image `X`, the given kernel `K`, and then calculated the output `Y`

**Goal**: let us try to learn the kernel for vertical edge detection, based on examples: (`X`, `Y`)

1. construct a convolutional layer
2. initialize the kernel randomly
  - PyTorch has a default method for randomly initializing the parameters in convolutional (and linear) layers. You can read the details here: https://pytorch.org/docs/stable/generated/torch.nn.Conv2d.html
3. compute the error wrt the known edges
4. update the kernel using the gradient


Instead of manually implementing a **2D convolutional layer**
 - PyTorch's built-in class `nn.Conv2d`
   * arguments: # input channels, # output channels, kernel size, stride (default: 1), padding (default: 0), bias (default: True), ...  
   https://pytorch.org/docs/stable/generated/torch.nn.Conv2d.html
   * images as input to such 2D convolutional layer: 4-dimensional tensor with shape (batch size, # channels, height, width)

In [None]:
# Construct a 2D convolutional layer: 1 input and output channel, 1x2 kernel
# For the sake of simplicity, we cancel the bias here
conv2d = nn.Conv2d(1,1, kernel_size=(1, 2), bias=False)

# 1 example per batch
# 1 channel in the images
# 6x8 images
X = X.reshape((1, 1, 6, 8))
Y = Y.reshape((1, 1, 6, 7))

# This loop manually iterates in the gradient descent method, step size = 0.03
for i in range(10):
    Y_hat = conv2d(X)
    l = (Y_hat - Y) ** 2
    conv2d.zero_grad()
    l.sum().backward() # This calls the backpropagation, calculating the gradient(s) only
    # Then update the kernel weights
    conv2d.weight.data[:] -= 3e-2 * conv2d.weight.grad
    if (i + 1) % 2 == 0:
        print(f'batch {i + 1}, loss {l.sum():.3f}')

batch 2, loss 13.581
batch 4, loss 4.422
batch 6, loss 1.620
batch 8, loss 0.632
batch 10, loss 0.253


Note that the error has dropped to a small value after 10 iterations. Now we will take a look at the kernel tensor we learned.


- Learnt kernel weights:

In [None]:
conv2d.weight.data.reshape((1, 2))

tensor([[ 1.0403, -0.9371]])

Indeed, the learned kernel tensor is remarkably close
to the kernel tensor `K` we defined earlier.


## Feature Map and Receptive Field
*Feature map* of the 2D convolution: is the result of the 2D convolution of an image with a kernel

*Receptive field* of **one** entry in the feature map: the patch in the input of the convolutional layer that is used for calculating that entry
 - clearly: shape of a receptive field = shape of the kernel





## Summary

* The core computation of a two-dimensional convolutional layer is a two-dimensional cross-correlation operation. In its simplest form, this performs a cross-correlation operation on the two-dimensional input data and the kernel, and then adds a bias.
* We can design a kernel to detect edges in images.
* We can learn the kernel's parameters from data.
* With kernels learned from data, the outputs of convolutional layers remain unaffected regardless of such layers' performed operations (either strict convolution or cross-correlation).
* When any element in a feature map needs a larger receptive field to detect broader features on the input, a deeper network can be considered.

## Exercises

1. Construct an image `X` with diagonal edges.
    1. What happens if you apply the kernel `K` in this section to it?
    1. What happens if you transpose `X`?
    1. What happens if you transpose `K`?
1. How do you represent a cross-correlation operation as a matrix multiplication by changing the input and kernel tensors?
1. Design some kernels manually.
    1. What is the form of a kernel for the ("discretize" version of) second derivative?
    1. What is the kernel for (the "discrete" version of) an integral?
    1. What is the minimum size of a kernel to obtain a ("discrete" version of the) derivative of order *d*?

## Solutions below

1. Construct an image `X` with diagonal edges.
    1. What happens if you apply the kernel `K` in this section to it?

    It detects the diagonal edges
    1. What happens if you transpose `X`?

    Nothing changes
    1. What happens if you transpose `K`?

    Nothing changes
1. How do you represent a cross-correlation operation as a matrix multiplication by changing the input and kernel tensors?

  Vectorize $K$ into columns of a new matrix $K'$ and build a matrix $X'$ whose columns are the vectorized patches of $X$ having the dimension of $K$. Then compute the matrix product $X'K'$, whose entries will be the resulting convolution feature map in a certain order.
1. Design some kernels manually.
    1. What is the form of a kernel for the ("discretize" version of) second derivative?

    See 2nd order differences: https://en.wikipedia.org/wiki/Finite_difference
    1. What is the kernel for (the "discrete" version of) an integral?

    The answer is simple for integrals of one-variable functions; extending to double integrals is conceptually feasible but technically challanging. See Trapezoid Rule for numerically approximating a (single) integral. Assume that the values in the vector $X$ are values sampled from a function $f$ at some points.
    1. What is the minimum size of a kernel to obtain a ("discrete" version of the) derivative of order *d*?

    See higher order differences: https://en.wikipedia.org/wiki/Finite_difference