# Convolutional Neural Networks

Recall that in previous notebooks we flattened images into long vectors and pushed them into MLPs. This is unstatisfying since we are discarding spatial information, i.e. MLPs are invariant to a permutation of the neurons. It would be nice if we can include our prior knowledge that pixels are spatially related in some way into the structure of our networks. 

For tasks such as object detection and classification, we can abstract away two desirable properties: 1) **translation invariance** and 2) **locality**. The output of the network should not depend on where the object is placed in the image. For the latter, we want early layers of the network to focus on local regions of the image, to create representations that will be latter aggregated to make image level predictions, perhaps by using an MLP classification subnetwork.  

## Convolution layer

In this section, we look at ways to modify the action of a dense layer but with the above two constraints enforced. Consider a black MNIST digit, with a large enough white background so that we can translate the digit around. Our input image $\mathbf X$ and its hidden representation $\mathbf H$ will be assumed to have corresponding pixels which will be indexed by $ij.$ Thus, spatial information in the input will be passed on to the hidden representation. 

The idea of translation invariance is that if we have a digit centered on $(i, j)$, then our layer should have the same output with the same digit centered on $(i+\Delta_i, j+\Delta_j).$ One way to do this is to have a **filter** just large enough to cover the digit, that we will slide along the input image to generate the hidden representation of the image. Thus, it does not matter where the digit is in the image, since we will go through each region of the image *using the same filter*. To enforce locality, we limit the filter size to $\Delta \times \Delta$ in space. Thus, we have the following operation:

$$
\begin{aligned}
{\mathbf H}_{ij} 
&= u + \sum_{a = -\Delta}^{\Delta} \sum_{b=-\Delta}^{\Delta} {\mathbf K}_{ab}\, {\mathbf X}_{i + a, j + b}.
\end{aligned}
$$

Note that the bias $u$ is independent of $ij$ and we apply a dot product around a local patch centered on $ij$ in the input $\mathbf X.$ This operation is called a [convolution](https://en.wikipedia.org/wiki/Convolution). Note that writing the operation this way has reduced the number of parameters by two orders of magnitude. Moreover, to be able to enforce locality, $a$ and $b$ are typically smaller than the input image resolution. 
One important consequence of this is **parameter shared** between output neurons. This also greatly improves the capability of networks to learn since the number of parameters is significantly reduced. From the perspective of learning, this means that weights that detect an object or an aspect of it, such as edges, is important at some location in the image, should also be useful at some other location due to the translationally-invariant structure of images hence should be shared by all output neurons.

### Adding channels

Images have channels (e.g. RGB images have three). So we want to be able to capture that. Moreover, analogous to having more than one neuron in the next hidden layer, we also add output channels. We can imagine each channel as capturing different aspects of the image, much like each neuron in a hidden layer of an MLP learns a long vector which we can interpret as a global filter for capturing large scale aspects of the input vector. Recall that we enforce similarity with the kernel using the dot product operation. Our convolution operation would therefore look like:

$$
\begin{aligned}
{\mathbf H}_{ij,\, c_\text{out}} 
&= u + \sum_{a = -\Delta}^{\Delta} \sum_{b=-\Delta}^{\Delta} {\mathbf K}_{ab,\,{c_\text{in}},\,{c_\text{out}}} \, {\mathbf X}_{i + a,\, j + b,\, c_\text{in}}.
\end{aligned}
$$

So we have the first two indices for navigating along the spatial dimensions of the input image. This is done for each input channel, hence a third dimension for which we have the third index.
Finally, we have such 3-dimensional tensors for each output channel, so that our final weight tensor is a 4-dimensional tensor. We can think of this geometrically as having $c_\text{out}$ filters of shape $(\Delta, \Delta, c_\text{in})$ which forms a $(\Delta, \Delta, c_\text{in}, c_\text{out})$ weight tensor. Note that we anchor everything on the indices $ij$ since our goal is to construct the output image by using the spatial information in the input image.  

```{figure} ../../img/kernels_cnn.jpeg
---

---
Filters learned by the first layer of AlexNet [[Krizhevsky et. al.]](https://proceedings.neurips.cc/paper/2012/hash/c399862d3b9d6b76c8436e924a68c45b-Abstract.html). Each of the 96 filters shown here is of size [11x11x3],  shared by the 55x55 neurons in each out channel. Note that shared edge detectors is learned by the layer. 

In [1]:
import tensorflow as tf
from tensorflow import keras as kr
from inefficient_networks.config import config as core_config

core_config.set_tensorflow_seeds(0)
# core_config.set_matplotlib()

Let us look at framework implementations of convolution layers. Consistent with the above equation, TensorFlow expects input images to be of shape `(B, H, W, c)` for an batch input of size `B` of `H x W` images with `c` channels. Note that since these layers will be stacked, this is also expected the shape of output of a convolutional layer.

In [80]:
c_in = 1
c_out = 3

X = tf.random.normal(shape=(256, 28, 28, c_in))                 # (B, H_in, W_in, c_in)
conv_kr = kr.layers.Conv2D(filters=c_out, kernel_size=(5, 5))   # c_out many (5, 5, c_in) filters.

# Get out image, and check shapes
Y = conv_kr(X)
K = conv_kr.weights[0]
print(Y.shape)   # (B, H_out, W_out, c_out)
print(K.shape)   # (5, 5, c_in, c_out)

(256, 24, 24, 3)
(5, 5, 1, 3)


Note that the output shape is different. This is because we can only place the kernel on the first pixel up to pixel number $28 - 5 + 1 = 24.$ Let's actually confirm whether this Keras layer implements the above formula.

In [144]:
def conv(X, K):
    k1, k2 = K.shape[0], K.shape[1]
    B, H, W, c_in = X.shape
    c_out = K.shape[3]

    Y = tf.Variable(tf.zeros(shape=(256, 24, 24, 3)))
    for i in range(Y.shape[1]):
        for j in range(Y.shape[2]):
            XX = tf.reshape(tf.transpose(X[:, i:i+k1, j:j+k2, :], perm=[0, 3, 1, 2]), (B, -1))
            KK = tf.reshape(K, (-1, c_out))
            Y_ = Y.numpy()
            Y_[:, i, j, :] = XX @ KK
            Y.assign(Y_) # Three lines just to assign on (i,j) seems... convoluted. :(
    return Y


mean_absolute_error = tf.reduce_mean(tf.abs(conv(X, K) - conv_kr(X)))
print(mean_absolute_error)

tf.Tensor(4.974943e-08, shape=(), dtype=float32)


### Receptive field

For any element (or pixel) `h` of a hidden layer, the receptive field to all the elements (from all the previous layers) that may affect the calculation of `h` during the forward propagation. We have to take this into account otherwise we may have elements which are not affected by the input. In particular, elements of the finall classification layer should have a receptive field that contains the whole input image in the input layer. Otherwise, some parts of the input data will not improve the performance of the model for that class. 
Ideally, we want large receptive fields, e.g. exponentially increasing with depth.

```{figure} ../../img/receptive_field.png
---
width: 30em
---
Source: [[Lin et. al.](https://www.researchgate.net/figure/The-receptive-field-of-each-convolution-layer-with-a-3-3-kernel-The-green-area-marks_fig4_316950618)]
```

### Improving convolutions

In this section we look at techniques used to improve the performance of convolutional networks.

<br>

**Padding.** Convolutions can only be executed if kernel lies entirely within input domain. This is often undesirable and inconvenient as it couples network architecture with input size. Moreover, it results in convolutions being biased towards the central pixels, and resulting to "information loss" on the borders. One way to fix this is to simply pad the edges with zero, i.e. replacing `?` with `0` in the figure below.


```{figure} ../../img/padding.png
---
width: 30em
---
```




<br>

**Stride.** Stride controls the number of pixels skipped by the kernel when it slides over the input image. If objects are large relative to the dimension of the image, then it would be beneficial to use a larger stride. In fact, AlexNet a stride of 4 in the first layer, since in ImageNet data tend to occupy more pixels. Moreover, since different objects are typically separated by a few pixels, the size of the kernel can be relatively large at the input layer, e.g. 11 x 11 for AlexNet.

Observe that the spatial dimension of the output is directly influenced by padding $p$ and stride $s$. Let the input image have spatial dimension of size $w$ and let the kernel be of size $k$ in that dimension, then the corresponding spatial dimension in the output has size $\left\lfloor ({w+2 p-k} ) / {s} \right\rfloor + 1.$ For odd sized filters we can use $2p = k-1$ so that the output has the same spatial dimension as the input.

<br>

**Pooling.** Downsampling, i.e. reducing the spatial dimensions of the output, can be achieved by a pooling operation. This is an important operation as it increases the receptive field of later pixels. Pooling requires no parameters and is typically applies either the **max** or **mean** of the pixels covered by its filter.
A pooling layer operates like a convolutional layer in that it can have stride and padding, and it acts on local regions of the input. Hence, we specify a kernel size. But unlike the convolution operation, pooling is non-parameteric &mdash; it only takes the maximum. 

```{figure} ../../img/pooling.png
---
width: 25em
---

Max pooling with `k=2` and `s=2`.
```

Note that pooling is applied to each channel separately, as it doesn't make sense to combine values from different channels so that the number of channels from input to output is retained. Max pooling works well with the ReLU activation so that all activations are positive. Also, max pooling provides invariance to small translations of the input at the cost of some information loss (which may be good for the network). Finally, max pooling significantly reduces the number of parameters which helps mitigate overfitting.

In practice, there are only two commonly used settings: overlapping pooling with `k=3` and `s=2` which results in little information loss, and more commonly `k=2` and `s=2` so that the pooling regions are nonoverlapping. Note that larger kernel sizes can be too aggressive.

## LeNet