# From Fully Connected Layers to Convolutions

Image data is stored in a 2D grid of pixels, regardless of whether the image is monochromatic or full-colour (or hyper-spectral!). In previous examples, we have flattened this 2D array to produce a linear vector, because this was what our models demanded. Pretty unsatisfying to throw away all that spatial information just to feed our 1D fully-connected input. In theory, we could even have permuted the order of the columns on the input and still achieved the same accuracy. 

Here, we begin to discuss the idea of convolutional neural networks, which are designed to leverage the spatial information in datatypes like images, making use of the fact that we know that pixels which are close to one another in space are related! CNN are ubiquitous in the fields of computer vision. 

Modern CNNs are popular because they are computationally efficient, requiring fewer parametrs than fully-connected networks and being easy to parallelize across GPU cores. They have even found numerous applications outside of computer vision. 

First, we cover in more depth the motivation behind CNNs. Then, more into nitty-gritty details about the implementation, the convolutional layers themselves, padding, stride, the pooling layers used to aggregate information across adjacent spatial regions, the use of multiple channels in each layer, and soem discussion of the structure of modern architectures. We'll conclude with a full implementation of LeNet, the first convolutional network successfully deployed. 

# From Fully Connected Layers to Convolution

The fully connected MLP approaches we have discuss up 'til now remain valid approaches for when data are tabular, i.e. when we don't assume any relationship between the data _a priori_. 

However, amongst other reasons, fully connected architectures become extremely unweidly for certain types of data. Imagine using 1 megapixel images for image recognition with 1000 elements in a hidden layer. This would require a fully-connected layer characterised by $10^6 \times 10^3 = 10^9$ parameters. 

Convolutional neural networks are designed to get around this, and make use of the spartial relationship between pixels at the same time.

## Invariance

We want our model to be invariant to the location in which objects to be recognised (for example) are found. This is termed _spatial invariance_ and CNNs formalize this. 

1. In the earliest layers of our network, we should see the same response to a particular "patch" of image, regardless of where it appears in the image, this is _translational invariance_.
2. The earliest layers of the image should focus on local regions, and not be so affected by areas far away in the model. This is known as the _locality_ principle. \
3. As we proceed, the model should be able to capture longer-range features in the image.

## Constraining the MLP

Start off with two dimensional images $\mathbf{X}$ as images and their hidden representations denoted $\mathbf{H}$, where both $\mathbf{X}$ and $\mathbf{H}$ have the same shape. Lets also imagine that not just the inputs, but also the hidden dimensions have spatial sturcture. 

Let $[\mathbf{X}]_{i, j}$ and $[\mathbf{H}]_{i, j}$ represent the pixel at $(i, j)$. To have each of the hidden units recive input from each of the input pixels, we would expand the dimensionality of our weight matrix from 2nd to a 4th order tensor $\mathsf{W}$. We could represent the values of the hidden layer by:

$$ [\mathbf{H}]_{i, j} = [\mathbf{U}]_{i, j} + \sum_k \sum_l [\mathsf{W}]_{i, j, k, l} [\mathbf{X}]_{i, j} $$

$$ [\mathbf{H}]_{i, j} = [\mathbf{U}]_{i, j} + \sum_a \sum_b [\mathsf{V}]_{i, j, a, b} [\mathbf{X}]_{i + a , j + b} $$

Where, for now, the swap from $\mathsf{W}$ to $\mathsf{V}$ is entirely cosmetic. We simply reindex so that $k = i + a$ and $l = j + b$. Where the indices $a$ and $b$ run both positively and negatively and cover the entire image. For any location $[\mathbf{H}]_{i, j}$ we compute its value by summing over pixels in the input, centered around $(i, j)$ and weighted by $\mathsf{V}$. This still requires $10^12$ parameters which is totally untenebale. 

### Translation Invariance

Invoking our requirement of translational invariance means that a shift in the input X should simply lead to a shift in the hidden representaion H. This is only possible if our model only depends on the indices a, b and not actually on i, j. So U and V cannot depend on i, j. So, we rewrite with this in mind.

$$ [\mathbf{H}]_{i, j} = u + \sum_a \sum_b [\mathbf{V}]_{a, b} [\mathbf{X}]_{i + a , j + b} $$

This is an example of a convolution. We are effectively weighting pixels at $(i + a, j + b)$, around a location $(i, j)$ with weights $\mathbf{V}_{a, b}$, to obtain the value of our hidden layer at location $(i, j)$. We have also substantially reduced the dimensionality of our problem by removing two dimensions from V. 

### Locality

Invoking locality... we assume we should not have to look very far from $(i, j)$ to understand what is going on at $[\mathbf{H}]_{i, j}$. Which means that outside of some range $|a| > \Delta$ or $|b| > \Delta$ we should be setting the value of $[\mathbf{V}]_{a, b}$ to 0. 

Equivalently, we could rewrite $[\mathbf{H}]_{i,j}$ as...

$$ [\mathbf{H}]_{i,j} = u + \sum_{a = -\Delta}^\Delta \sum_{b = -\Delta}^\Delta [\mathbf{V}]_{a, b} [\mathbf{X}]_{i + a, j + b} $$

Where we have again reduced the dimensionality by another 4 orders of magnitude. The above equation is an example of a convolutional layer, with convolutional neural networks being a special case of neural nets which include convolutional layers. In the deep learning community, the weights $\mathbf{V}$ are called a convolution kernel, filter, or simply a layer's weights as learnable parameters. 

We have reduced the dimensionality of the problem substantially here, but the price we pay is some measure of inductive bias, which comes in the form of our translational invariance and locality. If we are trying to train a model where these things are not true, then we may struggle to fit the training data. 

## Convolutions

Lets briefly review the mathematics of why these are called _convolutions_. 

or not, I'm too tired, lets come back to this later.

## Channels

In [None]:
page 238