# Lab 4: Convolutional Neural Networks
When your input is shaped "like a grid" (think of audio as a 1-D grid, images as a 2-D grid, 3-D scans, ...), there's a good chance it has _local structure_: points closer in the grid are more related to each other than they are to points further away.
In images, for example, small patterns of nearby pixels can form simple patterns like edges or corners, or more complex patterns like faces.
Pixels in the upper-left of an image are almost unrelated to pixels in the bottom-right corner, except that they're part of the same image.

Another observation is that these local patterns can show up anywhere in the image.
If you're trying to detect faces, it doesn't really matter where they are in the input.
This suggests that we should use the same structure to detect a given pattern everywhere in an image.

Convolutions let us take advantage of both of these observations.
Instead of connecting every part of the input to every part of the output in the same way, as dense layers do, convolutional layers learn small filters which recognize local patterns and "slide" them around the input.

# Convolutions
The heart of convolutional layers is an operation called "discrete convolution" which formalizes how we "slide filters around the grid."

## What does "like a grid" mean?
What this really means is that the input data has some kind of _ordering_ along at least one of its axes.
128x128-pixel color images will be represented by tensors of shape [128, 128, 3] -- the first two axes refer to the pixel, and the third picks an RGB **channel**.
The channel axis (also sometimes called the depth axis) is not ordered, since red is no closer to green than it is to blue.
But the first two axes are ordered: pixel (1, 1) is closer to pixel (1, 2) than it is to (1, 3) or (3, 3).

Convolutional layers are designed to take advantage of this ordering, by assuming that close points are more related than further ones.
Convolutions will only act on the ordered axes.

## Kernels
Instead of a set of dense weights, convolutional layers learn a set of **kernels**, which are formatted like small patches of grid.
In 1-D, they're 1-dimensional grid segments, in 2-D they're little squares, and in 3-D they're volumes.

To compute the **activation** of a kernel in a particular location in the grid, place the kernel centered there, then compute the dot product of the kernel and the input by multiplying elements that line up and summing the results.

The full result of **convolving** a kernel over an input is the set of activations produced by "sliding" the kernel to all locations on the input, resulting in a grid shaped like the input.
This is often called an **activation map**.

One really important thing to notice here is that _convolution preserves order_.
When you convolve a kernel over an image, the upper-left of the activation map corresponds to the upper-left of the image.
This means that convolutional layers can _preserve local structure_ even as you apply multiple convolutional layers in sequence.
Compare this with dense layers, which connect every output to every input and so lose all ordering information.

Another important property is that convolutions find small local patterns.
You can think of the kernel as a "template".
Its activation on a patch of image indicates how well that image patch matches the template: high positive values mean they have high "similarity" (note that the dot product is not actually normalized), high negative values mean they have similar structure but opposite sign, and values near zero mean that the kernel doesn't structurally match the input there.
These "templates" are applied everywhere on the input, able to match the same pattern (e.g. an edge or a face) again and again.

### 1D
![1-D convolution](https://files.realpython.com/media/njanakiev-1d-convolution.d7afddde2776.png)
(Image source: [Deep Learning with Python](https://realpython.com/asins/1617294438/))

### 2D
![2-D convolution gif](https://cdn-images-1.medium.com/max/1440/1*Zx-ZMLKab7VOCQTxdZ1OAw.gif)

In this animation, the input image is the blue grid on the left, the output is the teal grid on the right, and the kernel is the square grid of weights
$$\begin{bmatrix}0 & 1 & 2 \\ 2 & 2 & 0 \\ 0 & 1 & 2 \end{bmatrix}$$
The activation at a particular location is shown by shading where the kernel is on the input and the resulting space in the output.

(Image source: ["A guide to convolution arithmetic for deep learning"](https://arxiv.org/abs/1603.07285))

## Filters
Kernels act only on a single channel, and produce a single channel.
When dealing with multi-channel data (like RGB images, or multi-band satellite imagery), the output of a convolutional layer should be based on all of the channels together.
So, instead of using a kernel, we use a **filter**: a collection of kernels, one for each channel in the input's depth axis.

<img src="https://cdn-images-1.medium.com/max/1080/1*lRpx5pTrVewFTD8YXjhIKA.png" width=400>

(Image source: ["Intuitively understanding convolutions for deep learning"](https://towardsdatascience.com/intuitively-understanding-convolutions-for-deep-learning-1f6f42faee1))

The activation of a filter at a point on an image is the sum of the activations of each of its kernels, applied to one of the channels.
So, each filter takes in a grid of multiple channels and produces a new grid of a single channel (the activation map).

To increase the capacity of a convolutional layer, we use multiple filters in parallel, analogous to the "width" or "number of units" hyperparameter for dense layers.
The activation maps of each filter are concatenated.

So, the first layer of a convolutional neural network that operates on 128x128-pixel RGB and uses 64 5x5 filters has the following shapes:
 - Input images have shape [128, 128, 3] (pixel, pixel, channel)
 - Each kernel has shape [5, 5] (pixel, pixel)
 - Each filter has shape [5, 5, 3] (pixel, pixel, channel)
 - Each filter's activation map has shape [128, 128] (pixel, pixel)
 - The output of the layer has shape [128, 128, 64] (pixel, pixel, channel)
(assuming that the layer is padded so the activation maps are still 128x128 pixels)

Note: the next layer's filters would have shape [5, 5, 64] (assuming 5x5 convolutions).
Examples typically come in batches, so you can think of each shape also having a batch axis.
With a batch size of 8, for example, input image batches would have a shape of [8, 128, 128, 3].

For a good visual summary, check out this diagram from ["A guide to convolution arithmetic for deep learning"](https://arxiv.org/abs/1603.07285): 
<img src="./images/convolution_channels.png" width=500>

# Convolutional neural networks
Since the outputs of a convolutional layer preserve ordering along ordered axes, and multiple filters result in the output having channels, _the outputs have the same format as the inputs_.
In this case, an input 128x128 image with 3 channels is turned into an output 128x128 image with 64 channels.
This means that we can apply another convolutional layer to the output in order to build deep convolutional neural networks!

Much like when we stack dense layers, early convolutional layers capture simple patterns (like edges, corners, and textures) and later convolutional layers use patterns in those patterns to make higher-level features (like object parts and objects).

<img src="./images/feature_hierarchy.png" width=400>

(Image source: skymind.ai, ["A Beginner's Guide to Neural Networks and Deep Learning"](https://skymind.ai/wiki/neural-network))

This is great for the "feature extraction" part of a neural network, but how do we use the final activation maps to do classification or regression?
Often, the last activation map is "flattened" into a vector, then fed into a dense layer.
Usually one or two dense layers, followed by an output layer, are sufficient.

For example, if the last convolutional layer of a model is of shape [10, 10, 16], it will be flattened to a vector of shape [10 * 10 * 16] = [1600].
Then, every unit in the first dense layer will have 1600 inputs.

# Interpreting convolutional layers

## Learning filters
TODO

## A variant of dense layers
TODO

## Modules
TODO

# Padding, stride, receptive field
TODO

# Pooling
TODO

# Common architecture choices
TODO

# Uses for convolutional networks
TODO

# Example
TODO