# Lab 4: Convolutional Neural Networks
When your input is shaped "like a grid" (think of audio as a 1-D grid, images as a 2-D grid, 3-D scans, ...), there's a good chance it has _local structure_: points closer in the grid are more related to each other than they are to points further away.
In images, for example, small patterns of nearby pixels can form simple patterns like edges or corners, or more complex patterns like faces.
Pixels in the upper-left of an image are almost unrelated to pixels in the bottom-right corner, except that they're part of the same image.

Another observation is that these local patterns can show up anywhere in the image.
If you're trying to detect faces, it doesn't really matter where they are in the input.
This suggests that we should use the same structure to detect a given pattern everywhere in an image.

Convolutions let us take advantage of both of these observations.
Instead of connecting every part of the input to every part of the output in the same way, as dense layers do, convolutional layers learn small filters which recognize local patterns and "slide" them around the input.

# Convolutions
The heart of convolutional layers is an operation called "discrete convolution" which formalizes how we "slide filters around the grid."

## What does "like a grid" mean?
What this really means is that the input data has some kind of _ordering_ along at least one of its axes.
128x128-pixel color images will be represented by tensors of shape [128, 128, 3] -- the first two axes refer to the pixel, and the third picks an RGB **channel**.
The channel axis (also sometimes called the depth axis) is not ordered, since red is no closer to green than it is to blue.
But the first two axes are ordered: pixel (1, 1) is closer to pixel (1, 2) than it is to (1, 3) or (3, 3).

Convolutional layers are designed to take advantage of this ordering, by assuming that close points are more related than further ones.
Convolutions will only act on the ordered axes.

## Kernels
Instead of a set of dense weights, convolutional layers learn a set of **kernels**, which are formatted like small patches of grid.
In 1-D, they're 1-dimensional grid segments, in 2-D they're little squares, and in 3-D they're volumes.

To compute the **activation** of a kernel in a particular location in the grid, place the kernel centered there, then compute the dot product of the kernel and the input by multiplying elements that line up and summing the results.

The full result of **convolving** a kernel over an input is the set of activations produced by "sliding" the kernel to all locations on the input, resulting in a grid shaped like the input.
This is often called an **activation map**.

One really important thing to notice here is that _convolution preserves order_.
When you convolve a kernel over an image, the upper-left of the activation map corresponds to the upper-left of the image.
This means that convolutional layers can _preserve local structure_ even as you apply multiple convolutional layers in sequence.
Compare this with dense layers, which connect every output to every input and so lose all ordering information.

Another important property is that convolutions find small local patterns.
You can think of the kernel as a "template".
Its activation on a patch of image indicates how well that image patch matches the template: high positive values mean they have high "similarity" (note that the dot product is not actually normalized), high negative values mean they have similar structure but opposite sign, and values near zero mean that the kernel doesn't structurally match the input there.
These "templates" are applied everywhere on the input, able to match the same pattern (e.g. an edge or a face) again and again.

### 1D
![1-D convolution](https://files.realpython.com/media/njanakiev-1d-convolution.d7afddde2776.png)
(Image source: [Deep Learning with Python](https://realpython.com/asins/1617294438/))

### 2D
![2-D convolution gif](https://cdn-images-1.medium.com/max/1440/1*Zx-ZMLKab7VOCQTxdZ1OAw.gif)

In this animation, the input image is the blue grid on the left, the output is the teal grid on the right, and the kernel is the square grid of weights
$$\begin{bmatrix}0 & 1 & 2 \\ 2 & 2 & 0 \\ 0 & 1 & 2 \end{bmatrix}$$
The activation at a particular location is shown by shading where the kernel is on the input and the resulting space in the output.

(Image source: ["A guide to convolution arithmetic for deep learning"](https://arxiv.org/abs/1603.07285))

## Filters
Kernels act only on a single channel, and produce a single channel.
When dealing with multi-channel data (like RGB images, or multi-band satellite imagery), the output of a convolutional layer should be based on all of the channels together.
So, instead of using a kernel, we use a **filter**: a collection of kernels, one for each channel in the input's depth axis.

![filter](https://cdn-images-1.medium.com/max/1080/1*lRpx5pTrVewFTD8YXjhIKA.png)

(Image source: ["Intuitively understanding convolutions for deep learning"](https://towardsdatascience.com/intuitively-understanding-convolutions-for-deep-learning-1f6f42faee1))

The activation of a filter at a point on an image is the sum of the activations of each of its kernels, applied to one of the channels.
So, each filter takes in a grid of multiple channels and produces a new grid of a single channel (the activation map).

To increase the capacity of a convolutional layer, we use multiple filters in parallel, analogous to the "width" or "number of units" hyperparameter for dense layers.
The activation maps of each filter are concatenated.

So, the first layer of a convolutional neural network that operates on 128x128-pixel RGB and uses 64 5x5 filters has the following shapes:
 - Input images have shape [128, 128, 3] (pixel, pixel, channel)
 - Each kernel has shape [5, 5] (pixel, pixel)
 - Each filter has shape [5, 5, 3] (pixel, pixel, channel)
 - Each filter's activation map has shape [128, 128] (pixel, pixel)
 - The output of the layer has shape [128, 128, 64] (pixel, pixel, channel)
(assuming that the layer is padded so the activation maps are still 128x128 pixels)

Note: the next layer's filters would have shape [5, 5, 64] (assuming 5x5 convolutions).
Examples typically come in batches, so you can think of each shape also having a batch axis.
With a batch size of 8, for example, input image batches would have a shape of [8, 128, 128, 3].

For a good visual summary, check out this diagram from ["A guide to convolution arithmetic for deep learning"](https://arxiv.org/abs/1603.07285): 
![multi-channel convolution](./images/convolution_channels.png)

# Convolutional neural networks
Since the outputs of a convolutional layer preserve ordering along ordered axes, and multiple filters result in the output having channels, _the outputs have the same format as the inputs_.
In this case, an input 128x128 image with 3 channels is turned into an output 128x128 image with 64 channels.
This means that we can apply another convolutional layer to the output in order to build deep convolutional neural networks!

Much like when we stack dense layers, early convolutional layers capture simple patterns (like edges, corners, and textures) and later convolutional layers use patterns in those patterns to make higher-level features (like object parts and objects).

![feature hierarchy](./images/feature_hierarchy.png)

(Image source: skymind.ai, ["A Beginner's Guide to Neural Networks and Deep Learning"](https://skymind.ai/wiki/neural-network))

![feature hierarchy 2](http://teleported.in/post_imgs/11-zeiler-fertus.jpg)
(Image source: ["Visualizing and Understanding Convolutional Networks"](https://arxiv.org/abs/1311.2901))

This is great for the "feature extraction" part of a neural network, but how do we use the final activation maps to do classification or regression?
Often, the last activation map is "flattened" into a vector, then fed into a dense layer.
Usually one or two dense layers, followed by an output layer, are sufficient.

For example, if the last convolutional layer of a model is of shape [10, 10, 16], it will be flattened to a vector of shape [10 * 10 * 16] = [1600].
Then, every unit in the first dense layer will have 1600 inputs.

# Interpreting convolutional layers
Like dense layers, convolutional layers have lots of interpretations.

## Learning filters
The idea of convolving filters over images didn't originate with machine learning.
In signal processing and image processing, hand-designed kernels are common.
For example, hand-designed kernels exist for smoothing, sharpening, and edge detection on images.

![filters](./images/filters.png)

(Image source: https://en.wikipedia.org/wiki/Kernel_(image_processing))

Convolutional layers can be interpreted as learning filters of this sort that produce useful features when convolved over images.

## A variant of dense layers
Mathematically, convolutional layers are equivalent to dense layers with two very important constraints:
 - Many of the values are set to zero. This corresponds to the property that the pattern a kernel matches is local: it's based on a small number of adjacent inputs, and the weights connecting those inputs to other (non-local) outputs are zero.
 - The same weights are repeated many times in a fixed pattern. This corresponds to the property that the same kernel is applied many times on the same input.

For instance, a dense layer acting on a 4x4 image to produce a 2x2 image requires (4\*4) \* (2\*2) = 64 weights, arranged in a 4x16 matrix (taking in a 16-vector and producing a 4-vector through affine transformation):
![dense layer](https://cdn-images-1.medium.com/max/1800/1*Nq-Za2-OzW8J5n7Tu7QIWw.png)

However, a 3x3 kernel $K$ acting on this image just requires 3\*3 = 9 weights.

$$K = \begin{bmatrix}k_{1,1} & k_{1,2} & k_{1 ,3} \\ k_{2,1} & k_{2,2} & k_{2 ,3} \\ k_{3,1} & k_{3,2} & k_{3 ,3} \end{bmatrix}$$

Convolving this kernel with the input can be represented as multiplying by a specific 4x16 matrix:
![convolution as a dense layer](https://cdn-images-1.medium.com/max/1800/1*cr0IabpKu4zIyvDgCTQ64A.png)

(Source of images: ["Intuitively Understanding Convolutions for Deep Learning"](https://towardsdatascience.com/intuitively-understanding-convolutions-for-deep-learning-1f6f42faee1))

These properties lead to the interpretation of convolutions as dense layers with three very strong priors:
 - Patterns are purely local. This is also called the property of having "sparse interactions." This vastly reduces the number of parameters a layer has, making convolutional layers much faster to run and train, as well as being vastly more statistically efficient.
 - Parameters are shared throughout the layer (called "weight sharing" or "weight tying"). This is the prior that the same pattern can appear anywhere in an input.
 - The activation map should be **equivariant** with respect to translation. That means that if something is shifted in an image, the resulting activation is shifted by the same amount in the activation map. 

These priors work extremely well in practice on images and other kinds of grid-shaped data, including many kinds of time-series data (such as speech waveforms).

## Modules / functions
One interpretation of the value of weight sharing is that it corresponds to using the same neuron or group of neurons many places in the input like a programmer might write a function and use it many times in a program.

From this perspective, forcing the network to apply the same group of neurons multiple times forces it to learn representations that are useful in many ways, leading to robust representation.
If the prior that these "functions" are a good fit for the input is accurate, then learning one function and applying it many times is much easier than learning the function many times, once in each place it could be applied.
In practice this seems to be the case.

![modular conv net](https://colah.github.io/posts/2014-07-Conv-Nets-Modular/img/Conv-9-Conv2Conv2.png)

In the image above, taken from Chris Olah's ["Conv Nets: A Modular Perspective"](https://colah.github.io/posts/2014-07-Conv-Nets-Modular/), the network:
 1. Applies a learned function, A, to every point on the input
 2. Applies a second learned function, B, to the outputs of A
 3. Feeds the outputs of B into a dense layer F, which produces the output.
In this view, F is like a large "block of code", kind of like a `main()` function, that chooses how to apply the functions A and B by deciding which of their activations on the input are important and how to use them.

If you'd like to read more about this view, check out the blog post linked above.

# Output size, padding, receptive field, and stride
Note: throughout this box, the animations used have the input on the bottom (blue) and the output on top (teal).
Source: https://github.com/vdumoulin/conv_arithmetic.

## Output size
Because you can't center the kernel on the edge of the image, the output of a convolution will be slightly smaller than the input:
![convolution without padding](https://raw.githubusercontent.com/vdumoulin/conv_arithmetic/master/gif/no_padding_no_strides.gif)
This effect gets more significant with larger kernel sizes.

## Padding
However, it's common in many cases to pad the edges of the input so that the output and input have the same dimensions:

![same padding](https://raw.githubusercontent.com/vdumoulin/conv_arithmetic/master/gif/same_padding_no_strides.gif)

This is called "same padding," and sometimes not padding the input is called "valid padding" or "valid convolution" (because you only place the kernel on the input where it's "valid").
Usually the input is padded with zeroes (this is the only kind of padding Keras supports), though [there are some issues with this](https://twitter.com/karpathy/status/720622989289644033?lang=en).
Sometimes "border reflection" is used instead.
Zero-padding usually works fine in practice, though.

Same padding is useful for a few reasons:
 - It enables deeper networks by preventing shrinkage. Even losing 2 pixels at the boundary per layer can be a big deal for deep networks operating on low-resolution inputs.
 - It prevents the model from losing information at the edges.
 - It enables more advanced architectures like Inception to use multiple different sizes of kernels within the same layer, since their outputs will all have the same size.

## Receptive field
An important quantity to think about when designing convolutional networks is **receptive field**.
The receptive field of a pixel in an activation map is the set of pixels in the input image that contributed to making that activation.
For example, the receptive field size of a first-layer 3x3 convolution is 3, and the receptive field size of a second-layer 3x3 convolution based on that is 5.
Activations which are based on the entire image are said to have a "full receptive field." 

Receptive field is important because it constrains what the network could theoretically learn.
No matter how much data you have or how powerful your model, a 1-D convolutional network on audio won't be able to transcribe (word-by-word) a full audio clip if its output units don't have receptive field big enough to encompass an entire word.

Note:
 - Not every input in the receptive field contributes equally to the output. There is the concept of an ["effective receptive field"](http://www.cs.toronto.edu/~wenjie/papers/nips16/top.pdf) which is much smaller than the theoretical receptive field.
 - Just because a model's receptive field theoretically enables it to learn a task doesn't mean it necessarily will. Always consider whether you have enough model capacity to carry out the task and if you have enough data for the capacity.

## Stride
A common modification is **strided convolution**, which moves the kernel more than one space per step, effectively skipping steps in order to intentionally reduce the output size:
![strided convolution](https://raw.githubusercontent.com/vdumoulin/conv_arithmetic/master/gif/no_padding_strides.gif)

This is equivalent to _downsampling_ the output, performing a normal convolution and then discarding every other pixel (in the case where stride=2).
Reducing the output size is useful for the internal "feature extraction" layers of a CNN, since it makes training and running the network dramatically cheaper by reducing the number of times we must slide the kernel.
Strided convolution also increases the receptive field out the output significantly.

The cost is that the features become less precisely localized, since we're discarding some of the fine-grained position.

#### Aside: other kinds of convolutions
There are a number of different varieties of convolution that have been developed recently, and are useful in certain situations.
Transposed convolution, dilated convolution, separable convolution, and global average pooling are all important ideas in some modern convolutional networks.

For more info, you might want to skim through ["A guide to convolution arithmetic for deep learning"](https://arxiv.org/abs/1603.07285) or play around with a [receptive field and output size calculator](https://fomoro.com/projects/project/receptive-field-calculator).

# Pooling
**Pooling** is an operation very commonly used in convolutional networks.
Pooling acts on the activation of a layer by "grouping together" nearby values; the most common variety is **max pooling**, which replaces every $n \times n$ square in the activation map with the maximum activation value in that square.
Pooling can be thought of as a sliding-window operation, much like convolution, that uses some other function instead of the linear combination function used by convolution.

![maxpooling](https://computersciencewiki.org/images/8/8a/MaxpoolSample2.png)
(Source: [Computer Science Wiki](https://computersciencewiki.org/index.php/Max-pooling_/_Pooling))

Much like striding, pooling reduces the dimensionality of the output and increases receptive field size.
In addition, it makes the output _invariant to small translations_: when you shift the input a little, the output doesn't change at all.
This makes pooling appropriate when you care more about _whether a feature is present_ than _exactly where it is_.

# Common architecture choices
TODO

# Convolution in TensorFlow
TODO

# Convolution in Keras
TODO

# Example
TODO