# Convolutional Neural Networks
Jason Saporta, Department of Statistics, Iowa State University

## Neural Networks
<img src="http://neuralnetworksanddeeplearning.com/images/tikz41.png" style="margin-left: auto; margin-right: auto;"></img>
A typical neural network has completely connected layers.

Each connection is associated with a different weight parameter.

That's a lot of parameters!

## Image Data
Suppose our inputs have the form of 2-dimensional grayscale images.

Then our input layer looks like:
<img src="http://neuralnetworksanddeeplearning.com/images/tikz42.png" style="margin-left: auto; margin-right: auto;"></img>

## Local Receptive Fields
Instead of using a completely connected "dense" layer, neurons in the first hidden layer will each only be connected to a small patch of the input image. The size of this patch is dictated by hyperparameter values provided to the algorithm.

This patch is called the **local receptive field** for that neuron.
<img src="http://neuralnetworksanddeeplearning.com/images/tikz43.png" style="margin-left: auto; margin-right: auto;"></img>

## Local Receptive Fields
We will have one neuron in the hidden layer for each possible local receptive field in the input image (for now).
<img src="http://neuralnetworksanddeeplearning.com/images/tikz44.png" style="margin-left: auto; margin-right: auto;"></img>
<img src="http://neuralnetworksanddeeplearning.com/images/tikz45.png" style="margin-left: auto; margin-right: auto;"></img>

## Shared Weights
The weights connecting each of these local receptive fields to their respective hidden neurons are shared across the hidden layer.

This means that our hidden layer is really the result of a **convolution** between the input image and a kernel matrix consisting of learnable values. Convolutions are frequently used in image processing to detect the presence of certain features in an image.

Hidden layer neurons will activate when the feature specified by the kernel is present in their respective local receptive fields. The exact details of when they will activate will depend on the kernel, the bias value, and the activation function.

## Diagram of a Convolution
<img src="cnnpics/conv.png" style="margin-left: auto; margin-right: auto;"  height="50%" width="50%"></img>

## Multiple Feature Maps
So far we have discussed the case where there is one learned kernel that is convolved with the input image to produce the hidden layer.

In practice, we usually have multiple kernels, each of which are convolved with the input image. The hidden layer will then consist of multiple **feature maps**.
<img src="http://neuralnetworksanddeeplearning.com/images/tikz46.png" style="margin-left: auto; margin-right: auto;"></img>

## Pooling Layers
In addition to convolutional layers, we also have **pooling layers**, which frequently come directly after convolutional layers.

The distinction between the two is fuzzy, but the idea here is that the *exact location* of a feature in an image is frequently unimportant. This gives us a coarser, *smaller* representation of each feature map computed in the previous layer.

Here, the maximum of the input units in the image patch is computed:
<img src="http://neuralnetworksanddeeplearning.com/images/tikz47.png" style="margin-left: auto; margin-right: auto;"></img>
Pooling is also typically done using averaging the observations.

## Pooling Layers
Again, this can be replicated separately for each feature map under consideration.
<img src="http://neuralnetworksanddeeplearning.com/images/tikz48.png" style="margin-left: auto; margin-right: auto;"></img>

## A (Very Simple) Complete Convolutional Neural Network
Armed with the output of the pooling layer, we can imagine adding on a softmax layer (for example) to perform classification.
<img src="http://neuralnetworksanddeeplearning.com/images/tikz49.png" style="margin-left: auto; margin-right: auto;"></img>
In a real-life network, we will typically have several pairs of convolutional and pooling layers stacked on top of each other, and potentially a fully-connected layer as well before our output layer.

## Terminological Considerations
<img src="cnnpics/term.png" style="margin-left: auto; margin-right: auto;" height="75%" width="75%"></img>

## Application Areas
- Image Classification
- Object Detection
- Object Tracking
- Pose Estimation
- Text Detection and Recognition
- Visual Saliency Detection
- Action Recognition
- Scene Labeling
- Speech Processing
- Natural Language Processing

## Extensions to the Basic Architecture
There are at least seven areas where improvements have been proposed for CNNs:
- Convolutional Layers
- Pooling Layers
- Activation Functions
- Loss Functions
- Regularization Methods
- Optimization Methods
- CNN Processing

## Convolutional Layers
- **Tiled Convolution:** Uses multiple kernels when creating a single feature map, cycling between them as the sliding window moves across the image. The goal here is to learn features that are invariant to more than just small shifts.
- **Deconvolution:** Connects a single input activation to multiple outputs, done by performing a standard convolution on upsampled input values. Useful for visualizing the innards of a CNN by mapping feature maps back to pixel space.
- **Network in Network:** Replaces the linear filter of a convolutional layer by a small multilayer perceptron capable of approximating a more abstract representation of latent concepts.

## Dense, Tiled Convolutional, and Standard Convolutional Layers
<img src="cnnpics/tiled.png" style="margin-left: auto; margin-right: auto;" height="50%" width="50%"></img>

## Pooling Layers
- **$\boldsymbol{L_p}$ pooling**: $$y_{ij} = \left[ \sum_{(m, n) \in \mathcal{R}_{ij}} a_{(m, n)}^p \right]^{1 / p}$$ This corresponds to average pooling when $p = 1$ and max pooling when $p = \infty$.
- **Mixed Pooling**: During the forward propagation process, randomly decide whether to use average or mixed pooling. Retain the choice for when you back-propagate through the network.
- **Stochastic Pooling:** Randomly select one of the nodes in the pooling region based on a multinoulli distribution, where the node probabilities are determined by scaling the activation values so they sum to 1. The selected node's activation value is then used.
- **Spatial Pyramid Pooling:** Pool the image based on size-proportional regions rather than a sliding window. This can output a fixed-length layer even when dealing with images of multiple sizes.

## Activation Functions
- **ReLU:** $\max\{0, a\}$. This computes faster than sigmoid or tanh activation functions, and induces sparsity in the learned parameters.
- **Leaky ReLU:** $\max\{0, a\} + \lambda \min\{0, a\}$ for some predefined $\lambda \in (0, 1)$. Allows for a non-zero gradient when the unit is inactive.
- **Randomized Leaky ReLU:** $\max\{0, a\} + \lambda \min\{0, a\}$, where $\lambda \sim \text{Unif}(0, 1)$ during forward propagation (for each neuron with this activation function) and this value is saved during back-propagation.
- **Maxout:** Max of the weighted inputs over multiple channels (such as color for images). Generalizes the ReLU and absolute value activation functions. Designed for use with dropout regularization.

## Loss Functions
- **Hinge Loss:** This can be used to create the equivalent of an SVM classifier which uses the features automatically created in the network.
- **Softmax Loss:** This creates the equivalent of a multinoulli GLM using the network's features.
- **Contrastive Loss:** Used to train Siamese networks, which are two identical CNNs used together for the purpose of identifying matching and non-matching input values.
- **Triplet Loss:** Takes in three data points: an *anchor*, a *positive instance*, and a *negative instance*. The goal here is to minimize the distance between the anchor and the positive instance, and maximize that between the anchor and the negative instance.
- **K-L Divergence:** Used to train autoencoders. The symmetric form has been used to train Generative Adversarial Networks (GANs).

## Regularization Methods
- **$\boldsymbol{L_p}$ Norm Regularization**: A very widespread regularization method; when $p = 2$ this is known as *weight decay*.
- **Dropout:** Randomly removes neurons from the network on each iteration of SGD. This prevents the network from becoming too dependent on any small set of neurons, and forces it to be accurate even in the absence of certain information.
- **DropConnect:** Similar to Dropout, but randomly sets weights to $0$ rather than removing neurons from the network. The distinction is important because of the shared weights in CNNs.

## Dropout Diagram
<img src="http://neuralnetworksanddeeplearning.com/images/tikz31.png" style="margin-left: auto; margin-right: auto;"></img>

## Optimization Methods
- **Data Augmentation:** Deep CNNs are dependent on large quantities of training data. When there is a small amount of data, one might want to *augment* it by adding new data points created from the old ones. This may be done by rotating, shifting, mirroring, etc. the original data.
- **(Nesterov) (Momentum) SGD:** SGD is the standard method for training CNNs, and it may be augmented by using a quantity representing velocity. Nesterov momentum switches the order in which the velocity and gradient are updated.
- **Batch Normalization:** Makes the normalization of the data part of the model architecture, rather than a separate pre-processing step. This has the effect of normalizing data based on their specific mini-batch rather than the whole dataset.

## Fast Processing of CNNs
- **FFT:** Used to carry out the convolution operation.
- **Low Precision/Binarized Operations:** Small parameter updates may contain a lot of redundant information. To reduce the redundancies, it can be useful to restrict some or all of the arithmetic within the neural network to binary operations.
- **Weight Compression:** Used to reduce the number of parameters in the convolutional and fully-connected layers.

## Acknowledgements
- Nielsen, Neural Networks and Deep Learning, Determination Press, 2015
- Goodfellow et al., Deep Learning, MIT Press 2016
- Gu et al., Recent Advances in Convolutional Neural Networks, Arxiv 2017
- Ioffe and Szegedy, Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift, Arxiv 2015