# CNNs 101
In this notebook, we'll work through how to make a convolutional neural network (CNN) using Keras and TensorFlow. If you haven't yet installed these packages, please refer to the Installation notebook.

## Theory
CNNs are a class of neural networks where one (or more) of the layers contain convolutions. CNNs do not need to consist entirely of convolution layers (and indeed, typically have one or more dense layers following convolutions), but they typically start with several convlutions layers. CNNs are well-suited for image recognition tasks, because they encode information about patterns in neighboring pixels within an input layer. A single "convolution layer" actually generally consists of three discrete operations: (1) a convolution operation, (2) an activation operation, and (3) a max pooling operation. We will now discuss each of these further.

### Convolutions
A _convolution_ is a mathematical operation where the values of a function are combined together with neighboring values to generate a "smoothed" output value. Mathematically, we can write a convolution as:

$$
(f * g)(t) \equiv \int_{-\infty}^\infty f(\tau) g(t-\tau) d\tau
$$

In the continuous case, the function $g(x)$ is typically a Gaussian function (or an approximation on a finite domain) or a "top hat" window (which has a normalized constant amplitude over some range). For the case of a two dimensional discrete convolution (which is the case for CNNs), given an two-dimensional input image $\vec{A} \in \mathbb{R}^{X\times Y}$ and a convolutional weights matrix $\vec{W} \in \mathbb{R}^{M \times N}$, the value in the output matrix $B(i,j)$ is given by:

$$
B(i,j) \equiv (A * W)(i, j) = \sum_{m=-M/2}^{M/2} \sum_{n=-N/2}^{N/2} W(m, n) \cdot A(i-m, j-n)
$$

The convolution thus has the effect of "mixing together" information in neighboring pixels in the input image. Initially, the values of the convolution window $\vec{W}$ are random, but as training progresses the values are updated to extract specific features of the input and hidden layers. These values can be positive or negative, but their magnitude is typically close to 1. When the network is fully trained, the convolution operation can be thought of as recognizing local patters in the data.

When applying the convolution, one can specify the _padding_ and the _stride_ to use. _Padding_ is adding additional data to the edges of the input images (typically zeros) so that the size of the output image can be the same as the input image. If there is no padding, and a dimension of the convolution kernel is greater than 1 (i.e., $M > 1$ or $N > 1$), then the output image must necessarily be smaller than the input. The _stride_ specifies how many pixels to translate the starting point of the convolutional filter for an adjacent pixel in the output image. For instance, a stride of 2 means that the convolutional kernel will only use _every other_ pixel as a starting location for the convolutional kernel, decreasing the size of the output image by a factor of 2 along the dimension of the stride.

### Max Pooling
Following the activation function being applied to each pixel of the output of the convolution, a _max pooling_ step is typically applied. This reduced the dimensionality of the data, which means fewer parameters to train in the next layer. It also serves to emphasize the most prominent features present in an output image.

A max pool, as its name implies, divides the image into equal chunks (or pools), and simply returns the maximum value within that pool. A typical size of the max pool is a 2x2 square, and so the max pool returns the largest of those 4 values as the new "output pixel". As with the padding and stride options mentioned above in the convolution step, this has the property of reducing the dimensionality of the output layers.

#### Visualize it!
Try to sketch out what is happening when a max pooling step is applied.

# Developing a CNN in Keras
Now that we've covered some of the basics of CNNs, let's build one using Keras. For the sake of example, we'll work through the same application as yesterday (classifying hand-written digits) but with using a CNN architecture.