In [1]:
%load_ext nb_black

<IPython.core.display.Javascript object>

- so far, we've talked about neural networks in general
- but if you know something about your data, for example, if it's an image or a sequence of things you can do a lot better
- one of the very popular ways to structure a neural network is called a Convolutional Neural Network
- it's become incredibly popular for things like image processing and processing large datasets

# Color

- if your data has some structure, and your network doesn't have to learn that structure from scratch, it's going to perform better
- imagine, for example, that you're trying to classify those letters, and you know that color is really not a factor in what makes an A an A

<img src="resources/color_doesnt_matter.png" style="width: 70%;"/>

- what do you think would be easier for your classifier to learn?
  - a model that uses the color image, or a model that only looks at the gray scale
    - intuitively, if a new letter comes up in a color that you've never seen before, it's going to be a lot easier for your model that ignores the color to begin with to classify that letter

# Statistical Invariance

- here's another example
  - you have an image, and you want your network to say it's an image with a cat in it
  - it doesn't really matter where the cat is, it's still an image with a cat
  - if your network has to learn about kittens in the left corner, and about kittens in the right corner independently, that's a lot of work that it has to do
  - how about you telling it, instead explicitly, that objects and images are largely the same whether they're on the left or on the right of the picture
    - that's what's called **translation invariance**
      - different positions, same kitten

<img src="resources/translation_invariance_kittens.png" style="width: 30%;"/>

- yet another example
  - imagine you had a long text that talks about kittens
  - does the meaning of kitten change depending on whether it's in the first sentence or in the second one?
  - mostly not, so if you're trying to network on text, maybe you want the part of the network that learns what a kitten is to be reused every time you see the word kitten, and not have to re-learn it every time
  - the way you achieve this in your own networks, is using what's called **weight sharing**
    - when you know that two inputs can contain the same kind of information, then you share their weights
    - and train the weights jointly for those inputs
    - *statistical invariants*, things that don't change on average across time or space, are everywhere
    - for images, the idea of weight sharing will get us to study convolutional networks
      - for text and sequences in general, it will lead us to embeddings and recurrent neural networks

<img src="resources/weight_sharing_kittens.png" style="width: 70%;"/>

# Convolutional Networks

- let's talk about Convolutional Networks, or **ConvNets**
- ConvNets are neural networks that share their parameters across space


- imagine you have an image
- it can be represented as a flat pancake
  - it has a width, a height, and because you typically have red, green, and blue channels, it also has a depth
  - in this instance, depth is three
- that's your input

<img src="resources/input_pancake.png"/>

- now, imagine taking a small patch of this image, and running a tiny neural network on it, with say, $K$ outputs
- let's represent those outputs vertically in a tiny column like this

<img src="resources/outputs_column_slider.png"/>

- now, let's slide that little neural network across the image without changing the weights
  - just slide across invertically like we're painting it with a brush
- on the output, we've drawn another image

<img src="resources/convolution.png"/>

- it's got a different width, a different height and more importantly, it's got a different depth
- instead of just R, G, and B, now you have an output that's got many colored channels, $K$ of them
- this operation is called the **convolution**


- if your patch size were the size of the whole image, it would be no different than the regular layer of a neural network
- but because we have this small patch instead, we have many fewer weights and they are shared across space
- a ConvNet is going to basically be a deep network where instead of having stacks of matrix multiply layers, we're going to have stacks of convolutions
  - the general idea is that they will form a pyramid
  - at the bottom, you have this big image, but very shallow just R, G, and B
  - you're going to apply convolutions that are going to progressively squeeze the spacial dimensions while increasing the depth which corresponds roughly to the semantic complexity of your representation
  - at the top, you can put your classifier; you have a representation where all this spacial information has been squeezed out, and only parameters that map to content of the image remain

<img src="resources/pyramid_convolutions.png" style="width:70%"/>

- so that’s the general idea
- if you're going to implement this, there are lots of little details to get right, and a fair bit of lingo to get used to
- you've met the concept of *patch* and *depth*
- patches are sometimes called Kernels

<img src="resources/patch_and_depth.png" style="width:70%"/>

- each pancake in your stack is called a feature map
- here, you're mapping three feature maps to $K$ feature maps

<img src="resources/feature_maps.png" style="width:70%"/>

- another term that you need to know is *stride*
  - it's the number of pixels that you're shifting each time you move your filter
  - the stride of one makes the output roughly the same size as the input
  - a stride of two means it's about half the size
  
<img src="resources/stride_of_one.png"/>

<img src="resources/stride_of_two.png"/>


- I say roughly because it depends a bit about what you do at the edge of your image
  - either you don't go pass the edge
    - it's often called *valid padding* as a shortcut
  - or you go off the edge and pad with zeros in such a way that the output map size is exactly the same size as the input map
    - that is often called *same padding* as a shortcut
    
<img src="resources/valid_padding.png"/>

<img src="resources/same_padding.png"/>

# Intuition for CNN

- let's develop better intuition for how Convolutional Neural Networks (CNN) work
- we'll examine how humans classify images, and then see how CNNs use similar approaches


- let’s say we wanted to classify the following image of a dog as a Golden Retriever
- as humans, how do we do this?
  - one thing we do is that we identify certain parts of the dog, such as the nose, the eyes, and the fur
  - we essentially break up the image into smaller pieces, recognize the smaller pieces, and then combine those pieces to get an idea of the overall dog
  - in this case, we might break down the image into a combination of the following:
    - a nose, two eyes, golden fur
  - but let’s take this one step further
    - how do we determine what exactly a nose is?
      - Golden Retriever nose can be seen as an oval with two black holes inside it
      - one way of classifying a Retriever’s nose is to to break it up into smaller pieces and look for black holes (nostrils) and curves that define an oval as shown below


- broadly speaking, this is what a CNN learns to do
- it learns to recognize basic lines and curves, then shapes and blobs, and then increasingly complex objects within the image
- finally, the CNN classifies the image by combining the larger, more complex objects


- in our case, the levels in the hierarchy are:
  - simple shapes, like ovals and dark circles
  - complex objects (combinations of simple shapes), like eyes, nose, and fur
  - the dog as a whole (a combination of complex objects)


- with deep learning, we don't actually program the CNN to recognize these specific features
- rather, the CNN learns on its own to recognize such objects through forward propagation and backpropagation!
- it's amazing how well a CNN can learn to classify images, even though we never program the CNN with information about specific features to look for

<img src="resources/golder_retriever_hierarchy_diagram.jpg"/>

- a CNN might have several layers, and each layer might capture a different level in the hierarchy of objects
  - the first layer is the lowest level in the hierarchy, where the CNN generally classifies small parts of the image into simple shapes like horizontal and vertical lines and simple blobs of colors
  - the subsequent layers tend to be higher levels in the hierarchy and generally classify more complex ideas like shapes (combinations of lines), and eventually full objects like dogs.


- once again, the CNN ***learns all of this on its own***
- we don't ever have to tell the CNN to go looking for lines or curves or noses or fur
- the CNN just learns from the training set and discovers which characteristics of a Golden Retriever are worth looking for

# Filters

## Breaking up an Image

- the first step for a CNN is to break up the image into smaller pieces
- we do this by selecting a width and height that defines a filter
- the filter looks at small pieces, or patches, of the image
  - these patches are the same size as the filter

<img src="resources/patches_kitten.png" style="width:60%"/>

- we then simply slide this filter horizontally or vertically to focus on a different piece of the image
- the amount by which the filter slides is referred to as the *stride*
  - the stride is a hyperparameter which you, the engineer, can tune
  - increasing the stride reduces the size of your model by reducing the number of total patches each layer observes
    - however, this usually comes with a reduction in accuracy


- let’s look at an example
- in this zoomed in image of the dog, we first start with the patch outlined in red
- the width and height of our filter define the size of this square

<img src="resources/golden_retriever_one_patch_1.png"/>

- we then move the square over to the right by a given stride (2 in this case) to get another patch

<img src="resources/golden_retriever_one_patch_2.png"/>

- what's important here is that we are grouping together adjacent pixels and treating them as a collective
- in a normal, non-convolutional neural network, we would have ignored this adjacency
- in a normal network, we would have connected every pixel in the input image to a neuron in the next layer
- in doing so, we would not have taken advantage of the fact that pixels in an image are close together for a reason and have special meaning
- by taking advantage of this local structure, our CNN learns to classify local patterns, like shapes and objects, in an image

- https://www.youtube.com/watch?v=YRhxdVk_sIs

## Filter Depth

- it's common to have more than one filter
- different filters pick up different qualities of a patch
- for example, one filter might look for a particular color, while another might look for a kind of object of a specific shape
- the amount of filters in a convolutional layer is called the *filter depth*


- how many neurons does each patch connect to?
  - that’s dependent on our filter depth
  - if we have a depth of $K$, we connect each patch of pixels to $K$ neurons in the next layer
    - this gives us the height of $K$ in the next layer, as shown below
    - in practice, $K$ is a hyperparameter we tune, and most CNNs tend to pick the same starting values
    
<img src="resources/filter_depth.png"/>

- but why connect a single patch to multiple neurons in the next layer; isn’t one neuron good enough?
  - multiple neurons can be useful because a patch can have multiple interesting characteristics that we want to capture
  - for example, one patch might include some white teeth, some blonde whiskers, and part of a red tongue
  - in that case, we might want a filter depth of at least three - one for each of teeth, whiskers, and tongue


- having multiple neurons for a given patch ensures that our CNN can learn to capture whatever characteristics the CNN learns are important
- remember that the CNN isn't "programmed" to look for certain characteristics
  - rather, it learns on its own which characteristics to notice

# Feature Map Sizes

- imagine that you have a 28 x 28 image
- you run a 3 by 3 convolution on it, with an input depth of 3, and an output depth of 8
- what other dimensions are your output feature maps?
  - when you're using same padding with a stride of 1?
  - when you're using valid padding with a stride of 1?
  - when you're using valid padding with a stride of 2?

<img src="resources/feature_map_original.png"/>

- if you're using the so called *same padding* in a stride of 1, the output width and height are the same as the input  
  - we just add zeros to the input image to make the sizes match
  - in this case, width and height are 28 and 28

<img src="resources/feature_map_same_padding.png"/>

- if you use the so called *valid padding* in a stride of 1, then there is no padding at all
  - if you want to fit your little filter on the input image without doing and padding, you're going to have to remove one row and one column of the image on each side
  - so in this case you're left with 26 features in each of the maps at the output

<img src="resources/feature_map_valid_padding.png"/>

- if in addition you use a stride of two, then you only get half as many outputs
  - so in this case, 13 in width and 13 in height

<img src="resources/feature_map_stride_two.png"/>

- in all cases, the output depth isn't changed
  - in this case, it stays at 8

- https://www.youtube.com/watch?v=qSTv_m-KFk0

# Convolutions continued

- that's it, you can build a simple ConvNet with just this
- stack up your convolutions which thankfully you don't have to implement yourselves
- then use strides to reduce the dimensionality and increase the depth of your network layer after layer
- once you have a deep and narrow representation, connect the whole thing to a few regular, fully connected layers, and you're ready to train your classifier

<img src="resources/convolutions_layers_classifier.png"/>

- you might wonder what happens to training and to chain rule; in particular, when you use shared weights like this
  - nothing really happens, the math just works
  - you just add up the derivatives for all the possible locations on the image $\dfrac{\Delta £}{\Delta W} = \dfrac{\Delta £}{\Delta W}(X_1) + \dfrac{\Delta £}{\Delta W}(X_2)$

<img src="resources/chain_rule_shared_weights.png"/>

# Parameters

## Parameter Sharing

- when we are trying to classify a picture of a cat, we don’t care where in the image a cat is
- if it’s in the top left or the bottom right, it’s still a cat in our eyes
- we would like our CNNs to also possess this ability known as **translation invariance**
- as we saw earlier, the classification of a given patch in an image is determined by the weights and biases corresponding to that patch
- if we want a cat that’s in the top left patch to be classified in the same way as a cat in the bottom right patch, we need the weights and biases corresponding to those patches to be the same, so that they are classified the same way
- this is exactly what we do in CNNs
  - the weights and biases we learn for a given output layer are shared across all patches in a given input layer
  - note that as we increase the depth of our filter, the number of weights and biases we have to learn still increases, as the weights aren't shared across the output channels


- there’s an additional benefit to sharing our parameters
- if we did not reuse the same weights across all patches, we would have to learn new parameters for every single patch and hidden layer neuron pair
- this does not scale well, especially for higher fidelity images
- thus, sharing parameters not only helps us with translation invariance, but also gives us a smaller, more scalable model

## Padding

<img src="resources/5v5_grid_with_3v3_padding.png"/>

- let's say we have a $5x5$ grid and a filter of size $3x3$ (as shown above) with a stride of $1$
- what's the width and height of the next layer?
  - it's $3x3$
  - we see that we can fit at most three patches in each direction, giving us a dimension of $3x3$ in our next layer


- as we can see, the width and height of each subsequent layer decreases in the above scheme
- in an ideal world, we'd be able to maintain the same width and height across layers so that we can continue to add layers without worrying about the dimensionality shrinking and so that we have consistency
- one way to achieve this is to simply add a border of $0$s to our original $5x5$ image
  - you can see what this looks like in the below image

<img src="resources/5v5_grid_with_3v3_padding_and_0s_border.png"/>

- this would expand our original image to a $7x7$
- with this, we now see how our next layer's size is again a $5x5$, keeping our dimensionality consistent

## Dimensionality

- from what we've learned so far, how can we calculate the number of neurons of each layer in our CNN
- given:
  - our input layer has a width of $W$ and a height of $H$
  - our convolutional layer has a filter size $F$
  - we have a stride of $S$
  - a padding of $P$
  - the number of filters $K$


- the following formula gives us the width of the next layer: $W_{out} = [(W−F+2P)/S] + 1$
- the output height would be $H_{out} = [(H-F+2P)/S] + 1$
- the output depth would be equal to the number of filters $D_{out} = K$
- the output volume would be $W_{out} * H_{out} * D_{out}$


- knowing the dimensionality of each additional layer helps us understand how large our model is and how our decisions around filter size and stride affect the size of our network

## Quiz: Convolution Output Shape

- for the next few quizzes we'll test your understanding of the dimensions in CNNs
- understanding dimensions will help you make accurate tradeoffs between model size and performance
- as you'll see, some parameters have a much bigger impact on model size than others


- setup
  - H = height, W = width, D = depth
  - we have an input of shape $32x32x3$ (HxWxD)
  - $20$ filters of shape $8x8x3$ (HxWxD)
  - a stride of $2$ for both the height and width (S)
  - with padding of size $1$ (P)


- recall the formula for calculating the new height or width:
```python
new_height = (input_height - filter_height + 2 * P)/S + 1
new_width = (input_width - filter_width + 2 * P)/S + 1
```

**Q:** What's the shape of the output?
<br/>
**A:** It's 14x14x20 (HxWxD).


- this would correspond to the following code:
```python
input = tf.placeholder(tf.float32, (None, 32, 32, 3))
filter_weights = tf.Variable(tf.truncated_normal((8, 8, 3, 20))) # (height, width, input_depth, output_depth)
filter_bias = tf.Variable(tf.zeros(20))
strides = [1, 2, 2, 1] # (batch, height, width, depth)
padding = 'SAME'
conv = tf.nn.conv2d(input, filter_weights, strides, padding) + filter_bias
```

- note the output shape of `conv` will be [1, 16, 16, 20]
- it's 4D to account for batch size, but more importantly, it's not [1, 14, 14, 20]
- this is because the padding algorithm TensorFlow uses is not exactly the same as the one above
- an alternative algorithm is to switch `padding` from `'SAME'` to `'VALID'` which would result in an output shape of [1, 13, 13, 20]
- if you're curious how padding works in TensorFlow, read [this document](https://www.tensorflow.org/api_docs/python/tf/nn/convolution)
  - in summary, TensorFlow uses the following equation for `'SAME'` vs `'VALID'`
  - SAME Padding, the output height and width are computed as:
    - `out_height` = ceil(float(in_height) / float(strides[1]))
    - `out_width` = ceil(float(in_width) / float(strides[2]))
  - VALID Padding, the output height and width are computed as:
    - `out_height` = ceil(float(in_height - filter_height + 1) / float(strides[1]))
    - `out_width` = ceil(float(in_width - filter_width + 1) / float(strides[2]))

## Quiz: Number of Parameters

- we're now going to calculate the number of parameters of the convolutional layer
- the answer from the last quiz will come into play here!
- being able to calculate the number of parameters in a neural network is useful since we want to have control over how much memory a neural network uses


- setup
  - H = height, W = width, D = depth
  - we have an input of shape $32x32x3$ (HxWxD)
  - $20$ filters of shape $8x8x3$ (HxWxD)
  - a stride of $2$ for both the height and width (S)
  - zero padding of size $1$ (P)


- output layer: $14x14x20$ (HxWxD)


- hint
  - without parameter sharing, each neuron in the output layer must connect to each neuron in the filter
  - in addition, each neuron in the output layer must also connect to a single bias neuron
  

**Q:** How many parameters does the convolutional layer have (without parameter sharing)?
<br/>
**A:** There are $756560$ total parameters: $(8 * 8 * 3 + 1) * (14 * 14 * 20) = 756560$. That's a HUGE amount!
- 8 * 8 * 3 is the number of weights, we add 1 for the bias
- remember, each weight is assigned to every single part of the output (14 * 14 * 20)
- so we multiply these two numbers together and we get the final answer

## Quiz: Parameter Sharing

- now we'd like you to calculate the number of parameters in the convolutional layer, if every neuron in the output layer shares its parameters with every other neuron in its same channel
- this is the number of parameters actually used in a convolution layer (`tf.nn.conv2d()`)


- setup
  - H = height, W = width, D = depth
  - we have an input of shape $32x32x3$ (HxWxD)
  - $20$ filters of shape $8x8x3$ (HxWxD)
  - a stride of $2$ for both the height and width (S)
  - zero padding of size $1$ (P)


- output layer: $14x14x20$ (HxWxD)


- hint
  - with parameter sharing, each neuron in an output channel shares its weights with every other neuron in that channel
  - so the number of parameters is equal to the number of neurons in the filter, plus a bias neuron, all multiplied by the number of channels in the output layer


**Q:** How many parameters does the convolution layer have (with parameter sharing)?
<br/>
**A:** There are $3860$ total parameters: $(8 * 8 * 3 + 1) * 20 = 3840 + 20 = 3860$. That's $196$ times fewer parameters!

- (8 * 8 * 3 + 1) * 20 = 3840 + 20 = 3860
  - that's 3840 weights and 20 biases
- this should look similar to the answer from the previous quiz; the difference being it's just 20 instead of (14 * 14 * 20)
- remember, with weight sharing we use the same filter for an entire depth slice
  - because of this we can get rid of 14 * 14 and be left with only 20
- with weight sharing there's no longer a connection between every parameter in the filter and every neuron in the output
- rather, the same filter is used throughout an entire depth slice
  - so, we need to figure out how many depth slices we have

# Visualizing CNNs

- let’s look at an example CNN to see how it works in action
- the CNN we will look at is trained on ImageNet as described in [this paper](https://arxiv.org/abs/1311.2901) by Zeiler and Fergus
- in the images below (from the same paper), we’ll see *what* each layer in this network detects and see *how* each layer detects more and more complex ideas
  - the images are from Matthew Zeiler and Rob Fergus' [deep visualization toolbox](https://www.youtube.com/watch?v=ghEmQSxT6tw), which lets us visualize what each layer in a CNN focuses on

## Layer 1

<img src="resources/cnn_visualization_layer_1_grid.png"/>

- each image in the above grid represents an example pattern that causes the neurons in the first layer to activate
  - in other words, they are patterns that the first layer recognizes
  - the top left image shows a -45 degree line, while the middle top square shows a +45 degree line
  - these squares are shown below again for reference


- let's now see some example images that cause such activations
- the below grid of images all activated the -45 degree line
- notice how they are all selected despite the fact that they have different colors, gradients, and patterns

<img src="resources/cnn_visualization_layer_1_-45deg_grid.png"/>

- so, the first layer of our CNN clearly picks out very simple shapes and patterns like lines and blobs

## Layer 2

<img src="resources/cnn_visualization_layer_2_grid.png"/>

- image above is a visualization of the second layer in the CNN
- notice how we are picking up more complex ideas like circles (second row, second column), stripes (first row, second column), and rectangles (bottom right)
- the gray grid on the left represents how this layer of the CNN activates (or "what it sees") based on the corresponding images from the grid on the right


- the second layer of the CNN captures complex ideas
- **the CNN learns to do this on its own**
  - there is no special instruction for the CNN to focus on more complex objects in deeper layers
  - that's just how it normally works out when you feed training data into a CNN

## Layer 3

<img src="resources/cnn_visualization_layer_3_grid.png"/>

- the third layer picks out complex combinations of features from the second layer
- these include things like grids, and honeycombs (top left), wheels (second row, second column), and even faces (third row, third column)


- we'll skip layer 4, which continues this progression, and jump right to the fifth and final layer of this CNN

# Layer 5

<img src="resources/cnn_visualization_layer_5_grid.png"/>

- the last layer picks out the highest order ideas that we care about for classification, like dog faces, bird faces, and bicycles
- the gray grid on the left represents how this layer of the CNN activates (or "what it sees") based on the corresponding images from the grid on the right

# TensorFlow Convolution Layer

- TensorFlow provides the `tf.nn.conv2d()` and `tf.nn.bias_add()` functions to create your own convolutional layers

```python
# Output depth
k_output = 64

# Image Properties
image_width = 10
image_height = 10
color_channels = 3

# Convolution filter
filter_size_width = 5
filter_size_height = 5

# Input/Image
input = tf.placeholder(
    tf.float32,
    shape=[None, image_height, image_width, color_channels])

# Weight and bias
weight = tf.Variable(tf.truncated_normal(
    [filter_size_height, filter_size_width, color_channels, k_output]))
bias = tf.Variable(tf.zeros(k_output))

# Apply Convolution
conv_layer = tf.nn.conv2d(input, weight, strides=[1, 2, 2, 1], padding='SAME')
# Add bias
conv_layer = tf.nn.bias_add(conv_layer, bias)
# Apply activation function
conv_layer = tf.nn.relu(conv_layer)
```

- the code above uses the `tf.nn.conv2d()` function to compute the convolution with `weight` as the filter and `[1, 2, 2, 1]` for the strides
- TensorFlow uses a stride for each `input` dimension, `[batch, input_height, input_width, input_channels]`
- we are generally always going to set the stride for `batch` and `input_channels` (i.e. the first and fourth element in the `strides` array) to be $1$


- you'll focus on changing `input_height` and `input_width` while setting `batch` and `input_channels` to $1$
- the `input_height` and `input_width` strides are for striding the filter over `input`
- this example code uses a stride of $2$ with $5x5$ filter over input


- the `tf.nn.bias_add()` function adds a 1-d bias to the last dimension in a matrix

# Explore The Design Space

- now that you've seen what a simple ConvNet looks like, there are many things that we can do to improve it
- we're going to talk about three of them: pooling, 1x1 convolutions, and something a bit more advanced called the inception architecture

# Pooling

- **pooling** is a better way to reduce the spatial extent of your feature maps in the convolutional pyramid
- until now, we've used striding to shift the filters by a few pixels each time and reduce the feature map size
- this is a very aggressive way to downsample an image; it removes a lot of information
- what if instead of skipping one in every two convolutions, we still run with a very small stride, say for example, $1$ but then took all the convolutions in a neighborhood and combine them somehow?
  - that operation is called *pooling*, and there are a few ways to go about it

<img src="resources/conv_stride_pooling.png"/>

- the most common is the *max pooling*
  - $y = max(X_i)$
  - at every point of on the feature map, look at a small neighborhood around that point and compute the maximum of all the responses around it
  - there are some advantages to using max pooling
    - it doesn't add to your number of parameters, so you don't risk an increase in over fitting
    - it simply often yields a more accurate model
  - however, since the convolutions that run below run at lower stride, the model then becomes a lot more expensive to compute
  - and now, you have even more hyper parameters to worry about; the pooling region size and the pooling stride
    - they don't have to be the same


- a very typical architecture for a ConvNet is a few layers alternating convolutions and max pooling, followed by a few fully connected layers at the top
- the first famous model to use this architecture was Lenet-5 designed by Yan Lecun to do character recogniztion back in 1998
- modern convolutional networks, such as AlexNet, which famously won the competitive ImagNnet object recognition challenge in 2012, used the same architecture with a few wrinkles

<img src="resources/CovNet_architecture_example.png"/>

- another notable form of pooling is *average pooling*
  - $y = mean(X_i)$
  - instead of taking the max, just take an average over the window of pixels around a specific location
  - it's a little bit like providing a blurred low resolution view of the feature map below

- https://www.youtube.com/watch?v=ZjM_XQa5s6s

# TensorFlow Max Pooling

<img src="resources/max_pooling_example.png"/>

- the image above is an example of [max pooling](https://en.wikipedia.org/wiki/Convolutional_neural_network#Pooling_layer) with a $2x2$ filter and stride of $2$
  - the four $2x2$ colors represent each time the filter was applied to find the maximum value
  - for example, $[[1, 0], [4, 6]]$ becomes $6$, because $6$ is the maximum value in this set
  - similarly, $[[2, 3], [6, 8]]$ becomes $8$


- conceptually, the benefit of the max pooling operation is to reduce the size of the input, and allow the neural network to focus on only the most important elements
- max pooling does this by only retaining the maximum value for each filtered area, and removing the remaining values


- TensorFlow provides the `tf.nn.max_pool()` function to apply max pooling to your convolutional layers

```python
...
conv_layer = tf.nn.conv2d(input, weight, strides=[1, 2, 2, 1], padding='SAME')
conv_layer = tf.nn.bias_add(conv_layer, bias)
conv_layer = tf.nn.relu(conv_layer)
# Apply Max Pooling
conv_layer = tf.nn.max_pool(
    conv_layer,
    ksize=[1, 2, 2, 1],
    strides=[1, 2, 2, 1],
    padding='SAME')
```

- the `tf.nn.max_pool()` function performs max pooling with the `ksize` parameter as the size of the filter and the `strides` parameter as the length of the stride
  - $2x2$ filters with a stride of $2x2$ are common in practice
- the `ksize` and `strides` parameters are structured as 4-element lists, with each element corresponding to a dimension of the input tensor (`[batch, height, width, channels]`)
- for both `ksize` and `strides`, the batch and channel dimensions are typically set to $1$

## Quiz: Pooling Intuition

- the next few quizzes will test your understanding of pooling layers

**Q:** A pooling layer is generally used to ...
<br/>
**A:** The correct answer is *decrease the size of the output* and *prevent overfitting*. Reducing overfitting is a consequence of the reducing the output size, which in turn, reduces the number of parameters in future layers.

- recently, pooling layers have fallen out of favor
- some reasons are:
  - recent datasets are so big and complex we're more concerned about underfitting
  - *dropout* is a much better regularizer
  - pooling results in a loss of information (think about the max pooling operation as an example; we only keep the largest of $n$ numbers, thereby disregarding $n-1$ numbers completely)

## Quiz: Pooling Mechanics

- setup
  - H = height, W = width, D = depth
  - we have an input of shape $4x4x5$ (HxWxD)
  - filter of shape $2x2$ (HxW)
  - a stride of $2$ for both the height and width (S)


- recall the formula for calculating the new height or width:
  - `new_height = (input_height - filter_height)/S + 1`
  - `new_width = (input_width - filter_width)/S + 1`


- note
  - for a pooling layer the output depth is the same as the input depth
  - additionally, the pooling operation is applied individually for each depth slice


- the image below gives an example of how a max pooling layer works
- in this case, the max pooling filter has a shape of $2x2$
- as the max pooling filter slides across the input layer, the filter will output the maximum value of the $2x2$ square

<img src="resources/pooling_mechanics_quiz.png"/>

**Q:** What's the shape of the output? Format is HxWxD.
<br/>
**A:** It's $2x2x5$. It's calculated using the formulas above. The depth stays the same.

- here's the corresponding code:
```python
input = tf.placeholder(tf.float32, (None, 4, 4, 5))
filter_shape = [1, 2, 2, 1]
strides = [1, 2, 2, 1]
padding = 'VALID'
pool = tf.nn.max_pool(input, filter_shape, strides, padding)
```
- the output shape of `pool` will be [1, 2, 2, 5], even if `padding` is changed to `'SAME'`

## Quiz: Pooling Practice

- now let's practice doing some pooling operations manually

**Q:** What's the result of a max pooling operation on the input:
```python
[[[0, 1, 0.5, 10],
   [2, 2.5, 1, -8],
   [4, 0, 5, 6],
   [15, 1, 2, 3]]]
```
Assume the filter is 2x2 and the stride is 2 for both height and width. The output shape is 2x2x1.
The answering format will be 4 numbers, each separated by a comma, such as: 1,2,3,4.
Work from the top left to the bottom right.
<br/>
<br/>
**A:** It's $2.5,10,15,6$. We start with the four numbers in the top left corner. Then we work left-to-right and top-to-bottom, moving 2 units each time.
```python
max(0, 1, 2, 2.5) = 2.5
max(0.5, 10, 1, -8) = 10
max(4, 0, 15, 1) = 15
max(5, 6, 2, 3) = 6
```

## Quiz: Average Pooling

**Q:** What's the result of a average (or mean) pooling?
```python
[[[0, 1, 0.5, 10],
   [2, 2.5, 1, -8],
   [4, 0, 5, 6],
   [15, 1, 2, 3]]]
```
Assume the filter is 2x2 and the stride is 2 for both height and width. The output shape is 2x2x1.
The answering format will be 4 numbers, each separated by a comma, such as: 1,2,3,4.
Answer to 3 decimal places. Work from the top left to the bottom right
<br/>
<br/>
**A:** It's $1.375,0.875,5,4$. We start with the four numbers in the top left corner. Then we work left-to-right and top-to-bottom, moving 2 units each time.
```python
mean(0, 1, 2, 2.5) = 1.375
mean(0.5, 10, 1, -8) = 0.875
mean(4, 0, 15, 1) = 5
mean(5, 6, 2, 3) = 4
```

# 1x1 Convolutions

- I want to introduce you to another idea; it’s the idea of *1 x 1 convolutions*
- you might wonder, why one would ever want to use 1 x 1 convolutions?
  - they’re not really looking at a patch of the image just that one pixel

<img src="resources/1x1_convolutions.png"/>

- look at the classic convolution setting, it’s basically a small classifier for a patch of the image but it's only a linear classifier
- but if you add a 1 x 1 convolution in the middle, suddenly you have a mini neural network running over the patch instead of a linear classifier
- interspersing your convolutions with 1 x 1 convolutions is a very inexpensive way to make your models deeper and have more parameters, without completely changing their structure
- they're also very cheap because if you go through their math, they're not really convolutions at all, they're really just matrix multiplies and they have relatively few parameters

# Inception Module

- I mentioned all of these, average pulling and 1 x 1 convolutions because I want to talk about the general strategy that has been very successful at creating ConvNets that are both smaller and better than ConvNets that simply use a pyramid of convolutions
  - it's called an *inception module*


- the idea is that at each layer of your ConvNet, you can make a choice to have a pooling operation; have a convolution...
- and then you need to decide, it is a 1x1 or a 3x3 or a 5x5
- all of these are actually beneficial to the modeling power of your network, so why choose?
  - let's use them all


- here is what an inception module looks like
  - instead of having a single convolution, you have a composition of average pooling followed by 1 x 1, then a 1 x 1 convolution then a 1 x 1 followed by 3 x 3, then a 1x1 followed by a 5x5
  - and at the top, you simply concatenate the output of each of them

<img src="resources/inception_module.png"/>

- it looks complicated, but what's interesting is that you can chose these parameters in such a way that the total number of parameters in your model is very small
- yet the model performs better than if you had a simple convolution

# Convolutional Network in TensorFlow

- it's time to walk through an example Convolutional Neural Network (CNN) in TensorFlow
- the structure of this network follows the classic structure of CNNs, which is a mix of convolutional layers and max pooling, followed by fully-connected layers
- the code you'll be looking at is similar to what you saw in the segment on Deep Neural Network in TensorFlow in the previous lesson, except we restructured the architecture of this network as a CNN


- just like in that segment, here you'll study the line-by-line breakdown of the code
- if you want, you can even [download the code](https://d17h27t6h515a5.cloudfront.net/topher/2017/February/58a61ca1_cnn/cnn.zip) and run it yourself
- thanks to [Aymeric Damien](https://github.com/aymericdamien/TensorFlow-Examples) for providing the original TensorFlow model on which this segment is based

## Dataset

- you've seen this section of code from previous lessons
- here we're importing the MNIST dataset and using a convenient TensorFlow function to batch, scale, and One-Hot encode the data

```python
from tensorflow.examples.tutorials.mnist import input_data
mnist = input_data.read_data_sets(".", one_hot=True, reshape=False)

import tensorflow as tf

# Parameters
learning_rate = 0.00001
epochs = 10
batch_size = 128

# Number of samples to calculate validation and accuracy
# Decrease this if you're running out of memory to calculate accuracy
test_valid_size = 256

# Network Parameters
n_classes = 10  # MNIST total classes (0-9 digits)
dropout = 0.75  # Dropout, probability to keep units
```

## Weights and Biases

```python
# Store layers weight & bias
weights = {
    'wc1': tf.Variable(tf.random_normal([5, 5, 1, 32])),
    'wc2': tf.Variable(tf.random_normal([5, 5, 32, 64])),
    'wd1': tf.Variable(tf.random_normal([7*7*64, 1024])),
    'out': tf.Variable(tf.random_normal([1024, n_classes]))}

biases = {
    'bc1': tf.Variable(tf.random_normal([32])),
    'bc2': tf.Variable(tf.random_normal([64])),
    'bd1': tf.Variable(tf.random_normal([1024])),
    'out': tf.Variable(tf.random_normal([n_classes]))}
```

## Convolutions

<img src="resources/convolution_schematic.gif"/>

- the above is an example of a [convolution](https://en.wikipedia.org/wiki/Convolution) with a $3x3$ filter and a stride of $1$ being applied to data with a range of $0$ to $1$
- the convolution for each $3x3$ section is calculated against the weight, $[[1, 0, 1], [0, 1, 0], [1, 0, 1]]$, then a bias is added to create the convolved feature on the right
- in this case, the bias is $0$


- in TensorFlow, this is all done using `tf.nn.conv2d()` and `tf.nn.bias_add()`

```python
def conv2d(x, W, b, strides=1):
    x = tf.nn.conv2d(x, W, strides=[1, strides, strides, 1], padding='SAME')
    x = tf.nn.bias_add(x, b)
    return tf.nn.relu(x)
```
- the `tf.nn.conv2d()` function computes the convolution against weight $W$ as shown above


- in TensorFlow, `strides` is an array of $4$ elements
  - the first element in this array indicates the stride for batch
  - last element indicates stride for features
  - the middle two elements are the strides for height and width respectively  
- it's good practice to remove the batches or features you want to skip from the data set rather than use a stride to skip them
- you can always set the first and last element to $1$ in `strides` in order to use all batches and features
- I've mentioned stride as one number because you usually have a square stride where `height = width`
- when someone says they are using a stride of $3$, they usually mean `tf.nn.conv2d(x, W, strides=[1, 3, 3, 1])` 


- to make life easier, the code is using `tf.nn.bias_add()` to add the bias
- using `tf.add()` doesn't work when the tensors aren't the same shape

## Max Pooling

<img src="resources/maxpool.jpeg"/>

- the above is an example of [max pooling](https://en.wikipedia.org/wiki/Convolutional_neural_network#Pooling_layer) with a $2x2$ filter and stride of $2$
  - the left square is the input and the right square is the output
  - the four $2x2$ colors in input represents each time the filter was applied to create the max on the right side
  - for example, $[[1, 1], [5, 6]]$ becomes $6$ and $[[3, 2], [1, 2]]$ becomes $3$

```python
def maxpool2d(x, k=2):
    return tf.nn.max_pool(
        x,
        ksize=[1, k, k, 1],
        strides=[1, k, k, 1],
        padding='SAME')
```
- the `tf.nn.max_pool()` function does exactly what you would expect, it performs max pooling with the ksize parameter as the size of the filter

## Model

- in the code below, we're creating $ $layers alternating between convolutions and max pooling followed by a fully connected and output layer
- the transformation of each layer to new dimensions are shown in the comments
- for example, the first layer shapes the images from $28x28x1$ to $28x28x32$ in the convolution step
- then next step applies max pooling, turning each sample into $14x14x32$
- all the layers are applied from `conv1` to output, producing $10$ class predictions

```python
def conv_net(x, weights, biases, dropout):
    # Layer 1 - 28*28*1 to 14*14*32
    conv1 = conv2d(x, weights['wc1'], biases['bc1'])
    conv1 = maxpool2d(conv1, k=2)

    # Layer 2 - 14*14*32 to 7*7*64
    conv2 = conv2d(conv1, weights['wc2'], biases['bc2'])
    conv2 = maxpool2d(conv2, k=2)

    # Fully connected layer - 7*7*64 to 1024
    fc1 = tf.reshape(conv2, [-1, weights['wd1'].get_shape().as_list()[0]])
    fc1 = tf.add(tf.matmul(fc1, weights['wd1']), biases['bd1'])
    fc1 = tf.nn.relu(fc1)
    fc1 = tf.nn.dropout(fc1, dropout)

    # Output Layer - class prediction - 1024 to 10
    out = tf.add(tf.matmul(fc1, weights['out']), biases['out'])
    return out
```

## Session

- now let's run it!

```python
# tf Graph input
x = tf.placeholder(tf.float32, [None, 28, 28, 1])
y = tf.placeholder(tf.float32, [None, n_classes])
keep_prob = tf.placeholder(tf.float32)

# Model
logits = conv_net(x, weights, biases, keep_prob)

# Define loss and optimizer
cost = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(logits=logits, labels=y))
optimizer = tf.train.GradientDescentOptimizer(learning_rate=learning_rate).minimize(cost)

# Accuracy
correct_pred = tf.equal(tf.argmax(logits, 1), tf.argmax(y, 1))
accuracy = tf.reduce_mean(tf.cast(correct_pred, tf.float32))

# Initializing the variables
init = tf. global_variables_initializer()

# Launch the graph
with tf.Session() as sess:
    sess.run(init)

    for epoch in range(epochs):
        for batch in range(mnist.train.num_examples//batch_size):
            batch_x, batch_y = mnist.train.next_batch(batch_size)
            sess.run(optimizer, feed_dict={
                x: batch_x,
                y: batch_y,
                keep_prob: dropout})

            # Calculate batch loss and accuracy
            loss = sess.run(cost, feed_dict={
                x: batch_x,
                y: batch_y,
                keep_prob: 1.})
            valid_acc = sess.run(accuracy, feed_dict={
                x: mnist.validation.images[:test_valid_size],
                y: mnist.validation.labels[:test_valid_size],
                keep_prob: 1.})

            print('Epoch {:>2}, Batch {:>3} -'
                  'Loss: {:>10.4f} Validation Accuracy: {:.6f}'.format(
                epoch + 1,
                batch + 1,
                loss,
                valid_acc))

    # Calculate Test Accuracy
    test_acc = sess.run(accuracy, feed_dict={
        x: mnist.test.images[:test_valid_size],
        y: mnist.test.labels[:test_valid_size],
        keep_prob: 1.})
    print('Testing Accuracy: {}'.format(test_acc))
```

## TensorFlow Convolutional Layer Workspaces

- let's now apply what we've learned to build real CNNs in TensorFlow
- in the below exercise, you'll be asked to set up the dimensions of the Convolution filters, the weights, the biases
  - this is in many ways the trickiest part to using CNNs in TensorFlow
- once you have a sense of how to set up the dimensions of these attributes, applying CNNs will be far more straight forward


- review
  - you should go over the TensorFlow documentation for [2D convolutions](https://www.tensorflow.org/guide#Convolution)
  - most of the documentation is straightforward, except perhaps the `padding` argument
    - the padding might differ depending on whether you pass `'VALID'` or `'SAME'`
- here are a few more things worth reviewing:
  - introduction to TensorFlow -> [TensorFlow Variables](https://www.tensorflow.org/guide/variable)
  - how to determine the dimensions of the output based on the input size and the filter size (shown below)
    - you'll use this to determine what the size of your filter should be
```python
new_height = (input_height - filter_height + 2 * P)/S + 1
new_width = (input_width - filter_width + 2 * P)/S + 1
```


- instructions
  - finish off each `TODO` in the `conv2d function`
  - setup the `strides`, `padding` and filter weight/bias (`F_w` and `F_b`) such that the output shape is `(1, 2, 2, 3)`
    - note that all of these except `strides` should be TensorFlow variables

In [2]:
# NOTE: there's more than 1 way to get the correct output shape. Your answer might differ from mine

import tensorflow as tf
import numpy as np

"""
Setup the strides, padding and filter weight/bias such that
the output shape is (1, 2, 2, 3).
"""
# `tf.nn.conv2d` requires the input be 4D (batch_size, height, width, depth)
# (1, 4, 4, 1)
x = np.array(
    [[0, 1, 0.5, 10], [2, 2.5, 1, -8], [4, 0, 5, 6], [15, 1, 2, 3]], dtype=np.float32
).reshape((1, 4, 4, 1))
X = tf.constant(x)


def conv2d(input_array):
    # Filter (weights and bias)
    # The shape of the filter weight is (height, width, input_depth, output_depth)
    # The shape of the filter bias is (output_depth,)

    # TODO: Define the filter weights `F_W` and filter bias `F_b`.
    # NOTE: Remember to wrap them in `tf.Variable`, they are trainable parameters after all.
    F_W = tf.Variable(tf.truncated_normal((2, 2, 1, 3)))
    F_b = tf.Variable(tf.zeros(3))

    # TODO: Set the stride for each dimension (batch_size, height, width, depth)
    strides = [1, 2, 2, 1]

    # TODO: set the padding, either 'VALID' or 'SAME'.
    padding = "VALID"

    # https://www.tensorflow.org/versions/r0.11/api_docs/python/nn.html#conv2d
    # `tf.nn.conv2d` does not include the bias computation so we have to add it ourselves after.
    return tf.nn.conv2d(input_array, F_W, strides, padding) + F_b


output = conv2d(X)
output

<tf.Tensor 'add:0' shape=(1, 2, 2, 3) dtype=float32>

<IPython.core.display.Javascript object>

- I want to transform the input shape `(1, 4, 4, 1)` to `(1, 2, 2, 3)`
- I choose 'VALID' for the padding algorithm
  - I find it simpler to understand and it achieves the result I'm looking for
```python
out_height = ceil(float(in_height - filter_height + 1) / float(strides[1]))
out_width  = ceil(float(in_width - filter_width + 1) / float(strides[2]))
```

- plugging in the values:
```python
out_height = ceil(float(4 - 2 + 1) / float(2)) = ceil(1.5) = 2
out_width  = ceil(float(4 - 2 + 1) / float(2)) = ceil(1.5) = 2
```

- in order to change the depth from $1$ to $3$, I have to set the output depth of my filter appropriately:
```python
F_W = tf.Variable(tf.truncated_normal((2, 2, 1, 3))) # (height, width, input_depth, output_depth)
F_b = tf.Variable(tf.zeros(3)) # (output_depth)
```
- the input has a depth of $1$, so I set that as the `input_depth` of the filter

## Using Pooling Layers in TensorFlow

- in the below exercise, you'll be asked to set up the dimensions of the pooling filters, strides, as well as the appropriate padding
- you should go over the TensorFlow documentation for `tf.nn.max_pool()`
- padding works the same as it does for a convolution


- instructions
  - finish off each *TODO* in the `maxpool` function
  - setup the `strides`, `padding` and `ksize` such that the output shape after pooling is `(1, 2, 2, 1)`

In [3]:
# NOTE: there's more than 1 way to get the correct output shape. Your answer might differ from mine

"""
Set the values to `strides` and `ksize` such that
the output shape after pooling is (1, 2, 2, 1).
"""
import tensorflow as tf
import numpy as np

# `tf.nn.max_pool` requires the input be 4D (batch_size, height, width, depth)
# (1, 4, 4, 1)
x = np.array(
    [[0, 1, 0.5, 10], [2, 2.5, 1, -8], [4, 0, 5, 6], [15, 1, 2, 3]], dtype=np.float32
).reshape((1, 4, 4, 1))
X = tf.constant(x)


def maxpool(input):
    # TODO: Set the ksize (filter size) for each dimension (batch_size, height, width, depth)
    ksize = [1, 2, 2, 1]
    # TODO: Set the stride for each dimension (batch_size, height, width, depth)
    strides = [1, 2, 2, 1]
    # TODO: set the padding, either 'VALID' or 'SAME'.
    padding = "VALID"
    # https://www.tensorflow.org/versions/r0.11/api_docs/python/nn.html#max_pool
    return tf.nn.max_pool(input, ksize, strides, padding)


out = maxpool(X)
out

<tf.Tensor 'MaxPool:0' shape=(1, 2, 2, 1) dtype=float32>

<IPython.core.display.Javascript object>

- I want to transform the input shape `(1, 4, 4, 1)` to `(1, 2, 2, 1)`
- I choose 'VALID' for the padding algorithm
- I find it simpler to understand and it achieves the result I'm looking for
```python
out_height = ceil(float(in_height - filter_height + 1) / float(strides[1]))
out_width  = ceil(float(in_width - filter_width + 1) / float(strides[2]))
```

- plugging in the values:
```python
out_height = ceil(float(4 - 2 + 1) / float(2)) = ceil(1.5) = 2
out_width  = ceil(float(4 - 2 + 1) / float(2)) = ceil(1.5) = 2
```
- the depth doesn't change during a pooling operation so I don't have to worry about that

# CNNs - Additional Resources

- there are many wonderful free resources that allow you to go into more depth around Convolutional Neural Networks
- in this course, our goal is to give you just enough intuition to start applying this concept on real world problems so you have enough of an exposure to explore more on your own
- we strongly encourage you to explore some of these resources more to reinforce your intuition and explore different ideas


- these are the resources we recommend in particular:
  - Andrej Karpathy's [CS231n Stanford course](https://cs231n.github.io/) on Convolutional Neural Networks
  - Michael Nielsen's [free book](http://neuralnetworksanddeeplearning.com/) on Deep Learning
  - Goodfellow, Bengio, and Courville's more advanced [free book](https://www.deeplearningbook.org/) on Deep Learning