1. [Weight Sharing](#Weight Sharing)
2. [Convolutional Networks](#Convolutional Networks)
    1. [Intution](#Intution)
3. [Filters](#Filters)
    1. [Breaking up an Image](#Breaking up an Image)
    2. [Filter Depth](#Filter Depth)
4. [Parameters](#Parameters)
    1. [Parameter Sharing](#Parameter Sharing)
    2. [Padding](#Padding)
    3. [Dimensionality](#Dimensionality)
5. [Visualizing CNNs](#Visualizing CNNs)
    1. [Layer 1](#Layer 1)
    2. [Layer 2](#Layer 2)
    3. [Layer 3](#Layer 3)
    4. [Layer 5](#Layer 5)
6. [TensorFlow Convolution Layer](#TensorFlow Convolution Layer)
7. [Explore The Design Space](#Explore The Design Space)
    1. [Pooling](#Pooling)
    2. [TensorFlow Max Pooling](#TensorFlow Max Pooling)
    3. [1 x 1 Convolutions](#1 x 1 Convolutions)
    4. [Inception Module](#Inception Module)
8. [Convolutional Network in TensorFlow](#Convolutional Network in TensorFlow)
    1. [Dataset](#Dataset)
    2. [Weights and Biases](#Weights and Biases)
    3. [Convolutions](#Convolutions)
    4. [Max Pooling](#Max Pooling)
    5. [Model](#Model)

# 1. Weight Sharing <a name='Weight Sharing'></a>

If our data have some structures and our network doesn't have to learn that structure from scratch, it's going to perform better.

Imagine, for example, that we're trying to classify those letters, and we know that color is really not a factor in what makes an A an A. What would be easier for the classifier to learn? A model that uses the color image or a model that only looks at the grayscale.

The answer is grayscale. If color doesn't matter, it might help to reduce the complexity of the problem by combining color channels into a single monochromatic channel. Taking the average $\frac{R+G+B}{3}$ is one way of doing it; however there are other transformations that might be more effective/closer to how human's perceive color (e.g. converting to YUV and using the Y channel).

Another example is that we have an image and we want our network to say it's an image with a cat in it. it doesn't really matter where the cat is, it's still an image with a cat. If the network has to learn about kittens in the left corner and about kittens in the right corner, independently, that's a lot of work that it has to do.

<img src='Figures5/Screen Shot 2017-03-27 at 23.47.12.png' width=400>

How about telling it instead, explicitly, that objects and images are largely the same whether they're on the left or on the right of the picture. That's what's called translation invariance. Different positions, same kitten.

Another example. Imagine we have a long text that talks about kittens. Does the meaning of kitten change depending on whether it's in the first sentence or in the second one? 

<img src='Figures5/Screen Shot 2017-03-27 at 23.50.38.png' width=400>

Mostly not. So if we're trying to network on text, maybe we want the part of the network that learns what a kitten is to be reused every time we see the word kitten, and not have to relearn it every time. 

The way we achieve this in our own networks is using what is called __weight sharing__. When we know that two inputs can contain the same kind of information, then we share the weights and train
the weights jointly for those inputs. it is a very important idea.

<img src='Figures5/Screen Shot 2017-03-27 at 23.53.11.png' width=400>

__Statistical invariants__, things that don't change on average across time or space, are everywhere. 

For images, the idea of weight sharing will get us to study convolutional networks. For text and sequences in general,
it will lead us to embeddings and recurrent neural networks.



# 2. Convolutional Networks <a name='Convolutional Networks'></a>

<img src='Figures5/Screen Shot 2017-03-27 at 23.59.49.png' width=400>

Let's talk about convolutional networks, or convnets. Convnets are neural networks that share their parameters across space.

Imagine we have an image. It can be represented as a flat pancake. 

<img src='Figures5/Screen Shot 2017-03-28 at 00.01.46.png' width=400>

It has a width and a height. And because we typically have red,
green, and blue channels, it also has a depth. In this instance, depth is 3, that's our input. 

Now imagine taking a small patch of this image and running a tiny neural network on it, with say, $K$ outputs.

<img src='Figures5/Screen Shot 2017-03-28 at 00.03.43.png' width=400>

Let's represent those outputs vertically, in a tiny column like this.

<img src='Figures5/Screen Shot 2017-03-28 at 00.04.42.png' width=400>

Now let's slide that little neural network across the image without changing the weights. Just slide across and vertically like we're painting it with a brush.

<img src='Figures5/Screen Shot 2017-03-28 at 00.06.26.png' width=400>
<img src='Figures5/Screen Shot 2017-03-28 at 00.08.41.png' width=400>
<img src='Figures5/Screen Shot 2017-03-28 at 00.09.20.png' width=400>
<img src='Figures5/Screen Shot 2017-03-28 at 00.09.29.png' width=400>
<img src='Figures5/Screen Shot 2017-03-28 at 00.09.51.png' width=400>
<img src='Figures5/Screen Shot 2017-03-28 at 00.09.58.png' width=400>
<img src='Figures5/Screen Shot 2017-03-28 at 00.07.40.png' width=400>

On the output, we've drawn another image. It's got a different width, a different height, and more importantly, it's got a different depth.

<img src='Figures5/Screen Shot 2017-03-28 at 00.12.53.png' width=400>

Instead of just R, G and B, now we have an output that's got many color channels, $K$ of them. This operation is called a __convolution__. 

If the patch size were the size of the whole image, it would be no different than a regular layer of a neural network. But because we have this small patch instead, we have many fewer weights and they are shared across space.

<img src='Figures5/Screen Shot 2017-03-28 at 00.15.39.png' width=400>

A convnet is going to basically be a deep network where instead of having stacks of matrix multiply layers, we're going to have stacks of convolutions.

The general idea is that they will form a pyramid. At the bottom we have this big image but very shallow, just R, G, and B. We're going to apply convolutions that are going to progressively squeeze the spatial dimensions, while increasing the depth, which corresponds roughly to the semantic complexity of our representation. 

<img src='Figures5/Screen Shot 2017-03-28 at 00.17.01.png' width=500>

At the top we can put our classifier. We have a representation where all the spatial information has been squeezed out and only parameters that map to contents of the image remain.


So that's the general idea. If we're going to implement this, there are lots of little details to get right and a fair bit of lingo to get used to.

<img src='Figures5/Screen Shot 2017-03-28 at 11.53.11.png' width=500>

__Patches__ are sometimes called kernels. Each pancake in our stack
is called a __feature map__. Here, we're mapping three feature maps to $K$ feature maps. 

__stride__ is the number of pixels that we're shifting each time we move our filter.

<img src='Figures5/Screen Shot 2017-03-28 at 11.58.54.png' width=500>

A stride of 1 makes the output roughly the same size as the input.

<img src='Figures5/Screen Shot 2017-03-28 at 12.00.46.png' width=500>

A stride of 2 means it's about half the size. 

<img src='Figures5/Screen Shot 2017-03-28 at 12.07.12.png' width=300>
<img src='Figures5/Screen Shot 2017-03-28 at 12.07.50.png' width=300>
<img src='Figures5/Screen Shot 2017-03-28 at 12.07.58.png' width=300>
<img src='Figures5/Screen Shot 2017-03-28 at 12.08.04.png' width=300>
<img src='Figures5/Screen Shot 2017-03-28 at 12.08.13.png' width=300>
<img src='Figures5/Screen Shot 2017-03-28 at 12.08.21.png'
width=300>
<img src='Figures5/Screen Shot 2017-03-28 at 12.08.30.png' width=300>
<img src='Figures5/Screen Shot 2017-03-28 at 12.08.38.png' width=300>
<img src='Figures5/Screen Shot 2017-03-28 at 12.08.46.png' width=300>
<img src='Figures5/Screen Shot 2017-03-28 at 12.08.53.png' width=300>
<img src='Figures5/Screen Shot 2017-03-28 at 12.09.03.png' width=300>
<img src='Figures5/Screen Shot 2017-03-28 at 12.09.11.png' width=300>
<img src='Figures5/Screen Shot 2017-03-28 at 12.09.26.png' width=300>

Roughly speacking, it depends a bit about what we do at the edge of our image. Either, we don't go pass the edge, and it's often called __valid padding__ as a shortcut. 

<img src='Figures5/Screen Shot 2017-03-28 at 14.29.44.png' width=300>
$$ \text{Valid Padding}$$

Or we go off the edge and pad with zeros in such a way that the output map size is exactly the same size as the input map. That is often called __same padding__ as a shortcut.

<img src='Figures5/Screen Shot 2017-03-28 at 14.30.09.png' width=300>
$$ \text{Same Padding}$$

### 2.1. Intution <a name='Intution'></a>

Let's develop better intuition for how Convolutional Neural Networks (CNN) work. We'll examine how humans classify images, and then see how CNNs use similar approaches.

Let’s say we wanted to classify the following image of a dog as a Golden Retriever.

<img src='Figures5/dog-1210559-1280.jpg' width=400>

As humans, how do we do this?

One thing we do is that we identify certain parts of the dog, such as the nose, the eyes, and the fur. We essentially break up the image into smaller pieces, recognize the smaller pieces, and then combine those pieces to get an idea of the overall dog.

In this case, we might break down the image into a combination of the following:
- A nose
- Two eyes
- Golden fur

These pieces can be seen below:

<img src='Figures5/screen-shot-2016-11-24-at-12.49.08-pm.png' width=150>
<img src='Figures5/screen-shot-2016-11-24-at-12.49.43-pm.png' width=150>
<img src='Figures5/screen-shot-2016-11-24-at-12.50.54-pm.png' width=150>

But let’s take this one step further. How do we determine what exactly a nose is? A Golden Retriever nose can be seen as an oval with two black holes inside it. Thus, one way of classifying a Retriever’s nose is to to break it up into smaller pieces and look for black holes (nostrils) and curves that define an oval as shown below.

<img src='Figures5/screen-shot-2016-11-24-at-12.51.47-pm.png' width=200>
$$ \text{A curve that we can use to determine a nose.} $$

<img src='Figures5/screen-shot-2016-11-24-at-12.51.51-pm.png' width=70>
$$ \text{A nostril that we can use to classify a nose of the dog.} $$

Broadly speaking, this is what a CNN learns to do. It learns to recognize basic lines and curves, then shapes and blobs, and then increasingly complex objects within the image. Finally, the CNN classifies the image by combining the larger, more complex objects.

In our case, the levels in the hierarchy are:
- Simple shapes, like ovals and dark circles
- Complex objects (combinations of simple shapes), like eyes, nose, and fur
- The dog as a whole (a combination of complex objects)

With deep learning, we don't actually program the CNN to recognize these specific features. Rather, the CNN learns on its own to recognize such objects through forward propagation and backpropagation!

It's amazing how well a CNN can learn to classify images, even though we never program the CNN with information about specific features to look for.

<img src='Figures5/heirarchy-diagram.jpg' width=500>
$$ \text{An example of what each layer in a CNN might recognize when classifying a picture of a dog.} $$

A CNN might have several layers, and each layer might capture a different level in the hierarchy of objects. The first layer is the lowest level in the hierarchy, where the CNN generally classifies small parts of the image into simple shapes like horizontal and vertical lines and simple blobs of colors. The subsequent layers tend to be higher levels in the hierarchy and generally classify more complex ideas like shapes (combinations of lines), and eventually full objects like dogs.

Once again, the CNN __learns all of this on its own__. We don't ever have to tell the CNN to go looking for lines or curves or noses or fur. The CNN just learns from the training set and discovers which characteristics of a Golden Retriever are worth looking for.

Next, let’s look at some implementation details.

# 3. Filters <a name='Filters'></a>

### 3.1. __Breaking up an Image__ <a name='Breaking up an Image'></a>

The first step for a CNN is to break up the image into smaller pieces. We do this by selecting a width and height that defines a filter.

The filter looks at small pieces, or patches, of the image. These patches are the same size as the filter.

<img src='Figures5/vlcsnap-2016-11-24-15h52m47s438.png' width=400>
$$ \text{As shown in the previous video, a CNN uses filters to split an image into smaller patches.} \\ \text{The size of these patches matches the filter size.} $$

We then simply slide this filter horizontally or vertically to focus on a different piece of the image.

The amount by which the filter slides is referred to as the 'stride'. The stride is a hyperparameter which we, the engineer, can tune. Increasing the stride reduces the size of our model by reducing the number of total patches each layer observes. However, this usually comes with a reduction in accuracy.

Let’s look at an example. In this zoomed in image of the dog, we first start with the patch outlined in red. The width and height of our filter define the size of this square.

<img src='Figures5/retriever-patch.png' width=400>
$$ \text{One patch of the Golden Retriever image.} $$

We then move the square over to the right by a given stride (2 in this case) to get another patch.

<img src='Figures5/retriever-patch-shifted.png' width=400>
$$ \text{We move our square to the right by two pixels to create another patch.} $$

What's important here is that we are __grouping together adjacent pixels__ and treating them as a collective.

In a normal, non-convolutional neural network, we would have ignored this adjacency. In a normal network, we would have connected every pixel in the input image to a neuron in the next layer. In doing so, we would not have taken advantage of the fact that pixels in an image are close together for a reason and have special meaning.

By taking advantage of this local structure, our CNN learns to classify local patterns, like shapes and objects, in an image.

### 3.2. Filter Depth <a name='Filter Depth'></a>

It's common to have more than one filter. Different filters pick up different qualities of a patch. For example, one filter might look for a particular color, while another might look for a kind of object of a specific shape. _The amount of filters in a convolutional layer is called the_ ___filter depth___.

<img src='Figures5/neilsen-pic.png' width=400>
$$ \text{In the above example, a patch is connected to a neuron in the next layer.}\\ \text{Source: MIchael Nielsen.} $$

How many neurons does each patch connect to?

That’s dependent on our filter depth. If we have a depth of $k$, we connect each patch of pixels to $k$ neurons in the next layer. This gives us the height of $k$ in the next layer, as shown below. In practice, $k$ is a hyperparameter we tune, and most CNNs tend to pick the same starting values.

<img src='Figures5/filter-depth.png' width=200>
$$ \text{Choosing a filter depth of k connects each patch to k neurons in the next layer.} $$

But why connect a single patch to multiple neurons in the next layer? Isn’t one neuron good enough?

Multiple neurons can be useful because a patch can have multiple interesting characteristics that we want to capture.

For example, one patch might include some white teeth, some blonde whiskers, and part of a red tongue. In that case, we might want a filter depth of at least three - one for each of teeth, whiskers, and tongue.

<img src='Figures5/teeth-whiskers-tongue.png' width=400>
$$ \text{This patch of the dog has many interesting features we may want to capture.} \\ \text{These include the presence of teeth, the presence of whiskers, and the pink color of the tongue.} $$

Having multiple neurons for a given patch ensures that our CNN can learn to capture whatever characteristics the CNN learns are important.

Remember that the CNN isn't "programmed" to look for certain characteristics. Rather, it learns __on its own__ which characteristics to notice.

---
__Quiz:__

Imagine that we have a 28 by 28 image. We run a 3 by 3 convolution on it with an input depth of 3 and an output depth of 8. What are the dimensions of the output feature maps? When we're using same
padding with a stride of 1, when we're using valid padding with a stride of 1, and when we're using valid padding with a stride of 2.

<img src='Figures5/Screen Shot 2017-03-28 at 20.43.51.png' width=500>

__Answer:__

__SAME__ padding equation:

```Python
out_height = ceil(float(in_height) / float(strides[1]))
out_width  = ceil(float(in_width) / float(strides[2]))
```

__VALID__ padding equation:

```Python
out_height = ceil(float(in_height - filter_height + 1) / float(strides[1]))
out_width  = ceil(float(in_width - filter_width + 1) / float(strides[2]))
```

<img src='Figures5/Screen Shot 2017-03-28 at 20.48.39.png' width=500>

If we're using the so-called same padding and a stride of 1, the output width and height are the same as the input. We just add zeroes to the input image to make the sizes match. 

If we use the so-called valid padding and a stride of 1, then there is no padding at all. So if we want to fit our little filter on the input image without doing any padding, we're going to have to remove one row and one column of the image on each side.

So we're left with 26 features in each of the maps at the output. If in addition we use a stride of 2, then we only get half as many outputs. So 13 in width and 13 in height. In all cases, the output depth isn't changed



There are more sophisticated methods for padding an image which avoid some of the problems introduced by simply adding zeros.

If we're interested in learning more, check out this discussion on [Boundary Issues](https://classroom.udacity.com/courses/ud810/lessons/3461518740/concepts/34384591060923) in our [Introduction to Computer Vision](https://www.udacity.com/courses/ud810) course.

---

That's it. We can build a simple covenant with just this.

<img src='Figures5/Screen Shot 2017-03-28 at 20.59.48.png' width=500>

Stack up our convulsions, which thankfully we don't have to implement ourselves, then use trieds to to reduce the dimensionality and increase the depth of the network, layer after layer. And once we have a deep and narrow representation, connect the whole thing to a few regular fully connected layers and we're ready to train our classifier. 

We might wonder what happens to training, into the chain rule in particular, when we use shared weights like this.

<img src='Figures5/Screen Shot 2017-03-28 at 21.03.20.png' width=400>

Nothing really happens. The math just works. We just add up the derivates from all of the possible locations on the image. 

__Note: __ A "Fully Connected" layer is a standard, non convolutional layer, where all inputs are connected to all output neurons. This is also referred to as a "dense" layer, and is what we used in the previous two lessons. 

# 4. Parameters <a name='Parameters'></a>

### 4.1. Parameters Sharing <a name='Parameter Sharing'></a>

<img src='Figures5/vlcsnap-2016-11-24-16h01m35s262.png' width=400>
$$ \text{The weights, w, are shared across patches for a given layer in a CNN} \\ \text{to detect the cat above regardless of where in the image it is located.} $$

When we are trying to classify a picture of a cat, we don’t care where in the image a cat is. If it’s in the top left or the bottom right, it’s still a cat in our eyes. We would like our CNNs to also possess this ability known as translation invariance. How can we achieve this?

As we saw earlier, the classification of a given patch in an image is determined by the weights and biases corresponding to that patch.

If we want a cat that’s in the top left patch to be classified in the same way as a cat in the bottom right patch, we need the weights and biases corresponding to those patches to be the same, so that they are classified the same way.

This is exactly what we do in CNNs. The weights and biases we learn for a given output layer are shared across all patches in a given input layer. Note that as we increase the depth of our filter, the number of weights and biases we have to learn still increases, as the weights aren't shared across the output channels.

There’s an additional benefit to sharing our parameters. If we did not reuse the same weights across all patches, we would have to learn new parameters for every single patch and hidden layer neuron pair. This does not scale well, especially for higher fidelity images. Thus, sharing parameters not only helps us with translation invariance, but also gives us a smaller, more scalable model.

### 4.2. Padding <a name='Padding'></a>

<img src='Figures5/screen-shot-2016-11-24-at-10.05.37-pm.png' width=200>
$$ \text{A 5x5 grid with a 3x3 filter. Source: Andrej Karpathy.} $$

Let's say we have a $5\times 5$ grid (as shown above) and a filter of size $3 \times 3$ with a stride of 1. What's the width and height of the next layer? We see that we can fit at most three patches in each direction, giving us a dimension of $3 \times 3$ in our next layer. As we can see, the width and height of each subsequent layer decreases in such a scheme.

In an ideal world, we'd be able to maintain the same width and height across layers so that we can continue to add layers without worrying about the dimensionality shrinking and so that we have consistency. How might we achieve this? One way is to simply add a border of 0s to our original $ 5 \times 5 $ image. We can see what this looks like in the below image.

<img src='Figures5/screen-shot-2016-11-24-at-10.05.46-pm.png' width=250>
$$ \text{The same grid with 0 padding. Source: Andrej Karpathy.} $$

This would expand our original image to a $ 7 \times 7 $. With this, we now see how our next layer's size is again a $5 \times 5$, keeping our dimensionality consistent.

### 4.3. Dimensionality <a name='Dimensionality'></a>

From what we've learned so far, how can we calculate the number of neurons of each layer in our CNN?

Given:
- our input layer has a width of W and a height of H
- our convolutional layer has a filter size F
- we have a stride of S
- a padding of P
- and a filter depth (the amount of filters) of K,

the following formula gives us the width of the next layer: W\_out = (W−F+2P)/S+1.

The output height would be H\_out = (H-F+2P)/S + 1.

And the output depth would be equal to the filter depth D\_out = K.

The output volume would be W\_out \* H\_out \* D\_out.

Knowing the dimensionality of each additional layer helps us understand how large our model is and how our decisions around filter size and stride affect the size of our network.

---
__Quiz 1: Convolution Output Shape__

Understanding dimensions will help us make accurate tradeoffs between model size and performance. As we'll see, some parameters have a much bigger impact on model size than others. 

H = height, W = width, D = depth

- We have an input of shape 32x32x3 (HxWxD)
- 20 filters of shape 8x8x3 (HxWxD)
- A stride of 2 for both the height and width (S)
- With padding of size 1 (P)

Recall the formula for calculating the new height or width:

```Python
new_height = ((input_height - filter_height + 2 * padding_height)/ stride_height) + 1
new_width = ((input_width - filter_width + 2 * padding_width)/ stride_width) + 1
```

What's the shape of the output?

__Answer:__

__14x14x20__

We can get the new height and width with the formula resulting in:

```Python
(32 - 8 + 2 * 1)/2 + 1 = 14
(32 - 8 + 2 * 1)/2 + 1 = 14
```

The new depth is equal to the number of filters, which is 20.

This would correspond to the following code:

```Python
input = tf.placeholder(tf.float32, (None, 32, 32, 3))
filter_weights = tf.Variable(tf.truncated_normal((8, 8, 3, 20))) # (height, width, input_depth, output_depth)
filter_bias = tf.Variable(tf.zeros(20))
strides = [1, 2, 2, 1] # (batch, height, width, depth)
padding = 'SAME'
conv = tf.nn.conv2d(input, filter_weights, strides, padding) + filter_bias
```

Note the output shape of __conv__ will be [1, 16, 16, 20]. It's 4D to account for batch size, but more importantly, it's not [1, 14, 14, 20]. This is because the padding algorithm TensorFlow uses is not exactly the same as the one above. An alternative algorithm is to switch __padding__ from __'SAME'__ to __'VALID'__ which would result in an output shape of [1, 13, 13, 20]. If we're curious how padding works in TensorFlow, read [this document](https://www.tensorflow.org/api_guides/python/nn#Convolution).

In summary TensorFlow uses the following equation for 'SAME' vs 'PADDING'

- __SAME Padding__, the output height and width are computed as:
    - out_height = ceil(float(in_height) / float(strides1))
    - out_width = ceil(float(in_width) / float(strides[2]))
- __VALID Padding__, the output height and width are computed as:
    - out_height = ceil(float(in_height - filter_height + 1) / float(strides1))
    - out_width = ceil(float(in_width - filter_width + 1) / float(strides[2]))
    
---
__Quiz 2: Convolution Output Shape__

We're now going to calculate the number of parameters of the convolutional layer. The answer from the last quiz will come into play here!

Being able to calculate the number of parameters in a neural network is useful since we want to have control over how much memory a neural network uses.

H = height, W = width, D = depth
- We have an input of shape 32x32x3 (HxWxD)
- 20 filters of shape 8x8x3 (HxWxD)
- A stride of 2 for both the height and width (S)
- Zero padding of size 1 (P)
- Output Layer, 14x14x20 (HxWxD)

Hint: Without parameter sharing, each neuron in the output layer must connect to each neuron in the filter. In addition, each neuron in the output layer must also connect to a single bias neuron.

__Answer 2:__

There are 756560 total parameters. That's a HUGE amount! Here's how we calculate it:

$ (8 * 8 * 3 + 1) * (14 * 14 * 20) = 756560$

$8 * 8 * 3$ is the number of weights, we add 1 for the bias. Remember, each weight is assigned to every single part of the output $(14 * 14 * 20)$. So we multiply these two numbers together and we get the final answer.

---
__Quiz 3: Parameter Sharing__

Now we'd like to calculate the number of parameters in the convolutional layer, if every neuron in the output layer shares its parameters with every other neuron in its same channel.

This is the number of parameters actually used in a convolution layer (__tf.nn.conv2d()__).

H = height, W = width, D = depth

- We have an input of shape 32x32x3 (HxWxD)
- 20 filters of shape 8x8x3 (HxWxD)
- A stride of 2 for both the height and width (S)
- Zero padding of size 1 (P)
- Output Layer, 14x14x20 (HxWxD)

Hint: With parameter sharing, each neuron in an output channel shares its weights with every other neuron in that channel. So the number of parameters is equal to the number of neurons in the filter, plus a bias neuron, all multiplied by the number of channels in the output layer.

__Answer 3:__

There are 3860 total parameters. That's 196 times fewer parameters! Here's how the answer is calculated:

$(8 * 8 * 3 + 1) * 20 = 3840 + 20 = 3860$

That's 3840 weights and 20 biases. This should look similar to the answer from the previous quiz. The difference being it's just 20 instead of $(14 * 14 * 20)$. Remember, with weight sharing we use the same filter for an entire depth slice. Because of this we can get rid of 14 * 14 and be left with only 20.

# 5. Visualizing CNNs <a name='Visualizing CNNs'></a>

Let’s look at an example CNN to see how it works in action.

The CNN we will look at is trained on ImageNet as described in [this paper](http://www.matthewzeiler.com/pubs/arxive2013/eccv2014.pdf) by Zeiler and Fergus. In the images below (from the same paper), we’ll see what each layer in this network detects and see how each layer detects more and more complex ideas.

### 5.1. Layer 1 <a name='Layer 1'></a>

<img src='Figures5/layer-1-grid.png' width=200>
$$ \text{Example patterns that cause activations in the first layer of the network.} \\ \text{These range from simple diagonal lines (top left) to green blobs (bottom middle).} $$

The images above are from Matthew Zeiler and Rob Fergus' [deep visualization toolbox](https://www.youtube.com/watch?v=ghEmQSxT6tw), which lets us visualize what each layer in a CNN focuses on.

Each image in the above grid represents a pattern that causes the neurons in the first layer to activate - in other words, they are patterns that the first layer recognizes. The top left image shows a -45 degree line, while the middle top square shows a +45 degree line. These squares are shown below again for reference.

<img src='Figures5/diagonal-line-1.png' width=80>
$$ \text{As visualized here, the first layer of the CNN can recognize -45 degree lines.} $$

<img src='Figures5/diagonal-line-2.png' width=80>
$$ \text{The first layer of the CNN is also able to recognize +45 degree lines, like the one above.} $$

Let's now see some example images that cause such activations. The below grid of images all activated the -45 degree line. Notice how they are all selected despite the fact that they have different colors, gradients, and patterns.

<img src='Figures5/grid-layer-1.png' width=150>
$$ \text{Example patches that activate the -45 degree line detector in the first layer.} $$

So, the first layer of our CNN clearly picks out very simple shapes and patterns like lines and blobs.

### 5.2. Layer 2 <a name='Layer 2'></a>

<img src='Figures5/screen-shot-2016-11-24-at-12.09.02-pm.png' width=500>

$$ \text{A visualization of the second layer in the CNN. Notice how we are picking up} \\ \text{more complex ideas like circles and stripes. The gray grid on the left represents} \\ \text{how this layer of the CNN activates (or "what it sees") based on the corresponding} \\ \text{images from the grid on the right.} $$

The second layer of the CNN captures complex ideas.

As we see in the image above, the second layer of the CNN recognizes circles (second row, second column), stripes (first row, second column), and rectangles (bottom right).

__The CNN learns to do this on its own__. There is no special instruction for the CNN to focus on more complex objects in deeper layers. That's just how it normally works out when we feed training data into a CNN.

### 5.3. Layer 3 <a name='Layer 3'></a>

<img src='Figures5/screen-shot-2016-11-24-at-12.09.24-pm.png' width=500>

$$ \text{A visualization of the third layer in the CNN. The gray grid on the left represents} \\ \text{how this layer of the CNN activates (or "what it sees") based on the corresponding} \\ \text{images from the grid on the right.} $$

The third layer picks out complex combinations of features from the second layer. These include things like grids, and honeycombs (top left), wheels (second row, second column), and even faces (third row, third column).

### 5.4. Layer 5 <a name='Layer 5'></a>

<img src='Figures5/screen-shot-2016-11-24-at-12.08.11-pm.png' width=500>

$$ \text{A visualization of the fifth and final layer of the CNN. The gray grid on the left} \\ \text{represents how this layer of the CNN activates (or "what it sees") based on the} \\ \text{corresponding images from the grid on the right.} $$

We'll skip layer 4, which continues this progression, and jump right to the fifth and final layer of this CNN.

The last layer picks out the highest order ideas that we care about for classification, like dog faces, bird faces, and bicycles.

# 6. TensorFlow Convolution Layer <a name='TensorFlow Convolution Layer'></a>

Let's examine how to implement a CNN in TensorFlow.

TensorFlow provides the [tf.nn.conv2d()](https://www.tensorflow.org/api_docs/python/tf/nn/conv2d) and [tf.nn.bias_add()](https://www.tensorflow.org/api_docs/python/tf/nn/bias_add) functions to create our own convolutional layers.

```Python
# Output depth
k_output = 64

# Image Properties
image_width = 10
image_height = 10
color_channels = 3

# Convolution filter
filter_size_width = 5
filter_size_height = 5

# Input/Image
input = tf.placeholder(
    tf.float32,
    shape=[None, image_height, image_width, color_channels])

# Weight and bias
weight = tf.Variable(tf.truncated_normal(
    [filter_size_height, filter_size_width, color_channels, k_output]))
bias = tf.Variable(tf.zeros(k_output))

# Apply Convolution
conv_layer = tf.nn.conv2d(input, weight, strides=[1, 2, 2, 1], padding='SAME')
# Add bias
conv_layer = tf.nn.bias_add(conv_layer, bias)
# Apply activation function
conv_layer = tf.nn.relu(conv_layer)
```

The code above uses the [tf.nn.conv2d()](https://www.tensorflow.org/api_docs/python/tf/nn/conv2d) function to compute the convolution with __weight__ as the filter and __[1, 2, 2, 1]__ for the strides. TensorFlow uses a stride for each __input__ dimension, __[batch, input_height, input_width, input_channels]__. We are generally always going to set the stride for __batch__ and __input\_channels__ (i.e. the first and fourth element in the __strides__ array) to be __1__.

We'll focus on changing __input\_height__ and __input\_width__ while setting __batch__ and __input\_channels__ to 1. The __input\_height__ and __input\_width__ strides are for striding the filter over __input__. This example code uses a stride of 2 with 5x5 filter over __input__.

The [tf.nn.bias_add()](https://www.tensorflow.org/api_docs/python/tf/nn/bias_add) function adds a 1-d bias to the last dimension in a matrix.

# 7. Explore The Design Space <a name='Explore The Design Space'></a>

We've seen what a simple convnet looks like, there are many things that
we can do to improve it. We're going to talk about three of them, _pooling_, _one by one convolutions_ and something a bit more advanced called the _inception architecture_.

### 7.1. Pooling <a name='Pooling'></a>

<img src='Figures5/Screen Shot 2017-03-30 at 16.07.22.png' width=400>

The first improvement is a better way to reduce the spatial extent of feature maps in the convolutional pyramid. Until now, we've used striding to shift the filters by a few pixel each time and reduce the future map size. This is a very aggressive way to downsample an image. It removes a lot of information. 

What if instead of skipping one in every two convolutions, we still ran with a very small stride, say for example one. But then took all the convolutions in a neighborhood and combined them somehow.

That operation is called pooling, and there are a few ways to go about it.

The most common is max pooling.

<img src='Figures5/Screen Shot 2017-03-30 at 16.11.00.png' width=500>

At every point in the future map, look at a small neighborhood around that point and compute the maximum of all the responses around it. 

There are some advantages to using max pooling.  First, it doesn't add to the number of parameters. So we don't risk an increasing over fitting. Second, it simply often yields more accurate models. 

However, since the convolutions that run below run at a lower stride, the model then becomes a lot more expensive to compute. And now we have even more hyper parameters to worry about; the pooling region size, and the pooling stride, and they don't have to be the same.

A very typical architecture for a covenant is a few layers alternating convolutions and max pooling, followed by a few fully connected layers at the top (figures below). 

<img src='Figures5/Screen Shot 2017-03-30 at 16.13.56.png' width=500>

The first famous model to use this architecture was __LENET-5__ designed by Yann Lecun to the character recognition back in 1998. Modern convolutional networks such as __ALEXNET__, which famously won the competitive ImageNet
object recognition challenge in 2012, used a very similar architecture
with a few wrinkles. 

Another notable form of pooling is __average pooling__. Instead of taking the max, just take an average over the window of pixels around a specific location. It's a little bit like providing a blurred low resolution view of the feature map below. We're going to take advantage of that shortly.



### 7.2. TensorFlow Max Pooling <a name='TensorFlow Max Pooling'></a>

<img src='Figures5/max-pooling.png' width=400>
$$ \text{By Aphex34 (Own work) [CC BY-SA 4.0], via Wikimedia Commons} $$

[CC BY-SA 4.0](http://creativecommons.org/licenses/by-sa/4.0)

The image above is an example of [max pooling](https://en.wikipedia.org/wiki/Convolutional_neural_network#Pooling_layer) with a 2x2 filter and stride of 2. The four 2x2 colors represent each time the filter was applied to find the maximum value.

For example, __[[1, 0], [4, 6]]__ becomes __6__, because __6__ is the maximum value in this set. Similarly, __[[2, 3], [6, 8]]__ becomes __8__.

Conceptually, the benefit of the max pooling operation is to reduce the size of the input, and allow the neural network to focus on only the most important elements. Max pooling does this by only retaining the maximum value for each filtered area, and removing the remaining values.

TensorFlow provides the [tf.nn.max_pool()](https://www.tensorflow.org/api_docs/python/tf/nn/max_pool) function to apply __max pooling__ to the convolutional layers.

```Python
...
conv_layer = tf.nn.conv2d(input, weight, strides=[1, 2, 2, 1], padding='SAME')
conv_layer = tf.nn.bias_add(conv_layer, bias)
conv_layer = tf.nn.relu(conv_layer)
# Apply Max Pooling
conv_layer = tf.nn.max_pool(
    conv_layer,
    ksize=[1, 2, 2, 1],
    strides=[1, 2, 2, 1],
    padding='SAME')
```

The [tf.nn.max_pool()](https://www.tensorflow.org/api_docs/python/tf/nn/max_pool) function performs max pooling with the __ksize__ parameter as the size of the filter and the __strides__ parameter as the length of the stride. 2x2 filters with a stride of 2x2 are common in practice.

The __ksize__ and __strides__ parameters are structured as 4-element lists, with each element corresponding to a dimension of the input tensor (__[batch, height, width, channels]__). For both __ksize__ and __strides__, the batch and channel dimensions are typically set to __1__.

---
__Quiz 1: Pooling Intuition__ 

A pooling layer is generally used to ...

1. Increase the size of the output
2. Decrease the size of the output
3. Prevent overfitting
4. Gain information

__Answer 1: Pooling Intuition__

The correct answer is __decrease the size of the output__ and __prevent overfitting__. Preventing overfitting is a consequence of reducing the output size, which in turn, reduces the number of parameters in future layers.

Recently, pooling layers have fallen out of favor. Some reasons are:

- Recent datasets are so big and complex we're more concerned about underfitting.
- Dropout is a much better regularizer.
- Pooling results in a loss of information. Think about the max pooling operation as an example. We only keep the largest of n numbers, thereby disregarding n-1 numbers completely.

---
__Quiz 2: Pooling Mechanics__

H = height, W = width, D = depth

- We have an input of shape 4x4x5 (HxWxD)
- Filter of shape 2x2 (HxW)
- A stride of 2 for both the height and width (S)

Recall the formula for calculating the new height or width:

```Python
new_height = (input_height - filter_height)/S + 1
new_width = (input_width - filter_width)/S + 1
```

__NOTE:__ For a pooling layer the output depth is the same as the input depth. Additionally, the pooling operation is applied individually for each depth slice.

The image below gives an example of how a max pooling layer works. In this case, the max pooling filter has a shape of 2x2. As the max pooling filter slides across the input layer, the filter will output the maximum value of the 2x2 square.

<img src='Figures5/convolutionalnetworksquiz.png' width=500>

What is the shape of the output?

__Answer 2: Pooling Mechanics__

The answer is __2x2x5__. Here's how it's calculated using the formula:

```Python
(4 - 2)/2 + 1 = 2
(4 - 2)/2 + 1 = 2
```

The depth stays the same.

Here's the corresponding code:

```Python
input = tf.placeholder(tf.float32, (None, 4, 4, 5))
filter_shape = [1, 2, 2, 1]
strides = [1, 2, 2, 1]
padding = 'VALID'
pool = tf.nn.max_pool(input, filter_shape, strides, padding)
```

The output shape of __pool__ will be [1, 2, 2, 5], even if __padding__ is changed to __'SAME'__.

---
__Quiz 3: Pooling Practice__

What's the result of a __max pooling__ operation on the input?

```Python
[[[0, 1, 0.5, 10],
   [2, 2.5, 1, -8],
   [4, 0, 5, 6],
   [15, 1, 2, 3]]]
```

Assume the filter is 2x2 and the stride is 2 for both height and width. The output shape is 2x2x1.

Work from the top left to the bottom right.

__Answer 3: Pooling Practice__
The correct answer is __2.5,10,15,6__. We start with the four numbers in the top left corner. Then we work left-to-right and top-to-bottom, moving 2 units each time.

```Python
max(0, 1, 2, 2.5) = 2.5
max(0.5, 10, 1, -8) = 10
max(4, 0, 15, 1) = 15
max(5, 6, 2, 3) = 6
```

__Quiz 4: Average Pooling__
What's the result of a __average (or mean) pooling__?

```Python
[[[0, 1, 0.5, 10],
   [2, 2.5, 1, -8],
   [4, 0, 5, 6],
   [15, 1, 2, 3]]]
```

Assume the filter is 2x2 and the stride is 2 for both height and width. The output shape is 2x2x1.

Answer to 3 decimal places. Work from the top left to the bottom right.

__Answer 4: Average Pooling__

The correct answer is __1.375,0.875,5,4__. We start with the four numbers in the top left corner. Then we work left-to-right and top-to-bottom, moving 2 units each time.

```Python
mean(0, 1, 2, 2.5) = 1.375
mean(0.5, 10, 1, -8) = 0.875
mean(4, 0, 15, 1) = 5
mean(5, 6, 2, 3) = 4
```
---

### 7.3. 1 x 1 Convolutions <a name='1 x 1 Convolutions'></a>

<img src='Figures5/Screen Shot 2017-03-30 at 20.53.05.png' width=400>

Why might we ever want to use one by one convolutions? They're not really looking at a patch of the image, just that one pixel. Look at the classic convolution setting.

<img src='Figures5/Screen Shot 2017-03-30 at 20.55.05.png' width=400>

It's basically a small classifier for a patch of the image, but it's only a linear classifier. 

But if we add a one by one convolution in the middle, suddenly we have a mini neural network running over the patch instead of a linear classifier. 

<img src='Figures5/Screen Shot 2017-03-30 at 20.56.31.png' width=400> 

Interspersing our convolutions with one by one convolutions is a very inexpensive way to make models deeper and have more parameters without completely changing their structure. They're also very cheap, because if we go through the math, they're not really convolutions at all. They're really just matrix multiplies,
and they have relatively few parameters. 

Average pooling and one by one convolutions has been very successful at creating covnets that are both smaller and better than covnets that simply use a pyramid of convolutions.

### 7.4. Inception Module <a name='Inception Module'></a>

Inception module is that at each layer of our covnet, we can make a choice. For example, have a pooling operation, have a convolution, then we need to decide is it a 1 by 1 convolution, or a 3 by 3, or a 5 by 5? All of these are actually beneficial to the modeling power of the network. 

<img src='Figures5/Screen Shot 2017-03-31 at 00.00.34.png' width=400>

So why choose? Instead, let's choose them all. 

<img src='Figures5/Screen Shot 2017-03-31 at 00.01.26.png' width=500>

Instead of having a single convolution, we have a composition of average
pooling followed by a 1 by 1, then a 1 by 1 convolution, then a 1 by 1 followed by a 3 by 3. Then a 1 by 1 followed by a 5 by 5. And at the top, we simply concatenate the output of each of them. It looks complicated, but what's interesting is that we can choose these parameters in such a way that the total number of parameters in our model is very small. Yet the model performs better than if we had a simple convolution.

# 8. Convolutional Network in TensorFlow <a name='Convolutional Network in TensorFlow'></a>

It's time to walk through an example Convolutional Neural Network (CNN) in TensorFlow.

The structure of this network follows the classic structure of CNNs, which is a mix of convolutional layers and max pooling, followed by fully-connected layers.

The code we'll be looking at is similar to what we saw in the segment on Deep Neural Network in TensorFlow, except we restructured the architecture of this network as a CNN.

Just like in that segment, here we'll study the line-by-line breakdown of the code. 

Thanks to Aymeric Damien for providing the original TensorFlow model on which this segment is based.

### 8.1. Dataset <a name='Dataset'></a>

We're importing the MNIST dataset and using a convenient TensorFlow function to batch, scale, and One-Hot encode the data.

```Python
from tensorflow.examples.tutorials.mnist import input_data
mnist = input_data.read_data_sets(".", one_hot=True, reshape=False)

import tensorflow as tf

# Parameters
learning_rate = 0.00001
epochs = 10
batch_size = 128

# Number of samples to calculate validation and accuracy
# Decrease this if we're running out of memory to calculate accuracy
test_valid_size = 256

# Network Parameters
n_classes = 10  # MNIST total classes (0-9 digits)
dropout = 0.75  # Dropout, probability to keep units
```

### 8.2. Weights and Biases <a name='Weights and Biases'></a>

```Python
# Store layers weight & bias
weights = {
    'wc1': tf.Variable(tf.random_normal([5, 5, 1, 32])),
    'wc2': tf.Variable(tf.random_normal([5, 5, 32, 64])),
    'wd1': tf.Variable(tf.random_normal([7*7*64, 1024])),
    'out': tf.Variable(tf.random_normal([1024, n_classes]))}

biases = {
    'bc1': tf.Variable(tf.random_normal([32])),
    'bc2': tf.Variable(tf.random_normal([64])),
    'bd1': tf.Variable(tf.random_normal([1024])),
    'out': tf.Variable(tf.random_normal([n_classes]))}
```

### 8.3. Convolutions <a name='Convolutions'></a>

<img src='Figures5/convolution-schematic.gif' width=400>
The source is [Convolution with 3×3 Filter]( http://deeplearning.stanford.edu/wiki/index.php/Feature_extraction_using_convolution).

The above is an example of a [convolution](https://en.wikipedia.org/wiki/Convolution) with a 3x3 filter and a stride of 1 being applied to data with a range of 0 to 1. The convolution for each 3x3 section is calculated against the weight, __[[1, 0, 1], [0, 1, 0], [1, 0, 1]]__, then a bias is added to create the convolved feature on the right. In this case, the bias is zero. In TensorFlow, this is all done using [tf.nn.conv2d()](https://www.tensorflow.org/api_docs/python/tf/nn/conv2d) and [tf.nn.bias_add()](https://www.tensorflow.org/api_docs/python/tf/nn/bias_add).

```Python
def conv2d(x, W, b, strides=1):
    x = tf.nn.conv2d(x, W, strides=[1, strides, strides, 1], padding='SAME')
    x = tf.nn.bias_add(x, b)
    return tf.nn.relu(x)
```

The [tf.nn.conv2d()](https://www.tensorflow.org/api_docs/python/tf/nn/conv2d) function computes the convolution against weight W as shown above.

In TensorFlow, __strides__ is an array of 4 elements; the first element in this array indicates the stride for batch and last element indicates stride for features. It's good practice to remove the batches or features we want to skip from the data set rather than use a stride to skip them. we can always set the first and last element to 1 in strides in order to use all batches and features.

The middle two elements are the strides for height and width respectively. I've mentioned stride as one number because we usually have a square stride where __height = width__. When someone says they are using a stride of 3, they usually mean __tf.nn.conv2d(x, W, strides=[1, 3, 3, 1])__.

To make life easier, the code is using [tf.nn.bias_add()](https://www.tensorflow.org/api_docs/python/tf/nn/bias_add) to add the bias. Using [tf.add()](https://www.tensorflow.org/api_docs/python/tf/add) doesn't work when the tensors aren't the same shape.

### 8.4. Max Pooling <a name='Max Pooling'></a>

<img src='Figures5/maxpool.jpg' width= 400>

Max Pooling with 2x2 filter and stride of 2. [Source]( http://cs231n.github.io/convolutional-networks/)

The above is an example of [max pooling](https://en.wikipedia.org/wiki/Convolutional_neural_network#Pooling_layer) with a 2x2 filter and stride of 2. The left square is the input and the right square is the output. The four 2x2 colors in input represents each time the filter was applied to create the max on the right side. For example, __[[1, 1], [5, 6]]__ becomes 6 and __[[3, 2], [1, 2]]__ becomes 3.

```Python
def maxpool2d(x, k=2):
    return tf.nn.max_pool(
        x,
        ksize=[1, k, k, 1],
        strides=[1, k, k, 1],
        padding='SAME')
```

The [tf.nn.max_pool()](https://www.tensorflow.org/api_docs/python/tf/nn/max_pool) function does exactly what we would expect, it performs max pooling with the __ksize__ parameter as the size of the filter.

### 8.5. Model <a name='Model'></a>

<img src='Figures5/arch.png' width=500>

In the code below, we're creating 3 layers alternating between convolutions and max pooling followed by a fully connected and output layer. The transformation of each layer to new dimensions are shown in the comments. For example, the first layer shapes the images from 28x28x1 to 28x28x32 in the convolution step. Then next step applies max pooling, turning each sample into 14x14x32. All the layers are applied from __conv1__ to __output__, producing 10 class predictions.

```Python
def conv_net(x, weights, biases, dropout):
    # Layer 1 - 28*28*1 to 14*14*32
    conv1 = conv2d(x, weights['wc1'], biases['bc1'])
    conv1 = maxpool2d(conv1, k=2)

    # Layer 2 - 14*14*32 to 7*7*64
    conv2 = conv2d(conv1, weights['wc2'], biases['bc2'])
    conv2 = maxpool2d(conv2, k=2)

    # Fully connected layer - 7*7*64 to 1024
    fc1 = tf.reshape(conv2, [-1, weights['wd1'].get_shape().as_list()[0]])
    fc1 = tf.add(tf.matmul(fc1, weights['wd1']), biases['bd1'])
    fc1 = tf.nn.relu(fc1)
    fc1 = tf.nn.dropout(fc1, dropout)

    # Output Layer - class prediction - 1024 to 10
    out = tf.add(tf.matmul(fc1, weights['out']), biases['out'])
    return out
```



### 8.6. Session

```Python
# tf Graph input
x = tf.placeholder(tf.float32, [None, 28, 28, 1])
y = tf.placeholder(tf.float32, [None, n_classes])
keep_prob = tf.placeholder(tf.float32)

# Model
logits = conv_net(x, weights, biases, keep_prob)

# Define loss and optimizer
cost = tf.reduce_mean(\
    tf.nn.softmax_cross_entropy_with_logits(logits=logits, labels=y))
optimizer = tf.train.GradientDescentOptimizer(learning_rate=learning_rate)\
    .minimize(cost)

# Accuracy
correct_pred = tf.equal(tf.argmax(logits, 1), tf.argmax(y, 1))
accuracy = tf.reduce_mean(tf.cast(correct_pred, tf.float32))

# Initializing the variables
init = tf. global_variables_initializer()

# Launch the graph
with tf.Session() as sess:
    sess.run(init)

    for epoch in range(epochs):
        for batch in range(mnist.train.num_examples//batch_size):
            batch_x, batch_y = mnist.train.next_batch(batch_size)
            sess.run(optimizer, feed_dict={
                x: batch_x,
                y: batch_y,
                keep_prob: dropout})

            # Calculate batch loss and accuracy
            loss = sess.run(cost, feed_dict={
                x: batch_x,
                y: batch_y,
                keep_prob: 1.})
            valid_acc = sess.run(accuracy, feed_dict={
                x: mnist.validation.images[:test_valid_size],
                y: mnist.validation.labels[:test_valid_size],
                keep_prob: 1.})

            print('Epoch {:>2}, Batch {:>3} -'
                  'Loss: {:>10.4f} Validation Accuracy: {:.6f}'.format(
                epoch + 1,
                batch + 1,
                loss,
                valid_acc))

    # Calculate Test Accuracy
    test_acc = sess.run(accuracy, feed_dict={
        x: mnist.test.images[:test_valid_size],
        y: mnist.test.labels[:test_valid_size],
        keep_prob: 1.})
    print('Testing Accuracy: {}'.format(test_acc))
```

__Quiz 1: Using Convolution Layers in TensorFlow__

Let's now apply what we've learned to build real CNNs in TensorFlow. In the below exercise, we'll be asked to set up the dimensions of the Convolution filters, the weights, the biases. This is in many ways the trickiest part to using CNNs in TensorFlow. Once we have a sense of how to set up the dimensions of these attributes, applying CNNs will be far more straight forward.

We should go over the TensorFlow documentation for [2D convolutions](https://www.tensorflow.org/api_guides/python/nn#Convolution). Most of the documentation is straightforward, except perhaps the __padding__ argument. The __padding__ might differ depending on whether we pass '__VALID__' or '__SAME__'.

Here are a few more things worth reviewing:

1. Introduction to TensorFlow -> TensorFlow Variables.
2. How to determine the dimensions of the output based on the input size and the filter size (shown below). We'll use this to determine what the size of our filter should be.

```Python
new_height = (input_height - filter_height + 2 * P)/S + 1
new_width = (input_width - filter_width + 2 * P)/S + 1
```

Setup the strides, padding and filter weight/bias such that
the output shape is (1, 2, 2, 3).

In [16]:
import tensorflow as tf
import numpy as np

# `tf.nn.conv2d` requires the input be 4D (batch_size, height, width, depth)
# (1, 4, 4, 1)
x = np.array([
    [0, 1, 0.5, 10],
    [2, 2.5, 1, -8],
    [4, 0, 5, 6],
    [15, 1, 2, 3]], dtype=np.float32).reshape((1, 4, 4, 1))
X = tf.constant(x)


def conv2d(input):
    # Filter (weights and bias)
    # The shape of the filter weight is (height, width, input_depth, output_depth)
    # The shape of the filter bias is (output_depth,)
    # TODO: Define the filter weights `F_W` and filter bias `F_b`.
    # NOTE: Remember to wrap them in `tf.Variable`, they are trainable parameters after all.
    F_W = tf.Variable(tf.random_normal([2, 2, 1, 3]))
    F_b = tf.Variable(tf.zeros(3))
    # TODO: Set the stride for each dimension (batch_size, height, width, depth)
    strides = [1, 2, 2, 1]
    # TODO: set the padding, either 'VALID' or 'SAME'.
    padding = 'VALID'
    # https://www.tensorflow.org/versions/r0.11/api_docs/python/nn.html#conv2d
    # `tf.nn.conv2d` does not include the bias computation so we have to add it ourselves after.
    return tf.nn.conv2d(input, F_W, strides, padding) + F_b

convolution = conv2d(X)
init = tf.global_variables_initializer()

with tf.Session() as sess:
    sess.run(init)
    output = sess.run(d)
    print('Output shape:', output.shape)
    print('Convolution result:', output)

Output shape: (1, 2, 2, 3)
Convolution result: [[[[  3.52205014   6.0473938   -9.88167572]
   [ -2.84982872 -14.56329346   1.54506218]]

  [[  7.6381855   13.48333549 -45.65987778]
   [ 12.39241028  10.22605324   2.58307552]]]]


__Answer 1: Using Convolution Layers in TensorFlow__

```Python
def conv2d(input):
    # Filter (weights and bias)
    F_W = tf.Variable(tf.truncated_normal((2, 2, 1, 3)))
    F_b = tf.Variable(tf.zeros(3))
    strides = [1, 2, 2, 1]
    padding = 'VALID'
    return tf.nn.conv2d(input, F_W, strides, padding) + F_b
```

We want to transform the input shape __(1, 4, 4, 1)__ to __(1, 2, 2, 3)__. I choose __'VALID'__ for the padding algorithm. We find it simpler to understand and it achieves the result I'm looking for.

```Python
out_height = ceil(float(in_height - filter_height + 1) / float(strides[1]))
out_width  = ceil(float(in_width - filter_width + 1) / float(strides[2]))
```

Plugging in the values:

```Python
out_height = ceil(float(4 - 2 + 1) / float(2)) = ceil(1.5) = 2
out_width  = ceil(float(4 - 2 + 1) / float(2)) = ceil(1.5) = 2
```

In order to change the depth from 1 to 3, we have to set the output depth of my filter appropriately:

```Python
F_W = tf.Variable(tf.truncated_normal((2, 2, 1, 3))) # (height, width, input_depth, output_depth)
F_b = tf.Variable(tf.zeros(3)) # (output_depth)
```

The input has a depth of 1, so we set that as the __input_depth__ of the filter.

__Quiz 2: Using Pooling Layers in TensorFlow__

[tf.nn.max_pool()](https://www.tensorflow.org/api_docs/python/tf/nn/max_pool). 

Padding works the same as it does for a convolution.

Set the values to `strides` and `ksize` such that
the output shape after pooling is (1, 2, 2, 1).

In [20]:
import tensorflow as tf
import numpy as np

# `tf.nn.max_pool` requires the input be 4D (batch_size, height, width, depth)
# (1, 4, 4, 1)
x = np.array([
    [0, 1, 0.5, 10],
    [2, 2.5, 1, -8],
    [4, 0, 5, 6],
    [15, 1, 2, 3]], dtype=np.float32).reshape((1, 4, 4, 1))
X = tf.constant(x)

def maxpool(input):
    # TODO: Set the ksize (filter size) for each dimension (batch_size, height, width, depth)
    ksize = [1, 2, 2, 1]
    # TODO: Set the stride for each dimension (batch_size, height, width, depth)
    strides = [1, 2, 2, 1]
    # TODO: set the padding, either 'VALID' or 'SAME'.
    padding = 'VALID'
    # https://www.tensorflow.org/versions/r0.11/api_docs/python/nn.html#max_pool
    return tf.nn.max_pool(input, ksize, strides, padding)

maxpooling = maxpool(X)

init = tf.global_variables_initializer()

with tf.Session() as session:
    session.run(init)
    output = session.run(maxpooling)
    print('Output shape:', output.shape)
    print('maxpool result:', output)
#out = maxpool(X)

Output shape: (1, 2, 2, 1)
maxpool result: [[[[  2.5]
   [ 10. ]]

  [[ 15. ]
   [  6. ]]]]


__Answer: Using Pooling Layers in TensorFlow__

```Python
def maxpool(input):
    ksize = [1, 2, 2, 1]
    strides = [1, 2, 2, 1]
    padding = 'VALID'
    return tf.nn.max_pool(input, ksize, strides, padding)
```

I want to transform the input shape (1, 4, 4, 1) to (1, 2, 2, 1). I choose 'VALID' for the padding algorithm. I find it simpler to understand and it achieves the result I'm looking for.

```Python
out_height = ceil(float(in_height - filter_height + 1) / float(strides[1]))
out_width  = ceil(float(in_width - filter_width + 1) / float(strides[2]))
```

Plugging in the values:

```Python
out_height = ceil(float(4 - 2 + 1) / float(2)) = ceil(1.5) = 2
out_width  = ceil(float(4 - 2 + 1) / float(2)) = ceil(1.5) = 2
```

The depth doesn't change during a pooling operation so we don't have to worry about that.