# Lesson13 Convolutional Neural Network

## 1. Intuition

### Intuition
Let's develop better intuition for how Convolutional Neural Networks (CNN) work. We'll examine how humans classify images, and then see how CNNs use similar approaches.

Let’s say we wanted to classify the following image of a dog as a Golden Retriever.

|![Golden Retreiver](https://video.udacity-data.com/topher/2016/November/58377b77_dog-1210559-1280/dog-1210559-1280.jpg)|
|:---:|
|An image that we'd like to classify as a Golden Retriever.|

As humans, how do we do this?

One thing we do is that we identify certain parts of the dog, such as the nose, the eyes, and the fur. We essentially break up the image into smaller pieces, recognize the smaller pieces, and then combine those pieces to get an idea of the overall dog.

In this case, we might break down the image into a combination of the following:

- A nose
- Two eyes
- Golden fur

These pieces can be seen below:

|<img src="https://video.udacity-data.com/topher/2016/November/58377bdb_screen-shot-2016-11-24-at-12.49.08-pm/screen-shot-2016-11-24-at-12.49.08-pm.png" width="200"/>|<img src="https://video.udacity-data.com/topher/2016/November/58377bed_screen-shot-2016-11-24-at-12.49.43-pm/screen-shot-2016-11-24-at-12.49.43-pm.png" width="200"/>|<img src="https://video.udacity-data.com/topher/2016/November/58377bff_screen-shot-2016-11-24-at-12.50.54-pm/screen-shot-2016-11-24-at-12.50.54-pm.png" width="200"/>|
|:----------------:|:-----------------:|:----------------:|
|The eye of the dog|The nose of the dog|The fur of the dog|

### Going One Step Further
But let’s take this one step further. How do we determine what exactly a nose is? A Golden Retriever nose can be seen as an oval with two black holes inside it. Thus, one way of classifying a Retriever’s nose is to to break it up into smaller pieces and look for black holes (nostrils) and curves that define an oval as shown below.

|<img src="https://video.udacity-data.com/topher/2016/November/58377c52_screen-shot-2016-11-24-at-12.51.47-pm/screen-shot-2016-11-24-at-12.51.47-pm.png" width = "200"/>|<img src = "https://video.udacity-data.com/topher/2016/November/58377c68_screen-shot-2016-11-24-at-12.51.51-pm/screen-shot-2016-11-24-at-12.51.51-pm.png" width = "200"/>|
|:------------------------------------:|:------------------------------------------------------:|
|A curve we can use to determine a nose|A nostril that we can use to classify a nose of the dog.|

Broadly speaking, this is what a CNN learns to do. It learns to recognize basic lines and curves, then shapes and blobs, and then increasingly complex objects within the image. Finally, the CNN classifies the image by combining the larger, more complex objects.

In our case, the levels in the hierarchy are:

- Simple shapes, like ovals and dark circles
- Complex objects (combinations of simple shapes), like eyes, nose, and fur
- The dog as a whole (a combination of complex objects)

With deep learning, we don't actually program the CNN to recognize these specific features. Rather, the CNN learns on its own to recognize such objects through forward propagation and backpropagation!

It's amazing how well a CNN can learn to classify images, even though we never program the CNN with information about specific features to look for.

|![](https://video.udacity-data.com/topher/2016/November/583cb19d_heirarchy-diagram/heirarchy-diagram.jpg)|
|:---:|
|An example of what each layer in a CNN might recognize when classifying a picture of a dog.|

A CNN might have several layers, and each layer might capture a different level in the hierarchy of objects. The first layer is the lowest level in the hierarchy, where the CNN generally classifies small parts of the image into simple shapes like horizontal and vertical lines and simple blobs of colors. The subsequent layers tend to be higher levels in the hierarchy and generally classify more complex ideas like shapes (combinations of lines), and eventually full objects like dogs.

Once again, the CNN **learns all of this on its own**. We don't ever have to tell the CNN to go looking for lines or curves or noses or fur. The CNN just learns from the training set and discovers which characteristics of a Golden Retriever are worth looking for.

That's a good start! Hopefully you've developed some intuition about how CNNs work.

Next, let’s look at some implementation details.



## 2. Filters

### Breaking up an Image
The first step for a CNN is to break up the image into smaller pieces. We do this by selecting a width and height that defines a filter.

The filter looks at small pieces, or patches, of the image. These patches are the same size as the filter.

|<img src = "https://video.udacity-data.com/topher/2016/November/58377d67_vlcsnap-2016-11-24-15h52m47s438/vlcsnap-2016-11-24-15h52m47s438.png" width = "700"/>|
|:---:|
|As shown in the previous video, a CNN uses filters to split an image into smaller patches. The size of these patches matches the filter size.|

We then simply slide this filter horizontally or vertically to focus on a different piece of the image.

The amount by which the filter slides is referred to as the 'stride'. The stride is a hyperparameter which you, the engineer, can tune. Increasing the stride reduces the size of your model by reducing the number of total patches each layer observes. However, this usually comes with a reduction in accuracy.

Let’s look at an example. In this zoomed in image of the dog, we first start with the patch outlined in red. The width and height of our filter define the size of this square.

|<img src = "https://video.udacity-data.com/topher/2016/December/5840fdac_retriever-patch/retriever-patch.png" width = "400"/>|
|:---:|
|One patch of the Golden Retriever image.|

We then move the square over to the right by a given stride (2 in this case) to get another patch.

|<img src = "https://video.udacity-data.com/topher/2016/December/5840fe04_retriever-patch-shifted/retriever-patch-shifted.png" width = "400"/>|
|:---:|
|We move our square to the right by two pixels to create another patch.|

What's important here is that we are **grouping together adjacent pixels** and treating them as a collective.

In a normal, non-convolutional neural network, we would have ignored this adjacency. In a normal network, we would have connected every pixel in the input image to a neuron in the next layer. In doing so, we would not have taken advantage of the fact that pixels in an image are close together for a reason and have special meaning.

By taking advantage of this local structure, our CNN learns to classify local patterns, like shapes and objects, in an image.

### Filter Depth
It's common to have more than one filter. Different filters pick up different qualities of a patch. For example, one filter might look for a particular color, while another might look for a kind of object of a specific shape. The amount of filters in a convolutional layer is called the filter depth.

|<img src = "https://video.udacity-data.com/topher/2016/November/58377e4f_neilsen-pic/neilsen-pic.png" width = "400"/>|
|:---:|
|In the above example, a patch is connected to a neuron in the next layer. Source: MIchael Nielsen.|

How many neurons does each patch connect to?

That’s dependent on our filter depth. If we have a depth of `k`, we connect each patch of pixels to `k` neurons in the next layer. This gives us the height of `k` in the next layer, as shown below. In practice, `k` is a hyperparameter we tune, and most CNNs tend to pick the same starting values.

|<img src = "https://video.udacity-data.com/topher/2016/December/5840ffda_filter-depth/filter-depth.png" width = "200"/>|
|:---:|
|Choosing a filter depth of `k` connects each patch to `k` neurons in the next layer.|

But why connect a single patch to multiple neurons in the next layer? Isn’t one neuron good enough?

Multiple neurons can be useful because a patch can have multiple interesting characteristics that we want to capture.

For example, one patch might include some white teeth, some blonde whiskers, and part of a red tongue. In that case, we might want a filter depth of at least three - one for each of teeth, whiskers, and tongue.

|<img src = "https://video.udacity-data.com/topher/2016/December/584104c8_teeth-whiskers-tongue/teeth-whiskers-tongue.png" width = "400"/>|
|:---:|
|This patch of the dog has many interesting features we may want to capture. These include the presence of teeth, the presence of whiskers, and the pink color of the tongue.|

Having multiple neurons for a given patch ensures that our CNN can learn to capture whatever characteristics the CNN learns are important.

Remember that the CNN isn't "programmed" to look for certain characteristics. Rather, it learns **on its own** which characteristics to notice.


## 3. Parameters
### Parameter Sharing
|<img src = "https://video.udacity-data.com/topher/2016/November/58377f77_vlcsnap-2016-11-24-16h01m35s262/vlcsnap-2016-11-24-16h01m35s262.png" width = "700">|
|:---:|
|The weights, `w`, are shared across patches for a given layer in a CNN to detect the cat above regardless of where in the image it is located.|

When we are trying to classify a picture of a cat, we don’t care where in the image a cat is. If it’s in the top left or the bottom right, it’s still a cat in our eyes. We would like our CNNs to also possess this ability known as translation invariance. How can we achieve this?

As we saw earlier, the classification of a given patch in an image is determined by the weights and biases corresponding to that patch.

If we want a cat that’s in the top left patch to be classified in the same way as a cat in the bottom right patch, we need the weights and biases corresponding to those patches to be the same, so that they are classified the same way.

This is exactly what we do in CNNs. The weights and biases we learn for a given output layer are shared across all patches in a given input layer. Note that as we increase the depth of our filter, the number of weights and biases we have to learn still increases, as the weights aren't shared across the output channels.

There’s an additional benefit to sharing our parameters. If we did not reuse the same weights across all patches, we would have to learn new parameters for every single patch and hidden layer neuron pair. This does not scale well, especially for higher fidelity images. Thus, sharing parameters not only helps us with translation invariance, but also gives us a smaller, more scalable model.

### Padding

|<img src = "https://video.udacity-data.com/topher/2016/November/5837d4d5_screen-shot-2016-11-24-at-10.05.37-pm/screen-shot-2016-11-24-at-10.05.37-pm.png" width = "200"/>|
|:---:|
|A `5x5` grid with a `3x3` filter. Source: Andrej Karpathy.|

#### Question 1
Let's say we have a 5x5 grid (as shown above) and a filter of size 3x3 with a stride of 1. What's the width and height of the next layer?
Answer : 3X3

As we can see, the width and height of each subsequent layer decreases in the above scheme.

In an ideal world, we'd be able to maintain the same width and height across layers so that we can continue to add layers without worrying about the dimensionality shrinking and so that we have consistency. How might we achieve this? One way is to simply add a border of `0`s to our original `5x5` image. You can see what this looks like in the below image.

|<img src = "https://video.udacity-data.com/topher/2016/November/5837d4ee_screen-shot-2016-11-24-at-10.05.46-pm/screen-shot-2016-11-24-at-10.05.46-pm.png" width = "200"/>|
|:---:|
|The same grid with `0` padding. Source: Andrej Karpathy.|

### Dimensionality
From what we've learned so far, how can we calculate the number of neurons of each layer in our CNN?

Given:
- our input layer has a width of `W` and a height of `H`
- our convolutional layer has a filter size `F`
- we have a stride of `S`
- a padding of `P`
- and the number of filters `K`,

the following formula gives us the width of the next layer: `W_out =[ (W−F+2P)/S] + 1`.

The output height would be `H_out = [(H-F+2P)/S] + 1`.

And the output depth would be equal to the number of filters `D_out = K`.

The output volume would be `W_out * H_out * D_out`.

Knowing the dimensionality of each additional layer helps us understand how large our model is and how our decisions around filter size and stride affect the size of our network.

### Quiz : Convolution Output shape
**Introduction**   
For the next few quizzes we'll test your understanding of the dimensions in CNNs. Understanding dimensions will help you make accurate tradeoffs between model size and performance. As you'll see, some parameters have a much bigger impact on model size than others.

**Setup**   
H = height, W = width, D = depth
- We have an input of shape 32x32x3 (HxWxD)
- 20 filters of shape 8x8x3 (HxWxD)
- A stride of 2 for both the height and width (S)
- With padding of size 1 (P)

Recall the formula for calculating the new height or width:
```
new_height = (input_height - filter_height + 2 * P)/S + 1
new_width = (input_width - filter_width + 2 * P)/S + 1
```

**Question - Convolutional Layer Output Shape**   
What's the shape of the output?

The answer format is HxWxD, so if you think the new height is 9, new width is 9, and new depth is 5, then type 9x9x5.

- **Solution**   
    We can get the new height and width with the above formula resulting in:
    ```
    (32 - 8 + 2 * 1)/2 + 1 = 14
    (32 - 8 + 2 * 1)/2 + 1 = 14
    ```
    The new depth is equal to the number of filters, which is 20.
    
    This would correspond to the following code:
    ```python
    input = tf.placeholder(tf.float32, (None, 32, 32, 3))
    filter_weights = tf.Variable(tf.truncated_normal((8, 8, 3, 20))) # (height, width, input_depth, output_depth)
    filter_bias = tf.Variable(tf.zeros(20))
    strides = [1, 2, 2, 1] # (batch, height, width, depth)
    padding = 'SAME'
    conv = tf.nn.conv2d(input, filter_weights, strides, padding) + filter_bias
    ```
    Note the output shape of `conv` will be [1, 16, 16, 20]. It's 4D to account for batch size, but more importantly, it's not [1, 14, 14, 20]. This is because the padding algorithm TensorFlow uses is not exactly the same as the one above. An alternative algorithm is to switch `padding` from `'SAME'` to `'VALID'` which would result in an output shape of [1, 13, 13, 20]. If you're curious how padding works in TensorFlow, read [this document](https://www.tensorflow.org/api_docs/python/tf/nn/convolution).

    In summary TensorFlow uses the following equation for 'SAME' vs 'VALID'

    **SAME Padding**, the output height and width are computed as:

    `out_height` = ceil(float(in_height) / float(strides[1]))

    `out_width` = ceil(float(in_width) / float(strides[2]))

    **VALID Padding**, the output height and width are computed as:

    `out_height` = ceil(float(in_height - filter_height + 1) / float(strides[1]))

    `out_width` = ceil(float(in_width - filter_width + 1) / float(strides[2]))

### Quiz: Number of Parameters Without Parameter sharing
We're now going to calculate the number of parameters of the convolutional layer. The answer from the last quiz will come into play here!

Being able to calculate the number of parameters in a neural network is useful since we want to have control over how much memory a neural network uses.

**Setup**   
H = height, W = width, D = depth

- We have an input of shape 32x32x3 (HxWxD)
- 20 filters of shape 8x8x3 (HxWxD)
- A stride of 2 for both the height and width (S)
- Zero padding of size 1 (P)

**Output Layer**
- 14x14x20 (HxWxD)

**Hint**   
Without parameter sharing, each neuron in the output layer must connect to each neuron in the filter. In addition, each neuron in the output layer must also connect to a single bias neuron.

**Question - Convolution Layer Parameters 1**   
How many parameters does the convolutional layer have (**without parameter sharing**)?

- **Solution**
    There are `756560` total parameters. That's a HUGE amount! Here's how we calculate it:
    ```
    (8 * 8 * 3 + 1) * (14 * 14 * 20) = 756560
    ```
    `8 * 8 * 3` is the number of weights, we add `1` for the bias. Remember, each weight is assigned to every single part of the output (`14 * 14 * 20`). So we multiply these two numbers together and we get the final answer.


### Quiz: Number of Parameters With Parameter sharing
Now we'd like you to calculate the number of parameters in the convolutional layer, if every neuron in the output layer shares its parameters with every other neuron in its same channel.

This is the number of parameters actually used in a convolution layer (`tf.nn.conv2d()`).

**Setup**   
H = height, W = width, D = depth

- We have an input of shape 32x32x3 (HxWxD)
- 20 filters of shape 8x8x3 (HxWxD)
- A stride of 2 for both the height and width (S)
- Zero padding of size 1 (P)

**Output Layer**
- 14x14x20 (HxWxD)

**Hint**   
With parameter sharing, each neuron in an output channel shares its weights with every other neuron in that channel. So the number of parameters is equal to the number of neurons in the filter, plus a bias neuron, all multiplied by the number of channels in the output layer.

**Question - Convolution Layer Parameters 2**   
How many parameters does the convolutional layer have (**with parameter sharing**)?

- **Solution**
    There are `3860` total parameters. That's 196 times fewer parameters! Here's how the answer is calculated:
    ```
    (8 * 8 * 3 + 1) * 20 = 3840 + 20 = 3860
    ```
    That's `3840` weights and `20` biases. This should look similar to the answer from the previous quiz. The difference being it's just `20` instead of (`14 * 14 * 20`). Remember, with weight sharing we use the same filter for an entire depth slice. Because of this we can get rid of `14 * 14` and be left with only `20`.


## 4. Visualizing CNNs
### Visualizing CNNs
Let’s look at an example CNN to see how it works in action.

The CNN we will look at is trained on ImageNet as described in [this paper](https://arxiv.org/abs/1311.2901) by Zeiler and Fergus. In the images below (from the same paper), we’ll see what each layer in this network detects and see how each layer detects more and more complex ideas.

### Layer 1
|<img src = "https://video.udacity-data.com/topher/2016/November/583cbd42_layer-1-grid/layer-1-grid.png" width = "150"/>|
|:---:|
|Example patterns that cause activations in the first layer of the network. These range from simple diagonal lines (top left) to green blobs (bottom middle).|

The images above are from Matthew Zeiler and Rob Fergus' [deep visualization toolbox](https://www.youtube.com/watch?v=ghEmQSxT6tw), which lets us visualize what each layer in a CNN focuses on.

Each image in the above grid represents a pattern that causes the neurons in the first layer to activate - in other words, they are patterns that the first layer recognizes. The top left image shows a -45 degree line, while the middle top square shows a +45 degree line. These squares are shown below again for reference.

|<img src = "https://video.udacity-data.com/topher/2016/November/583cbba2_diagonal-line-1/diagonal-line-1.png" width = "100"/>|
|:---:|
|As visualized here, the first layer of the CNN can recognize -45 degree lines.|

|<img src = "https://video.udacity-data.com/topher/2016/November/583cbc02_diagonal-line-2/diagonal-line-2.png" width = "100"/>|
|:---:|
|The first layer of the CNN is also able to recognize +45 degree lines, like the one above.|

Let's now see some example images that cause such activations. The below grid of images all activated the -45 degree line. Notice how they are all selected despite the fact that they have different colors, gradients, and patterns.

|<img src = "https://video.udacity-data.com/topher/2016/November/583cbace_grid-layer-1/grid-layer-1.png" width = "150"/>|
|:---:|
|Example patches that activate the -45 degree line detector in the first layer.|

So, the first layer of our CNN clearly picks out very simple shapes and patterns like lines and blobs.

### Layer 2
|<img src = "https://video.udacity-data.com/topher/2016/November/583780f3_screen-shot-2016-11-24-at-12.09.02-pm/screen-shot-2016-11-24-at-12.09.02-pm.png" width = "300"/>|
|:---:|
|A visualization of the second layer in the CNN. Notice how we are picking up more complex ideas like circles and stripes. The gray grid on the left represents how this layer of the CNN activates (or "what it sees") based on the corresponding images from the grid on the right.|

The second layer of the CNN captures complex ideas.

As you see in the image above, the second layer of the CNN recognizes circles (second row, second column), stripes (first row, second column), and rectangles (bottom right).

**The CNN learns to do this on its own.** There is no special instruction for the CNN to focus on more complex objects in deeper layers. That's just how it normally works out when you feed training data into a CNN.

### Layer 3
|<img src = "https://video.udacity-data.com/topher/2016/November/5837811f_screen-shot-2016-11-24-at-12.09.24-pm/screen-shot-2016-11-24-at-12.09.24-pm.png" width = "300"/>|
|:---:|
|A visualization of the third layer in the CNN. The gray grid on the left represents how this layer of the CNN activates (or "what it sees") based on the corresponding images from the grid on the right.|

The third layer picks out complex combinations of features from the second layer. These include things like grids, and honeycombs (top left), wheels (second row, second column), and even faces (third row, third column).

We'll skip layer 4, which continues this progression, and jump right to the fifth and final layer of this CNN.

### Layer 5
|<img src = "https://video.udacity-data.com/topher/2016/November/58378151_screen-shot-2016-11-24-at-12.08.11-pm/screen-shot-2016-11-24-at-12.08.11-pm.png" width = "400"/>|
|:---:|
|A visualization of the fifth and final layer of the CNN. The gray grid on the left represents how this layer of the CNN activates4(or "what it sees") based on the corresponding images from the grid on the right.|

The last layer picks out the highest order ideas that we care about for classification, like dog faces, bird faces, and bicycles.

### On to TensorFlow
This concludes our high-level discussion of Convolutional Neural Networks.

Next you'll practice actually building these networks in TensorFlow.


## 6. TensorFlow Convolution Layer

### TensorFlow Convolution Layer
Let's examine how to implement a CNN in TensorFlow.

TensorFlow provides the `tf.nn.conv2d()` and `tf.nn.bias_add()` functions to create your own convolutional layers.
```python
# Output depth
k_output = 64

# Image Properties
image_width = 10
image_height = 10
color_channels = 3

# Convolution filter
filter_size_width = 5
filter_size_height = 5

# Input/Image
input = tf.placeholder(
    tf.float32,
    shape=[None, image_height, image_width, color_channels])

# Weight and bias
weight = tf.Variable(tf.truncated_normal(
    [filter_size_height, filter_size_width, color_channels, k_output]))
bias = tf.Variable(tf.zeros(k_output))

# Apply Convolution
conv_layer = tf.nn.conv2d(input, weight, strides=[1, 2, 2, 1], padding='SAME')
# Add bias
conv_layer = tf.nn.bias_add(conv_layer, bias)
# Apply activation function
conv_layer = tf.nn.relu(conv_layer)
```

The code above uses the tf.nn.conv2d() function to compute the convolution with weight as the filter and [1, 2, 2, 1] for the strides. TensorFlow uses a stride for each input dimension, [batch, input_height, input_width, input_channels]. We are generally always going to set the stride for `batch` and `input_channels` (i.e. the first and fourth element in the `strides` array) to be `1`.

You'll focus on changing `input_height` and `input_width` while setting `batch` and `input_channels` to 1. The `input_height` and `input_width` strides are for striding the filter over `input`. This example code uses a stride of 2 with 5x5 filter over `input`.

The `tf.nn.bias_add()` function adds a 1-d bias to the last dimension in a matrix.

## 7. Tensorflow Max Pooling
### TensorFlow Max Pooling
|<img src = "https://video.udacity-data.com/topher/2016/November/582aac09_max-pooling/max-pooling.png" width = "300">|
|:---:|
|By Aphex34 (Own work) [CC BY-SA 4.0 (http://creativecommons.org/licenses/by-sa/4.0)], via Wikimedia Commons|

The image above is an example of [max pooling](https://en.wikipedia.org/wiki/Convolutional_neural_network#Pooling_layer) with a 2x2 filter and stride of 2. The four 2x2 colors represent each time the filter was applied to find the maximum value.

For example, `[[1, 0], [4, 6]]` becomes `6`, because `6` is the maximum value in this set. Similarly, `[[2, 3], [6, 8]]` becomes `8`.

Conceptually, the benefit of the max pooling operation is to reduce the size of the input, and allow the neural network to focus on only the most important elements. Max pooling does this by only retaining the maximum value for each filtered area, and removing the remaining values.

TensorFlow provides the `tf.nn.max_pool()` function to apply [max pooling](https://en.wikipedia.org/wiki/Convolutional_neural_network#Pooling_layer) to your convolutional layers.
```python
...
conv_layer = tf.nn.conv2d(input, weight, strides=[1, 2, 2, 1], padding='SAME')
conv_layer = tf.nn.bias_add(conv_layer, bias)
conv_layer = tf.nn.relu(conv_layer)
# Apply Max Pooling
conv_layer = tf.nn.max_pool(
    conv_layer,
    ksize=[1, 2, 2, 1],
    strides=[1, 2, 2, 1],
    padding='SAME')
```

The `tf.nn.max_pool()` function performs max pooling with the `ksize` parameter as the size of the filter and the `strides` parameter as the length of the stride. 2x2 filters with a stride of 2x2 are common in practice.

The `ksize` and `strides` parameters are structured as 4-element lists, with each element corresponding to a dimension of the input tensor (`[batch, height, width, channels]`). For both `ksize` and `strides`, the batch and channel dimensions are typically set to `1`.

### Quiz : Pooling intuition
**QUIZ QUESTION**   
A pooling layer is generally used to ...

- **Answer**
    - Decrease the size of the output
    - Prevent overfitting
   
   
- **Solution**   
    The correct answer is **decrease the size of the output** and **prevent overfitting**. Reducing overfitting is a consequence of the reducing the output size, which in turn, reduces the number of parameters in future layers.   
    
    Recently, pooling layers have fallen out of favor. Some reasons are:

    - Recent datasets are so big and complex we're more concerned about underfitting.
    - Dropout is a much better regularizer.
    - Pooling results in a loss of information. Think about the max pooling operation as an example. We only keep the largest of n numbers, thereby disregarding n-1 numbers completely.
    
### Quiz : Pooling Mechanics

**Setup**   
H = height, W = width, D = depth

- We have an input of shape 4x4x5 (HxWxD)
- Filter of shape 2x2 (HxW)
- A stride of 2 for both the height and width (S)
- Recall the formula for calculating the new height or width:
```
new_height = (input_height - filter_height)/S + 1
new_width = (input_width - filter_width)/S + 1
```

NOTE: For a pooling layer the output depth is the same as the input depth. Additionally, the pooling operation is applied individually for each depth slice.

The image below gives an example of how a max pooling layer works. In this case, the max pooling filter has a shape of 2x2. As the max pooling filter slides across the input layer, the filter will output the maximum value of the 2x2 square.

![](https://video.udacity-data.com/topher/2017/February/58a5fe3e_convolutionalnetworksquiz/convolutionalnetworksquiz.png)

**Question - Pooling Layer Output Shape**   
What's the shape of the output? Format is HxWxD.

- **Answer**   
    2x2x5
    
    
- **Solution**   
    The answer is 2x2x5. Here's how it's calculated using the above formula:
    ```
    (4 - 2)/2 + 1 = 2
    (4 - 2)/2 + 1 = 2
    ```
    The depth stays the same.

- **Code**   
    Here's the corresponding code:
    ```
    input = tf.placeholder(tf.float32, (None, 4, 4, 5))
    filter_shape = [1, 2, 2, 1]
    strides = [1, 2, 2, 1]
    padding = 'VALID'
    pool = tf.nn.max_pool(input, filter_shape, strides, padding)
    ```
    The output shape of `pool` will be [1, 2, 2, 5], even if `padding` is changed to `'SAME'`.
    
### Quiz: Pooling Practice
Great, now let's practice doing some pooling operations manually.

**Question - Max Pooling**   
What's the result of a max pooling operation on the input:
```
[[[0, 1, 0.5, 10],
   [2, 2.5, 1, -8],
   [4, 0, 5, 6],
   [15, 1, 2, 3]]]
```
Assume the filter is 2x2 and the stride is 2 for both height and width. The output shape is 2x2x1.

The answering format will be 4 numbers, each separated by a comma, such as: `1,2,3,4`.

**Work from the top left to the bottom right**

- **Solution**
    The correct answer is `2.5, 10, 15, 6`. We start with the four numbers in the top left corner. Then we work left-to-right and top-to-bottom, moving 2 units each time.
    ```
    max(0, 1, 2, 2.5) = 2.5
    max(0.5, 10, 1, -8) = 10
    max(4, 0, 15, 1) = 15
    max(5, 6, 2, 3) = 6    
    ```
    
### Quiz : Average Pooling
**Question - Mean Pooling**
What's the result of a **average (or mean) pooling**?
```
[[[0, 1, 0.5, 10],
   [2, 2.5, 1, -8],
   [4, 0, 5, 6],
   [15, 1, 2, 3]]]
```
Assume the filter is 2x2 and the stride is 2 for both height and width. The output shape is 2x2x1.

The answering format will be 4 numbers, each separated by a comma, such as: 1,2,3,4.

**Answer to 3 decimal places. Work from the top left to the bottom right**

- **Solution**
    The correct answer is `1.375, 0.875, 5, 4`. We start with the four numbers in the top left corner. Then we work left-to-right and top-to-bottom, moving 2 units each time.
    ```
    mean(0, 1, 2, 2.5) = 1.375
    mean(0.5, 10, 1, -8) = 0.875
    mean(4, 0, 15, 1) = 5
    mean(5, 6, 2, 3) = 4
    ```

## 8. 1x1 Convolution
[출처 : https://hwiyong.tistory.com/45]

GoogLeNet 즉, Inception network에서는 1x1 convolution을 통해 연산량을 줄였습니다. 그리고 추후 연구또한, 1x1 conv는 연산량 감소와 채널 특성 파악에 매우 큰 도움을 줬습니다.

특히, 이를 사용해서 더 깊은 신경망을 만들고, 뛰어난 결과를 만들어냈고, 이후 연구 방향을 이끌었죠. 이러한 부분은 Image Classification 관련 논문 몇 개만 보아도 단번에 알 수 있습니다.

이번 글에서는 1x1 Convolution에 대해 쉽게 설명해보았습니다.

1x1 Conv에는 크게 세 가지 장점이 있다고 말할 수 있습니다.

1. channel 수 조절
2. 연산량 감소
3. 비선형성
---

### (1) 채널 수 조절
1x1 convolution이란 무엇일까요? 

직관적으로 1x1의 크기를 가지는 Convolution Filter를 사용한 Convolution Layer입니다.
그럼 Convolution Layer를 사용하게 되면 채널 수를 결정해야 하는데 어떻게 되나요?

이 부분에서 1x1 Conv의 장점을 알 수 있습니다. channel 수는 우리가 원하는만큼 결정할 수 있습니다. 기존 컨볼루션 연산에서는 채널 수를 마음대로 결정할 수 없었습니다. 왜냐하면 그만큼 파라미터 수가 급격히 증가하기 때문이죠. 하지만 1x1 Convolution은 이를 걱정할 필요가 없습니다.

이 때문에 채널 수를 특정 경우가 아닌 이상 마음껏 조절할 수 있고, 다양한 모델을 구성해볼 수 있습니다.

그림처럼 1x1x128 convolution을 활용하여 이를 겹겹이 쌓아서 우리가 원하는 수의 channel 수를 가지는 층을 구성할 수 있게 됩니다.

### (2) 계산량 감소
* 여기서 파라미터 수는 주로 결과값 크기를 의미합니다. 밑 그림에서 (28*28)을 제거해야 진정한 의미의 파라미터 수라고 말할 수 있습니다. 하지만 편의를 위해 파라미터 수라고 언급하겠습니다. 유의하시길 바랍니다.

![](https://blog.kakaocdn.net/dn/KqguG/btqAyyU6Dlp/4k2pZ9IiJeZT3lB0KBotgK/img.png)

위 그림처럼 **채널 수 조절은 직접적으로 계산량 감소**로 이어지게 되어 네트워크를 구성할 때 좀 더 깊게 구성할 수 있도록 도움을 줍니다. 특히 위처럼 채널 수를 줄였다가 다시 늘이는 부분을 bottleneck 구조라고 표현하기도 합니다. 파라미터 수가 많으면 아무리 GPU가 좋아도 감당하기 힘드니 우리 입장에선 매우 충분히 고려해볼만한 기술입니다.

먼저 결과부터 볼까요? 위의 그림에서 윗 부분과 아랫 부분의 차이는 1x1 컨볼루션 사용 유무 차이입니다.

`빨간색 글씨`는 해당 과정에서 사용되는 전체 파라미터 개수를 의미합니다. 윗 부분은 160백만개의 파라미터 수가 필요하고, 아랫 부분은 약 44백만개의 수가 필요하네요. 대략 4배의 차이입니다. 여러 논문을 봤다면 알 수 있겠지만, 1x1 컨볼루션을 사용한 모델과 그렇지 않은 모델의 파라미터 수는 급격하게 차이가 납니다.

`파란색 글씨`는 컨볼루션을 거치고 난 후의 결과값에서 사용되는 파라미터 수이며,
(28*28은 무시해도 무관합니다.)

`초록색 글씨`는 컨볼루션 과정에서 사용되는 필터가 가지는 파라미터의 수입니다.

위 그림을 통해 각 숫자가 나타내는 의미가 무엇인지 생각해보시기 바랍니다. 만약 정확히 떠오르지 않는다면, 아직 컨볼루션 연산 개념이 정확히 확립되지 않은 것일 수도 있습니다. 이해하기까지 길지는 않지만 어느 정도의 시간이 소요될 것입니다. 

### (3) 비선형성
GoogLeNet을 포함하여 구글팀의 수많은 고민이 포함되어 적용된 Inception 계통의 다양한 model version을 공부하면, 많은 수의 1x1 Conv를 사용했다는 것을 알 수 있습니다. 이때 ReLU Activation을 지속적으로 사용하여 모델 비선형성을 증가시켜 줍니다. ReLU 사용 목적 중 하나는 모델의 비선형성을 더해주기 위함도 있습니다. 비선형성이 증가한다는 것은 그만큼 복잡한 패턴을 좀 더 잘 인식할 수 있게 된다는 의미와 비슷하겠죠?

1x1 Convolution을 사용하면서 사용되는 파라미터 수가 감소하게 되고, 이러한 이점을 통해 모델을 더욱 깊게 구성할 수 있습니다. 모델을 깊게 구성하는 과정에서 기존보다 많은 수의 비선형성 활성화 함수를 사용하게 되고, 모델은 점점 더 구체적인 패턴을 파악할 수 있어 성능 향상을 이룰 수 있겠죠?

## 9. Inception Module
- Inception(GoogLeNet)
2014년 IRSVRC에서 1등을 한 모델이다. 대회 이후 Going Deeper with Convolutions라는 논문에서, Inception이란 이름으로 발표했다. Inception은 이 후 여러 버전이 발표 되었는데 이번에 설명할 것은 Inception v1이다. IRSVRC 대회에 참여할 당시 팀 이름인 GoogLeNet 이라고도 불린다. 다른 버전과의 혼동을 방지 하기 위해서 Inception v1에 대한 설명은 GoogLeNet으로 하겠다.

- 구글의 가설
딥러닝에서는 대용량 데이터를 학습할 때, 일반적으로 망이 깊고 레이어가 넓을 수록 성능이 좋다는 것이 정설이다. 하지만, 현실적으로는 네트워크를 크게 만들면, 파라미터가 많이 늘어나고, 망이 늘어 날 때마다, 연산량이 지수적으로 많아지며, 과적합, Gradient vanishing 등의 문제 때문에 학습이 매우 힘들다.

이를 해결하기 위한 방안 중 하나로 제시된 것이 Sparse Connectivity 이다. 지금까지 배운 convolution 연산은 Densely 연결 되어 있다. 이를 높은 관련성(correlation)을 가진 노드들 끼리만 연결하도록, 다시 말해 노드들 간의 연결을 sparse 하도록 바꾼다면, 연산량과 파리미터 수가 줄고 따라서 overfitting 또한 개선 될 것이라고 생각했다. Fully connected network에서 사용하는 Dropout과 비슷한 기능을 할 것이라고 본 것이다.
하지만, 실제로는 Dense matrix연산 보다 Sparse Matrix연산이 더 큰 Computational resource를 사용한다. LeNet 시절의 CNN만 하더라도 Sparse 한 CNN 연산을 사용했다. 이 후, 연산을 병렬처리 하기위해서 Dense Connection을 사용했고, Dense Matrix의 연산 기술이 발전했다. 반면, Sparse Matrix연산은 그 만큼 발전하지 못했고, Dense Matrix 연산 보다 비효율적이다. 따라서, 위의 목적을 달성하기 위해서 Sparse connectivity를 사용하는 것은 해결방안이 될 수 없었다.

여기에서, 구글이 고민한 것은 어떻게 노드 간의 연결을 줄이면서(Sparse connectivity), 행렬 연산은 Dense 연산을 하도록 처리하는가 였다. 그리고, 이 고민의 결과가 Inception module 이다.



![](https://datascienceschool.net/upfiles/f7fbb88f80234a6da36de2d250705914.png)


블로그 설명 1 : https://ikkison.tistory.com/86   
블로그 설명 2 : https://datascienceschool.net/view-notebook/8d34d65bcced42ef84996b5d56321ba9/