# TensorFlow for Deep Learning - Convolutional Neural Networks

**Why not simply use a deep neural network with fully connected layers for image recognition tasks?**

Unfortunately, although this works fine for small images (e.g., MNIST), it breaks down for larger images because of the huge number of parameters it requires. For example, a 100 x 100 -pixel image has 10,000 pixels, and if the first layer has just 1,000 neurons (which already severely restricts the amount of information transmitted to the next layer), this means a total of 10 million connections. And that's just the first layer. CNNs solve this problem using partially connected layers and weight
sharing.

## Convolutional Layers

<img src="images/TF_CNN1.png" width="400"/>
<img src="images/TF_CNN2.png" width="348"/>
<img src="images/TF_CNN3.png" width="373"/>

**Local receptive fields**: the most important building block of a CNN is the _convolutional layer_: neurons in the first convolutional layer are not connected to every single pixel in the input image (like they were in the layers of DNN earlier), but only to pixels in their receptive fields. In turn, each neuron in the second convolutional layer is connected only to neurons located within a small rectangle in the first layer. This architecture allows the network to concentrate on small low-level features in the first hidden layer, then assemble them into larger higher-level features in the next hidden layer, and so on. This hierarchical structure is common in real-world images, which is one of the reasons why CNNs work so well for image recognition.

**Zero Padding**: a neuron located in row i, column j of a given layer is connected to the outputs of the neurons in the previous layer located in rows i to i + fh - 1, columns j to j + fw - 1, where fh  and fw  are the height and width of the receptive field. In order for a layer to have the same height and width as the previous layer, it is common to add zeros around the inputs. This is called zero padding.

**Stride**: tt is possible to connect a large input layer to a much smaller layer by spacing out the receptive fields. This dramatically reduces the model's computational complexity. The shift from one receptive field to the next is called the stride. A 5x7 input layer (plus zero padding) is connected to a 3x4 layer, using 3x3 receptive fields and a stride of 2.

<img src="images/TF_CNN4.png" width="430"/>
<img src="images/TF_CNN5.png" width="320"/>

**Filters**: a neuron's weights can be represented as a small image the size of the receptive field. For example, two possible sets of weights, called filters (or convolution kernels). The first one is represented as a black square with a vertical white line in the middle (it is a 7x7 matrix full of 0s except for the central column, which is full of 1s); neurons using these weights will ignore everything in their receptive field except for the central vertical line (since all inputs will get multiplied by 0, except for the ones located in the central vertical line). The second filter is a black square with a horizontal white line in the middle. Once again, neurons using these weights will ignore everything in their receptive field except for the central horizontal line.

Now if all neurons in a layer use the same vertical line filter (and the same bias term), and you feed the network the input image , the layer will output the top-left image. Notice that the vertical white lines get enhanced while the rest gets blurred.  Similarly, the upper-right image is what you get if all neurons use the same horizontal line filter; notice that the horizontal white lines get enhanced while the rest is blurred out. Thus, a layer full of neurons using the same filter outputs a feature map, which highlights the areas in an image that activate the filter the most. Of course, _you do not have to define the filters manually_: instead, during training the convolutional layer will "automatically learn the most useful filters for its task", and the layers above will learn to combine them into more complex patterns.

**Stacking Multiple Feature Maps**: usually a convolutional layer has multiple filters (you decide how many) and outputs one feature map per filter, so it is more accurately represented in higher dimensions. It has one neuron per pixel in each feature map, and all neurons within a given feature map share the same parameters (i.e., the same weights and bias term i.e., $filter(f \times f) \times c_{in} + bias(1)$). Neurons in different feature maps use different parameters. A neuron's receptive field is the same as described earlier, but it extends across all the previous layers' feature maps. In short, a convolutional layer simultaneously applies multiple trainable filters to its inputs, making it capable of detecting multiple features anywhere in its inputs.

- _CNN vs DNN_ : the fact that all neurons in a feature map share the same parameters dramatically reduces the number of parameters in the model. Once the CNN has learned to recognize a pattern in one location, it can recognize it in any other location. In contrast, once a regular DNN has learned to recognize a pattern in one location, it can recognize it only in that particular location.

<img src="images/TF_CNN6.png" width="400"/>
