# Chapter 14 Deep Computer Vision Using Convolutional Neural Networks

#### 1. What are the advantages of a CNN over a fully connected DNN for image classification?

1. CNN use fewer parameters when processing images, this allows to process images in a more efficient way. This is achieved by using kernels that is shared among all the channel of output. 
2. Due to the nature of the kernels and the convolution operations, they can combine low-level features of the previous layers to detect high-level features, taking into consideration the position of the pixels in an image. This is very important for images, since this high-level features are used to analyze the images.
3. Once a feature is detected in an image by a CNN, the network is capable of generalize in any part of the image, while DNN only can detect in the same portion of the image and cannot generalize that well.

#### 2. Consider a CNN composed of three convolutional layers, each with 3x3 kernels, a stride of 2, and `same` padding. The lowest layer outputs 100 feature maps, the middle one outputs 200, and the top one outputs 400. The input images are RGB images of 200x300 pixels:
##### a. What is the toal number of parameters in the CNN?
##### b. If we are using a 32-bit floats, at least how much RAM will this network require when making a prediction for a single instance?
##### c. What about when training on a mini-batch of 50 images?

<img src="../../reports/figures/Screenshot%202022-12-16%20at%2017.35.56.png" alt="CNN architecture" style="height: 500px;display: block; margin-left: auto; margin-right: auto;">

a. Each layer has the weights of the kernel and the bias terms. The weights of the kernel has 4 dimensions: [kernel_height x kernel_width x input_channels x output_channels], and the biases has 1 dimension: [ output_channels ]. Therefore for the first layer we have $3*3*3*100 = 2700+100 = 2800$ parameters, the second layer has $3*3*100*200 = 180000+200 = 180200$ parameters, and finally for the third layer we have $3*3*200*400 = 720000 + 400 = 720400$ parameters. The total is $2800 + 180200 + 720400 = 903400$ parameters in the three layers.
b. The memory required per layer is `feature map size * channels output * 32 bits`. Let's calculate the total memory required. For the first layer we have $100*150*100*32 = 48\text{ Mbits} = 6 \text{ MB}$, for the second layer $50*75*200*32 = 24\text{ Mbits} = 3 \text{ MB}$, and for the third layer $25*38*400*32 = 12.16\text{ Mbits} = 1.5 \text{ MB}$. In training only the memory required for 2 consecutive layers is required. This means that we need at least the memory necessary to store the two heavier (in memory terms) layers: $6 \text{ MB} + 3 \text{ MB} = 9 \text{ MB}$. But the parameters have to be stored in memory as well. We need to include them in our final calculation. The total of parameters in the network is 903,400. Every parameter uses 32-bits or 4 bytes. This makes a total of $903400 * 4 = 3.6 \text{ MB}$. Therefore, our model requires at least $9 \text{ MB} + 3.6 \text{ MB} = 12.6 \text{ MB}$ for inference. Remember, this is just a rough estimation. 
c. For training, we need to include all the layers. The only layer not included in the previous question was the third layer. The total memory consumed by this layer is $1.5 \text{ MB}$, so the memory required for the three layers is $10.5 \text{ MB}$. We need to multiply this by the batch size $10.5 * 50 = 525\text{ MB}$. We also need to take into consideration the size of every image, this is $200 * 300 * 3 * 32 * 50 = 288 \text{Mbits} = 36\text{ MB}$. Finally, we can sum the total of the memory required by the layers $525\text{ MB}$, the memory required by the images $36\text{ MB}$, and the memory required by the parameters $3.6\text{ MB}$, and the total is $564.6\text{ MB}$. Again, this is just a rough estimation for reference. 

In [1]:
import tensorflow as tf
from tensorflow import keras 

In [6]:
cnn_model = keras.Sequential([
    keras.layers.Conv2D(filters=100, kernel_size=3, strides=2, padding="same", activation="relu", input_shape=[200,300,3]),
    keras.layers.Conv2D(filters=200, kernel_size=3, strides=2, padding="same", activation="relu"),
    keras.layers.Conv2D(filters=400, kernel_size=3, strides=2, padding="same", activation="relu"),
])

cnn_model.build()


In [7]:
cnn_model.save('../../models/chapter_14/cnn_model.h5')



#### 3. If your GPU runs out of memory while training a CNN, what are fifve things you could try to solve the problem?

1. Decrease the number of images in the mini-batch. 
2. Remove some convolutional layers
3. Use 16-bits instead of 32-bits for the images
4. Reduce dimensionality using higher strides in the convolutional layers
5. Distribute the training across multiple devices

#### 4. Why would you want to add a max pooling layer rather than a convolutional layer with the same stride?

A max pooling layer do not have parameters, therefore, the use of a max pooling layer is cheaper computationally to reduce dimensionality.

#### 5. When would you want to add a local response normalization layer?

The local response normalization layers are used to penalize when feature maps that are close to each other activate the same neurons, i.e., they are detecting the same features. Using these layers forces the features to explore different features and this allows the layer to increase the variety of feature detected, and generalize better. This is a form of regularization. 

#### 6. Can you name the main innovations in AlexNet, as compared to LeNet-5? What about the main innovations in GoogLeNet, ResNet, SENet, Xception, and EfficientNet?

The main innovations incorpored by the models are the following:

1. AlexNet was the first model to stack conv layers over another conv layer. Additionally it include 2 forms of regularization: dropout in the dense layers, and data augmentation.
2. GoogLeNet's main innovation was the inception units, that included 1x1 kernels. This had the function of a bottleneck, to find depth features and combine with a 3x3 or 5x5 conv layer to capture more complex patterns. 
3. ResNet aded the residual units, that introduced the residual learning, i.e., adding the input of the unit to the output. 
4. SENet added the SE blocks to the Inception units and Residual units. This SE block creates an embedding of the average value of each feature map to determine the importance of each, and then is rescaled to change the weight of every feature map of the output. 
5. Xception model added the Separable convolutional layers that look for depth and spatial features separately.
6. EfficientNet introduced the compound scaling method, that gives the option to scale the networks in an efficient way. 

#### 7. What is a fully convolutional network? How can you convert a ense layer into a convolutiona layer? 

A fully convolutional Network is a network composed only by convolutional layers. This allows to process images of any size. To convert a CNN into a FCN, we need to replace the dense layers by CNN, copy the weights, use kernels with the size of the input's size, use `padding="valid"`, normally `strides=1` and use the same activation function.

#### 8. What is the main technical difficulty of semantic segmentation?

When the information of the images passes through the convolutional layers to the higher layers, a lot of spatial information is lost. It is necessary to restore this spatial information somehow. A good way to solve this is by upscaling using skip connections to the output of lower layers with more spatial information. 