Some notes to take from deeplearning.ai Convolutional Neural Networks Course
=============================================================================

Foundations of Convolutional Neural Networks
--------------------------------------------
 - Most importantly, how to compute output height and width given input height and width, filter size, strides and padding. Depth of each filter is always equal to depth of input and depth of output is always equal to number of filters.
<p></p>
 - Pooling is an operation that does not have weights hence not counted as a layer in VGG16, ResNet 152 etc.
 - Output depth of pooling operation is same as input depth.
 - Constraints of striding, the value of <p></p>$$ \frac{n_H - filter\_size + 2*padding}{strides} $$ <p></p>should be an integer, else it means that the filter with stride does not fit nicely and symmentrically across the input feature map.
 - This is the formula for computing $dA$ with respect to the cost for a certain filter $W_c$ and a given training example:
<p></p>$$ dA += \sum _{h=0} ^{n_H} \sum_{w=0} ^{n_W} W_c \times dZ_{hw} \tag{1}$$<p></p>
Where $W_c$ is a filter and $dZ_{hw}$ is a scalar corresponding to the gradient of the cost with respect to the output of the conv layer Z at the hth row and wth column (corresponding to the dot product taken at the ith stride left and jth stride down). Note that at each time, we multiply the the same filter $W_c$ by a different dZ when updating dA. We do so mainly because when computing the forward propagation, each filter is dotted and summed by a different a_slice. Therefore when computing the backprop for dA, we are just adding the gradients of all the a_slices.
 - This is the formula for computing $dW_c$ ($dW_c$ is the derivative of one filter) with respect to the loss:
<p></p>$$ dW_c  += \sum _{h=0} ^{n_H} \sum_{w=0} ^ {n_W} a_{slice} \times dZ_{hw}  \tag{2}$$<p></p>
Where $a_{slice}$ corresponds to the slice which was used to generate the acitivation $Z_{ij}$. Hence, this ends up giving us the gradient for $W$ with respect to that slice. Since it is the same $W$, we will just add up all such gradients to get $dW$. 
 - For max pooling, apply a mask such that only the max value gets backpropagated while for average pooling you distribute the backpropagated error evenly over the entire pool_size because each input contribute equally to the error.

Deep convolutional models: case studies
--------------------------------------------
 - Residual nets have identity block and convolutional block, identity block if the height and width before and after the block are the same, and conv block if the height and width changes after the block.

 - For the identity block, the skip connection does not go through any convolution layer, while for conv block the skip connection goes through conv filters to reshape the height and width.

 - Resnets work well due to the skip connection allowing the convnet to learn identity mappings when needed, it is easy to learn identity mapping if the learning algorithm deem that this block is not really needed.

 - In fast.ai lesson 7, it is also mentioned that in the Resnet paper, from the skip connections, this actually has an effect of learning on the residuals of the previous block (which can be shown through some reformulation). This has an effect similar to how boosting works.

 - 1x1 convolutions can be used to reduce dimensions if the number of 1x1 filters is less than the depth of the input tensor. This works as the spatial arrangment (aka the height and width) is retained but the depth is reduced. In the inception network, 1x1 convolutions is applied before more expensive 3x3 and 5x5 convolutions as this reduced dimension from the 1x1 convolution allows for less computations when computing the 3x3 and 5x5 convolutions. The 1x1 convolution is sometimes also called the bottleneck layer. In Resnet, the bottleneck architecture is also applied as well as shown below. Do note that there's a batch norm layer after the conv layer. 
<p></p>
<p align="center">
  <img src="images/bottleneck.png" width="400" height="100">
  <p style="text-align:center">(Source: Resnet Paper)</p>
</p>
 - Using an example: 28x28x192 using 5x5 convolution with same padding and 32 filters, will result in output of 28x28x32 and the number of computations is 28x28x32x5x5x192 = 120 million.<p></p>
Using 1x1 convolutions to downscale the dimensions to 16 dimensions before applying 5x5 convolutions using 32 filters will result in the same output. The number of computations now is (1x1x192x28x28x16) + (28x28x32x5x5x16) = 12.4 million. Much lesser computations with the bottleneck layer.
 - Inception network uses the idea of letting the network choose the type of convolutions it needs (1x1, 3x3, 5x5) instead of choosing for it.

Object Detection
--------------------------------------------
 - First key concept introduced is the idea of convolutional implementation of sliding windows. The main gist behind this concept is that instead of sliding over the entire image, what we can instead do is to "slide" over the feature map instead by applying convolutions over the target feature map to produce a 1x1 convolution kernels as Yann LeCun calls it, but has a similar idea to a fully connected layer (see [here](https://www.facebook.com/yann.lecun/posts/10152820758292143?pnref=story)). The below slide explains it the best.
<p></p>
<p align="left">
  <img src="images/convolutional_sliding_window.png" width="600" height="200">
  <p style="text-align:center">(Source: Andrew Ng Coursera CNN Notes)</p>
</p>
<p></p>
This concept is applied in YOLO, Faster-RCNN and SSD as well.
 - For bounding box one common scale is x,y,w,h where x and y are coordinates of mid point and w,h are width and height of bounding box respectively. The prediction vector is usually 
$$\left[\begin{array}
{rrr}
p_c \\
b_x \\
b_y \\
b_h \\
b_w \\
c_1 \\
c_2 \\
c_3
\end{array}\right]
$$
where
 - $p_c$: prediction, 1 if object is present in the grid and 0 if its not present in the grid
 - $b_x, b_y, b_h, b_w$: x, y coordinates of mid point and width/height of bounding box
 - $c_1, c_2, c_3$: indicators if object belong to class 1, 2 or 3 <p></p>
If object is not present then it becomes this:
$$\left[\begin{array}
{rrr}
0 \\
? \\
? \\
? \\
? \\
? \\
? \\
?
\end{array}\right]
$$<p></p>
where question mark means not part of the computations. If object does not appear in cell grid for YOLO, the loss function does not compute the other losses besides the classification loss of whether there's an object there or not.<p></p>
 - IoU, or the Jaccard Index, is a comparison of how much the ground truth box a predicted bounding box. In some papers, an overlap of IoU of > 0.5 is considered a good bounding box.
 - Non-max suppression is to suppress multiple bounding boxes that appear for the same image. Usually it is a two step process where 
   1. In the first step the $p_c$ is checked if it is lesser than a threshold (eg. 0.6). All prediction bounding boxes below that threshold are deleted
   2. For the second step, loop from the one with the highest confidence and output that into the final prediction list. For each prediction outputted, check if any other predictions have a Jaccard Overlap of >0.5, and discard those predictions. Loop through all the predictions to get a final prediction list.
 - Anchor boxes are used to predict multiple different types of shapes that may appear at each location such as a car overlapped by a person when both are objects of interest. In Faster-RCNN, the anchor boxes are used to constrain regression for each object to only a specific area of the image. This is why for calculating loss of proposal regions in Faster-RCNN, only anchor boxes that have IoU > 0.3 when compared to ground-truth boxes are considered for the regression loss. The regression loss contains prediction coordinates, anchor box coordinates as well as ground truth coordinates. This is how anchor boxes can constrain the predictions from regression head to only output certain values. See [this](https://www.quora.com/Why-does-Faster-R-CNN-use-anchor-boxes) and [this](https://stats.stackexchange.com/questions/265875/anchoring-faster-rcnn).
With anchor boxes, the prediction vector becomes:
$$\left[\begin{array}
{rrr}
p_c^1 \\
b_x^1 \\
b_y^1 \\
b_h^1 \\
b_w^1 \\
c_1^1 \\
c_2^1 \\
c_3^1 \\
p_c^2 \\
b_x^2 \\
b_y^2 \\
b_h^2 \\
b_w^2 \\
c_1^2 \\
c_2^2 \\
c_3^2
\end{array}\right]
$$ <p></p>
where the superscript corresponding for which anchor box these values belong to. Similar to the previous, if the Jaccard Similarity between the ground truth and the default anchor box $i$ is below a certain value, the prediction value $p_c^i$ will be 0 and the rest of the values will not be considered in the loss computations (or what Andrew Ng terms it as "Don't care").

RCNN
--------------------------------------------
See this [post](https://blog.athelas.com/a-brief-history-of-cnns-in-image-segmentation-from-r-cnn-to-mask-r-cnn-34ea83205de4) for greater detail.

 - In RCNN, the first step is to create proposal regions or bounding boxes called [Selective Search](https://ivi.fnwi.uva.nl/isis/publications/bibtexbrowser.php?key=UijlingsIJCV2013&bib=all.bib) that attempts to propose regions for classification through using sliding windows of different size and for each size tries to group together adjacent pixels by texture, color or intensity to identify objects. 
<p align="left">
  <img src="images/RCNN.png" width="400" height="200">
  <p style="text-align:center">(Source: RCNN Paper)</p>
</p>
 - On the final layer of the CNN, a SVM is added to classify the images into different classes using the convolutional features. The algorithm thus follow these 3 steps: 
   - Step 1: Generate set of proposals for bounding boxes
   - Step 2: Run images in bounding boxes through pretrained Alexnet and finally SVM for object classification
   - Step 3: Run the bounding box through linear regression model
<p></p> 
<p align="left">
  <img src="images/RCNN_2.png" width="400" height="200">
</p>

Fast-RCNN
--------------------------------------------
 - Nonetheless, there are some disadvantages of RCNN including:
   1. Each proposal has to run through a forward pass of the CNN which can be computationally expensive considering the number of proposals generated (can be up to 2000).
   2. Need to train 3 models (CNN, SVM and linear regression model for bounding box).
   
   
 - Hence Fast-RCNN was proposed to solve this issue. Some newer things from Fast-RCNN include:
   1. Region of Interest(RoI) Pooling: ROI pool allows for just one forward passs of the image instead of ~2000 passes in the original RCNN paper. For this, we project the ROI generated from ROI search algorithms directly into the Conv5 layer [Here](https://blog.deepsense.ai/region-of-interest-pooling-explained/) is a link to explain more about RoI pool.
     <p align="left">
       <img src="images/ROIPool.png" width="400" height="200">
       <p style="text-align:center">(Source: Stanford’s CS231N slides by Fei Fei Li, Andrei Karpathy, and Justin Johnson.)</p>
     </p>
     What ROI pooling does is to map the ROI from the image to the feature map and from there do max/average pooling to output from the pooling operation a fixed spacial extent of $H \times W$ that is consistent for all the RoIs. In the Fast-RCNN paper, RoI pooling works by dividing the $h \times w$ RoI window into a $H \times W$ grid of subwindows with approximately size of $\frac{h}{H} \times \frac{w}{W}$ for each subwindow, and then maxpooling in each subwindow to get the final $H \times W$ feature map. Below is an example. 
     <p align="left">
       <img src="images/faster-rcnn-pr012-17-638.jpg" width="400" height="200">
     </p>     
     As you can see the outcome is to map it to a 7x7 feature map through different max pooling strategies for different RoI height and width.
     
   2. Combining the CNN, regressor and classifier into one single network.
   Instead of having multiple models to train, by having fully connected layers after the RoI feature maps and then connecting these FC layers to 1. a regressor head and 2. a softmax classifier, we can achieve both classification and bounding box regression in the same model.
     <p align="left">
       <img src="images/1-E_P1vAEbGT4HNYjqMtIz4g.png" width="400" height="200">
     </p>    

Faster-RCNN
--------------------------------------------
One main bottleneck still lies in the RoI proposal algorithm, which considers many shapes and sizes in the original image and hence takes a long time before RoIs are generated. To tackle this issue, Faster-RCNN was proposed. The authors observed that the convolutional feature maps used by region-based detectors, like Fast-RCNN can also be used for generating region proposals. Hence, they would not need to run Selective Search on the images nor train additional models.
<p align="left">
  <img src="images/0-_nNI03ESXm2P6YXO-.png" width="400" height="200">
</p>  

For Faster-RCNN, they have both a Region Proposal network which shares the convolutional layers with a Fast R-CNN network for object detection.  In the paper, they use ZF and VGG-16 for the convolutional layers. For VGG-16, after the 13 sharable convolutional layers, they slide a small $n \times n$ window over the last shared convolutional layer which is then mapped to a low dimensional feature (512-d for VGG) followed by ReLU activations. They are then followed with two seperate fully connected layers, one for regression of region proposed bounding box coordinates and one for probability of whether there is an object ornot. 
<p align="left">
  <img src="images/0-n6pZEyvW47nlcdQz-.png" width="400" height="200">
</p>  
For each sliding window, $k$ anchor boxes with different scales and aspect ratios are used for each sliding window. Hence the output of the regression will be $4k$ scores (4 coordinates for each anchor) and $2k$ probabilities (1 for probability object exists and 1 for probability object does not exists).
<p align="left">
  <img src="images/loss.png" width="300" height="100">
</p><p></p>
There is a seperate loss function for training the RPNs, which involve two values, the classification loss whether an object exists in the proposed region for that anchor box, and if yes, the regression loss of the bounding box. In the Faster-RCNN paper, an anchor box is given the label true if the Jaccard Similarity between the ground truth box and an anchor box is greater than 0.3. 
Finally, after the RPN has been trained, the proposed regions are fed back into the convolutional net and ROI pooling followed by fully connected layers are used to predict class probabilities of the object, similar to how Fast-RCNN works.
<p align="left">
  <img src="images/RCNN-pooling.PNG" width="500" height="200">
</p><p></p>
In the four step training process, during the first two steps when the RPN and the Fast-RCNN are trained, the convolutional layers are not freezed and are allowed to have their weights adjusted. Following which, the convolutional layers are freezed and the top RPN layers as well as R-CNN layers are then fine-tuned.


<h2>Single-Shot Multi-Box Detector (SSD)</h2><p></p>
This is another object detection algorithm more similar to YOLO in that requires only training of a single convolutional net without having to explictly train both RPNs and RoI pooling followed by classifier seperately. For SSD, the idea is similar in that it also uses pre-existing architectures for extracting convolutional features, following which it runs multiple classifiers across different scales using different anchor boxes (or prior boxes). This produces a fixed-size collection of bounding boxes and scores for the presence of object class instances in those boxes, followed by a non-maximum suppression step to produce the final detections.
<p align="left">
  <img src="images/ssd_vs_yolo.jpeg" width="800" height="200">
  <p style="text-align:center">(Source: SSD Paper)</p>
</p><p></p>
From the SSD architecture, we can see that the main difference between SSD and YOLO is the use of multiple feature maps to perform classification. This allows for handling of different shapes and sizes across different scales, especially after combining it with the use of prior boxes of differnt aspect ratios and scales as well.
<p align="left">
  <img src="images/feature_map.png" width="600" height="200">
  <p style="text-align:center">(Source: SSD Paper)</p>
</p><p></p>
In each feature layer starting from Conv5_3 layer, it runs a 3x3 convolutional filter on the feature map to classify and predict for each default box in the 3x3 filter for a total of $k$ boxes:
<ul>
  <li>Probability it belongs to a certain class (num_classes)</li>
  <li>x and y offsets to center of default box (2 values)</li>
  <li>width and height scales to width and height of default box (2 values)</li>
</ul>

From this, we have num_classes + 4 outputs per prior box for classification. The default boxes need to be mapped to the coordinates on input images so that these default boxes can be compared to the ground truth. A more detailed explaination can be found [here](https://medium.com/towards-data-science/learning-note-single-shot-multibox-detector-with-pytorch-part-2-dd96bdf4f434). Then, for Jaccard overlap of >0.5 between prior boxes and ground truth, they are considered as positive labels for that particular class and negative lables otherwise.

For losses, the loss formula is as follows:

<p style="text-align: center; font-size: 1.2em">$L(x,c,l,g) = \frac{1}{n}(L_{conf}(x,c)+\alpha L_{loc}(x,l,g))$</p>
<p></p>
There are two parts of this objective function:
<ol>
 <li>The confidence loss: How accurate does the model predict the class of the each object</li>
 <li>The localization loss: How close are the bounding boxes the model created to the ground truth boxes.</li>
</ol>

<p align="left">
  <img src="images/confidence_loss.png" width="600" height="200">
  <p style="text-align:center">(Source: SSD Paper)</p>
</p><p></p>
<h4>Confidence Loss</h4><p></p>
For the confidence loss, due to the high number of prior boxes generated from the different feature maps at different positions, hard negative mining is applied such that only the negative samples with the highest confidence are used in calculating the confidence loss. This idea stems from the fact that by using the hardest samples to train the model, it avoids having to use easier samples that are easy to classify such that the model will only learn non-trivial solutions.

<p align="left">
  <img src="images/localisation_loss.png" width="400" height="200">
  <p style="text-align:center">(Source: SSD Paper)</p>
</p><p></p>
<p align="left">
  <img src="images/smooth_l1_loss.png" width="300" height="200">
  <p style="text-align:center">(Source: Fast-RCNN Paper)</p>
</p><p></p>

<h4>Localization Loss</h4><p></p>
For the localization loss, only for positive labels will it be computed, and calculates difference between correct and predicted offsets to center point coordinates, and the correct and predicted scaled width and heightd. The absolute difference is then smoothened. Do note that the coordinates and width are constrained by the default boxs' height and width similar to Faster-RCNN.

<h4>Inference</h4><p></p>

Finally, at inference, because multiple boxes may be outputted for a single ground truth, we use non-max suppression to restrict the output to only a single one for a ground truth. First we remove all output boxes < 0.6 and then apply non-max suppression on the remaining ones with Jaccard Overlap > 0.45.