# Introduction

**1.  Object Classification**-
Tells you what the “main subject” of the image is

**2. Object Localization**- Predict and draw bounding boxes around on object in an image

**3. Object Detection**- Find multiple objects, classify them, and locate where they are in the image.

![](image/intro.jpeg)

**Why it is difficult**

1. Can have varying number of objects in an image and we do not know ahead of time how many we would expect
in an image.
2. Choosing a right crop is not a trivial task as we may encouter any number of images which :
    1. can be at any place.
    2. can be of any aspect ration.
    3. can be of any size.


# General object detection framework

Typically, there are three steps in an object detection framework. 
1. **Object localisation component** 
<br>
A model or algorithm is used to generate regions of interest or region proposals. These region proposals are a large set of bounding boxes spanning the full image.

Some of the famous approaches:
* **Selective Search**  - A clustering based approach which attempts to group pixels and generate proposals based on the generated clusters.
* **Region Proposal using Deep Learning Model (Features extracted from the image to generate regions)** - Based on the features from a deep learning model
* **Brute Force** - Similar to a sliding window that is applied to the image, over several ratios and scales. These regions are generated automatically, without taking into account the image features.


![](image/anchor_boxes.PNG)


**Things to note:**
* Trade-off that is made with region proposal generation is the number of regions vs. the computational complexity.
* Use problem specific information to reduce the number of ROI’s (e.g. pedestrian typically have a ratio of approximately 1.5, so it is not useful to generate ROI’s with a ratio of 0.25).
![](image/ped_car.JPG)


2. **Object classification component** 
<br>
In the second step, visual features are extracted for each of the bounding boxes, they are evaluated and it is determined whether and which objects are present in the proposals based on visual features.

**Some of the famous approaches:**
* Use pretrained image classification models to extract visual features
* Traditional Computer Vision (filter based approached, histogram methods, etc.)

![](image/cnn_layer.PNG)
![](image/feature_map.PNG)

3. **Non maximum suppression**
<br>
In the final post-processing step, reduce the number of detections in a frame to the actual number of objects present to make sure overlapping boxes are combined into a single bounding box.
<br>
NMS techniques are typically standard across the different detection frameworks, but it is an important step that might require hyperparameter tweaking based on the scenario.

Predicted
![](image/NMS_1.svg)

Desired
![](image/NMS_2.svg)

**Evaluation**
<br>
Evaluation metric used:
*  Mean Average Precision (mAP or  mAP@0.5 or mAP@0.25) -
    *  It is a number from 0 to 100 and higher values are typically better
    
Steps to calculate mAP
1. Predict bounding box scores(likelihood of the box containing an object).
2. Based on the predictions a precision-recall curve (PR curve) is computed for each class by varying the score threshold.
3. First the AP is computed for each class, which is the area under the PR curve and then averaged over the different classes. The end result is the mAP.


# Concepts

## Bounding Box Representation

Bounding box is represented using : 
x_min , y_min , x_max , y_max
![](image/bounding_1.PNG)

But pixel values are next to useless if we don't know the actual dimensions of the image. A better way would be to represent all coordinates is in their fractional form.
![](image/bounding_2.PNG)

1. From boundary coordinates to centre size coordinates 
<br>
 Function- **xy_to_cxcy** 
<br>
x_min,y_min,x_max,y_max -> c_x,c_y,w,h

2. From centre size coordinates to bounding box coordinates 
Function - **cxcy_to_xy**
<br>
c_x , c_y , w , h -> x_min,y_min,x_max,y_max 

![](image/bounding_3.PNG)

3. Offset from bounding box (used in the loss function) 
Function - **cxcy_to_gcxgcy** 
<br>
    * **g_c_x ,g_c_y** -find the offset with respect to the prior box, and scale by the size of the prior box.
    * **g_w , g_h** - scale by the size of the prior box, and convert to the log-space. 
    
4. Decoding the predicted offset to centre size coordinates 
<br>
Function - gcxgcy_to_cxcy



## IOU (Jaccard Index)
 
How well the one box matches the the other box we can compute the IOU (or intersection-over-union, also known as the Jaccard index) between the two bounding boxes.

Steps to Calculate:
1. Find Intersection
2. Find Union

Jaccard Overlap = Intersection / Union

![](image/IOU.PNG)

Check out the excel sheet for the calculations
![](image/IOU_excel.PNG)



# SSD
**2016 - Google**

![](image/SSD.PNG)

Single Shot Multibox Detector

1. **Single Shot**:In a single forward pass of the network,the task of object localization and object classification are done.
2. **Multibox**-Name of a technique for bounding box regression developed earlier by Szegedy et .al
3. **Detector**-The network does the job of object detector and classifies those detected objects.

VGG-16 
![](image/vgg_16.PNG)


https://towardsdatascience.com/review-ssd-single-shot-detector-object-detection-851a94607d11

https://towardsdatascience.com/review-retinanet-focal-loss-object-detection-38fba6afabe4

https://lilianweng.github.io/lil-log/


**Modification in Vgg Network:**

![](image/cnn_layer.PNG)

1. ** Input Image**  
300 x300  instead of 224 x 224
<br>
2. ** 3rd Layer ** 
Ceil Mode
Significant if the dimensions of the preceding feature map are odd and not even.In this case we get the input as 75 x75 which is halved to to 38, 38 instead of an inconvenient 37, 37.
<br>
3. **Max Pooling** 
5th pooling layer -
From a 2, 2 kernel and 2 stride to a 3, 3 kernel and 1 stride and 1 padding. The effect this has is it no longer halves the dimensions of the feature map from the preceding convolutional layer.
<br>
4. ** Linear Layer **
We will toss fc8  which is the classification layer.
 Rework(using decimate) fc6 and fc7 into convolutional layers conv6 and conv7. 

Rework Strategy-Reparameterize a fully connected layer into a convolutional layer
<br>
**Things to note**
    * An image of size H, W with I input channels, a fully connected layer of output size N is equivalent to a convolutional layer with kernel size equal to the image size H, W and N output channels.
    * fc6 with a flattened input size of 7 * 7 * 512 and an output size of 4096 has parameters of dimensions 4096, 7 * 7 * 512. The equivalent convolutional layer conv6 has a 7, 7 kernel size and 4096 output channels, with reshaped parameters of dimensions 4096, 7, 7, 512
    * fc7 with an input size of 4096 (i.e. the output size of fc6) and an output size 4096 has parameters of dimensions 4096, 4096. The input could be considered as a 1, 1 image with 4096 input channels. The equivalent convolutional layer conv7 has a 1, 1 kernel size and 4096 output channels, with reshaped parameters of dimensions 4096, 1, 1, 4096.
    * These filters are numerous and large – and computationally expensive.Hence opt to reduce both their number and the size of each filter by subsampling parameters.
    fc6 - 1024 filters of size 3 x 3
    fc7 - 1024 filters of size 1 x 1
**Auxiliary Connection**
<br>
Stacking some more convolutional layers on top of our base network
These convolutions provide additional feature maps, each progressively smaller than the last.
<br>
Conv8_1 ,  Conv8_2 
<br>
Conv9_1 ,  Conv9_2
<br>
Conv10_1 ,Conv_10_2 
<br>
Conv11_1 ,Conv_11_2
<br>
Feature Map of Conv_8_2 , Conv9_2 , Conv_10_2 , Conv_11_2 will be used for Detection
6. Multiple Output Feature Map used for Detection:  
  1. Conv4_3   - 38 x 38 x 512
  2. Conv_7     - 19 x 19 x 1024
  3. Conv_8_2 - 10 x 10 x 512
  4. Conv_9_2 -  5 x 5 x 256
  5. Conv_10_2- 3 x 3 x 256
  6. Conv_11_2- 1 x 1 x 256
  

![](image/anchor_boxes_count.PNG)

**Prediction Convolution**
<br>
Two Covolution layer for each feature map for class prediction  and localization prediction.

For each prior at each location on each feature map, we want to predict –
1. the offsets (g_c_x, g_c_y, g_w, g_h) for a bounding box.
2. a set of n_classes scores for the bounding box, where n_classes represents the total number of object types (including a background class).

What we do:
<br>
We need two convolutional layers for each feature map 
1. A ** localization prediction convolutional layer** with a 3, 3 kernel evaluating at each location (i.e. with padding and stride of 1) with 4 filters for each prior present at the location.

The 4 filters for a prior calculate the four encoded offsets (g_c_x, g_c_y, g_w, g_h) for the bounding box predicted from that prior.

2. A **class prediction convolutional layer** with a 3, 3 kernel evaluating at each location (i.e. with padding and stride of 1) with n_classes filters for each prior present at the location.

The n_classes filters for a prior calculate a set of n_classes scores for that prior.


Let us take one output feature map and understand :
 
Considering Conv_9_2 output feature map of size 5 x 5 x256

Step1:
1. For localization:
    1. Convolution:
    5 x 5 x 256 ->  3 x 3 x 24  [6(Anchor boxes) x 4(Offsets)]
    2. Output of size = 5 x 5 x 24 
    3. Resize to 150(5x5x6 ) x 4
   
2. For class prediction 
    1. Convolution:
    5 x 5 x 256 ->  3 x 3 x  126  [ 6 (Anchor Boxes)x 21 [20(Class Labels) +1(Background)]]
    2. Output of size = 5 x 5 x 126 
    3. Resize to 150(5x5x6 ) x 21
    
    
 Similarly we do for all output feature maps and stack the results together .Thus the  Output from the Predicted Convolution module is the following:
 1. locs = 8732 x 4
 2. class scores = 8732 x 21



# MultiBox Loss

The MultiBox loss, a loss function for object detection.

This is a combination of:
1. A localization loss for the predicted locations of the boxes 
2.  A confidence loss for the predicted class scores.

# Performance

![](image/map.PNG)

SSD300\* and SSD512\* applies data augmentation for small objects to improve mAP.)

Data:
”07”: VOC2007 trainval, ”07+12”: union of VOC2007 and VOC2012 trainval.
”07+12+COCO”: first train on COCO trainval35k then fine-tune on 07+12

# Issues with One Shot Detectors



## Class Imbalance 

There is extreme foreground-background class imbalance problem in one-stage detector

![](image/issue1.PNG)

## Not Focussed on Hard Examples

Hard Samples-Those examples where the difference between the true label and the predicted label is large,thus resulting in higher loss.
Easy Samples:Those examples where the difference between the true label and the predicted label is small,thus resulting in lower loss.

Standard Cross Entropy Loss treats both(hard and easy samples) equally.Due to which these small loss values of easy samples can overwhelm the rare class.

![](image/ce.PNG)

![](image/ce_graph.PNG)

    The loss from easy examples = 100000×0.1 = 10000
    The loss from hard examples = 100×2.3 = 230
    It is about 40× (10000 / 2.3 = 43.) bigger loss from easy examples.
Thus, CE loss is not a good choice when there is extreme class imbalance.

## Speed vs Accuracy Tradeoff

# Approaches to solve:

** CLASS IMBALANCE**

1. **Sampling heuristics**
<br>
Using fixed foreground-to-background ratio (1:3)
2. **Online hard example mining (OHEM)**  
Select a small set of anchors (e.g., 256) for each minibatch
3. **$\alpha$ balanced Cross entropy Loss**
<br>
Add a weighting factor α for class 1 and 1 - α for class -1.
![](image/alpha_ce.PNG)

**$\alpha$** is set by inverse class frequency or treated as a hyperparameter to set by cross validation.

Or
<br>
**$\alpha$** is implicitly implemented by selecting the foreground-to-background ratio of 1:3.
<br>
However,training procedure is still dominated by easily classified background examples

**CLASS IMBALANCE + Not Focussed on Hard Examples** 

1. **Focal Loss**

The loss function is reshaped to down-weight easy examples and thus focus training on hard negatives. A modulating factor (1-pt)^ γ is added to the cross entropy loss where γ is tested from [0,5] in the experiment
![](image/fl.PNG)

Proporties of Focal Loss:
1. When an example is misclassified and **pt** is small, the modulating factor is near 1 and the loss is unaffected. As **pt →1**, the factor goes to 0 and the loss for well-classified examples is down-weighted.
2. The focusing parameter $\gamma$ smoothly adjusts the rate at which easy examples are down-weighted. When $\gamma$ = 0, FL is equivalent to CE. When $\gamma$ is increased, the effect of the modulating factor is likewise increased. ($\gamma$=2 works best in experiment.)

For instance, with $\gamma$ = 2, an example classified with pt = 0.9 would have 100 lower loss compared with CE and with pt = 0.968 it would have 1000 lower loss. This in turn increases the importance of correcting misclassified examples.
The loss is scaled down by at most 4× for pt ≤ 0.5 and γ = 2.


**To handle Class Imbalance:**

2. **$\alpha$-Balanced FL**
![](image/alpha_fl.PNG)

  - γ: Focus more on hard examples.
  - α: Offset class imbalance of number of examples. 
    
**From Paper**
   - α is added into the equation, which yields slightly improved accuracy over the one without α
   - Using sigmoid activation function for computing p resulting in greater numerical stability.

**CLASS IMBALANCE + Not Focussed on Hard Examples + Accuracy** 

1. RetinaNet

# References

https://github.com/sgrvinod/a-PyTorch-Tutorial-to-Object-Detection
<br>
https://towardsdatascience.com/understanding-2d-dilated-convolution-operation-with-examples-in-numpy-and-tensorflow-with-d376b3972b25
<br>
https://machinethink.net/blog/object-detection/
<br>
https://towardsdatascience.com/going-deep-into-object-detection-bed442d92b34
<br>
https://d2l.ai/chapter_computer-vision/anchor.html
<br>
https://medium.com/@jonathan_hui/map-mean-average-precision-for-object-detection-45c121a31173
<br>
https://medium.com/@jonathan_hui/object-detection-speed-and-accuracy-comparison-faster-r-cnn-r-fcn-ssd-and-yolo-5425656ae359
<br>
https://www.coursera.org/lecture/convolutional-neural-networks/non-max-suppression-dvrjH
<br>
https://lilianweng.github.io/lil-log/2017/10/29/object-recognition-for-dummies-part-1.html
<br>
https://towardsdatascience.com/review-retinanet-focal-loss-object-detection-38fba6afabe4
<br>
https://medium.com/@jonathan_hui/understanding-feature-pyramid-networks-for-object-detection-fpn-45b227b9106c
<br>
https://medium.com/@smallfishbigsea/notes-on-focal-loss-and-retinanet-9c614a2367c6
<br>
R-CNN
http://www.telesens.co/2018/03/11/object-detection-and-classification-using-r-cnns/