# Chapter- Object Detection Frameworks

- Topic-1 Naive Deep learning approach for object detection
- Topic-2 Problems with naive approach and its solution
- Topic-3 Detection frameworks
  - 3.1 two stage
  - 3.2 one stage
  - 3.3 Advantages and disadvantages of two stage and one stage frameworks

## 1. Naive Deep Learning approach for object detection

In the first section we have seen how object detection is performed using sliding window, kernel_size, strides, image_pyramids, HOG descriptor and SVM classifier. The approach we have seen can be divided into two parts.
- Part-1: kernel_size, strides, image_pyramids are one part which acts like extracting different parts of the image
- Part-2: HOG Descriptor + SVM classifier to vectorize and classify an object.


Let me first start with part-2


### 1.2 Part-2  
we can remove HOG Descriptor and SVM classifer and use a deep learning model. The model takes in a patch of image and predicts what object is present in the image.  
![Naive object detection](../images/naive_dl_od.png)

Usually these deep learning architecutures can be image net pretrained VGG or ResNet or any another classification architecutures. One of the problems with these naive approach is that since sliding window and image pyrmaids generate ~1000s of proposals, sending all the proposals to the network used to take a lot of time and thus used to increase the inference time. Deep learning researchers have found smart ways to get way these bottlenecks and we will discuss them in-details later. 


## 1.3 Part-1: 
Image pyramids increases the inference time by almost ~5x times if we are using 5 scales. Also different objects have different aspect ratios. Humans have height greater than width. Cars, aeroplanes have width greater than height etc. Using a simple sliding window of one kernel_size might not be the best solution. So researchers have brought in new tools called **Anchor boxes**.  
Anchor boxes are similar to sliding window but they don't have fixed height and width. They are defined using base_size, anchor_ratios and anchor scales. Lets say we have an image of size (800 x 800). We will divide the image using subsample ratio 32. In this simple case we have 625 (25 x 25) anchor boxes. 

![Anchor box version 1](gifs/naive_01.gif "anchor_box")

you can generate the above gif using the following code.
```python
### Load the libararies
import numpy as np 
import imageio 
import cv2 

## Image of height and width 800 x 800 respectively
img = np.zeros((800, 800, 3))


sub_sample = 32 # (32, 32) is considered as one window

## What is the size of each box 
fe_size = (800//sub_sample) 

## generating x_max and y_max of each base anchor box
ctr_x = np.arange(32, (fe_size+1) * 32, 32)
ctr_y = np.arange(32, (fe_size+1) * 32, 32)

## generating x_ctr and y_ctr of each anchor box
ctr = np.zeros((fe_size*fe_size, 2))
index = 0
for x in range(len(ctr_x)):
    for y in range(len(ctr_y)):
        ctr[index, 1] = ctr_x[x] - 16
        ctr[index, 0] = ctr_y[y] - 16
        index +=1
        
## finding x1, y1, x2, y2 of each anchor box 
anchor_boxes = np.concatenate([ctr-16, ctr+16], axis=1)
print(anchor_boxes.shape) # 625, 4

## Draw a bounding box on each cell 
imgs = []
for box in anchor_boxes:
    x1, y1, x2, y2 = [int(i) for i in box] 
    clone = img.copy()
    cv2.rectangle(clone, (x1, y1), (x2, y2), (0, 255, 0), 2)
    imgs.append(clone)
    

## geneate gifs 
import imageio
images = []
for img_ in imgs:
    images.append(img_.astype(np.uint8))
imageio.mimsave('gifs/naive_01.gif', images)
```

### 1.4 Anchor scales and ratios
Since objects will come with different shape and size. A standard object anchor box of size (32x32) might not be sufficient. So lets say that each location, we can have objects of different aspect ratio and their might be small object, medium object and large object. So in-order to solve this problem, we are using 

```python
anchor_ratios = [0.5, 1, 2]
anchor_scales = [8, 16, 32]
```

Anchor ratio of 0.5 means height is greater than width and useful in detecting objects like humans. anchor ratio of 2 means width is greater than height and objects like cars and aeroplanes can be detected.  

Anchor scales are responsible for bringing size variation into the window. Objects of small size might need small window and objects of large size would need to large window. For this sake we will be using  anchor scales of 8, 16 and 32, 8 being responsible for small objects, 16 for medium and 32 for large objects.

In total there will be 9 anchor boxes at each anchor location.  We can generate anchors at each anchor location using the following piece of code.

```python
anchors = np.zeros(((fe_size * fe_size * 9), 4))
base_size =16

anchor_ratios = [0.5, 1, 2]
anchor_scales = [8, 16, 32]

index = 0
for c in ctr:
    ctr_y, ctr_x = c
    for i in range(len(anchor_ratios)):
        for j in range(len(anchor_scales)):
            h =  base_size * anchor_scales[j] * np.sqrt(anchor_ratios[i])
            w =  base_size * anchor_scales[j] * np.sqrt(1./ anchor_ratios[i])

            anchors[index, 0] = ctr_y - h / 2.
            anchors[index, 1] = ctr_x - w / 2.
            anchors[index, 2] = ctr_y + h / 2.
            anchors[index, 3] = ctr_x + w / 2.
            index += 1

print(anchors.shape)
# (5625, 4)
```

There are intotal 5625 anchor boxes. Some of these anchor boxes will be out of image location as a large box might not fit inside the small box. Lets ignore all those boxes. 

```python
valid_anchors = anchors[np.where((anchors[:, 0] >= 0) &
        (anchors[:, 1] >= 0) &
        (anchors[:, 2] <= 800) &
        (anchors[:, 3] <= 800))[0]]
print(valid_anchors.shape)
# 2257, 4
```
![Anchor box version 2](gifs/naive_02.gif "anchor_box2")

### 1.5 part-1 and part-2 combined.
So now we have seen how anchor boxes are generated across an image. How they take care of detecting objects of different sizes and shapes present at different locations of the image. Part-1 and Part-2 can be combined in the following way. At each sliding window location, we will have 9 anchor boxes. These 9 anchor boxes goes through the network and prediction were made. Once all the sliding window locations were predicted, we apply non-maxima supression to remove redundant bounding boxes and finalize the bounding boxes. 

![Deep learning using object detection](../images/naive_03.png)


### 1.6 How do we detect specific location of the object ?
Till now we have done only classification. Object detection till now is an approximation. We are not accurately predicting where exactly an object is present, we are only finding the window in which an object is present. To accurately tell where exactly an object is present, we need to use a regression layer too.  Now lets see how this regression layer works. In the above example you have ~2200 valid anchor boxes. If you find the iou with ground truth objects present in the images, there will be say some ~100 positive anchor boxes (iou > 0.5) (contains object) and remaining negative objects. suppose there is an object at [200, 300, 400, 500] in the 800 x 800 image, lets see how many anchor boxes are positive 

```python
gt_box = [200, 300, 400, 500]

def iou(bbox1, bbox2):
    """
    gives iou between two bounding boxes
    :param bbox1:(x1,y1, x2, y2)
    :param bbox2:(x1,y1, x2, h2)
    """
    xA = max(bbox1[0], bbox2[0])
    yA = max(bbox1[1], bbox2[1])
    xB = min(bbox1[2], bbox2[2])
    yB = min(bbox1[3], bbox2[3])
    intersection = max(0, xB - xA) * max(0, yB - yA)
    box1_a = (bbox1[2] - bbox1[0]) * (bbox1[3] - bbox1[1])
    box2_a = (bbox2[2] - bbox2[0]) * (bbox2[3] - bbox2[1])

    iou = intersection / float(box1_a + box2_a - intersection)
    return iou


ious = []
for x in valid_anchors:
    iou_ = iou(x, gt_box)
    ious.append(iou_)
    
pos_ious = len([i for i in ious if i> 0.5])
print(pos_ious) 
## 19

max_ious = max(ious)
print(max_ious)
# 0.61
```

There are only 19 positive anchor boxes and the max iou of the gt_box is only 0.61. So even though our algo detect the best anchor box. our detection box will be accurate only by 0.61. This is the problem with using only a classifier layer. So researchers have started using **Regression layers**, which will predict the offsets of the objects wrt to anchor box.

![architecture_0](../images/arch_0.png)

So in the architecuture above, since we have ~5000 valid anchors and out of which 19 are positive. while training the deep learning model, we will use [0, 0, 0, 0, 0] as output for negative anchor boxes. and [1, x_offset, y_offset, gt_h/a_h, gt_w/a_w] as positive boxes output. In this way the deep learning model will not only tell weather there is cow (object) or not but also tells the location of the object wrt to the anchor boxes. 
As shown in the image above
- x_offset is the horizontal distance between the anchor box center and object center
- y_offset is the vertical distance between the anchor box center and object center
- a_h is anchor height, gt_h is ground truth height, gt_h/a_w is the relative height of ground truth to  anchor height.
- a_w is anchor width, gt_w is ground truth width, a_w/gt_w is the relative width of ground truth to  anchor width.

### 1.7 What will we do when we have multiple classes ?
Suppose say that you not only want to find the location of cows but also of human. Extending the above framework would be simple. When single class, we have a 5 dim vector [cow_present_or_not, x, y, w, h] (where x, y, w, h are offsets wrt anchor box). We will extend this 5 dim vector to 7 dim
- object present_or_not (1 output)
- which object is present (2 output) -> if there are n classes this would be N)
- location of object (4 output).


So if there are N classes we will have N+1+4 output vector. In the above case since there 2 classes (cow and human) we will have 2+1+4 = 7 dim vector. This is how object detection is performed in the early days of deep learning. Next we will discuss the problems with this approach and later work on modifying our pipeline to accomadate the same.

## Task 
Using the following below params. Answer the below mentioned questions 

image_size = (800, 800)
anchor_ratios = [0.5, 1, 2]
anchor_scales = [4, 8, 16, 32, 64]
base_size = 16
sub_sample = 32


1) How many valid anchors can be generated using the above params ?

Instructions: 
1) feature size is image_size/bas_size
2) Generate the centers using base size and feature size
3) For height use base_size * anchor_scales * np.sqrt(anchor_ratios)
4) For width use base_size * anchor_scales * np.sqrt(1./ anchor_ratios)


Ans) (3844, 4)

solution
```python
import numpy as np
image_size = (800, 800)
anchor_ratios = [0.5, 1, 2]
anchor_scales = [4, 8, 16, 32, 64]
base_size = 16
sub_sample = 32

## Total number of anchors 
num_anchors = len(anchor_ratios) * len(anchor_scales)
fe_size = image_size[0]//sub_sample
anchors = np.zeros(((fe_size * fe_size * num_anchors), 4))

## generating x_max and y_max of each base anchor box
ctr_x = np.arange(32, (fe_size+1) * 32, 32)
ctr_y = np.arange(32, (fe_size+1) * 32, 32)

## generating x_ctr and y_ctr of each anchor box
ctr = np.zeros((fe_size*fe_size, 2))
index = 0
for x in range(len(ctr_x)):
    for y in range(len(ctr_y)):
        ctr[index, 1] = ctr_x[x] - 16
        ctr[index, 0] = ctr_y[y] - 16
        index+=1

## Generating 15 anchors at each center.
index=0
for c in ctr:
    ctr_y, ctr_x = c
    for i in range(len(anchor_ratios)):
        for j in range(len(anchor_scales)):
            h =  base_size * anchor_scales[j] * np.sqrt(anchor_ratios[i])
            w =  base_size * anchor_scales[j] * np.sqrt(1./ anchor_ratios[i])

            anchors[index, 0] = ctr_y - h / 2.
            anchors[index, 1] = ctr_x - w / 2.
            anchors[index, 2] = ctr_y + h / 2.
            anchors[index, 3] = ctr_x + w / 2.
            index+=1
print(anchors.shape)

## Valid anchors which are within the image
valid_anchors = anchors[np.where((anchors[:, 0] >= 0) &
        (anchors[:, 1] >= 0) &
        (anchors[:, 2] <= 800) &
        (anchors[:, 3] <= 800))[0]]
print(valid_anchors.shape)
```

# Topic-2 Problems with naive approach and its solution
As seen in the last section, there are two fundamental problems with the approach.
- Sending ~2500 odd proposals to the network. that is like predicting on ~2500 images. Even using state-of-the-art infrastructure this would take seconds to process the entire image. This is will be a problem for usecases like self driving cars where you will recieve frames at ~60 frames per second or more. 
- Extreme class imbalance. As we have seen in the last section, out of the ~2500 anchor boxes only 19 anchor boxes are positive and remaining all are negative. In the later section we will see how this problem is solved when discussing about two stage object detection.

We will see how we can resolve the first part in this section. Lets take resnet like architecuture for example as shown below. The image size is reduced after every resblock and the feature map size is (200, 200), (100, 100), (50, 50), (25, 25) as shown. Now since we have used sub_sample 32, we will stop here. Now every 1x1 pixel on feature map corresponds to (32, 32) pixels in the image. so this (512, 1, 1) is responsible for predicting the object in that location. Which is then passed on to some more layers and output (N+1+4, 9) output, 9 because there are 9 anchor boxes at each location. Now since the backbone network is only passed once for each image and the rest of the computation acts on the feature map only, the inference time/ training time has been reduced by 100x in most of the cases. 
![Architecuture-2](../images/arch_2.png)

In total the architecuture when slided over the entire feature map it will generate (625x9, N+1+4) outputs. This is how object detection is performed using deep learning. Though I have called it naive object detection, we have almost used most of the techniqes which are used in the deep learning community. In the next section we will see what are two stage object detection techniques and what are one stage object detection techniques.  

## Topic-3 Two stage object detection 
Till now we have seen how object detection framework has evolved with time. The last approach we have discussed has brought down the inference time drastically from the order of seconds to milli seconds. Using the above solution as base, there were several architecutures proposed by several deep learning researchers to counter the extreme class imbalance and train these networks efficiently.  Object detection frameworks can be broadly classified into two parts.
- Two stage object detection framework 
- One stage object detection framework 

We will discuss about two stage object detection frameworks here and discuss one stage object detection in the next section. 

### 3.1 Two stage object detection Framework 
As we have discussed previously, The object detection pipelines we have discussed till now suffer from extreme class imbalance. In the above example we have seen that there are roughly ~5000 valid anchor boxes of which only 19 are positives. In order to solve this problem, the authors have suggested a two stage object detection framework. In two stage, we have two networks, one which will suggest the possible locations which might contain the objects, the other network takes these proposed boxes and regress and classify the objects present in it. [RCNN](https://arxiv.org/abs/1311.2524), [Fast RCNN](https://arxiv.org/abs/1504.08083) and [Faster RCNN](https://arxiv.org/abs/1506.01497) frameworks uses these approaches. Faster RCNN is the extension of RCNN and Fast RCNN. We will discuss the workings of faster rcnn in this section. 

Faster RCNN has two networks. Region proposal network and Fast RCNN network. The region proposal network aka RPN, given all the anchor boxes predicts the possible locations where the objects might be present. These proposals are taken by the Fast RCNN network and regression and classification happen for each object. Lets see in-depth on how this is done.

Lets say we have an image of size 800 x 800 and we are using a base_size of 16. So the feature map size will be 50 x 50. Since the feature map size is 50x50, we will have 2500 anchor boxes. To solve the scale and aspect ratio problem, we will select 3 different aspect ratios and 3 different scales. Together we will have 2500x9 = 22500 anchor boxes. So our RPN network, takes all these 22500 anchor boxes and predicts weather each anchor box is positive (contains object) or negative (doesn't contain object) and location of each object wrt to anchor box. We can use some threshold and select the top N boxes. In Faster RCNN the researchers select the top 2000 boxes while training after extensive experiments. This will reduce our number of possible locations by ~10x (reduced from ~22500 to ~2500). Since these boxes also predicts the possible locations of objects wrt to anchor boxes. These boxes are much accurate when compared to plain anchor boxes (remember we only got 0.61 max iou with ground truth with the anchor box in the last section). Now these top 2000 boxes are sent to Fast RCNN. 



![faster rcnn rpn -1](../images/faster_rcnn_rpn1.png)


Fast RCNN is very similar to proposed sliding window approach but instead of taking boxes on a continous scale, it will only extract portions of the images which are proposed by the network (2000). The fast rcnn network predicts both the class of the object and the possible location of object wrt to the new proposed box by RPN. In the diagram below, we got a proposal from the RPN network, saying that there is an object inside this window with some probability, we will extract the window features from the feature map and send it to a regressor and a classifier. 

![faster_rcnn_2](../images/faster_rcnn_2.png)


If you have observed clearly, both the networks have similar backend (image to feature map). So the authors instead of having two networks in the above discussed way, have combined the backbone of the two networks and made it as single unified network. while proposals are generated the entire backbone and RPN network is updated, while objects are identified the entire backbone and Fast rcnn is updated. In this way we will save compute time and memory and according to authors of the paper, Faster RCNN works at 5 frames per second. We can see the network in the following image. 

![two-stage](../images/two_stage_1.png)


In the above diagram the top green box is the RPN network, which proposes ~2000 boxes inside the image which might contain objects, Then these proposed boxes features are extracted from the feature map and sent to fast rcnn network where we regress and classify each proposal. Since different proposal boxes have different sizes and neural networks FC layers only accepts fixed sizes, we keep a layer called ROI pooling layer. ROI pooling layer gives a constant size output no matter whatever is the size of the proposal box. Suppose say we want a fixed output of 2 x 2 x C (channels) irrespective of the size of the proposal box, we divide the proposal box using a 2 x 2 grid (For odd number proposal box as shown in the diagram which has 16x17 proposal on a feature map of 50x50, we will divide the feature map from the left to right and top to bottom into parts, the last grid cell will have odd number. ) and later max pooling or average pooling is applied on each grid.  

![ROI pooling](../images/roi_pooling.jpg)


This is how a two stage object detection works. The training methodology, testing methodology and the kind of loss functions used in these networks can be studied from the [Faster RCNN](https://arxiv.org/abs/1506.01497) paper. 



## 3.2 One (Single) stage object detection.

It is exactly similar to what we have discussed in the naive object detection framework. To counter class imbalance there are several frameworks proposed by different researchers. Some of the noted object detection frameworks are.
1) Yolo and its versions (Yolov1, Yolov2, Yolov3)
2) SSD and DSSD 
3) RetinaNet.
4) M2Det 

In one stage, we divide the image into multiple parts using anchor boxes. These anchor boxes instead of featuring at image level are extracted from the feature map (as shown in the diagram below). Now each feature map is further regressed and classified without the use of RPN network. This direct regressing and classifying has class imbalance problem and we will see how this is tackled in the next section while we talk in-depth about RetinaNet. 


![one-stage-network](../images/one_stage_1.png)


## 3.3 Advantages and disadvatages of two stage and one stage frameworks 

#### Advantages of Two-stage object detectors over Single stage detectors 
- Better Localization of objects compared to Single stage detectors since in two-stage detectors like FasterRCNN, It first proposes a box using RPN network and Fast R-CNN uses this box further to find the location of the object.
- High precision compared to single stage object detector, since each box is validated by two networks sequentially 

#### Advantages of one-stage object detectors over two stage object detectors.
- single stage object detector sees the entire image during training and testing so it implicity ecodes contextual information about classes as well as their appearance. Faster RCNN, a two stage object detection framework mistakes back-ground patches in an image for objects because it can't see the larger context (The Fast RCNN network inside the Faster RCNN framework only looks at the proposals generated by RPN network). Single stage object detectors makes less than half the number of background errors compared to Fast R-CNN.
- Single stage object detector are compartively faster to train. 
- The inference time of single stage object detector is much lower than two stage object detector. Faster R-CNN works at 5 FPN while Yolo works at 45 fps.


The graph below clearly shows some notable differences 

![one-stage-two-stage comparison](../images/one_stage_two_stage.png)

- This comparison chart is obtained from https://arxiv.org/pdf/1811.04533.pdf
- clearly two stage pipelines are slower compared to one stage pipelines.
- Comparitvely, if we remove M2Det (As it is authors claim but no code was released yet to test it), two-stage pipelines are more accurate.


I will be explaining about RetinaNet in this particular course. With small changes in network and few other things more or less all the single stage detectors are the same. Towards the end of the course I will add a few papers and explain them in brief on how they are different from retinaNet. Finally, consider that if you understand RetinaNet well, you will understand every other framework easily.

## Quiz
1) Which of the following is a two stage detector?   
A) RetinaNet, B) Yolo, C) Faster RCNN D) SSD   

Ans) C 

2) Which of the following statements are correct ?  
A) Single stage detectors are faster than two stage detectors.  
B) Single stage detectors make less background errors compared to Two stage detectors.  
C) both the above statements are correct.  
D) neither of the above statements are correct.  

Ans) C

## References 

- http://www.telesens.co/2018/03/11/object-detection-and-classification-using-r-cnns/