# R-CNN

## Object detction

Machine learning approaches:

1) Viola–Jones object detection framework based on Haar features<br>
2) Scale-invariant feature transform (SIFT)<br>
3) Histogram of oriented gradients (HOG) features<br>

### Classification

<img src="./data/image-classification-pipeline.jpg" width="500">

### Image pyramid and sliding window

<table>
    <tr>
        <th><img src="./data/pyramid_example.png" width="300"></th>
        <th><img src="./data/sliding_window_example.gif" width="250"></th>
    </tr>
</table>

### NMS

<img src="./data/hog_object_detection_nms.jpg" width="500">

## R-CNN

R-CNN consists of three modules:

1) Category-independent region proposals generator <br>
2) CNN feature extractor <br>
3) Set of classs pecific linear SVMs <br>

<img src="./data/rcnn-pipeline.png" width="700">

### Selective search

Selective Search is a region proposal algorithm used in object detection. <br>
It is based on computing hierarchical grouping of similar regions based on color, texture, size and shape compatibility. <br><br>


<table>
    <tr>
        <th><img src="./data/selective-search-0.png" width="300"></th>
        <th><img src="./data/selective-search-1.png" width="300"></th>
        <th><img src="./data/selective-search-2.png" width="300"></th>
    </tr>
</table>

### Problems

1) Extracting regions for each image based on selective search (no learning) <br>
2) Extracting features using CNN for every image region <br>
3) The entire process of object detection using R-CNN consists of 2 models <br>
4) It cannot be implemented real time as it takes around 47 seconds for each test image <br>

## Fast R-CNN

1) Network takes as input an entire image and a set of object proposals <br>
2) CNN produce a conv feature map <br>
3) For each object proposal a RoI pooling layer extracts a fixed-length feature vector from the feature map <br>
4) Each feature vector is fed into a MLP that finally branch into two sibling output layers: softmax probability over K + 1 classes and four real-valued numbers for each of the K object classes which encodes refined bounding-box positions <br>

<img src="./data/fast-rcnn-pipeline.png" width="700">

### RoI pooling

<img src="./data/roi_pooling.gif" width="500">

### Multi-task Loss

<img src="./data/multitask_loss.png" width="400">

### Comparison

<img src="./data/fast-rcnn-comparison.png" width="600">

### Problems

1) Regions still extracted with selective search <br>
2) Faster, but not real-time model <br>

## Faster R-CNN

1) Selective search replaced by RPN <br>
2) Detection is Fast R-CNN <br>

<img src="./data/faster-rcnn-pipeline.png" width="300">

### RPN

1) Extract feature map from input image <br>
2) Sliding window is used in RPN for each location over the feature map <br>
3) For each location, k (k=9) anchor boxes are used (3 scales of 128, 256 and 512, and 3 aspect ratios of 1:1, 1:2, 2:1) for generating region proposals <br>
4) A cls layer outputs 2k scores whether there is object or not for k boxes <br>
5) A reg layer outputs 4k for the coordinates (box center coordinates, width and height) of k boxes <br>
6) With a size of $W \times H$ feature map, there are $WHk$ anchors in total <br>

### Anchors

<img src="./data/anchors.png" width="500">

RPN network is to pre-check which location contains object. And the corresponding locations and bounding boxes will pass to detection network for detecting the object class and returning the bounding box of that object. <br><br>
NMS is used to reduce the number of proposals from about 6000 to N (N=300) <br>

### 4-Step Alternating Training

1) Train (fine-tune) RPN with imagenet pre-trained model <br>
2) Train (fine-tune) a separate detection network with imagenet pre-trained model. (Conv layers not yet shared) <br>
3) Use the detector network to initialize PRN training, fix the shared conv layers, only fine-tune unique layers of RPN <br>
4) Keep the conv layers fixed, fine-tune the unique layers of detector network <br>

### Comparison

<img src="./data/faster-rcnn-comparison.png" width="600">

## References

1. Simple HOG+SVM object detector https://www.pyimagesearch.com/2014/11/10/histogram-oriented-gradients-object-detection/ <br>
2. R-CNN https://arxiv.org/pdf/1311.2524.pdf <br>
3. Fast R-CNN https://arxiv.org/pdf/1504.08083.pdf <br>
4. Faster R-CNN https://arxiv.org/pdf/1506.01497.pdf <br>
5. Introduction to the basic detection algorithms https://www.analyticsvidhya.com/blog/2018/10/a-step-by-step-introduction-to-the-basic-object-detection-algorithms-part-1/ <br>
6. R-CNN review https://medium.com/coinmonks/review-r-cnn-object-detection-b476aba290d1 <br>
7. Fast R-CNN review https://medium.com/coinmonks/review-fast-r-cnn-object-detection-a82e172e87ba <br>
8. Faster R-CNN review https://towardsdatascience.com/review-faster-r-cnn-object-detection-f5685cb30202 <br>