....................................................................................................................................................................
# C4_Notes_W3 - Convolutional Neural Networks

This is the forth course of the deep learning specialization at [Coursera](https://www.coursera.org/specializations/deep-learning) which is moderated by [DeepLearning.ai](http://deeplearning.ai/). The course is taught by Andrew Ng.

**Week 3: Object Detection**

Learn how to apply your knowledge of CNNs to one of the toughest but hottest field of computer vision: Object detection.

  * [1. Object Localization](#object-localization)
  * [2. Landmark Detection](#landmark-detection)
  * [3. Object Detection](#object-detection-1)
  * [4. Convolutional Implementation of Sliding Windows](#convolutional-implementation-of-sliding-windows)
  * [5. Bounding Box Predictions](#bounding-box-predictions)
  * [6. Intersection Over Union](#intersection-over-union)
  * [7. Non-max Suppression](#non-max-suppression)
  * [8. Anchor Boxes](#anchor-boxes)
  * [9. YOLO Algorithm](#yolo-algorithm)
  * [10. Region Proposals (R-CNN)](#region-proposals-r-cnn)

# Object Localization 
<a id='object-localization'></a>

- Object detection is one of the areas in which deep learning is doing great in the past two years.
- What are localization and detection?

## Image Classification 
- Classify an image to a specific class. The whole image represents one class. We don't want to know exactly where are the object. Usually only one object is presented.

![](Images/Classification.jpg)


## Classification with localization
- Given an image we want to learn the class of the image and where are the class location in the image. We need to detect a class and a rectangle of where that object is. Usually only one object is presented.

![](Images/ClassificationLoc.jpg)


## Object detection
- Given an image we want to detect all the object in the image that belong to a specific classes and give their location. An image can contain more than one object with different classes.

![](Images/ObjectDetection.png)

## Semantic Segmentation
- We want to Label each pixel in the image with a category label. Semantic Segmentation doesn't differentiate instances, only cares about pixels. It detects no objects just pixels.
- If there are two objects of the same class is intersected, we won't be able to separate them.

![](Images/SemanticSegmentation.png)


## Instance Segmentation
- This is like the full problem. Rather than simply predicting the bounding box, we want to know which pixels belong to which label.

![](Images/InstanceSegmentation.png)

## Localilization output (y/target value)
- To make image classification we use a Conv Net with a Softmax attached to the end of it.
- To make classification with localization we use a Conv Net with a softmax attached to the end of it and a four numbers `bx`, `by`, `bh`, and `bw` to tell you the location of the class in the image. The dataset should contain this four numbers with the class too.
- Defining the target label Y in classification with localization problem: 
```
Y = [
    Pc  # Probability of an object is presented
    bx  # Bounding box midpoint_x
    by  # Bounding box midpoint_x
    bh  # Bounding box height
    bw  # Bounding box weight
    c1  # The classes (binary variable telling us which class is identified)
    c2
    ...
    ]
```


- Example (Object is present):
```
Y = [
    0.97 # Object is present (this would be between 0 and 1)
    float
    float
    float
    float
    0 # at least one of classes should be 1 
    1 # at least one of classes should be 1 
    0 # at least one of classes should be 1 
]
```


- Example (When object isn't presented):
```
Y = [
    0  # Object isn't presented
    ?  # ? means we dont care with other values
    ?
    ?
    ?
    ?
    ?
    ?
]
```


- The loss function for the Y we have created (Example of the square error)
- Note, `n` here refers to the number of components in the target variable (y)
- So if you have 3 classes, n would be 8: `(Pc, bx, by, bh,bw, c1,c2,c4)`
```
L(y',y) = {(y1'-y1)^2 + (y2'-y2)^2 + ... + (yn'-yn)^2           if y1 = 1
             (y1'-y1)^2                                           if y1 = 0
            }
```

- Note, **in practice we use:**
     - logistic regression for `pc`, 
     - log likelyhood loss for classes,
     - squared error for the bounding box.


# Landmark Detection
<a id='landmark-detection'></a>

- In some of the computer vision problems you will need to output some points. That is called **landmark detection**.
- For example, if you are working in a face recognition problem you might want some points on the face like corners of the eyes, corners of the mouth, and corners of the nose and so on. This can help in a lot of application like detecting the pose of the face.
- For example: Y shape for the face recognition problem that needs to output 64 landmarksm you will need 64 landmarks (128 in total if you include x and y values):

  - ```
    Y = [THereIsAface # Probability of face is presented 0 or 1
        l1x,
        l1y,
        ....,
        l64x,
        l64y
    ]
    ```

- Another application is when you need to get the skeleton of the person using different landmarks/points in the person which helps in some applications. Note, in your labeled data, if `l1x,l1y` is the left corner of left eye, all other `l1x,l1y` of the other examples has to be the same.


# Object Detection
<a id=object-detection-1></a>

- We will use a Conv net to solve the object detection problem using a technique called the sliding windows detection algorithm.
- For example lets say we are working on Car object detection.
- The first thing, we will train a Conv net on cropped car images and non car images.
  - ![](Images/18.png)
- After we finish training of this Conv net we will then use it with the sliding windows technique.
- Sliding windows detection algorithm:
  1. Decide a rectangle size.
  2. Split your image into rectangles of the size you picked. Each region should be covered. You can use some strides.
  3. For each rectangle feed the image into the Conv net and decide if its a car or not.
  4. Pick larger/smaller rectangles and repeat the process from 2 to 3.
  5. Store the rectangles that contains the cars.
  6. If two or more rectangles intersects choose the rectangle with the best accuracy.
- **HUGE Disadvantage of sliding window is the computation time... you are cropping out so many images and running them through the convnet.**
- In the era of machine learning before deep learning, people used a hand crafted linear classifiers that classifies the object and then use the sliding window technique. The linear classier makea it a cheap computation. But in the deep learning era that is so computationally expensive due to the complexity of the deep learning model.
- To solve this problem, we can implement the sliding windows with a **Convolutional approach**.
- One other idea is to compress your deep learning model.

# Convolutional Implementation of Sliding Windows
<a id=convolutional-implementation-of-sliding-windows></a>

- Turning FC layer into convolutional layers (predict image class from four classes). Lets start with a 14X14X3 image looking for 4 classes (pedestrian, car, tree, or background):
  ![](Images/19.png)
  - As you can see in the above image, we turned the FC layer into a Conv layer using a convolution with the width and height of the filter is the same as the width and height of the input.

- **Convolution implementation of sliding windows**:
  - First lets consider that the Conv net you trained is like this (No FC all is conv layers):
    - ![](Images/20.png)
  - Say now we have a 16 x 16 x 3 image that we need to apply the sliding windows in. By the normal implementation that have been mentioned in the section before this, we would run this Conv net four times each rectangle size will be 16 x 16.
  - The convolution implementation will be as follows:
    - ![](Images/21.png)
  - Simply we have feed the image into the same Conv net we have trained.
  - The left cell of the result "The blue one" will represent the the first sliding window of the normal implementation. The other cells will represent the others.
  - Its more efficient because it now shares the computations of the four times needed.
  - Another example would be:
![](Images/22.png)
  - This example has a total of 16 sliding windows that shares the computation together.
  - [[Sermanet et al., 2014, OverFeat: Integrated recognition, localization and detection using convolutional networks]](https://arxiv.org/abs/1312.6229)
- The weakness of the algorithm is that the position of the rectangle wont be so accurate. Maybe none of the rectangles is exactly on the object you want to recognize.
![](Images/23.png)
  - In red, the rectangle we want and in blue is the required car rectangle.


# Bounding Box Predictions
<a id=bounding-box-predictions></a>

- A better algorithm than the one described in the last section is the [YOLO algorithm](https://arxiv.org/abs/1506.02640).
- YOLO stands for *you only look once* and was developed back in 2015.
- Yolo Algorithm:
![](Images/24.png)

    1. Lets say we have an image of 100 X 100
    2. Place a  3 x 3 grid on the image (though in actually implementation you'd use sometime like a 19 x 19 grid for the 100 x 100 image)
    3. Apply the classification and localization algorithm we discussed a few videos back and apply it to each of the sections in the grid.  
    4. So for each grid box, your `y` value will have those 8 elements (same as example above): 
        - `bx` and `by` will represent the center point of the object in each grid and will be relative to the box so the range is between 0 and 1 (the centre point of the object determines which grid box it gets assigned to).
        - `bh` and `bw` will represent the height and width of the object which can be **greater than 1.0** (this will be necessary if the numbers go beyond the grid box that the object centre point is located in!).
        - Note, that each grid box values will be scalled to (0,0) on the top left corner and (1,1) on the bottom right corner.
    4. Do everything at once with the convolution sliding window. If Y shape is 1x8 as we discussed before then the output of the 100 x 100 image should be 3x3x8 which corresponds to 9 cell results. Note that if an object spans multiple grid boxes, it is assigned to the grid box where the object midpoint lies.
![](images/dope_pic5.png)


- This is VERY similar to the classificaiton/localization algorithm that we discussed a few sections ago (we just run it over each grid box instead of the whold image at once). **AND THE BOUNDING BOXES ARENT CONSTRAINED BY THE STRIDE SIZES LIKE IN THE SLIDING WINDOW ALGORITHM (which pretty much just loops through each window and checks 1 or 0 if the window has a object inside! The YOLO algo gives us much more prescise coordinates of the boudning box that arent constrained by the windos sizes and locations)!!!** 
- NOTE we borrow from the above approach and use a **convolutional implementation**, so if we have 9 grid boxes, we use a convolutional approach and only run the algo once (we don't have to run the algo 9 times!).
- One of the best advantages that makes the YOLO algorithm popular is that it has a great speed and a Conv net implementation.
- However we do have a problem if we have found more than one object in one grid box.
- How is YOLO different from other Object detectors?  YOLO uses a single CNN
  network for both classification and localizing the object using bounding boxes.
- In the next sections we will see some ideas that can make the YOLO algorithm better.




# Intersection Over Union
<a id=intersection-over-union></a>

- Intersection Over Union is a function used to evaluate the object detection algorithm.
- It computes size of intersection and divide it by the union. More generally, *IoU* *is a measure of the overlap between two bounding boxes*.
- For example:

![](Images/25.png)

- The red is the labeled output and the purple is the predicted output.
  - To compute Intersection Over Union we first compute the union area of the two rectangles which is "the first rectangle + second rectangle - overlap"
  - Then compute the intersection area between these two rectangles, which is simply the overlap.
  - Finally `IOU = intersection area / Union area`
  - If the boxes overlap perfectly, then the IoU will be 1.
- If `IOU >=0.5` then its good. The best answer will be 1.
- The higher the IOU the better is the accuracy.



# Non-max Suppression
<a id=non-max-suppression></a>

- One of the problems we have addressed in YOLO is that it can detect an object multiple times.
- Non-max Suppression is a way to make sure that YOLO detects the object just once.
- For example:
  - ![](Images/26.png)
  - Each car has two or more detections with different probabilities. This came from some of the grids that thinks that this is the center point of the object.
- Non-max suppression algorithm:
  1. Lets assume that we are targeting one class as an output class.
  2. Y shape should be `[Pc, bx, by, bh, hw]` Where Pc is the probability if that object occurs.
  3. Discard all boxes with `Pc < 0.6`  
  4. While there are any remaining boxes:
     1. Pick the box with the largest Pc Output that as a prediction.
     2. Discard any remaining box with `IoU > 0.5` with that box output in the previous step i.e any box with high overlap(greater than overlap threshold of 0.5).
- If there are multiple classes/object types `c` you want to detect, you should run the Non-max suppression `c` times, once for every output class.


# Anchor Boxes
<a id=anchor-boxes></a>

- In YOLO, a grid only detects one object. What if a grid cell wants to detect multiple object?

![](Images/27.png)
- Car and person grid is same here. In practice this happens rarely.
- Still, if it does occur, Anchor boxes can help us solve this issue.
- If Y = `[Pc, bx, by, bh, bw, c1, c2, c3]`, then to use two anchor boxes like this:
  - Y = `[Pc, bx, by, bh, bw, c1, c2, c3, Pc, bx, by, bh, bw, c1, c2, c3]`  We simply have repeated  the one anchor Y.
  - The two anchor boxes you choose should be known as a shape:
![](Images/28.png)
- So Previously, each object in training image is assigned to grid cell that contains that object's midpoint.
- With two anchor boxes, each object in training image is assigned to grid cell that contains object's midpoint **AND an anchor box for the grid cell with <u>highest IoU</u>.** You have to check where your object should be based on its rectangle closest to which anchor box.
![](Images/dope_pic6.png)
- Example of data:
![](Images/29.png)
  - Where the car was closer to anchor 2 than anchor 1.
- You may have two or more anchor boxes but you should know their shapes.
  - How do you choose the anchor boxes? People used to just choose them by hand. Maybe five or ten anchor box shapes that span a variety  of shapes and cover the types of objects you seem to detect frequently.
  - You may also use a k-means algorithm on your dataset to specify that.
- Anchor boxes allows your algorithm to specialize, which can help the algo to more easily detect wider images or taller ones.


# YOLO Algorithm
<a id=yolo-algorithm></a>

- YOLO is a state-of-the-art object detection model that is fast and accurate
- Lets sum up and introduce the whole YOLO algorithm given an example.
- Suppose we need to do object detection for our autonomous driver system.It needs to identify three classes:
  1. Pedestrians.
  2. Cars.
  3. Motorcycles.
- We decided to choose two anchor boxes, a tall one and a wide one **(in practice we'd use five or more anchor boxes hand made or generated using k-means. So if we used 5 anchor boxes, we'd have an array the size of 3x3x5x8)**.
- Our label Y shape will be `[Ny, HeightOfGrid, WidthOfGrid, 16]`, where Ny is number of instances and each row (of size 16) is as follows:
  - `[Pc, bx, by, bh, bw, c1, c2, c3, Pc, bx, by, bh, bw, c1, c2, c3]`
- Your dataset could be an image with multiple labels and a rectangle for each label, we will structure the target Y values as follows:
    1. With a grid size of 3x3, 2 anchor boxes, and 3 classes, our Y value will 3x3x2x8
    2. So we go through each of the grid cells and see if there is any object detected. For the first grid box, nothing is associated with either anchor box, so our target value should output 0s (or any number really, we dont really care...)
    3. Now when we get to a grid cell with an object, our target value should have the bounding box `(bx, by, bh, bw)` in the position associated with the correct anchor box **(NOTE: Let's say you have 2 anchor boxes, square on top and skinny rectnagle on bottom of Y value, you need to make sure that you PUT YOU VALUES IN THE RIGHT PLACES. So in t his example, if the square is on the bottom in the Y vector, and square better matches a car object (BASED ON IOU), the car's Y data SHOULD BE IN THE SECOND HALF OF THE VECTOR)**. The numbers in the other half of the array (the other anchor box) dont matter!

![](Images/dope_pic9.jpg)
    

**Once you've created the dataset:**
- Train the labeled images on a Conv net. 
- You should receive an output of `[HeightOfGrid, WidthOfGrid,16]` for our case.

**Making predictions:**
- To make predictions, run the Conv net on an image and run Non-max suppression algorithm for each class. 3 times in our example.
- You could get something like this:

![](Images/31.png)
- Total number of generated boxes are grid_width * grid_height * no_of_anchors = 3 x 3 x 2... Note the network can't ouput `question marks`, so each bounding bound will have numbers, but hopefully the grid regions with no objects should have very low `Pc` values.
- Also note that some of the bounding box hieights/widths can go outside the respective grid boxes (only the centre point is wihtin the grid region). 
- By removing the low probability predictions you should have:

![](Images/32.png)
- Then, for each of the 3 classes, independently run non-max suppression to generate the final predictions:

![](Images/33.png)

- Note that the YOLO algo is not good at detecting smaller objects.
- [YOLO9000 Better, faster, stronger](https://arxiv.org/abs/1612.08242)

  - Here is the summary of our model:

```
    ________________________________________________________________________________________
    Layer (type)                     Output Shape          Param #     Connected to                
    ========================================================================================
    input_1 (InputLayer)             (None, 608, 608, 3)   0                                 
    ________________________________________________________________________________________
    conv2d_1 (Conv2D)                (None, 608, 608, 32)  864         input_1[0][0]         
    ________________________________________________________________________________________
    batch_normalization_1 (BatchNorm (None, 608, 608, 32)  128         conv2d_1[0][0]       
    ________________________________________________________________________________________
    leaky_re_lu_1 (LeakyReLU)        (None, 608, 608, 32)  0     batch_normalization_1[0][0] 
    ________________________________________________________________________________________
    max_pooling2d_1 (MaxPooling2D)   (None, 304, 304, 32)  0           leaky_re_lu_1[0][0]   
    ________________________________________________________________________________________
    conv2d_2 (Conv2D)                (None, 304, 304, 64)  18432       max_pooling2d_1[0][0] 
    ________________________________________________________________________________________
    batch_normalization_2 (BatchNorm (None, 304, 304, 64)  256         conv2d_2[0][0]       
    ________________________________________________________________________________________
    leaky_re_lu_2 (LeakyReLU)        (None, 304, 304, 64)  0     batch_normalization_2[0][0] 
    _______________________________________________________________________________________
    max_pooling2d_2 (MaxPooling2D)   (None, 152, 152, 64)  0           leaky_re_lu_2[0][0]   
    ________________________________________________________________________________________
    conv2d_3 (Conv2D)                (None, 152, 152, 128) 73728       max_pooling2d_2[0][0] 
    ________________________________________________________________________________________
    batch_normalization_3 (BatchNorm (None, 152, 152, 128) 512         conv2d_3[0][0]       
    ________________________________________________________________________________________
    leaky_re_lu_3 (LeakyReLU)        (None, 152, 152, 128) 0     batch_normalization_3[0][0] 
    ________________________________________________________________________________________
    conv2d_4 (Conv2D)                (None, 152, 152, 64)  8192        leaky_re_lu_3[0][0]   
    ________________________________________________________________________________________
    batch_normalization_4 (BatchNorm (None, 152, 152, 64)  256         conv2d_4[0][0]       
    ________________________________________________________________________________________
    leaky_re_lu_4 (LeakyReLU)        (None, 152, 152, 64)  0     batch_normalization_4[0][0] 
    ________________________________________________________________________________________
    conv2d_5 (Conv2D)                (None, 152, 152, 128) 73728       leaky_re_lu_4[0][0]   
    ________________________________________________________________________________________
    batch_normalization_5 (BatchNorm (None, 152, 152, 128) 512         conv2d_5[0][0]       
    ________________________________________________________________________________________
    leaky_re_lu_5 (LeakyReLU)        (None, 152, 152, 128) 0     batch_normalization_5[0][0] 
    ________________________________________________________________________________________
    max_pooling2d_3 (MaxPooling2D)   (None, 76, 76, 128)   0           leaky_re_lu_5[0][0]   
    ________________________________________________________________________________________
    conv2d_6 (Conv2D)                (None, 76, 76, 256)   294912      max_pooling2d_3[0][0] 
    _______________________________________________________________________________________
    batch_normalization_6 (BatchNorm (None, 76, 76, 256)   1024        conv2d_6[0][0]       
    ________________________________________________________________________________________
    leaky_re_lu_6 (LeakyReLU)        (None, 76, 76, 256)   0     batch_normalization_6[0][0] 
    _______________________________________________________________________________________
    conv2d_7 (Conv2D)                (None, 76, 76, 128)   32768       leaky_re_lu_6[0][0]   
    ________________________________________________________________________________________
    batch_normalization_7 (BatchNorm (None, 76, 76, 128)   512         conv2d_7[0][0]       
    _______________________________________________________________________________________
    leaky_re_lu_7 (LeakyReLU)        (None, 76, 76, 128)   0     batch_normalization_7[0][0] 
    ________________________________________________________________________________________
    conv2d_8 (Conv2D)                (None, 76, 76, 256)   294912      leaky_re_lu_7[0][0]   
    ________________________________________________________________________________________
    batch_normalization_8 (BatchNorm (None, 76, 76, 256)   1024        conv2d_8[0][0]       
    ________________________________________________________________________________________
    leaky_re_lu_8 (LeakyReLU)        (None, 76, 76, 256)   0     batch_normalization_8[0][0] 
    ________________________________________________________________________________________
    max_pooling2d_4 (MaxPooling2D)   (None, 38, 38, 256)   0           leaky_re_lu_8[0][0]   
    ________________________________________________________________________________________
    conv2d_9 (Conv2D)                (None, 38, 38, 512)   1179648     max_pooling2d_4[0][0] 
    ________________________________________________________________________________________
    batch_normalization_9 (BatchNorm (None, 38, 38, 512)   2048        conv2d_9[0][0]       
    ________________________________________________________________________________________
    leaky_re_lu_9 (LeakyReLU)        (None, 38, 38, 512)   0     batch_normalization_9[0][0] 
    ________________________________________________________________________________________
    conv2d_10 (Conv2D)               (None, 38, 38, 256)   131072      leaky_re_lu_9[0][0]   
    ________________________________________________________________________________________
    batch_normalization_10 (BatchNor (None, 38, 38, 256)   1024        conv2d_10[0][0]       
    ________________________________________________________________________________________
    leaky_re_lu_10 (LeakyReLU)       (None, 38, 38, 256)   0    batch_normalization_10[0][0]
    ________________________________________________________________________________________
    conv2d_11 (Conv2D)               (None, 38, 38, 512)   1179648    leaky_re_lu_10[0][0]   
    ________________________________________________________________________________________
    batch_normalization_11 (BatchNor (None, 38, 38, 512)   2048        conv2d_11[0][0]       
    ________________________________________________________________________________________
    leaky_re_lu_11 (LeakyReLU)       (None, 38, 38, 512)   0    batch_normalization_11[0][0]
    _______________________________________________________________________________________
    conv2d_12 (Conv2D)               (None, 38, 38, 256)   131072      leaky_re_lu_11[0][0] 
    ________________________________________________________________________________________
    batch_normalization_12 (BatchNor (None, 38, 38, 256)   1024        conv2d_12[0][0]       
    ________________________________________________________________________________________
    leaky_re_lu_12 (LeakyReLU)       (None, 38, 38, 256)   0   batch_normalization_12[0][0]
    ________________________________________________________________________________________
    conv2d_13 (Conv2D)               (None, 38, 38, 512)   1179648     leaky_re_lu_12[0][0] 
    ________________________________________________________________________________________
    batch_normalization_13 (BatchNor (None, 38, 38, 512)   2048        conv2d_13[0][0]       
    ________________________________________________________________________________________
    leaky_re_lu_13 (LeakyReLU)       (None, 38, 38, 512)   0    batch_normalization_13[0][0]
    ________________________________________________________________________________________
    max_pooling2d_5 (MaxPooling2D)   (None, 19, 19, 512)   0           leaky_re_lu_13[0][0] 
    _______________________________________________________________________________________
    conv2d_14 (Conv2D)               (None, 19, 19, 1024)  4718592     max_pooling2d_5[0][0] 
    ________________________________________________________________________________________
    batch_normalization_14 (BatchNor (None, 19, 19, 1024)  4096        conv2d_14[0][0]       
    ________________________________________________________________________________________
    leaky_re_lu_14 (LeakyReLU)       (None, 19, 19, 1024)  0    batch_normalization_14[0][0]
    ________________________________________________________________________________________
    conv2d_15 (Conv2D)               (None, 19, 19, 512)   524288      leaky_re_lu_14[0][0] 
    ________________________________________________________________________________________
    batch_normalization_15 (BatchNor (None, 19, 19, 512)   2048        conv2d_15[0][0]       
    ________________________________________________________________________________________
    leaky_re_lu_15 (LeakyReLU)       (None, 19, 19, 512)   0    batch_normalization_15[0][0]
    ________________________________________________________________________________________
    conv2d_16 (Conv2D)               (None, 19, 19, 1024)  4718592     leaky_re_lu_15[0][0] 
    ________________________________________________________________________________________
    batch_normalization_16 (BatchNor (None, 19, 19, 1024)  4096        conv2d_16[0][0]       
    ________________________________________________________________________________________
    leaky_re_lu_16 (LeakyReLU)       (None, 19, 19, 1024)  0    batch_normalization_16[0][0]
    ________________________________________________________________________________________
    conv2d_17 (Conv2D)               (None, 19, 19, 512)   524288      leaky_re_lu_16[0][0] 
    ________________________________________________________________________________________
    batch_normalization_17 (BatchNor (None, 19, 19, 512)   2048        conv2d_17[0][0]       
    ________________________________________________________________________________________
    leaky_re_lu_17 (LeakyReLU)       (None, 19, 19, 512)   0    batch_normalization_17[0][0]
    _______________________________________________________________________________________
    conv2d_18 (Conv2D)               (None, 19, 19, 1024)  4718592     leaky_re_lu_17[0][0] 
    ________________________________________________________________________________________
    batch_normalization_18 (BatchNor (None, 19, 19, 1024)  4096        conv2d_18[0][0]       
    ________________________________________________________________________________________
    leaky_re_lu_18 (LeakyReLU)       (None, 19, 19, 1024)  0    batch_normalization_18[0][0]
    ________________________________________________________________________________________
    conv2d_19 (Conv2D)               (None, 19, 19, 1024)  9437184     leaky_re_lu_18[0][0] 
    ________________________________________________________________________________________
    batch_normalization_19 (BatchNor (None, 19, 19, 1024)  4096        conv2d_19[0][0]       
    ________________________________________________________________________________________
    conv2d_21 (Conv2D)               (None, 38, 38, 64)    32768       leaky_re_lu_13[0][0]
    ________________________________________________________________________________________
    leaky_re_lu_19 (LeakyReLU)       (None, 19, 19, 1024)  0    batch_normalization_19[0][0]
    ________________________________________________________________________________________
    batch_normalization_21 (BatchNor (None, 38, 38, 64)    256         conv2d_21[0][0]       
    ________________________________________________________________________________________
    conv2d_20 (Conv2D)               (None, 19, 19, 1024)  9437184     leaky_re_lu_19[0][0]
    ________________________________________________________________________________________
    leaky_re_lu_21 (LeakyReLU)       (None, 38, 38, 64)    0    batch_normalization_21[0][0]
    ________________________________________________________________________________________
    batch_normalization_20 (BatchNor (None, 19, 19, 1024)  4096        conv2d_20[0][0]       
    ________________________________________________________________________________________
    space_to_depth_x2 (Lambda)       (None, 19, 19, 256)   0           leaky_re_lu_21[0][0] 
    ________________________________________________________________________________________
    leaky_re_lu_20 (LeakyReLU)       (None, 19, 19, 1024)  0    batch_normalization_20[0][0]
    ________________________________________________________________________________________
    concatenate_1 (Concatenate)      (None, 19, 19, 1280)  0         space_to_depth_x2[0][0] 
                                                                      leaky_re_lu_20[0][0] 
    ________________________________________________________________________________________
    conv2d_22 (Conv2D)               (None, 19, 19, 1024)  11796480    concatenate_1[0][0]   
    ________________________________________________________________________________________
    batch_normalization_22 (BatchNor (None, 19, 19, 1024)  4096        conv2d_22[0][0]       
    ________________________________________________________________________________________
    leaky_re_lu_22 (LeakyReLU)       (None, 19, 19, 1024)  0    batch_normalization_22[0][0]
    ________________________________________________________________________________________
    conv2d_23 (Conv2D)               (None, 19, 19, 425)   435625      leaky_re_lu_22[0][0] 
    ===============================================================================================
    Total params: 50,983,561
    Trainable params: 50,962,889
    Non-trainable params: 20,672
    _______________________________________________________________________________________________
    ```

- You can find implementations for YOLO here:

  - https://github.com/allanzelener/YAD2K
  - https://github.com/thtrieu/darkflow
  - https://pjreddie.com/darknet/yolo/


In [49]:
np.random.rand(3,5,5)

array([[[0.72133117, 0.0318133 , 0.56137495, 0.07180609, 0.6887018 ],
        [0.98680547, 0.99084519, 0.37203476, 0.82471769, 0.12701034],
        [0.68902273, 0.42705328, 0.98770995, 0.3426188 , 0.98119761],
        [0.57337519, 0.70056436, 0.67138321, 0.47281931, 0.47510862],
        [0.25738487, 0.05975865, 0.74255355, 0.72240439, 0.1024813 ]],

       [[0.31586838, 0.14047243, 0.3982375 , 0.06601686, 0.3168282 ],
        [0.66932196, 0.51690191, 0.15489127, 0.48278095, 0.15346056],
        [0.74032307, 0.99111619, 0.27769028, 0.63621405, 0.87453153],
        [0.23774224, 0.83340234, 0.34595968, 0.14432201, 0.37433885],
        [0.45572845, 0.09280295, 0.14153008, 0.26028749, 0.99077837]],

       [[0.91077083, 0.37012522, 0.43893285, 0.51899574, 0.24544057],
        [0.99628546, 0.30299488, 0.91921951, 0.33433787, 0.09750268],
        [0.90651317, 0.64683681, 0.12764232, 0.91216457, 0.21451424],
        [0.13238696, 0.77861644, 0.56863553, 0.23372218, 0.45702453],
        [0.84508

# Region Proposals (R-CNN)
<a id=region-proposals-r-cnn></a>

- R-CNN is an algorithm that also makes an object detection.

- Yolo tells that its faster:

  - > Our model has several advantages over classifier-based systems. It looks at the whole image at test time so its predictions are informed by global context in the image. It also makes predictions with a single network evaluation unlike systems like R-CNN which require thousands for a single image. This makes it extremely fast, more than 1000x faster than R-CNN and 100x faster than Fast R-CNN. See our paper for more details on the full system.

- But one of the downsides of YOLO that it process a lot of areas where no objects are present.

- **R-CNN** stands for regions with Conv Nets.

- R-CNN tries to pick a few windows and run a Conv net (your confident classifier) on top of them.

- The algorithm R-CNN uses to pick windows is called a segmentation algorithm. Outputs something like this:

  - ![](Images/34.png)

- If for example the segmentation algorithm produces 2000 blob then we should run our classifier/CNN on top of these blobs.

- There has been a lot of work regarding R-CNN tries to make it faster:

  - R-CNN:
    - Propose regions. Classify proposed regions one at a time. Output label + bounding box.
    - Downside is that its slow.
    - [[Girshik et. al, 2013. Rich feature hierarchies for accurate object detection and semantic segmentation]](https://arxiv.org/abs/1311.2524)
  - Fast R-CNN:
    - Propose regions. Use convolution implementation of sliding windows to classify all the proposed regions.
    - [[Girshik, 2015. Fast R-CNN]](https://arxiv.org/abs/1504.08083)
  - Faster R-CNN:
    - Use convolutional network to propose regions.
    - [[Ren et. al, 2016. Faster R-CNN: Towards real-time object detection with region proposal networks]](https://arxiv.org/abs/1506.01497)
  - Mask R-CNN:
    - https://arxiv.org/abs/1703.06870

- Most of the implementation of faster R-CNN are still slower than YOLO.

- Andew Ng thinks that the idea behind YOLO is better than R-CNN because you are able to do all the things in just one time instead of two times.

- Other algorithms that uses one shot to get the output includes **SSD** and **MultiBox**.

  - [[Wei Liu, et. al 2015 SSD: Single Shot MultiBox Detector]](https://arxiv.org/abs/1512.02325)

- **R-FCN** is similar to Faster R-CNN but more efficient.

  - [[Jifeng Dai, et. al 2016 R-FCN: Object Detection via Region-based Fully Convolutional Networks ]](https://arxiv.org/abs/1605.06409)