<a href="https://colab.research.google.com/github/lalitpandey02/PythonNotebooks/blob/main/1_Object_Detection.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<center><img src="https://github.com/insaid2018/Term-1/blob/master/Images/INSAID_Full%20Logo.png?raw=true" width="240" height="100" /></center>

**<center><h1>Object Detection</center>**

---
# **Table of Contents**
---

**1.** [**Categorization of Object Detection Tasks**](#section1)<br>
  - **1.1** [**Image Classification**](#section101)
  - **1.2** [**Object Classification and Localization**](#section102)
  - **1.3** [**Multiple Objects Detection and Localization**](#section103)
  
**2.** [**YOLO: You Only Look Once**](#section2)<br>
**3.** [**SSD: Single Shot MultiBox Detection**](#section3)<br>
**4.** [**Conclusion**](#section4)

---
<a name = Section1></a>
# **1. Categorization of Object Detection Tasks**
---

<center><img src="https://raw.githubusercontent.com/insaid2018/DeepLearning/master/images/cv_tasks.png" width="900" height="450"/></center>

<a id=section101></a>
### **1.1 Image Classification**

- This is the most **common** computer vision problem where an algorithm looks at an image and **classifies** the **object** in it.

- Image classification has a **wide variety** of applications, ranging from **face detection** on social networks to **cancer detection** in medicine.

- Such problems are typically modeled using **Convolutional Neural Nets** (CNNs).

<center><img src="https://raw.githubusercontent.com/insaid2018/DeepLearning/master/images/img_classification.png" width="900" height="500"/></center>

<a id=section102></a>
### **1.2 Object Classification and Localization**

- Let’s say we not only want to know whether there is **cat** in the image, but **where** exactly is the cat.

- Object **localization** algorithms not only **label** the **class** of an object, but also **draw** a **bounding box** around **position** of object in the image.

<center><img src="https://raw.githubusercontent.com/insaid2018/DeepLearning/master/images/object_localization.png" width="900" height="500"/></center>

- Now, to make our model **draw** the **bounding boxes** of an object, we just change the output labels from the previous algorithm, so as to make our **model learn** the **class** of object and also the **position** of the object in the image. 

- We **add 4** more **numbers** in the **output layer** which include **centroid position** of the object and proportion of **width** and **height** of **bounding box** in the image.

<center><img src="https://raw.githubusercontent.com/insaid2018/DeepLearning/master/images/simple_right.jpg" width="400" height="300"/></center>

- Just **add** a bunch of **output units** to spit out the **x, y coordinates** of different positions you want to **recognize**.

- These different **positions** or landmark would be **consistent** for a particular object in all the images we have.

- For e.g. for a car, **height** would be **smaller than width** and **centroid** would have some specific **pixel density** as compared to other points in the image.

<a id=section103></a>
### **1.3 Multiple Objects Detection and Localization**

- What if there are **multiple objects** in the image (3 dogs and 2 cats as in above figure) and we want to **detect** them **all**?

- That would be an **Object Detection and Localization** problem. 

- A well known application of this is in **self-driving cars** where the algorithm not only needs to detect the cars, but also pedestrians, motorcycles, trees and other **objects** in the frame.

- These kind of problems need to **leverage** the ideas or concepts learnt from **image classification** as well as from **object localization**.

<center><img src="https://raw.githubusercontent.com/insaid2018/DeepLearning/master/images/multiple_object_localization.png" width="900" height="500"/></center>

- To **detect all** kinds of **objects** in an image, we can directly use what we learnt so far from object localization.

- The difference is that we want our algorithm to be able to **classify** and **localize** all the **objects** in an image, not just one.

- So the idea is, just **crop** the **image** into multiple images and **run CNN** for all the cropped images to **detect** an object.

- This solution is known as **Object Detection** with Sliding Windows. 

<center><img src="https://raw.githubusercontent.com/insaid2018/DeepLearning/master/images/sliding_window.gif" width="500" height="350"/></center>

- It is very basic solution which has many caveats as the following:


- **Computationally expensive**: Cropping multiple images and passing it through ConvNet is going to be computationally very expensive.

  - **Solution**: There is a simple hack to **improve** the computation power of sliding window method. 
  
  - It is to **replace** the **fully connected layer** in ConvNet with **1x1 convolution layers** and for a given window size, pass the input image only once.
  
  - So, in actual implementation we do not pass the cropped images one at a time, but we **pass** the **complete image** at once.


- **Inaccurate bounding boxes**: We are sliding windows of square shape all over the image, maybe the object is **rectangular** or maybe none of the **squares** match perfectly with the **actual size** of the object. 

  - Although this algorithm has ability to find and **localize** multiple objects in an image, but the **accuracy** of **bounding box** is still **bad**.

- So, how can we make our **algorithm better** and **faster**?

- It turns out that we have **YOLO (You Only Look Once)** which is much more **accurate** and **faster** than the sliding window algorithm.

---
<a name = Section2></a>
# **2. YOLO: You Only Look Once**
---

- It is based on only a **minor tweak** on the top of algorithms that we already know.

- The idea is to **divide** the **image into multiple grids**. 

- Then we **change** the **label** of our data such that we implement both **localization** and **classificatio**n algorithm for **each grid cell**.

<center><img src="https://raw.githubusercontent.com/insaid2018/DeepLearning/master/images/yolo.png" width="900" height="500"/></center>

#### **YOLO, in easy steps:**

1. **Divide the image into multiple grids**. 

  - For illustration, we have drawn 4x4 grids in above figure, but actual implementation of **YOLO** has different number of grids. (**7x7** for **training** YOLO on PASCAL VOC dataset)

2. **Label the training data** as shown in the above figure. 

  - If $C$ is number of **unique** objects in our data, $S*S$ is number of grids into which we split our image, then our **output vector** will be of **length** $S*S*(C+5)$. 
  
  - For e.g. in above case, our target vector is $4*4*(3+5)$ as we divided our images into 4x4 grids and are **training** for 3 unique objects: Car, Light and Pedestrian.

3. Make **one deep convolutional** neural net with **loss function** as **error between output activations** and label vector. 

  - Basically, the model predicts the **output** of all the grids in just **one forward pass** of input image through **ConvNet**.

4. Keep in mind that the **label** for object being present in a grid cell (**`P.Object`**) is determined by the presence of **object’s centroid** in that grid. 

  - This is **important** to not allow one object to be **counted multiple times** in different grids.

#### **Caveats of YOLO and their solutions:**

1. **Can’t detect multiple objects in same grid**.

  - This issue can be solved by choosing **smaller grid size**. 
  
  - But even by choosing **smaller** grid size, the algorithm can still fail in cases where **objects** are very close to each other, like image of **flock** of birds.

- **Solution**: **Anchor boxes**. 
  
  - In addition to having $5+C$ labels for each grid cell (where $C$ is number of distinct objects), the idea of anchor boxes is to have $(5+C)*A$ labels for each grid cell, where $A$ is **required** anchor boxes. 
  
  - If one **object** is assigned to one anchor box in **one grid**, other object can be assigned to the other **anchor box** of same grid.

<center><img src="https://raw.githubusercontent.com/insaid2018/DeepLearning/master/images/yolo_anchor_boxes.png" width="850" height="550"/></center>

**2.** **Possibility to detect one object multiple times**.

- **Solution**: **Non-max suppression**. 
  
  - Non max suppression **removes** the **low probability bounding boxes** which are very close to a high probability bounding boxes.

---
<a name = Section3></a>
# **3. SSD: Single Shot MultiBox Detection**
---

- **Reasons behind the origin of SSD algorithm:**

1. **How do you know the size of the sliding window so that it always contains the object?** 

  - Different types of **objects** (palm tree and swimming pool), even the same type of objects (e.g. a small building and a large buidling) can be of **varying** sizes as well.

2. **Aspect ratio** (the ratio of height to width of a bounding box). 

  - A lot of objects can be present in **various shapes** like a building footprint will have a different **aspect ratio** than a palm tree.

- To solve these problems, we would have to try out **different sizes/shapes** of **sliding window**, which is very **computationally intensive**, especially with deep neural network.

#### **Single-Shot Detector (SSD)**

- SSD has two components: a **backbone model** and **SSD head**.

- **Backbone** model usually is a **pre-trained image classification network** as a feature extractor. 

- This is typically a network like **ResNet** trained on ImageNet from which the final fully connected classification layer has been removed. 

- We are thus left with a deep neural network that is able to **extract semantic meaning** from the input image while **preserving** the **spatial structure** of the image albeit at a lower resolution.

- For ResNet34, the backbone results in a **256 7x7 feature maps** for an input image. 

- The **SSD head** is just one or more **convolutional layers** added to this backbone and the outputs are interpreted as the **bounding boxes** and classes of objects in the spatial location of the final layers activations.

- In the figure below, the first few layers (**white boxes**) are the **backbone**, the last few layers (**blue boxes**) represent the **SSD head**.

<center><img src="https://raw.githubusercontent.com/insaid2018/DeepLearning/master/images/ssd.png" width="850" height="250"/></center>

#### **Grid Cell**

- Instead of using **sliding window**, SSD divides the image using a **grid** and have each grid cell be responsible for detecting objects in that **region** of the image.

- **Detecting** objects simply means **predicting** the **class** and **location** of an object within that region.

- If **no object** is present, we consider it as the **background** class and the **location** is **ignored**.

- For instance, we could use a **4x4 grid** in the example below. 

- Each grid cell is able to output the **position** and shape of the object it contains.

<center><img src="https://raw.githubusercontent.com/insaid2018/DeepLearning/master/images/grid_cell.png" width="400" height="350"/></center>

- Now what if there are **multiple objects** in one grid cell or we need to detect multiple objects of different shapes. 

- There is where anchor **box** and **receptive** field come into play.

#### **Anchor Box**

- Each grid cell in SSD can be assigned with **multiple anchor/prior** boxes.

- These anchor boxes are **pre-defined** and each one is responsible for a **size** and **shape** within a grid cell.

- For example, the swimming pool in the image below corresponds to the taller **anchor box** while the building corresponds to the **wider box**.

<center><img src="https://raw.githubusercontent.com/insaid2018/DeepLearning/master/images/anchor_box.png" width="600" height="350"/></center>

- SSD uses a matching phase while training, to **match** the **appropriate anchor box** with the bounding boxes of each ground truth object within an image.

- Essentially, the anchor box with the **highest degree** of **overlap** with an object is responsible for **predicting** that object’s class and its location.

- This property is used for **training** the network and for **predicting** the detected objects and their **locations** once the network has been trained.

- In practice, each **anchor box** is specified by an **aspect ratio** and a **zoom level**.

#### **Aspect Ratio**

- Not all objects are **square** in shape.

- Some are **longer** and some are **wider**, by varying degrees.

- The SSD architecture allows **pre-defined aspect ratios** of the anchor boxes to account for this.

- The **ratios parameter** can be used to specify the **different aspect ratios** of the anchor boxes **associates** with each grid cell at each **zoom/scale level**.

<center><img src="https://raw.githubusercontent.com/insaid2018/DeepLearning/master/images/aspect_ratio.png" width="400" height="350"/></center>
<br> 
<center><strong>The bounding box of building 1 is higher, while the bouding box for building 2 is wider</strong></center>

#### **Zoom Level**

- It is not necessary for the **anchor boxes** to have the **same size** as the grid cell.

- We might be interested in finding **smaller** or **larger** objects within a grid cell.

- The zooms parameter is used to **specify** how much the anchor boxes need to be **scaled up** or **down** with respect to each grid cell.

- Just like what we have seen in the anchor box example, the size of **building** is generally larger than swimming pool.

#### **Receptive Field**

- Receptive field is defined as the **region** in the input space that a particular **CNN’s** feature is looking at (i.e. be affected by).

- Because of the convolution operation, **features** at different layers represent **different sizes** of region in the **input image**.

- As it goes deeper, the **size** represented by a feature gets **larger**.

- In this example below, we start with the **bottom layer** (5x5) and then apply a convolution that results in the **middle layer** (3x3) where one feature (green pixel) represents a 3x3 region of the **input layer** (bottom layer).

- And then apply the **convolution** to middle layer and get the top layer (2x2) where each feature corresponds to a **7x7 region** on the input image.

<center><img src="https://raw.githubusercontent.com/insaid2018/DeepLearning/master/images/receptive_field.png" width="760" height="450"/></center>
<br> 
<center><strong>Visualizing CNN feature maps and receptive field</strong></center>

- These kind of green and orange 2D array are also called **feature maps** which refer to a set of features created by applying the same **feature extractor** at **different locations** of the input map in a sliding window fashion.

- Features in the same feature map have the **same receptive field** and look for the **same pattern** but at **different locations**.

- This creates the **spatial invariance** of ConvNet.

- Receptive field is the central premise of the **SSD architecture** as it enables us to detect objects at **different scales** and output a **tighter bounding box**.

<center><img src="https://raw.githubusercontent.com/insaid2018/DeepLearning/master/images/why.jpg" width="500" height="300"/></center>

- As you might still remember, the **ResNet34** backbone outputs a 256 **7x7** feature maps for an **input** image.

- If we specify a **4x4** grid, the simplest approach is just to **apply** a convolution to this **feature map** and convert it to 4x4.

- This approach can actually work to some **extent** and is exatcly the idea of **YOLO** (You Only Look Once).

- The extra step taken by SSD is that it applies more **convolutional** layers to the backbone **feature map** and has each of these convolution layers **output** a object detection **results**.

- As earlier layers bearing **smaller receptive field** can represent smaller sized objects, predictions from earlier layers **help** in dealing with **smaller sized objects**.

- Because of this, SSD allows us to define a **hierarchy** of **grid cells** at different layers.

- For example, we could use a **4x4 grid** to find smaller objects, a **2x2 grid** to find mid **sized** objects and a **1x1 grid** to find **objects** that cover the entire image.

#### **SSD Advantages**

- SSD is a **single-shot detector**. 

- It has no delegated **region proposal** network and predicts the boundary boxes and the classes directly from feature maps in **one single pass**.

- To **improve accuracy**, SSD introduces:

  - **Small convolutional filters** to predict object classes and offsets to default boundary boxes.
  
  - **Separate filters** for default boxes to handle the difference in aspect ratios.
  
  - **Multi-scale feature maps** for object detection.

- SSD can be trained **end-to-end** for better accuracy.

- It makes more **predictions** and has a **better coverage** on location, scale and aspect ratios.

- With the **improvements** above, It can **lower** the **input image resolution** to 300 × 300 with a comparative accuracy performance.

- By removing the **delegated** region proposal and using lower **resolution** images, the model can **run at real-time speed** and still beats the accuracy of the **state-of-the-art** Faster R-CNN.

---
<a name = Section4></a>
# **4. Conclusion**
---

- **Performance Comparison on COCO Object Detection Dataset**

<center><img src="https://raw.githubusercontent.com/insaid2018/DeepLearning/master/images/yolo_v3_performance.png" width="900" height="550"/></center>

- **YOLOv3** is an **updated** version of the YOLO architecture with some massive improvements.

  - It is extremely **fast** and **accurate**.

  - In mAP measured at .5 IOU **YOLOv3** is on par with **Focal Loss** but about **4x faster**.

  - Moreover, you can easily **tradeoff** between speed and **accuracy** simply by **changing** the **size** of the model, no retraining required!

- From the plot above, we can see that **YOLOv3 outperforms SSD** in both speed and **accuracy**.

- We will be studying **in-depth** about both these **architectures** next.