## Chapter: Introduction to Object Detection
- Topics
    - Object Detection vs Classification
    - Usecases
    - Object Detection using computer vision

# 1. Object Detection

Object detection is the process of identifying instances of different objects inside the image. The fundamental difference between image classification and object detection is that in Image classification, we only tell whether a particular object (single class) or a list of objects (multi-class) are present or not in the image, Where as in object detection we can locate each and every instance of the object and predict both the object name (class) and its location ( rectangular coordinates (x1, y1, x2, y2)). The following diagram makes it more clear.

![Image classification vs Object detection](../images/cows.png "Title")

## 2. why Object detection? 
Object detection has various use cases. It is widely used across various industries 

### Usecase-1 
In China, Already about 200 million surveillance cameras are scattered around the country — to track big spenders in luxury retail stores, catch identity thieves, prevent violent crime, find fugitives, catch sleeping students in the classroom and even snag jaywalkers. In fact, nearly every one of its 1.4 billion citizens is in China’s facial recognition database. 

The first thing we need to do in facial recongization is to accurately localize each and every face and here is where object detection comes into picture, the below diagram shows an output of object detection model trained to accurately identify human faces given a image. (Later these faces are vectorized and matched with existing databases).

Companies like [YITU](https://www.yitutech.com/en), [SenseTime](https://en.wikipedia.org/wiki/SenseTime), [Face++](https://www.faceplusplus.com/) are all multi billion dollar companies which heavily invested in these technologies. 
![Face Detection](../images/faces.png)
source: [link](https://arxiv.org/pdf/1711.07246.pdf)


### Usecase-2 
Self-Driving car technology heavily uses Object detection to accurately identify other vehicles, pedistrains, traffic lights etc to understand and make sense out of the surroundings. 

Google's [Waymo](https://waymo.com/), [Tesla Autopilot](https://www.tesla.com/autopilot) use some object detection modules to effictively navigate on the roads without any accidents.

Click on this [link](https://www.youtube.com/watch?v=VF8JuQwKQmU) To see how object detection works on roads.


### Other Usecases
- In Insurance, Object detection is used to localize and identify the damages to vechiles, buildings etc.
- In Retail stores, Object detection is used to track the shoppers behaviour across the store.
- Army, Navy and other govt institutions use to track and monitor intruders.
- In manufacturing, Object detection is used to identify the defective locations of a product. 

In this way, no matter whatever industry you take, object detection is used in some form or other and is the major research topic in **Deep Learning** Community.

## 3. Object Detection using computer vision
- 
Object detection work flow is complex and yet simple if you understand what exactly is happening under the hood. Before jumping into how object detection frameworks based on Deep learning works, lets look at how object detection using computer vision was performed before deep learning era.  

Object detection in its simple terms means searching for objects inside an image. Now, What's the most simple way of searching? We will be using a technique called sliding window approach. Lets take an image

### 3.1 What is a sliding window?
In the context on computer vision, a sliding window is a rectangular region of fixed width and height that slides across an image, such as in the following figure. 

![SlidingWindow](gifs/sliding_window.gif "positive")

Lets look at how we can generate the above sliding windows on the image. Similar to our deep learning terminology, we have 
- kernel size: the size of the window
- stride: number of pixels to leave before going into the next slide 

The following is the code to generate a window given an image, stride and kernel_size.
```python
def sliding_window(image, stride, kernel_size):
    # slide a window across the image
    for y in range(0, image.shape[0], stride):
        for x in range(0, image.shape[1], stride):
            # yield the current window
            yield (x, y, image[y:y + kernel_size[1], x:x + kernel_size[0]])
```

To visually check how to strip each window to identify wheather there is an object or not, we can use opencv rectangle drawing function. We will use a kernel_size of (128, 128) and stride of 32 in this example.

```python
import cv2
import imageio ## For converting into gifs 

## Read the image
img = cv2.imread("../images/cow.jpg")

## Convert to BGR format
img = img[:, :, ::-1]

## kernel_size and stride
kernel_size = (128, 128)
stride= 32 

imgs = []
for (x, y, window) in sliding_window(img, stride=stride, kernel_size=kernel_size):
    # if the window does not meet our desired window size, ignore it
    if window.shape[0] != kernel_size[0] or window.shape[1] != kernel_size[1]:
        continue
    # since we do not have a classifier, we'll just draw the window
    clone = img.copy()
    cv2.rectangle(clone, (x, y), (x + kernel_size[0], y + kernel_size[1]), (0, 255, 0), 2)
    imgs.append(clone)
    
print("total_sliding_windows: {}".format(len(imgs)))
## 170 sliding windows
    
    
## Convert to a gif
import imageio
images = []
for img_ in imgs:
    images.append(img_)
imageio.mimsave('movie.gif', images)
```

In total there are 170 sliding window images. In the next section we will see how we will move from sliding window to object detection

Note: [Imageio](https://imageio.github.io/) is a Python library that provides an easy interface to read and write a wide range of image data, including animated images, video, volumetric data, and scientific formats. It is cross-platform, runs on Python 2.7 and 3.4+, and is easy to install. Instrested readers can read more about the library in the link given above.

## Task - 
How many valid windows are present in an image of size (800, 800) when kernel_size is (64, 64) and stride = 64 and there is no padding ?

Note: All the windows should be of shape (64, 64). We considered that there is no padding.

Answer = 144
solution:
```python
image = np.zeros((800, 800, 3))
kernel_size = (64, 64)
stride = 64

windows = []
for y in range(0, image.shape[0], stride):
    for x in range(0, image.shape[1], stride):
        if x+stride > image.shape[0] or y+stride > image.shape[1]:
            continue
        window = (x, y, image[y:y + kernel_size[1], x:x + kernel_size[0]])
        windows.append(window)
print(len(windows))       
```

## 3.2 sliding windows to object detection
Once the sliding windows are generated, it is a simple classifier problem. We will vectorize each and every sliding window using traditional computer vision techniques like [HOG descriptors](https://gurus.pyimagesearch.com/lesson-sample-histogram-of-oriented-gradients-and-car-logo-recognition/) etc, then pass on the image vector to a classifier like SVM which will tell weather a cow is present (1) or not (0) in the window.

![object detection using sliding window](../images/ob_sw_process.png)

The above described process is for inference. Now lets look at how we will develop something like this from scratch. Lets say we want to build an cow detector. 
- The first thing you would do is to collect a dataset which contains cows. These cows appear in different sizes, across different groups, in different locations etc. 
- Using any [image annotation tool](https://www.quora.com/What-is-the-best-image-labeling-tool-for-object-detection), we annotate the dataset for cows.
- Now we generate sliding windows on each image to generate background images and cow images for training the classifier. This is done simply using an iou threshold (say 0.3). As shown in the below diagram, All the green boxes are background classes and all the red boxes are +ve classes.  
![SlidingWindow2](gifs/sliding_window_with_label.gif "label")
- Once the classifier is trained and accuracy metrics are satisfied, we ship the model.
- During inference, there might be multiple sliding windows predicting the same cow. In these cases we will be using **Non-maxima supression(NMS)** techniques to remove duplicate boxes.  [Non-maxima suppression](https://www.pyimagesearch.com/2015/02/16/faster-non-maximum-suppression-python/) is a simple technique to remove duplicate bounding boxes. First we will select class (cow here) and then sort all the +ve boxes using probability score. Now if two +ve bounding boxes have iou (intersection are) greater than the desired (threshold), we will remove the bounding box with lower confidence score. In this way, we will get the best bounding boxes and remove duplicates. 

input_image       |  output_image( After non-maxima suppression )
:---------:|:------:
![img1](gifs/x.png)  |  ![img_2](gifs/y.png) 

## Problems with Sliding window approach
Sliding window approach is a simple approach and it would not take much time to develop this code base but this simple naive approach has so many problems. First objects come in different sizes and shapes. Images in the database will be of different shape and size. Detecting an object at any scale, shape and size is challenging using sliding window approach.  We can see some images below which are available in the cocodataset (we will discuss in the next session)

img1       |  img2 |  img3 |  img4 |  img5
:---------:|:------:|:-----:|:-------:|:------:
![img1](../images/0_coco1.jpg)  |  ![img2](../images/0_coco2.jpg) | ![img3](../images/0_coco3.jpg)  | ![img4](../images/0_coco4.jpg)| ![img5](../images/0_coco5.jpg) 

Computer vision researchers have found some engineering ways to deal with this. In the above case instead of using just one kernel_size and stride we can use multiple strides and kernel_size (largely derived from the dataset). Using multiple kernel_sizes and strides would generate more windows from the existing image, which will result in both training data (thus training time) and inference time also. for the research community, this gave slight improvements in accuracy. To further increase the search, they have started using **image pyramids**. 


### 3.3 What are image pyrmaids?
An Image Pyramid is a multi-scale representation of an image. As shown below, we can see that image of width and height (x, y) are resized to (x/2, y/2), (x/4, y/4) and (x/8, y/8).  Instead of using one single image to generate windows, here we can use images of multiple scales and generate windows on each of them respectively.

![image_pyramids](../images/image_pyramids.png)


At the bottom of the pyramid we have the original image at its original size (in terms of width and height). And at each subsequent layer, the image is resized (subsampled) and optionally smoothed (usually via Gaussian blurring). More about Gaussian blurring [here](https://en.wikipedia.org/wiki/Gaussian_blur). scikit-image transforms module has pyramid_gaussian function which can generate images based on the provided downscale value. The following is the code to generate pyramids.


```python
from skimage.transform import pyramid_gaussian
for (i, resized) in enumerate(pyramid_gaussian(img, downscale=1.2)):
    # if the image is too small, break from the loop
    if resized.shape[0] < 128 or resized.shape[1] < 128:
        break
    print(resized.shape)
```

![SlidingWindow3](gifs/image_pyramids.gif "image pyramids")


So all together, Utilizing an image pyramid allows us to find objects in images at different scales of an image and when combined with a sliding window we can find objects in images in various locations which are of different sizes and shapes. 


## Quiz 

Q1) How many pyramids are present in an image of size (800, 800) using downscale of 1.5, where the smallest image cannot be less than (64, 64)?

Sol) A 

A) 7
B) 10
C) 6
D) 5


Q2) Which of the following image sizes doesn't occur in pyramid generated on an image of size (800, 800) using downscale of 1.5, where the smallest image cannot be less than (64, 64)?

sol) C 

A) (356, 356, 3)
B) (238, 238, 3)
C) (64, 64, 3)
D) (106, 106, 3)


Code for the Q1 and Q2.
```python
for (i, resized) in enumerate(pyramid_gaussian(img, downscale=1.5)):
    ## Only images of greater than minimum size can be used
    if resized.shape[0] >= 64 or resized.shape[1] >= 64:
        print(i, resized.shape)
```

## Task 
How many windows are present on an image of size (800, 800), kernel_size = 64, stride = 64 and no padding while using  downscale=2 and min_image_size = 128?

Instructions:
- minimum size of the image in the pyrmaid can be of (128, 128). So we need to discard all other remaining pyramids.
- There is no padding. So x+stride <= image.shape[0] and y+stride <= image.shape[1]

Ans) 455 

Solution:

```python
from skimage.transform import pyramid_gaussian
img = np.zeros((800, 800, 3))

windows = []
for (i, resized) in enumerate(pyramid_gaussian(img, downscale=1.2)):
    ## Only images of greater than minimum size can be used
    if resized.shape[0] >= 128 or resized.shape[1] >= 128:
        print(resized.shape)
        ## Window Along the y axis
        for y in range(0, resized.shape[0], stride):
            ## Window along the x axis
            for x in range(0, resized.shape[1], stride):
                if x+stride > resized.shape[0] or y+stride > resized.shape[1]:
                    continue
                # capture the window
                window = (x, y, resized[y:y + kernel_size[1], x:x + kernel_size[0]])
                ## Append the window to the list
                windows.append(window)
print(len(windows)) 
```


## Final notes.
In this first section we have seen 
- the difference between object detection and Image classification.
- We have seen various applications where object detection is used. 
- We have seen how object detection works using traditional computer vision techniques. We have defined sliding windows, kernel_size, strides, image_pyrmaids and NMS (non-maxima supression) etc. This terminology is very important as this is extensively used in the coming sections. 

In the next section we will have a look at the datasets available for object detection