<a href="https://colab.research.google.com/github/hitha-varganti/matching-game/blob/main/Copy_of_Student_ObjectDetection_Section3.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
import tensorflow as tf

device_name = tf.test.gpu_device_name()
if device_name != '/device:GPU:0':
  raise SystemError('GPU device not found')
print('Found GPU at: {}'.format(device_name))

# Milestone 1. What is YOLO?


The “You Only Look Once,” or YOLO, family of models are a series of end-to-end deep learning models designed for fast object detection, developed by Joseph Redmon, et al. and first proposed in the 2015 paper titled “[You Only Look Once: Unified, Real-Time Object Detection](https://arxiv.org/abs/1506.02640).” The model has been updated since then. Today, we'll focus on YOLOv3, which is described in this very interesting [technical report](https://pjreddie.com/media/files/papers/YOLOv3.pdf). Today, we'll walk through the basic idea of the algorithm. If you'd like to know more details about it, definitely check out the papers!

The approach involves a single deep convolutional neural network (DarkNet which is based on the VGG model we used before) that splits the input into a grid of cells and each cell directly predicts a bounding box and object classification. The result is a large number of candidate bounding boxes that are consolidated into a final prediction by a post-processing step.

For example, an image may be divided into a 7×7 grid and each cell in the grid may predict 2 bounding boxes, resulting in 98 proposed bounding box predictions. The class probabilities map and the bounding boxes with confidences are then combined into a final set of bounding boxes and class labels. The image taken from the paper below summarizes the two outputs of the model.

In summary, to make object detection on one input image, the first step is a forward pass of the DarkNet; the second step is the post-processing on the DarkNet output to get the final bounding boxes prediction.

![](https://drive.google.com/uc?export=view&id=1-IuBrnrTZPOb4zZXGLl6Kn4OKsX9NnXR)











# Milestone 2. How does YOLO work?

Before we proceed to build the YOLO model, let's first define the **anchor boxes**, which are several pre-defined bounding boxes with useful shapes and sizes that are tailored based on the object shapes in the training dataset.

There are 9 anchor boxes in total. As we'll talk about later, the detection is performed on 3 scales. Therefore, the anchor boxes are divided into 3 groups, each corresponding to 1 scale.


In [None]:
anchors = [[[116,90], [156,198], [373,326]], [[30,61], [62,45], [59,119]], [[10,13], [16,30], [33,23]]]


The 9 anchor boxes are plotted below. As you may discover, they can cover a variety of shapes and sizes.

<img src="http://www.programmersought.com/images/401/891354390c3aab3f1ab1fd0db3110bf9.png" width="400"/>

Now, let's load the image that we'll apply object detection on. To load the image, we'll use the `Image` module in the package `PIL`, which is commonly used  for image processing. The image is saved as a `PIL image` in the variable `image_pil`. We can get the width and the height of the image by accessing the `size` attribute of the image.


In [None]:
from PIL import Image
from matplotlib import  pyplot as plt

image_path = '/content/data/image.jpg'

image_pil = Image.open(image_path)
image_w, image_h = image_pil.size
print("The type of the saved image is {}".format(type(image_pil)))
plt.imshow(image_pil)
plt.show()

### Exercise (Coding) | Image Preprocessing

The input size of DarkNet is `(416, 416)`, so we need to preprocess our image into the required size by resizing our image, keeping the aspect ratio consistent, and padding the left out areas with the grey color, which is `(128,128,128)` in RGB. We have implemented the preprocessing for you in the `preprocess_input(image, net_h, net_w)` function, which takes the orininal image, the target height and width `net_h, net_w` as input and returns the new image in the required size.

In the chunk below, do the preprocessing by yourself! Plot the new image to check your result


In [None]:
### YOUR CODE HERE

# new_image =

### END CODE

In [None]:
#@title Run this to check the new image { display-mode: "form" }
plt.imshow(new_image[0])
plt.show()

### Exercise (Discussion) | DarkNet Architecture

The main part of the YOLO algorithm is the DarkNet model, which is basiclly a Convolutional Neural Network, with some special designs, like upsampling layers and detection layers.

Here is how the architecture of DarkNet looks like:


<img src="https://miro.medium.com/max/2000/1*d4Eg17IVJ0L41e7CTWLLSg.png" width="1000"/>

**The residual blocks** in the picture contain layers that are similar to the CNN models we built before, eg. convolutional layers `Conv2D`, max pooling layers `MaxPooling2D`, activation layers `Activation('relu')`. The network just stacks a lot more layers than the model we built before.





**How to make detections at 3 different scales?**

Besides the stuff that we are familiar with, the most salient feature of YOLOv3 DarkNet is that it makes detections at three different scales, which are precisely given by downsampling the dimensions of the input image by 32, 16 and 8 respectively.

The first detection is made by the 82nd layer. For the first 81 layers, the image is down sampled by the network, such that the 81st layer has a stride of 32. If we have an image of 416 x 416, the resultant feature map would be of size 13 x 13.

The feature map size at layer 94 and 106 grows bigger because of the upsampling layers. The feature maps are upsampled by 2x to dimensions of 26 x 26 and 52 x 52 respectively.

**What exactly are the DarkNet outputs?**

The eventual detection output is generated by applying detection kernels on feature maps at the three different places in the network.

For each grid cell, we'll consider several possible bounding boxes that are centered at the given grid cell. Then for each considered bounding box, the model predicts t<sub>x</sub>, t<sub>y</sub>, t<sub>w</sub>, t<sub>h</sub>, an objectness score, and class scores.
- t<sub>x</sub>, t<sub>y</sub>, t<sub>w</sub>, t<sub>h</sub> are related to predicting the exact position and shape of the considered bounding box.
- The objectness score is the model's prediction about how likely the considered bounding box has a complete object inside it.
- Class scores are the predicted probability over all the object classes.

Therefore, the shape of the detection kernel is 1 x 1 x (B x (4 + 1 + C)). Here, 1 x 1 means the kernel only looks at one grid cell at one time. B is the number of bounding boxes a cell on the feature map can predict, "4" is for the 4 bounding box attributes (t<sub>x</sub>, t<sub>y</sub>, t<sub>w</sub>, t<sub>h</sub>) and "1" for the object confidence. C is the number of object classes.

The model will consider bounding boxes based on the 3 anchor boxes defined before, so B = 3. As YOLO is trained on COCO (a large-scale object detection dataset), which contains 80 object catogories, C = 80. Therefore, the kernel size is 1 x 1 x 255. The feature map produced by this kernel has identical height and width of the previous feature map, and has detection attributes along the depth as described above.

The following picture illustrates how this works.

<img src="https://miro.medium.com/max/1200/0*3A8U0Hm5IKmRa6hu.png" width="500"/>



### Exercise (Coding and Discussion) | Forward Pass

Now, let's load a fully trained DarkNet model!

In [None]:
import tensorflow as tf

# Load model
darknet = tf.keras.models.load_model(model_path)

Just as how we got the classification predictions from the perceptron, CNN, and VGG models, call the `model.predict(input_data)` function to do a forward pass on our preprocessed image `new_image`!

After you get the output, check the structure of the output and discuss what the dimensions mean with you classmates!

In [None]:
### YOUR CODE HERE
# yolo_outputs =

### END CODE

Answer the following questions:


*   How many elements are there in the `yolo_outputs`? Why?
*   What does each dimension of the `yolo_outputs[0]` mean?
*   Why the last dimension is 255?

If you are clear about the questions above, now you can definitely explain how the DarkNet works to your classmates! (At each detection scale, ... For each grid cell, ... For each bounding boxes, ...)

# Milestone 3. Bounding Box Prediction

We now have DarkNet's detection predictions for all the possible bounding boxes centered at each grid cell position, but to get the final detection results, which are the bounding boxes that the model is confident of, we need to apply a threshold to filter the results.

Besides, as you can imagine, there might be multiple bounding boxes that are detecting the same object. We need to remove the overlapping bounding boxes and only leave the best ones.

Here are some post-processing steps:



*   `decode_netout(yolo_outputs, obj_thresh, anchors, image_h, image_w, net_h, net_w)` takes the DarkNet output feature maps `yolo_outputs` as input, and returns all the predicted bounding boxes that have a higher objectness than the objectness threshold `obj_thresh`
*   `do_nms(boxes, nms_thresh, obj_thresh)` means Non-Maximal Suppression, which a commonly used post-processing step for object detection. It  removes  all the bounding boxes that have a big (higher overlap than the `nms_thresh`) overlap with other better bounding boxes.
*   `draw_boxes(image_pil, boxes, labels)` draws the final bounding boxes on the input image and return the detection image as a `PIL image`.


### Exercise (Coding) | Post-processing for bounding box prediction

First, let's define the thresholds mentioned above:

In [None]:
obj_thresh = 0.4
nms_thresh = 0.45

Make use of the functions above to get our final detection bounding boxes and plot the result you get!


In [None]:
### YOUR CODE HERE

### END CODE

### Exercise (Coding) | Non-Maximal Suppression

Good job! Are you curious about what each post-processing step is doing? You can explore this by yourself!

As a hint, you can...

*   Check the number of boxes after each step
*   Call the `draw_boxes(image_pil, boxes, labels)` function to visualize the bounding boxes after each step

In [None]:
### YOUR CODE HERE

### END CODE

### Exercise (Coding) | Image Detection Function

Our final goal is to detect objects in a video, which contains multiple frames (images). For better reusability and modularity, let's wrap all the code we wrote before in a function called `detect_image`, which takes the raw `PIL image` (without preprocessing) and other parameters as input, and returns the `PIl image` with detected bounding boxes and labels. Complete this function by yourself and test it

In [None]:
def detect_image(image_pil, obj_thresh = 0.4, nms_thresh = 0.45, darknet=darknet, net_h=416, net_w=416, anchors=anchors, labels=labels):
  ### YOUR CODE HERE
  pass
  ### END CODE

In [None]:
#@title Run this to check your function definition { display-mode: "form" }
plt.figure(figsize=(12,12))
plt.imshow(detect_image(image_pil))
plt.show()

### Exercise (Discussion) | Thresholds

Up till now, We used default values for the 2 thresholds, `objectness threshold` and `nms_threshold`. Do you understand what these 2 thresholds control? Make use of the `detect_image`function we defined above, try different values for the 2 thresholds in the range of 0-1 and see the changes in the results. Then discuss this with your classmates!

# Milestone 4. Detection on Videos

A video is just a sequence of frames (images). Therefore, once we can use YOLO to detect objects on images, it's easy to extend this to videos. To deal with videos, we'll use the OpenCV package, which is called `cv2` in Python. If you are interested to know more, here is a [tutorial](https://docs.opencv.org/4.5.2/d0/de3/tutorial_py_intro.html).

The code below will open one video, create a new video file, read the input video frame-by-frame, and write each frame into the new video.

Now modify the code by yourself to get the object detection result on the input video!

Remember that the image input for the `detect_image` function is a `PIL image`, but here we are loading the input video using `OpenCV`. These 2 image formats are different, so we need to convert `OpenCV` to `PIL` for detection, and convert back to write the frame into the new video.

The conversion can be done as follows
```
# OpenCV -> PIL
image_pil = Image.fromarray(cv2.cvtColor(image_cv2, cv2.COLOR_BGR2RGB))

# PIL -> OpenCV
image_cv2 = cv2.cvtColor(np.asarray(image_pil), cv2.COLOR_RGB2BGR)

```

In [None]:
import cv2

def detect_video(video_path, output_path, obj_thresh = 0.4, nms_thresh = 0.45, darknet=darknet, net_h=416, net_w=416, anchors=anchors, labels=labels):
    vid = cv2.VideoCapture(video_path)
    if not vid.isOpened():
        raise IOError("Couldn't open webcam or video")
    video_FourCC    = int(vid.get(cv2.CAP_PROP_FOURCC))
    video_FourCC = cv2.VideoWriter_fourcc(*'mp4v')
    video_fps       = vid.get(cv2.CAP_PROP_FPS)
    video_size      = (int(vid.get(cv2.CAP_PROP_FRAME_WIDTH)),
                        int(vid.get(cv2.CAP_PROP_FRAME_HEIGHT)))

    out = cv2.VideoWriter(output_path, video_FourCC, video_fps, video_size)

    num_frame = 0
    while vid.isOpened():
      ret, frame = vid.read()
      num_frame += 1
      print("=== Frame {} ===".format(num_frame))
      if ret:
          ### YOUR CODE HERE
          new_frame = frame

          ### END CODE
          out.write(new_frame)
      else:
          break
    vid.release()
    out.release()
    print("New video saved!")

Now test your code! You can check the videos in the FILES on the left

In [None]:
video_path = '/content/data/video1.mp4'
output_path = '/content/data/video1_detected.mp4'
detect_video(video_path, output_path)