

# Exercise 1 Non-Maximum Supression in object detection


<p align="center">
  <img src="doc/sayit.jpg" />
</p>

### Befor we start with today's exercise please read the following materials.


## What is YOLO

YOLO, short for "You Only Look Once," is a groundbreaking object detection algorithm renowned for its real-time processing speed and high accuracy. Unlike traditional methods, YOLO approaches object detection as a single regression problem, directly predicting bounding boxes and class probabilities from entire images in one evaluation. Specifically, the YOLO algorithm takes an image as input and then uses a simple deep convolutional neural network to detect objects in the image. Following a fundamentally different approach to object detection, YOLO achieved state-of-the-art results, beating other real-time object detection algorithms by a large margin.

## Grid Cell

YOLO divides an input image into an S × S **grid**. If the center of an object bounding box falls into a **grid cell**, that **grid cell** is responsible for detecting that object. Each grid cell predicts B bounding boxes and confidence scores for those boxes. These confidence scores reflect how confident the model is that the box contains an object and how accurate it thinks the predicted box is.

<p align="center" width="300" height="200">
  <img src="doc/yolo.png" width="600" height="400" alt="Yolo Pipeline">
</p>

YOLO predicts multiple bounding boxes per **grid cell** based on the **anchor boxes**. At training time, we only want one bounding box predictor to be responsible for each object. YOLO assigns one predictor to be **responsible** for predicting an object based on which prediction has the highest current IOU with the ground truth. This leads to specialization between the bounding box predictors. Each predictor gets better at forecasting certain sizes, aspect ratios, or classes of objects, improving the overall recall score.

We'll know the concept of **anchor boxes** in the next section.


## Anchor boxes

<p align="center" width="30" height="200">
  <img src="doc/anchor.png" width="600" height="200" alt="Description"/>
</p>

Anchor boxes, also known as anchor priors or default boxes, are pre-defined bounding boxes with specific sizes, aspect ratios, and positions that are used as reference templates during object detection. These anchor boxes are placed at each grid cell, to capture objects of different scales and shapes. During training and inference, anchor boxes are used to predict the locations and shapes of objects relative to these reference boxes.

During training, the ground truth bounding boxes are assigned to the anchor boxes based on their **IoU (Intersection over Union)** overlap. Each anchor box is responsible for predicting the object whose ground truth box has the highest IOU with the anchor.

Again IoU calculates the overlap between two bounding boxes by dividing the area of their intersection by the area of their union (as in the following illustration):

<p align="center" >
  <img src="doc/IOU.png" width="250" height="250" alt="Description"/>
</p>

Anchor boxes help stabilize the training process by providing a consistent set of reference bounding boxes for prediction. Without anchor boxes, the model might struggle to learn meaningful bounding box predictions, especially when objects vary significantly in size and aspect ratio.

**In a word, **grid cells** support the localization of predictions and anchor boxes serve the shape of predictions. We are estimating shifts of predictions wrt. certain **grid cells** and **resizes** of them wrt. certain anchor boxes**. 

Specifically, here is the illustration:

<p align="center" width="200" height="150">
  <img src="doc/estimate.png" width="500" height="400"/>
</p>

where σ represents the Sigmoid function, which limit the shift ration `σ(t)` between 0 and 1 so that the middle point can only lie in the grid cell. the term `e^(t_w)` and `e^(t_h)` are the ratios between predicted bounding box sizes `(b_w, b_h)` and anchor box sizes `(p_w, p_h)`. So the network estimates are `(t_x, t_y, t_w, t_h)`.

Raw Output Shape of YOLO:

The output of the network gives tensors with the shape of:

```(batch_size, num_anchor_box_per_cell, grid_cell_num_w, grid_cell_num_h,  data)```

The `data` consists of following terms: 

1. The shape information `(t_x, t_y, t_w, t_h)`, 

2. Objectness, namely the probability of whether an object exists in the box `Pr(there_is_an_object)` and 

3. Conditional probability of each class `Pr(c_i|there_is_an_object)`

**So in total there will be  num_anchor_box_per_cell * grid_cell_num_w * grid_cell_num_h bounding boxes as the output of the network.** 

While obviously, most of them are redundant (See YOLO pipeline, the upper middle image) due to: 

1. Low probability, no objects in the box 

2. Duplicated boxes that are refering the same objects while overlapping with each other. 

As can see in the following pictures, here all the bounding boxes are visualized in the left image. Only filtering out low probability boxes (middle) is not sufficient due to remaining boxes which overlapp.

<div style="display: flex; justify-content: space-between;">
  <img src="doc/no_nms.jpg" width="450" height="600" alt="No NMS">
  <img src="doc/filter_low.jpg" width="450" height="600" alt="Description 1">
  <img src="doc/nms.jpg" width="450" height="600" alt="Description 3">
</div>

So a post-processing step is necessary to filter out these boxes and finally get clean predictions (right).


## Non-Maximum Suppression (NMS)


One key technique used in the YOLO models is **non-maximum suppression (NMS)**. NMS is a post-processing step that is used to improve the accuracy and efficiency of object detection. In object detection, it is common for multiple bounding boxes to be generated for a single object in an image. These bounding boxes may overlap or be located at different positions, but they all represent the same object. NMS is used to identify and remove redundant or incorrect bounding boxes and to output a single bounding box for each object in the image.

NMS including follwing steps:

1. NMS begins by setting a threshold for the confidence scores (objectness). Bounding boxes with confidence scores below this threshold are discarded as they are considered not significant or reliable.

2. For the remaining bounding boxes, NMS identifies pairs of boxes that have a significant overlap, typically measured using IoU. 

3. Among the overlapping bounding boxes, NMS retains the one with the highest confidence score and suppresses (removes) the others. This process ensures that each detected object is represented by only one bounding box with the highest confidence score.

The NMS algorithm is described in following pseudo code:

<p align="center" width="30" height="200">
  <img src="doc/nms_alg.png" width="650" height="400"/>
</p>

Here is the a video form [DeepLearningAI](https://www.youtube.com/watch?v=VAo84c1hQX8) to give you a better understanding of how NMS works.


## YOLOv5
In this exercise, we use YOLOv5 from [ultralytics](https://github.com/ultralytics/yolov5), a general object detection toolbox for apply YOLO model series. The network architecture of YOLOv5 looks as follows:

<p align="center">
  <img src="doc/YOLOv5.png"/>
</p>

As one of the advanced object detectors, from YOLOv3 on they usually have 3 prediction heads where different specification of grids are set up. In this way, the higher the grid number (the smaller the grid size), the smaller the objects that will be responsible for the head to detect, and vice versa. In another word, our network can detect 3 scales of objects from small to large.

### In this exercise, our task is to construct the inference pipeline of YOLOv5

First install and import the relevant modules

In [1]:
%pip install -r requirements.txt
import random
import torch, torchvision
import numpy as np
from pathlib import Path
import glob, os
import matplotlib.pyplot as plt
from PIL import Image
import cv2
import utils
from utils.general import scale_boxes, xywh2xyxy
from ultralytics.utils.plotting import Annotator, Colors
display = utils.notebook_init() 


YOLOv5  2024-4-22 Python-3.11.9 torch-2.6.0+cpu CPU


Setup complete  (12 CPUs, 31.7 GB RAM, 277.6/463.9 GB disk)


Now we need to specify the parameters for initializing our pipeline

In [2]:
weights="./yolov5n.pt"  # model path
source="./data/images/"  # source of images
save_dir = "./data/preds/" # save images with prediction results
imgsz=(640, 640)  # inference size (height, width)
device="cpu"  # cuda device, i.e. 0 or 0,1,2,3 or cpu
hide_labels=False  # hide labels
hide_conf=False  # hide confidences
bs=1 # batch size
colors = Colors()

Task 1: Finish the pre-processing step before the image is put into the network.

Rather than directly resizing a non-square image to desired size, our typical approach involves resizing it to maintain the same width-to-height ratio as the desired square image, with the larger dimension set to the desired square size. Following this, we pad the smaller dimension with black borders (as following example).

<p align="center">
  <img src="doc/pad.png" alt="Yolo Pipeline">
</p>

For example, if we have an image with dimensions of 1280 (width) by 640 (height), and the desired shape is 480 by 480, we first resize it to 480 by 240, and then add padding of 120 pixels to each border in the height direction.

In [3]:
def pre_process(im, new_shape):
    """Preprocess image

    Args:
        im0 (np.array): image with shape (height, width, 3) 
        new_shape (list): desired output shape [height, width]

    Returns:
        torch.tensor: image tensor with (3, height, width) normalized to 0-1
    """
    # Compute padding
    shape = im.shape[:2]  # current shape [height, width]
    r = min(new_shape[0] / shape[0], new_shape[1] / shape[1])
    new_unpad = int(round(shape[1] * r)), int(round(shape[0] * r))
    dw, dh = new_shape[0] - new_unpad[0], new_shape[1] - new_unpad[1]  # wh padding
    dw, dh = np.mod(dw, 32), np.mod(dh, 32)  # wh padding
    dw /= 2  # divide padding into 2 sides
    dh /= 2
    
    im = cv2.resize(im, new_unpad, interpolation=cv2.INTER_LINEAR)
    
    top, bottom = int(round(dh - 0.1)), int(round(dh + 0.1))
    left, right = int(round(dw - 0.1)), int(round(dw + 0.1))
    im = cv2.copyMakeBorder(im, top, bottom, left, right, cv2.BORDER_CONSTANT, value=(0,0,0))  # add border
    
    im = im.transpose((2, 0, 1))[::-1]  # HWC to CHW, BGR to RGB
    im = np.ascontiguousarray(im)  # contiguous
    
    im = torch.from_numpy(im).to(device).float()  # uint8 to fp16/32
    im /= 255  # 0 - 255 to 0.0 - 1.0
    if len(im.shape) == 3:
        im = im[None]  # expand for batch dim
    return im

Task 2: Finish the following Non-Maximum Suppression function

In [4]:
def non_max_suppression(
    prediction,
    conf_thres=0.25,
    iou_thres=0.45,
    max_det=1000,
):
    """
    Non-Maximum Suppression (NMS) on inference results to reject overlapping detections.

    Parameters:
    - prediction : Tensor of shape [batch_size, num_predictions, 5+num_classes]
                    Contains the bbox coords (x, y, w, h), confidence score and class scores.
    - conf_thres : float, threshold for confidence score
    - iou_thres  : float, Intersection Over Union threshold for deciding overlap
    - max_det    : int, maximum number of detections per image

    Returns:
    - output : List of tensors, each tensor is (n,6) for each image [xyxy, conf, cls]
    """

    bs = prediction.shape[0]  # batch size
    nc = prediction.shape[2] - 5  # number of classes
    xc = prediction[..., 4] > conf_thres  # candidates

    # Settings
    max_wh = 7680  # (pixels) maximum box width and height
    max_nms = 30000  # maximum number of boxes into torchvision.ops.nms()

    output = [torch.zeros((0, 6), device=prediction.device)] * bs
    for xi, x in enumerate(prediction):  # image index, image inference
        x = x[xc[xi]]  # objectness score > conf_thres

        # If none remain process next image
        if not x.shape[0]:
            continue

        # Compute conf
        x[:, 5:] *= x[:, 4:5]  # conf = obj_conf * cls_conf

        # Box/Mask
        box = xywh2xyxy(x[:, :4])  # center_x, center_y, width, height) to (x1, y1, x2, y2)

        # Detections matrix nx6 (xyxy, conf, cls)
        conf, j = x[:, 5:].max(1, keepdim=True)
        x = torch.cat((box, conf, j.float()), 1)[conf.view(-1) > conf_thres]

        # Check shape
        n = x.shape[0]  # number of boxes
        if not n:  # no boxes
            continue
        x = x[x[:, 4].argsort(descending=True)[:max_nms]]  # sort by confidence and remove excess boxes

        # Batched NMS
        c = x[:, 5:6] * max_wh  # classes
        boxes, scores = x[:, :4] + c, x[:, 4]  # boxes (offset by class), scores
        i = torchvision.ops.nms(boxes, scores, iou_thres)  # NMS
        i = i[:max_det]  # limit detections

        output[xi] = x[i]

    return output

Now let's try on with your pipeline. You can store any image under `data/images` (jpg or png format) and the pipeline will randomly select one image from this path and show you the detection results

Have fun with it :)

In [5]:
# Load model
model = torch.load(weights, map_location="cpu", weights_only=False)["model"].to(device).float() 
# Model compatibility updates
model= model.eval()  # model in eval mode
stride = model.stride
names = model.names
new_shape = [640, 640]  # check image size

paths = glob.glob(source + "*")
random.shuffle(paths)
path = paths[0]
im0 = cv2.imread(path)

# Run inference
path = Path(path)
im = im0.copy()
im = pre_process(im, new_shape)
# Inference
visualize = Path(save_dir) / path.stem
pred = model(im, visualize=False)
# NMS
pred = non_max_suppression(pred[0])
s = ""

for i, det in enumerate(pred):  # per image
    im0 = im0.copy()
    save_path = str(Path(save_dir) / path.name)  # im.jpg
    
    annotator = Annotator(im0, line_width=3, example=str(names))
    if len(det):
        # Rescale boxes from img_size to im0 size
        det[:, :4] = scale_boxes(im.shape[2:], det[:, :4], im0.shape).round()

        # Print results
        for c in det[:, 5].unique():
            n = (det[:, 5] == c).sum()  # detections per class
            s += f"{n} {names[int(c)]}{'s' * (n > 1)}, "  # add to string

        # Write results
        for *xyxy, conf, cls in reversed(det):
            # Add bbox to image
            c = int(cls)  # integer class
            label = None if hide_labels else (names[c] if hide_conf else f"{names[c]} {conf:.2f}")
            annotator.box_label(xyxy, label, color=colors(c, True))
    im0 = annotator.result()
    cv2.imwrite(save_path, im0)

print(s)
display.Image(save_path)

FileNotFoundError: [Errno 2] No such file or directory: './yolov5n.pt'