# Tutorial 6-2: You Only Look Once â€“ "Understanding Object Detection"

**Course:** CSEN 342: Deep Learning  
**Topic:** Object Detection, YOLO, Intersection over Union (IoU), and Non-Max Suppression (NMS)

## Objective
Object Detection models like YOLO (You Only Look Once) don't just output a class label; they output a **dense grid of bounding boxes**. A single forward pass might produce thousands of candidate boxes, most of which are overlapping or low-confidence.

In this tutorial, we will peel back the layers of the detection pipeline. We won't train a model (which takes days); instead, we will implement the **post-processing logic** that turns raw network output into clean detections.

We will:
1.  **Decode the Tensor:** Understand the famous $7\times7\times30$ YOLO output tensor.
2.  **Calculate IoU:** Implement Intersection over Union to measure how much two boxes overlap.
3.  **Implement NMS:** Write the Non-Max Suppression algorithm from scratch to remove duplicate detections.

---

## Part 1: The YOLO Output Tensor

As described in the lecture (Slide 94), the original YOLO model divides the image into a $7\times7$ grid. Each cell predicts:
1.  **2 Bounding Boxes** ($x, y, w, h, \text{confidence}$ for each).
2.  **20 Class Probabilities** (for PASCAL VOC).

Total depth = $2 \times 5 + 20 = 30$.  
Final Tensor Shape: $(7, 7, 30)$.

Let's write a function to decode a prediction vector from one of these cells.

In [None]:
import torch
import matplotlib.pyplot as plt
import matplotlib.patches as patches
import numpy as np

# Define the class names for PASCAL VOC (20 classes)
VOC_CLASSES = [
    "aeroplane", "bicycle", "bird", "boat", "bottle", "bus", "car", "cat", "chair", "cow", 
    "diningtable", "dog", "horse", "motorbike", "person", "pottedplant", "sheep", "sofa", "train", "tvmonitor"
]

def decode_yolo_cell(cell_tensor, grid_x, grid_y, img_width=448, img_height=448):
    """
    Converts raw YOLO output for a single cell into absolute bounding boxes.
    
    Args:
        cell_tensor: tensor of shape (30,) -> [x1, y1, w1, h1, c1, x2, y2, w2, h2, c2, class_probs(20)]
        grid_x, grid_y: coordinates of the grid cell (0-6)
    """
    # Split the tensor
    # First 5: Box 1 (x, y, w, h, conf)
    # Next 5: Box 2 (x, y, w, h, conf)
    # Last 20: Class probabilities
    box1_data = cell_tensor[0:5]
    box2_data = cell_tensor[5:10]
    class_probs = cell_tensor[10:]
    
    # Find best class
    class_score, class_idx = torch.max(class_probs, 0)
    class_name = VOC_CLASSES[class_idx]
    
    decoded_boxes = []
    
    # Process both boxes
    for data in [box1_data, box2_data]:
        x, y, w, h, conf = data
        
        # YOLO mechanics (simplified):
        # x, y are offsets relative to the grid cell bounds (0 to 1)
        # w, h are normalized relative to image size
        
        # Convert to absolute pixel coords
        cell_width = img_width / 7
        cell_height = img_height / 7
        
        center_x = (grid_x + x) * cell_width
        center_y = (grid_y + y) * cell_height
        abs_w = w * img_width
        abs_h = h * img_height
        
        # Convert Center-Width-Height to Top-Left-Bottom-Right (x1, y1, x2, y2)
        x1 = center_x - abs_w / 2
        y1 = center_y - abs_h / 2
        x2 = center_x + abs_w / 2
        y2 = center_y + abs_h / 2
        
        # Final Score = Box Confidence * Class Probability
        final_score = conf * class_score
        
        decoded_boxes.append([x1.item(), y1.item(), x2.item(), y2.item(), final_score.item(), class_name])
        
    return decoded_boxes

print("Decoder function defined.")

---

## Part 2: Intersection over Union (IoU)

To remove duplicates, we need to know if two boxes overlap significantly. **IoU** is the standard metric.

$$ \text{IoU} = \frac{\text{Area of Intersection}}{\text{Area of Union}} $$

In [None]:
def calculate_iou(boxA, boxB):
    # boxA = [x1, y1, x2, y2]
    # 1. Determine the coordinates of the intersection rectangle
    xA = max(boxA[0], boxB[0])
    yA = max(boxA[1], boxB[1])
    xB = min(boxA[2], boxB[2])
    yB = min(boxA[3], boxB[3])

    # 2. Compute the area of intersection rectangle
    interWidth = max(0, xB - xA)
    interHeight = max(0, yB - yA)
    interArea = interWidth * interHeight

    # 3. Compute the area of both the prediction and ground-truth rectangles
    boxAArea = (boxA[2] - boxA[0]) * (boxA[3] - boxA[1])
    boxBArea = (boxB[2] - boxB[0]) * (boxB[3] - boxB[1])

    # 4. Compute the intersection over union
    iou = interArea / float(boxAArea + boxBArea - interArea + 1e-6)
    return iou

# Test it
box1 = [50, 50, 150, 150] # 100x100 box
box2 = [60, 60, 160, 160] # Shifted by 10
box3 = [200, 200, 300, 300] # Non-overlapping

print(f"IoU (Overlap): {calculate_iou(box1, box2):.2f}")
print(f"IoU (No Overlap): {calculate_iou(box1, box3):.2f}")

---

## Part 3: Non-Max Suppression (NMS)

Detectors often output multiple boxes for the same object (e.g., one slightly to the left, one slightly to the right). **NMS** cleans this up.

**The Algorithm:**
1.  Discard all boxes with low confidence score.
2.  Sort remaining boxes by confidence (descending).
3.  Pick the highest confidence box $B$ as a valid detection.
4.  Discard any other box that has high IoU with $B$ (duplicate).
5.  Repeat until no boxes remain.

In [None]:
def non_max_suppression(boxes, iou_threshold=0.5, score_threshold=0.4):
    """
    Args:
        boxes: List of [x1, y1, x2, y2, score, class_name]
    Returns:
        List of kept boxes
    """
    # 1. Filter by score threshold
    boxes = [b for b in boxes if b[4] > score_threshold]
    
    # 2. Sort by confidence (highest first)
    boxes = sorted(boxes, key=lambda x: x[4], reverse=True)
    
    kept_boxes = []
    
    while len(boxes) > 0:
        # Pick the best box
        current_box = boxes.pop(0)
        kept_boxes.append(current_box)
        
        # Compare with rest
        remaining_boxes = []
        for other_box in boxes:
            iou = calculate_iou(current_box[:4], other_box[:4])
            
            # If they are different objects (low IoU) OR different classes, keep it
            if iou < iou_threshold or current_box[5] != other_box[5]:
                remaining_boxes.append(other_box)
        
        boxes = remaining_boxes
        
    return kept_boxes

print("NMS Defined.")

---

## Part 4: Putting it Together (Simulation)

We will generate a noisy set of predictions (simulating a YOLO raw output) and see if NMS can clean it up.

**Scenario:** A "Dog" in the center, and a "Person" to the right. The model outputs many overlapping boxes for each.

In [None]:
# Generate Synthetic "Raw" Detections
# Format: [x1, y1, x2, y2, score, class]
raw_detections = []

# Cluster 1: The Dog (Good box + 2 duplicates)
raw_detections.append([100, 100, 200, 300, 0.9, "dog"])       # Perfect
raw_detections.append([105, 102, 198, 298, 0.75, "dog"])      # Slightly off
raw_detections.append([90, 90, 210, 310, 0.6, "dog"])         # Too big

# Cluster 2: The Person (Good box + 1 duplicate)
raw_detections.append([300, 150, 400, 400, 0.85, "person"])   # Perfect
raw_detections.append([310, 160, 410, 410, 0.82, "person"])   # Shifted

# Cluster 3: Noise (Low confidence background)
raw_detections.append([50, 50, 80, 80, 0.1, "bird"])          # Noise

# Run NMS
cleaned_detections = non_max_suppression(raw_detections, iou_threshold=0.5, score_threshold=0.5)

# Visualization Helper
def draw_boxes(ax, boxes, title):
    ax.set_title(title)
    ax.set_xlim(0, 500); ax.set_ylim(500, 0)
    # Draw pseudo-image
    ax.add_patch(patches.Rectangle((0,0), 500, 500, color='#f0f0f0'))
    
    for b in boxes:
        x1, y1, x2, y2, score, label = b
        w = x2 - x1
        h = y2 - y1
        
        # Color by class
        color = 'blue' if label == 'dog' else 'red' if label == 'person' else 'green'
        
        rect = patches.Rectangle((x1, y1), w, h, linewidth=2, edgecolor=color, facecolor='none')
        ax.add_patch(rect)
        ax.text(x1, y1-5, f"{label} {score:.2f}", color=color, fontsize=10, weight='bold')

fig, axs = plt.subplots(1, 2, figsize=(12, 6))
draw_boxes(axs[0], raw_detections, f"Raw Output ({len(raw_detections)} boxes)")
draw_boxes(axs[1], cleaned_detections, f"After NMS ({len(cleaned_detections)} boxes)")
plt.show()

### Conclusion
You can clearly see how NMS cleaned up the output. It kept the highest confidence "Dog" box (0.9) and suppressed the overlapping duplicates (0.75, 0.6), and removed the low-confidence "Bird" noise entirely.

This logic runs thousands of times per second in self-driving cars and security cameras!