## I. Introduction to Object Detection and YOLO

Object detection is a computational task that takes an input image and provides **output bounding boxes** and **class labels**. For example, in an image of a person riding a horse, the output includes bounding boxes around the person and the horse, along with their respective class labels.

### A. Drawbacks of Previous Architectures

Models like R-CNN, Fast R-CNN, and Faster R-CNN (excluding single shot detectors) are considered **multi-stage networks**. They rely heavily on region proposals, which are generated either through methods like selective search or using a Region Proposal Network (RPN).

The significant drawbacks of these multi-stage models include:
1.  **Complexity and Pipeline:** They use a multi-stage pipeline where each component must be trained separately.
2.  **Redundant Processing:** They require training a single image multiple times.
3.  **Real-Time Limitations:** They are not suitable for real-time applications; even Fast CNN is not fast enough when compared to later models like YOLO.
4.  **Generalizability:** These model architectures are generally **not generalizable** to all domains (e.g., medical domains), lacking the advantage of transfer learning that YOLO provides.
5.  **Structure Example:** A typical multi-stage network involves shared convolution layers, region proposal networks, pooling, and classification steps.

### B. The Core Idea of YOLO (You Only Look Once)

YOLO aims to solve the limitations of previous models by functioning as a single, unified architecture.

*   **Single Pass:** The input image is passed **only once** through the architecture.
*   **Direct Output:** The architecture directly outputs the bounding boxes and class probabilities.
*   **Regression Problem:** YOLO models object detection as a **single-stage regression problem**. This is reflected in the loss calculation, which heavily uses squared errors.

## II. YOLO V1 Architecture and Process

<img src="../../images/yolo.png">
<img src="../../images/yolo2.png">

YOLO V1 is designed as a single architecture, often inspired by GoogleNet.

### A. Input Preprocessing and Grid Division

1.  **Resizing:** The input image (e.g., 480x640) must be resized to a fixed shape, typically **448 x 448** pixels, before being passed to the CNN architecture.
2.  **Grid Division:** The image is divided into an $S \times S$ grid, where $\text{S} = 7$. This results in **49 grid cells**. Each cell in the $7 \times 7$ grid is $64 \times 64$ pixels.
3.  **Cell Responsibility:** A grid cell is responsible for detecting an object if the **center** of the ground truth bounding box falls within that cell.

### B. Bounding Box Normalization

Raw coordinates (X, Y, W, H) are typically large numbers (e.g., 200, 31). To make the model easier to train and learn, these coordinates are normalized to be within the range **0 to 1**.

*   **X and Y (Center Coordinates):** These are normalized relative to the coordinates of the **responsible grid cell**. The distance of the ground truth center from the grid cell's top-left corner is calculated (delta X, delta Y) and then divided by the cell size (64 pixels).
*   **Width (W) and Height (H):** These are normalized with respect to the **whole image width and height**.

### C. Target Vector and Prediction

1.  **Predictions Per Cell:** Every grid cell predicts $B=2$ bounding boxes.
2.  **Output Tensor:** The final prediction tensor is $S \times S \times 30$ ($7 \times 7 \times 30$).
3.  **Vector Structure (30-length vector):** The output vector for each cell consists of:
    *   **Box 1:** X, Y, W, H, Confidence Score ($C_1$).
    *   **Box 2:** X, Y, W, H, Confidence Score ($C_2$).
    *   **Class Probabilities:** 20 class probabilities ($P_1...P_{20}$), assuming training is done on the 20-class Pascal VOC dataset.
4.  **Scores:**
    *   **Object Score (C):** Represents the model's confidence that an object is present in that bounding box (C=1 if an object is present; C=0 if absent).
    *   **Class Probability (P):** Tells *what* class the object belongs to (e.g., person, horse).

### D. YOLO V1 Architecture Details

The architecture is a 24-layer structure consisting of convolutions and max-pooling.

*   It uses basic **convolution and max pool** layers.
*   It does **not** include batch normalization or dropouts.
*   The network ends with **two fully connected layers** to convert the output feature maps into the final $7 \times 7 \times 30$ output tensor.

### E. Loss Function (Squared Errors)

<img src="../../images/yoloLoss.png">

YOLO V1 utilizes a complex loss function based on **squared errors** across all 49 grid cells ($S^2$).

1.  **Handling Imbalance:** Since most cells (around 47 out of 49) do not contain an object, the "no object" loss tends to dominate. To prevent the model from being biased:
    *   **Lambda Parameter:** A tweakable parameter ($\lambda$) is introduced.
    *   $\lambda_{\text{coord}}$ is set to 5 (or 5x stronger) to increase the importance of cells that *do* contain objects.
    *   $\lambda_{\text{noobj}}$ is set to 0.5 to decrease the importance (reduce the weight) of cells that *do not* contain objects.
2.  **Bounding Box Loss Tweak:** Square roots are applied to the calculated loss for **width (W) and height (H)**. This is done because, when boxes are very small, even a small error in terms of pixels can represent a large percentage of error (e.g., 50%). Applying the square root ensures that errors on small boxes are penalized more heavily.
3.  **Total Loss Components:** The total loss includes:
    *   Loss corresponding to the bounding box (X, Y, W, H).
    *   Loss corresponding to the objectness score (C).
    *   Loss corresponding to the class predictions (P).

### F. Prediction Finalization

YOLO requires a process to select the best bounding box, especially since two boxes are predicted per cell. This involves using **Non-Max Suppression**.

1.  For prediction, the model calculates the **multiplication of the Object Score (C) and the maximum Class Probability (P)**.
2.  The box having the highest combined confidence score is selected as the final box.

### G. Limitations of YOLO V1

Despite being much faster (e.g., 45 FPS vs. 7 FPS for Faster R-CNN), YOLO V1 has limitations:

*   **Detection Cap:** It can only detect a maximum of **49 bounding boxes** per image due to the $7 \times 7$ grid constraint.
*   **Object Size/Groups:** It has difficulty predicting small objects, particularly those appearing in groups.
*   **Localization Error:** It suffers from poor localization error.
*   **False Positives:** There is a high rate of false positives (boxes incorrectly identified as background).

## III. YOLO V2 Improvements (YOLO9000)

<img src="https://www.maskaravivek.com/post/yolov2/featured.png">
YOLO V2 introduced several new techniques aimed at making the model faster, improving MAP, and allowing it to predict more objects.

### A. Architectural Changes

1.  **Fully Convolutional Architecture:** The last two fully connected layers used in V1 were **removed**. This drastically reduces the number of parameters and increases performance. This change also allows the network to accept images of **varying resolution** (as long as the size is a multiple factor of 32).
2.  **Batch Normalization (BN):** BN was introduced in **every convolution layer**. This helps reduce overfitting and increases Mean Average Precision (MAP).
3.  **High Resolution Classifier:** V1 trained classification on $224 \times 224$ images and then jumped to $448 \times 448$ for object detection fine-tuning. V2 introduces a step where the model is pre-trained on $448 \times 448$ for several epochs *after* initial classification training, allowing it to better learn features at the detection resolution.
4.  **Darknet-19:** V2 introduced a new, lightweight architecture called Darknet. Darknet-19 has 19 layers and results in a 10 MAP point improvement over V1.

### B. Grid and Information Flow

1.  **Grid Size Change:** The grid size was changed from $7 \times 7$ to **$13 \times 13$**. An odd grid size was chosen to ensure a single center point exists for every cell, making it easier to assign cell responsibility.
2.  **Pass-Through Layer:** This is a type of **skip connection** introduced to carry forward high-resolution information. It takes feature maps from earlier layers (e.g., $26 \times 26$ feature map) and reshapes them to match the final tensor size (e.g., $13 \times 13$ with more channels) before concatenation. This allows the network to use finer-grained features, addressing localization issues.

### C. Anchor Boxes and Predictions

1.  **Anchor Boxes:** YOLO V1 was limited to 98 boxes maximum. V2 introduces the concept of anchor boxes.
2.  **Data-Specific Anchor Boxes:** Instead of using predefined shapes, V2 uses **K-means clustering** on the ground truth bounding boxes of the training data set to determine the optimal predefined anchor shapes.
3.  **Prediction Count:** Based on clustering on Pascal VOC, V2 uses **five anchor boxes** per grid cell ($\text{B}=5$).
4.  **Vector Length:** Since $S=13$ and $B=5$, the output tensor is $13 \times 13 \times 125$.
5.  **Per-Box Classification:** In V2, **every bounding box** predicts its own class probabilities, rather than sharing a single class vector per cell as in V1. Each box prediction is a 25-length vector (X, Y, W, H, C + 20 class probabilities).
6.  **Coordinate Calculation:** Bounding box coordinates (X, Y, W, H) are now calculated **with respect to the anchor boxes** rather than just the grid cell coordinates.

## IV. YOLO V3 Mention

YOLO V3 is characterized as an **incremental update** compared to V2. It introduced a new, lightweight architecture called **Darknet-53**, which has 53 layers and provides superior MAP performance.