## Object Detection and R-CNN Architecture

### I. Introduction and Course Outline

The session focuses on advanced computer vision topics.

#### A. Course Content and Timeline
1.  **Object Detection Series:** The series will cover R-CNN, Fast R-CNN, and Faster R-CNN.
    *   **R-CNN:** It is a "very old world architecture" and an "old technique" developed at a time period different from the present.
    *   **Advanced Models:** Fast R-CNN and Faster R-CNN are considered slightly more advanced models compared to R-CNN.
2.  **State-of-the-Art (SOTA):** YOLO (You Only Look Once) is the current state-of-the-art technique. The course will cover the basic understanding of YOLO, the ideas behind its modifications, and possibly versions V5 or V10.
3.  **Historical Timeline (Object Detection):** Object detection models began appearing around 2013-2014, following perfected image classification techniques (like VGG16).
    *   1998: LeNet (Image Classification foundation).
    *   2014: R-CNN released.
    *   2015: Fast R-CNN.
    *   2018: YOLO V3.
    *   Current: YOLO V10.

#### B. Projects and Deployment
1.  **YOLO Project:** A multi-class object detection project will be undertaken using YOLO.
2.  **Deployment:** A dedicated session will cover deploying a Casava leaf detection model on a cloud API, primarily Google GCP.
3.  **Final Project (Self-Driving Car):** This project will be the last but one topic.
    *   **Scope:** The project will incorporate only one or two techniques.
    *   **Comparison to Industry Models:** Complex commercial systems like Tesla's use a "very strong hybrid model" combining segmentation, lane detection, classification, object detection, and signal finding to predict acceleration, braking, and turning.

### II. Foundational Concepts: Classification, Localization, and Detection

#### A. Object Detection Definition
Object detection is a part of computer vision where the goal is not only to classify the image but also to classify the objects present inside the image and build a bounding box for every object.

#### B. Classification vs. Localization vs. Detection
1.  **Classification (Normal CNN):** Classifies what kind of image is present (e.g., house, dog, person). It provides a class prediction/confidence score (e.g., probability 0.9) but does not specify location or count of objects.
2.  **Object Localization:** Classification combined with a single label plus the object location. It is used when there is **only one object** in focus in the image.
3.  **Object Detection (Goal):** Requires detecting **all objects and all classes** within an image, irrespective of the object's shape or size. Output includes the label, confidence score, and bounding box for every prediction.

#### C. Segmentation vs. Object Detection
1.  **Object Detection:** Works on the **object level** using a bounding box. The bounding box often covers background pixels that are not part of the object.
2.  **Segmentation:** An advanced topic that works on the **pixel level**, giving the output as a mask.
3.  **Hybrid Models:** Segmentation and object detection models can be mixed to create a hybrid model.

### III. Bounding Box Details

#### A. Shape and Alignment
1.  **Shape:** Bounding boxes are generally **rectangular or square** because they are easier to calculate distances.
2.  **Axis Alignment:** The focus is on bounding boxes that are **aligned to the X and Y axes** (axis-aligned bounding boxes), meaning they are parallel to the axes.

#### B. Goal and Perfection
The goal is to have the bounding box as **close to the object's border as possible** to minimize covering background data, creating the "best bounding box". The box must perfectly cover the whole content of the object.

#### C. Representations (Formats)
Any representation must allow derivation of all four coordinates.
1.  **Corner Coordinates:** $X_1, Y_1$ (top-left) and $X_2, Y_2$ (bottom-right).
    *   Center $X$ is calculated as $(X_1 + X_2)/2$.
    *   Center $Y$ is calculated as $(Y_1 + Y_2)/2$.
2.  **Center Coordinates & Dimensions (YOLO Format):** Center $X$, Center $Y$, Width $W$, and Height $H$.
3.  **Format Conversions:** Multiple formats exist (e.g., Coco format, YOLO format), requiring transformations between them.

### IV. Early Detection Models and Limitations

#### A. CNN Modification for Single Object Detection
This modification is suitable only if the image contains a single object.
1.  **Architecture:** The standard CNN is followed by a flatten layer.
2.  **Branch 1 (Classification):** Predicts the confidence/objectness score (e.g., 0.9 probability) using a dense layer and **cross-entropy loss**.
3.  **Branch 2 (Regression):** A separate fully connected layer is added with **four hidden units** to predict the four coordinates (X, Y, W, H). This uses **Mean Square Error (MSE) loss**.

#### B. Sliding Window Technique (SWT)
SWT was an attempt to solve the multi-object problem by passing subsets (windows) of the image to the model.
1.  **Class Modification:** Requires adding a **background class** to the focused classes (total classes $C+1$), as many windows will not contain a focused object.
2.  **Feasibility Issue:** SWT is not feasible because possible combinations are in the millions.
    *   Example: A 200x200 image could generate around **400 million boxes**.
    *   This requires "insane amounts of training" time and GPU resources.
3.  **Solution Needed:** Techniques were required to decrease the number of inputs/regions sent to the model for predictions.

# V. R-CNN Architecture (Regional CNN)

R-CNN (Regional CNN) was developed to reduce the number of inputs drastically ‚Äî from millions of potential regions to around **2,000 proposals per image**.
It consists of **three main modules**, each responsible for a different stage of object detection.

---

## A. Module 1: Selective Search (Region Proposals)

1. **Purpose:** Generates *category-independent* region proposals that might contain objects.
2. **Reduction:** Reduces the number of boxes from millions to approximately **2,000 proposals per image**, achieving a drastic **99.9% reduction**.
3. **Process:**

   * The image is segmented into many small regions.
   * Similar regions are merged based on various **similarity factors**:

     * Color similarity
     * Texture similarity
     * Size similarity
     * Shape similarity
     * A linear combination of the above factors
   * **Output:** Selective Search produces candidate regions but **does not assign any class labels**.

---

## B. Module 2: Feature Extraction and Classification

R-CNN uses a **pretrained CNN** (e.g., AlexNet or VGG) for feature extraction and **SVMs** for final classification.

### 1. Preprocessing and Resizing

* **Labeling issue:** The ~2,000 region proposals are initially **unlabeled**.
* **Fixed input size:** Since CNNs require fixed-size inputs (e.g., 227√ó227), all proposals are resized using techniques like *dilation* to preserve the aspect ratio.

### 2. Labeling Region Proposals using IOU

To train the model, each region proposal is assigned a label based on **IOU (Intersection over Union)** with ground-truth boxes.

* **IOU > 0.5:** The proposal is labeled as the object‚Äôs class (e.g., Person, Car).
* **IOU < 0.3:** Labeled as **background**.
* **0.3 ‚â§ IOU ‚â§ 0.5:** Ignored during training.
* **Multiple overlaps:** If a proposal overlaps multiple objects, it takes the label of the one with the highest IOU.
* **Sampling:** Because most regions are background (~90%), only a subset of background samples are used to balance the dataset.

üîπ **Purpose:** This IOU labeling step *does not perform classification* ‚Äî it only creates **training data** (features + labels) for the classifier.

### 3. Training with CNN Features and SVMs

Once proposals are labeled:

1. Each region is passed through the CNN to extract a **4096-dimensional feature vector** (from fully connected layers).
2. These features, along with their IOU-based labels, are used to **train a set of Support Vector Machines (SVMs)**.

* **Why SVMs?**

  * CNNs were pretrained only for *image-level classification* (e.g., ImageNet).
  * R-CNN uses SVMs as separate classifiers because:

    * They perform well on **high-dimensional feature spaces**.
    * They are effective for **binary classification with limited data**.
    * Training was easier in separate stages (CNN ‚Üí SVM ‚Üí Regressor) given 2014‚Äôs hardware and software limits.

* **Architecture:** One **SVM per object class** (e.g., 20 classes = 20 SVMs).
  Each SVM is trained as:
  `object class` vs. `background`.

* **SVM Training Threshold:**

  * IOU > 0.3 ‚Üí Positive sample (object)
  * IOU < 0.3 ‚Üí Negative sample (background)

üí° **Remember:** IOU labeling creates the *supervised dataset*, while SVMs **learn to classify** those proposals for unseen images.

---

## C. Module 3: Bounding Box Regressor (Refinement)

A separate **bounding box regression model** refines the coordinates predicted by Selective Search.

### 1. Goal

Convert an approximate region proposal (P) into a more accurate box (G) that tightly encloses the object.

### 2. Inputs and Targets

* Proposal box: ( P = (x_p, y_p, w_p, h_p) )
* Ground truth box: ( G = (x_g, y_g, w_g, h_g) )

**Target offset values (regression targets) for training:**
$[
t_x^* = \frac{(x_g - x_p)}{w_p}, \quad
t_y^* = \frac{(y_g - y_p)}{h_p}, \quad
t_w^* = \log\left(\frac{w_g}{w_p}\right), \quad
t_h^* = \log\left(\frac{h_g}{h_p}\right)
]$

These represent the relative corrections the model should predict to align the proposal with the ground truth.

### 3. Mechanism

* The CNN extracts features from each proposal.
* A fully connected layer predicts four values: ( t_x', t_y', t_w', t_h' ).
* During inference, these predicted offsets are applied to the proposal box:
  $[
  x' = t_x' \cdot w_p + x_p, \quad
  y' = t_y' \cdot h_p + y_p, \quad
  w' = w_p \cdot e^{t_w'}, \quad
  h' = h_p \cdot e^{t_h'}
  ]$
  This gives the refined bounding box.

### 4. Training Condition

The regressor is trained only on proposals with **IOU ‚â• 0.6** with ground truth boxes.

### 5. Summary Table for Regression

|           Symbol           | Meaning                     | Phase                | Formula                 |
| :------------------------: | --------------------------- | -------------------- | ----------------------- |
| t_x^*, t_y^*, t_w^*, t_h^* | Ground truth target offsets | Training             | From P, G               |
|   t_x', t_y', t_w', t_h'   | Predicted offsets           | Training & Inference | From CNN                |
|       x', y', w', h'       | Refined box                 | Inference            | Apply predicted offsets |

---

### üß≠ Summary of R-CNN Pipeline

| Stage | Component              | Purpose                               | Output                        |
| ----- | ---------------------- | ------------------------------------- | ----------------------------- |
| 1     | Selective Search       | Generate ~2,000 region proposals      | Candidate regions (no labels) |
| 2     | CNN                    | Extract high-level features           | 4096-D feature vectors        |
| 3     | SVM                    | Classify features into object classes | Object labels                 |
| 4     | Bounding Box Regressor | Refine box coordinates                | Final bounding boxes          |

---

### ‚ö†Ô∏è Limitations of R-CNN

* **Very slow inference:** Each of 2,000 regions is processed by the CNN independently.
* **Large storage needs:** All region features are stored on disk before SVM training.
* **Multi-stage training:** CNN, SVMs, and Regressor are trained separately (not end-to-end).

These issues were later solved by **Fast R-CNN** and **Faster R-CNN**.


# second lecture 

## 1. Introduction

Object detection is the task of identifying objects within an image along with their bounding boxes. Unlike image classification (which predicts a single label for an image) or object localization (which predicts a bounding box for a single object), object detection identifies multiple objects and their locations.

Key concepts:

* **Classification**: Single object prediction in the image.
* **Localization**: Predicting bounding box of one object in the image.
* **Object Detection**: Multiple objects with bounding boxes.

---

## 2. Bounding Boxes

Bounding boxes can be represented in multiple ways:

1. **Corner Points Representation**: `(x_min, y_min, x_max, y_max)`
2. **Center + Width & Height**: `(x_center, y_center, width, height)`

**Coordinate Systems**:

* Pixel values or normalized coordinates.
* Axis-aligned boxes are used for simplicity (no rotation).

**Labeling Formats**:

* **COCO format**: `[x_center, y_center, width, height]`
* **YOLO format**: `[x_center, y_center, width, height]` (normalized)

---

## 3. Object Detection Methods

### 3.1 Sliding Window

* Slide multiple windows of different aspect ratios across the image.
* Each window is classified as **object** or **background** using CNN.
* Works well for single-object images.
* **Limitations**: Computationally expensive for multiple objects.

---

### 3.2 Region-Based CNN (RCNN)

RCNN improves efficiency by using region proposals instead of sliding windows over the entire image.

**Pipeline**:

1. **Region Proposals**:

   * Generated using **Selective Search**.
   * Divides image into multiple regions based on similarity (color, texture, size).
   * Produces candidate bounding boxes likely to contain objects.

   ```
   Example ASCII:
   Original Image
   +---------------------+
   |       Object 1      |
   |   +-------+         |
   |   | Box A |         |
   |   +-------+         |
   |   +-------+         |
   |   | Box B |         |
   |   +-------+         |
   +---------------------+
   ```

2. **Feature Extraction**:

   * Resize region proposals to fixed size.
   * Pass through pre-trained CNN (e.g., VGG16) to extract features.

3. **Object Classification**:

   * Use **class-specific SVMs** to classify proposals into object classes.
   * **Negative hard mining** used to improve training.

4. **Bounding Box Regression**:

   * CNN predicts offsets to refine bounding boxes.
   * Learnable parameters adjust box to tightly fit object.

---

### 3.3 Non-Maximum Suppression (NMS)

Multiple proposals often predict overlapping boxes for the same object. **NMS** filters out redundant boxes.

**Steps**:

1. Sort predicted boxes by object confidence score (descending).
2. Select the highest-scoring box.
3. Compute **Intersection over Union (IoU)** with remaining boxes.
4. Remove boxes with IoU > threshold (e.g., 0.5).
5. Repeat for the next highest box.

**Diagram**:

```
Boxes: [B1, B2, B3, B4]
Scores: [0.9, 0.85, 0.34, 0.1]

Step 1: Select B1 (score 0.9)
Step 2: Remove overlapping boxes B2, B3 if IoU > 0.5
Step 3: Select next highest remaining box B4
```

---

## 4. Evaluation Metrics

### 4.1 Precision & Recall

* **Precision**: Fraction of predicted boxes that are correct.
  [\text{Precision} = \frac{TP}{TP + FP}]
* **Recall**: Fraction of ground truth boxes correctly detected.
  [\text{Recall} = \frac{TP}{TP + FN}]

**Example**:

* Ground truth boxes: 3
* Predicted correct boxes: 2
* Precision = 2/3, Recall = 2/3

### 4.2 Average Precision (AP) & Mean Average Precision (mAP)

* **AP**: Area under the precision-recall curve for one class.
* **mAP**: Mean of APs across all classes.
* IoU thresholds: Usually evaluated at 0.5 to 0.95 with step 0.05.

---

## 5. Fast RCNN

**Limitations of RCNN**:

* Multi-stage training (CNN ‚Üí SVM ‚Üí Bounding Box Regression).
* Slow at test time (47s per image with VGG16).
* High storage requirements for region proposals.

**Fast RCNN Improvements**:

1. **Single-stage training**: CNN, classification, and bounding box regression in one network.
2. **ROI Pooling**: Converts region proposals of varying sizes into fixed-size feature maps.
3. **Faster feature extraction**: Pass full image once through CNN, then extract features for all proposals.

**Pipeline**:

1. Pass full image through CNN ‚Üí get convolutional feature map.
2. ROI pooling ‚Üí fixed-size features for each proposal.
3. Fully connected layers ‚Üí two branches:

   * Classification
   * Bounding box regression

**Advantages**:

* Faster training and inference.
* Less storage required.
* Single-stage network with end-to-end learning.

---

## 6. Summary of RCNN Variants

| Model     | Feature Extraction | Bounding Box | SVM | Training Time | Test Time |
| --------- | ------------------ | ------------ | --- | ------------- | --------- |
| RCNN      | CNN per proposal   | Yes          | Yes | 2.5 GPU-days  | 47s/img   |
| Fast RCNN | CNN once           | Yes          | No  | 9.5 hrs       | 1.8s/img  |

---


### References

1. Girshick, R. ‚ÄúRich feature hierarchies for accurate object detection and semantic segmentation.‚Äù CVPR 2014.
2. Girshick, R. ‚ÄúFast R-CNN.‚Äù ICCV 2015.
3. Ren, S., et al. ‚ÄúFaster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks.‚Äù NIPS 2015.

---

# lecture 3

## Detailed Lecture Notes: Faster R-CNN: Theory, Data Visualisation, and Evolution

### I. Revision of R-CNN Fundamentals

#### A. Non-Max Suppression (NMS)
NMS is used to **suppress bounding boxes that are not relevant**. The goal is to choose only one accurate bounding box per object when multiple boxes are predicted.

1.  **Process:** All bounding boxes are sorted based on their **objectness score** in descending order.
2.  The box with the highest score is selected.
3.  The **Intersection Over Union (IOU)** is calculated between the highest scoring box and all other bounding boxes.
4.  Bounding boxes overlapping significantly (e.g., IOU greater than 0.5) are removed, as they are likely detecting the same object.
5.  This process is repeated for the next highest remaining bounding box until all relevant boxes are selected.
6.  NMS is applied independently for every different class of objects present in the image.

#### B. Bounding Box Evaluation Metrics
The primary metric used to evaluate bounding box accuracy is **Mean Average Precision (mAP)**.

*   **Precision:** Measures how many of the predicted boxes are accurately correct, relative to all predicted boxes.
*   **Recall:** Measures how many of the ground truths are accurately predicted, relative to all ground truths.
*   **Average Precision (AP):** Defined as the **area under the precision-recall curve**.
*   **mAP:** Calculated by finding the average precision for every class of objects in the dataset.
*   **mAP at varying IOU:** Object detection models are often evaluated by calculating mAP across a range of IOU threshold values (e.g., starting at 0.5, stepping up by 0.05, until 0.95), and then calculating the mean of all those average precisions. This is referred to as mAP@[0.5:0.95].

### II. Evolution to Fast R-CNN

#### A. Motivation
The original R-CNN was a complex, multi-stage network that was very slow, taking approximately **48.5 seconds** per image prediction (almost 1 minute). The network had multiple complex components, including CNNs and SVMs.

#### B. Fast R-CNN Architecture
Fast R-CNN aimed to reduce complexity and time:

*   **Change in Order:** Instead of running Selective Search first, the image is passed through the CNN architecture first to generate a **feature map**. Selective Search is then applied to this downsized feature map.
*   **ROI Pooling:** This concept was introduced because Selective Search generates proposals of many different shapes, preventing the use of a unified fully connected layer. ROI (Region of Interest) pooling takes these input proposals and transforms them into a **single, fixed shape** using max pooling.
*   **Architecture Structure:** CNN $\rightarrow$ ROI Pooling $\rightarrow$ Flatten $\rightarrow$ Fully Connected Layers $\rightarrow$ Classification/Bounding Box branches.
*   **Speed Improvement:** Fast R-CNN drastically reduced the prediction time to **0.3 seconds**.

#### C. The Bottleneck
Even in Fast R-CNN, the **Selective Search** algorithm remained the primary bottleneck, still taking **1.5 seconds** to extract region proposals. This resulted in a total prediction time of about 1.8 seconds (0.5 FPS), which is **not real-time**.

### III. Faster R-CNN Architecture

<img src="../images/faterrc.png">

#### A. Motivation and Key Improvement
Faster R-CNN sought to solve the Selective Search bottleneck to achieve a total detection time of less than 1 second, necessary for real-time applications like video object detection.

The core architectural change was the introduction of the **Region Proposal Network (RPN)** to replace Selective Search.

#### B. RPN and Shared Architecture

<img src="../images/rpn.png">

Faster R-CNN utilizes a **shared CNN architecture** (e.g., VGG16, ResNet, MobileNet).

*   The image is sent to the CNN once.
*   The resulting feature map is branched out to two components: the RPN and the downstream ROI Pooling layer.
*   The RPN loss and the classification loss are calculated separately, but the overall loss is used to **update the shared weights** of the backbone CNN architecture.
*   **RPN vs. Selective Search:** Selective Search is based on color, texture, and shape similarity and is **not a learnable algorithm**. RPN, in contrast, consists of convolution layers, making it rely on **learnable parameters** that adapt based on the data and objects.

#### C. Anchor Boxes
Anchor boxes are central to how RPN generates proposals.

1.  **Intuition:** The premise is that a single pixel (or cell) in the feature map, derived from passing through multiple convolutions, could be responsible for detecting an object.
2.  **Generation:** At every sliding window position (pixel) in the feature map, multiple anchor boxes are generated.
    *   This is typically done using **three different scales** and **three different aspect ratios** (e.g., square, horizontal rectangle, vertical rectangle).
    *   This yields $K=9$ anchor boxes at each sliding position.
3.  **Output:** For a feature map of size $W \times H$, the total number of anchor boxes generated is $W \times H \times K$ (e.g., over 20,000 potential boxes for a 60x40 map).

#### D. RPN Output and Training
The RPN applies a convolution (e.g., 3x3) across the feature map to generate predictions for these anchor boxes.

RPN produces two outputs for the $K$ anchor boxes:
1.  **Objectness Scores (2K channels):** This determines the probability that the anchor box contains *any* object (foreground) or is just background. (Although $K+1$ channels might suffice, the original design used 2K channels).
2.  **Bounding Box Regression (4K outputs):** Four coordinates (e.g., X, Y, width, height) are predicted for each of the $K$ anchor boxes.

To train the RPN, the generated proposals must be labeled:
*   **Foreground/Positive Label:** Anchor boxes with an IOU greater than 0.7 with any ground truth box are labeled as a specific object class.
*   **Background/Negative Label:** Anchor boxes with an IOU less than 0.3 with *all* ground truth boxes are labeled as background.
*   Boxes with IOU between 0.3 and 0.7 are usually discarded.

A batch (e.g., 256 proposals, split into 128 positive and 128 negative examples) is created and used to train the ROI pooling layer and subsequent fully connected layers.

#### E. Performance Summary (R-CNN Family)
| Architecture | Proposal Method | Total Time (Latency) | Note |
| :--- | :--- | :--- | :--- |
| R-CNN | Selective Search | 48.5 seconds | Multi-stage, slowest. |
| Fast R-CNN | Selective Search | 1.8 seconds | 25 times faster than R-CNN; bottlenecked by SS. |
| Faster R-CNN | RPN (Learnable) | $\approx$ 200 milliseconds (0.2 seconds) | Almost real-time; RPN replaces SS. |

All three R-CNN architectures (R-CNN, Fast R-CNN, Faster R-CNN) are considered **two-stage networks**.

### IV. Data Visualization and Preparation

#### A. Data Set and Annotation
The practical session focused on creating and utilizing data for multi-class object detection, specifically using an aquarium dataset.

*   **Tools:** Roboflow was used to download images, perform **manual labeling**, and output data files.
*   **Classes:** The example dataset included 7 classes: fish, stingray, jellyfish, penguin, shark, puffin, and starfish.
*   **Annotation:** Bounding boxes must be manually drawn and labeled for each object (e.g., turtle, goldfish).

#### B. COCO Data Format
The session utilized the **COCO JSON format** for annotations. The JSON file is stored as a dictionary containing fields for licenses, categories, images, and annotations.

*   **Categories:** Used to map the Category ID (an integer) to the Class Name (e.g., 'fish').
*   **Images:** Used to map the Image ID to the actual Image File Name.
*   **Annotations:** Contain the bounding box information, linked via Image ID and Category ID.
*   **Bounding Box Format:** COCO defines bounding boxes using the coordinates: $x_{min}, y_{min}, \text{width}, \text{height}$.

#### C. Data Visualization
To visualize the data, the JSON annotations are loaded, and the image, category, and annotation information is mapped into dictionaries for fast lookup (Order of 1 time complexity).

Custom functions were written to:
1.  Access the bounding box coordinates ($x_{min}, y_{min}, \text{width}, \text{height}$).
2.  Calculate the coordinates needed to draw a rectangle (e.g., $x_{min}$ and $y_{min}$ for the top-left corner, and $x_{min} + \text{width}$ and $y_{min} + \text{height}$ for the bottom-right).
3.  Draw the bounding box and the corresponding class name on the image.

#### D. Tools for Handling Data
It was noted that while writing custom code to load and manipulate JSON data from scratch is possible, packages like **pycocotools** exist that allow users to directly load annotation files in the COCO format and skip the manual parsing and mapping.

### V. Future Topics
The next sessions are planned to cover:
*   Training a Faster R-CNN model on the custom dataset.
*   Starting the theory and architecture of **YOLO (You Only Look Once)**.
*   YOLO is a **single-stage network** that solves the complexity and latency issues of two-stage detection architectures.