# Week 3 Lab 1 - Object Detection using R-CNN

# Problem Definition
**Goal:** Detect wheat heads wheat plants, including wheat datasets from around the globe.

<img src="https://storage.googleapis.com/kaggle-media/competitions/UofS-Wheat/descriptionimage.png">

The Global Wheat Head Dataset is led by nine research institutes from seven countries: the University of Tokyo, Institut national de recherche pour l’agriculture, l’alimentation et l’environnement, Arvalis, ETHZ, University of Saskatchewan, University of Queensland, Nanjing Agricultural University, and Rothamsted Research. These institutions are joined by many in their pursuit of accurate wheat head detection, including the Global Institute for Food Security, DigitAg, Kubota, and Hiphen.

More details on the data acquisition and processes are available at https://arxiv.org/abs/2005.02162


## So, what is this type of problem called?

Our problem can be categorized as a binary classification and object localization task.

Specifically:
* Classification: We are dealing with two classes—images that contain wheat and images that do not. Most images are expected to contain wheat, with only a few exceptions.

* Localization: In addition to identifying whether an image contains wheat, it's essential to specify the exact locations of the wheat heads within the image. Simply stating the presence of wheat is insufficient; we must also perform localization to pinpoint where the wheat heads appear in the image.

Recap: computer vision task:
<img src="https://miro.medium.com/v2/resize:fit:1400/1*z89KwWbF59XXrsXXQCECPA.jpeg">

Therefore, we need a solution that can:

* Determine whether the input image contains wheat heads.
* If wheat heads are present, identify their precise locations.
* Draw bounding boxes around the detected wheat heads.
* Provide a confidence score indicating how certain the algorithm is that the detected object is a wheat head.

## What is Faster R-CNN
Faster R-CNN is a powerful technology that can address the questions we've posed. How does it achieve this?

Before diving into the details, I will first provide a brief overview of its predecessors. This background will help clarify why Faster R-CNN has become such a popular and effective approach for object detection.

**Background:**
Faster R-CNN is part of the region-based object detection methods. The development of Faster R-CNN followed this trajectory:

* **2014:** Ross Girshick et al. introduced Regions with CNN features (R-CNN) in their groundbreaking paper.
* **2015:** Girshick improved upon R-CNN and proposed Fast R-CNN, making the process more efficient.
* **2015:** Building on Fast R-CNN, the team introduced Faster R-CNN, further enhancing speed and accuracy by incorporating a region proposal network (RPN).

Let's briefly review each of these methods!

### R-CNN

https://arxiv.org/abs/1311.2524

The architecture of R-CNN:

<img src="https://learnopencv.com/wp-content/uploads/2019/06/rcnn.png">

**Step 1:** They used an algorithm called Selective Search to generate around 2,000 region proposals. These proposals represent areas in the image that could potentially contain objects.

**Step 2:** For each of these 2,000 bounding boxes, they applied a CNN followed by an SVM classifier to determine whether the region contained an object.

While the accuracy of R-CNN was state-of-the-art at the time, the approach had significant speed limitations:
* Inference: Processing a single image took 18-20 seconds on a GPU.
* Training: The training process was extremely slow, taking around 84 hours and requiring a substantial amount of disk space.
* The method involved ad-hoc training objectives for region proposals and the SVM classifier, leading to inefficiencies.

### Fast R-CNN

https://arxiv.org/abs/1504.08083

The architecture of Fast R-CNN:
<img src="https://learnopencv.com/wp-content/uploads/2019/06/frcnn.png">

In Fast R-CNN, instead of passing 2,000 region proposals separately through multiple convolutional neural networks, the estimated region proposals (which could be fewer than 2,000) are combined into a single feature map. This unified feature map is then fed into a single neural network for processing.

Fast R-CNN delivered a significant performance improvement by streamlining the computation, making the object detection process much faster.

### Faster R-CNN

https://arxiv.org/abs/1506.01497

<img src="https://doimages.nyc3.cdn.digitaloceanspaces.com/010AI-ML/content/images/2020/09/Fig05-2.jpg">

A Convolutional Neural Network (CNN) was used to generate a feature map of the image, which was simultaneously utilized for both training a Region Proposal Network (RPN) and an image classifier. This shared computation greatly improved the speed of object detection by eliminating the need for separate processing stages.

Faster R-CNN revolutionized object detection, enabling inference to be completed in less than a second. However, the training time is longer compared to Fast R-CNN.

## Overall

<img src="https://dzone.com/storage/temp/9814919-screen-shot-2018-07-23-at-114334-am.png">

# Download Dataset

In [1]:
# Download dataset https://drive.google.com/file/d/1NaqtzF3GuSW_iOiwHn-I1BGKOpNUFWEa/view?usp=sharing
!gdown 1NaqtzF3GuSW_iOiwHn-I1BGKOpNUFWEa

Downloading...
From (original): https://drive.google.com/uc?id=1NaqtzF3GuSW_iOiwHn-I1BGKOpNUFWEa
From (redirected): https://drive.google.com/uc?id=1NaqtzF3GuSW_iOiwHn-I1BGKOpNUFWEa&confirm=t&uuid=583e8a85-4e6b-4445-9ce3-72d173d3f866
To: /content/global-wheat-detection.zip
100% 637M/637M [00:20<00:00, 31.0MB/s]


In [None]:
# unzip dataset
!unzip -q global-wheat-detection.zip

In [None]:
# See the unzip file
!ls

# Setup

In [None]:
import warnings
warnings.filterwarnings('ignore')

import os
import ast
import cv2
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib.patches as patches
from tqdm import tqdm
from sklearn.model_selection import train_test_split
from PIL import Image

import albumentations as A
from albumentations.pytorch import ToTensorV2

import torch
import torchvision
import torch.optim as optim
from torchvision.models.detection import FasterRCNN
from torchvision.models.detection.faster_rcnn import FastRCNNPredictor
from torch.utils.data import DataLoader, Dataset

In [None]:
# Set the device to GPU if available, otherwise use the CPU
device = torch.device('cuda') if torch.cuda.is_available() else torch.device('cpu')

# Load & Prepare Dataset

In [None]:
df = pd.read_csv('train.csv')
image_dir = 'train'

In [None]:
df

## Augmentation

In [None]:
def get_train_transform():
    return A.Compose([
        A.Flip(p=0.5),              # Apply horizontal or vertical flip with a probability of 50%
        A.RandomRotate90(p=0.5),    # Rotate the image by 90 degrees with a probability of 50%
        ToTensorV2(p=1.0)           # Convert the image and its bounding boxes to PyTorch tensors
    ], bbox_params=A.BboxParams(format='pascal_voc', label_fields=['labels']))  # Define bounding box parameters in Pascal VOC format and ensure labels are retained for transformations

def get_valid_transform():
    return A.Compose([
        ToTensorV2(p=1.0)           # Convert the image and its bounding boxes to PyTorch tensors
    ], bbox_params=A.BboxParams(format='pascal_voc', label_fields=['labels']))  # Use Pascal VOC format for bounding boxes and ensure labels are retained

## Data Loader

In [None]:
class WheatDataset(Dataset):
    def __init__(self, dataframe, image_dir, transforms=None):
        self.dataframe = dataframe      # The dataframe containing image IDs and bounding box information
        self.image_dir = image_dir      # Directory where images are stored
        self.transforms = transforms    # Any transformations to be applied (e.g., augmentations)
        self.image_ids = dataframe['image_id'].unique()  # Extract unique image IDs from the dataframe

    def __len__(self):
        return len(self.image_ids)      # Return the total number of unique images in the dataset

    def __getitem__(self, idx):
        image_id = self.image_ids[idx]  # Get the image ID at the given index

        # Construct the full path to the image file
        image_path = os.path.join(self.image_dir, f"{image_id}.jpg")

        # Load the image and convert it from BGR to RGB
        image = cv2.imread(image_path)
        image = cv2.cvtColor(image, cv2.COLOR_BGR2RGB).astype(np.float32)
        image /= 255.0  # Normalize the image to the range [0, 1]

        # Retrieve the bounding box data for the image
        records = self.dataframe[self.dataframe['image_id'] == image_id]
        boxes = []
        for i in range(len(records)):
            box = ast.literal_eval(records.iloc[i]['bbox'])     # Convert the bounding box string to a list
            boxes.append([box[0], box[1], box[0] + box[2], box[1] + box[3]])  # Convert (x, y, w, h) to (x_min, y_min, x_max, y_max)

        boxes = np.array(boxes, dtype=np.float32)               # Convert the list of boxes to a NumPy array
        labels = np.ones((records.shape[0],), dtype=np.int64)   # Assign a label of 1 to all bounding boxes (since there's only one class: wheat)

        target = {'boxes': boxes, 'labels': labels}             # Create the target dictionary with bounding boxes and labels

        # Apply transformations if provided
        if self.transforms:
            transformed = self.transforms(image=image, bboxes=target['boxes'], labels=target['labels'].tolist())
            image = transformed['image']  # Apply the transformation to the image
            target['boxes'] = torch.tensor(transformed['bboxes'], dtype=torch.float32)  # Convert transformed bounding boxes to PyTorch tensors
            target['labels'] = torch.tensor(transformed['labels'], dtype=torch.int64)   # Convert transformed labels to PyTorch tensors

        return image, target  # Return the image and the corresponding target (bounding boxes and labels)

In [None]:
# Select a random subset of 150 unique image IDs from the dataframe
subset_image_ids = np.random.choice(df['image_id'].unique(), size=150, replace=False)

# Filter the dataframe to include only the selected 100 image IDs
subset_df = df[df['image_id'].isin(subset_image_ids)]

# Split the subset into training and validation sets (e.g., 80 images for training and 20 for validation)
train_ids, val_ids = train_test_split(subset_image_ids, test_size=0.2, random_state=42)

# Create a dataframe by selecting rows with image IDs
train_df = subset_df[subset_df['image_id'].isin(train_ids)]
val_df = subset_df[subset_df['image_id'].isin(val_ids)]

# Initialize the dataset using WheatDataset class and applying transformations
train_dataset = WheatDataset(dataframe=train_df, image_dir=image_dir, transforms=get_train_transform())
val_dataset = WheatDataset(dataframe=val_df, image_dir=image_dir, transforms=get_valid_transform())

# Create DataLoader
train_data_loader = DataLoader(train_dataset, batch_size=16, shuffle=True, collate_fn=lambda x: tuple(zip(*x)))
val_data_loader = DataLoader(val_dataset, batch_size=4, shuffle=False, collate_fn=lambda x: tuple(zip(*x)))

## Sample Image with Bounding Box

In [None]:
def visualize_sample_with_bboxes(dataset, idx):
    # Retrieve an image and its corresponding target (bounding boxes) from the dataset at the given index
    image, target = dataset[idx]

    # Convert the image tensor (C, H, W) to a NumPy array (H, W, C) for visualization
    image_np = image.permute(1, 2, 0).cpu().numpy()

    # Extract the bounding boxes from the target and convert them to a NumPy array
    boxes = target['boxes'].cpu().numpy()

    # Create a matplotlib figure to display the image
    fig, ax = plt.subplots(1, figsize=(12, 9))
    ax.imshow(image_np)  # Display the image

    # Loop through each bounding box and draw a rectangle around it
    for i, box in enumerate(boxes):
        x_min, y_min, x_max, y_max = box  # Extract the coordinates of the bounding box
        rect = patches.Rectangle((x_min, y_min), x_max - x_min, y_max - y_min,
                                 linewidth=2, edgecolor='red', facecolor='none')  # Create a red rectangle
        ax.add_patch(rect)  # Add the rectangle to the image

    plt.title(f'Image Index: {idx}')
    plt.show()

In [None]:
visualize_sample_with_bboxes(train_dataset, idx=0)

# Load the Model

In [None]:
def get_model(num_classes):
    # Load the Faster R-CNN model pre-trained on the COCO dataset
    model = torchvision.models.detection.fasterrcnn_resnet50_fpn(pretrained=True)

    # Replace the model's head (box predictor) to match the number of classes
    # Get the number of input features for the classifier (for the head)
    in_features = model.roi_heads.box_predictor.cls_score.in_features

    # Replace the box predictor with a new one that has the correct number of output classes
    model.roi_heads.box_predictor = FastRCNNPredictor(in_features, num_classes)

    return model

# Instantiate the model for 2 classes: 1 class (wheat) + 1 background class
model = get_model(num_classes=2)
model.to(device)

# Training

In [None]:
def train(model, optimizer, data_loader, device, epoch):
    model.train()
    running_loss = 0.0

    for images, targets in tqdm(data_loader):
        images = [image.to(device) for image in images]
        targets = [{k: v.to(device) for k, v in t.items()} for t in targets]

        # Forward pass
        loss_dict = model(images, targets)
        losses = sum(loss for loss in loss_dict.values())

        # Backward pass
        optimizer.zero_grad()
        losses.backward()
        optimizer.step()

        running_loss += losses.item()

    avg_loss = running_loss / len(data_loader)
    return avg_loss

# Optimizer
params = [p for p in model.parameters() if p.requires_grad]
optimizer = optim.SGD(params, lr=0.005, momentum=0.9, weight_decay=0.0005)

<img src="https://storage.googleapis.com/kaggle-media/competitions/rsna/IoU.jpg">

In [None]:
def iou(boxA, boxB):
    # Calculate the (x, y) coordinates of the intersection rectangle
    xA = max(boxA[0], boxB[0])
    yA = max(boxA[1], boxB[1])
    xB = min(boxA[2], boxB[2])
    yB = min(boxA[3], boxB[3])

    # Compute the area of intersection rectangle
    interArea = max(0, xB - xA + 1) * max(0, yB - yA + 1)

    # Compute the area of both bounding boxes
    boxAArea = (boxA[2] - boxA[0] + 1) * (boxA[3] - boxA[1] + 1)
    boxBArea = (boxB[2] - boxB[0] + 1) * (boxB[3] - boxB[1] + 1)

    # Compute the IoU by dividing the intersection area by the union of both areas
    iou = interArea / float(boxAArea + boxBArea - interArea)
    return iou

In [None]:
def calculate_metrics(model, data_loader, device, iou_threshold=0.5):
    model.eval()  # Set model to evaluation mode

    # Initialize accumulators for metrics
    all_true_positives = 0
    all_false_positives = 0
    all_false_negatives = 0
    total_iou = 0
    total_detections = 0

    with torch.no_grad():  # No gradients needed during evaluation
        for images, targets in data_loader:
            images = [image.to(device) for image in images]  # Move images to the device
            outputs = model(images)  # Get model predictions

            for i, output in enumerate(outputs):
                # Get predicted boxes, scores, and labels for each image
                pred_boxes = output['boxes'].detach().cpu().numpy()
                pred_scores = output['scores'].detach().cpu().numpy()
                pred_labels = output['labels'].detach().cpu().numpy()

                # Filter predictions by a score threshold (e.g., keep only boxes with score > 0.5)
                pred_filtered = [pred_boxes[j] for j in range(len(pred_boxes)) if pred_scores[j] > 0.5]

                # Get ground truth boxes for the current image
                true_boxes = targets[i]['boxes'].detach().cpu().numpy()

                # Initialize counts for the current image
                true_positives = 0
                false_positives = 0
                false_negatives = 0
                iou_sum = 0

                # Compare each predicted box with ground truth boxes
                for pred_box in pred_filtered:
                    max_iou = 0  # Track the highest IoU for each predicted box
                    for gt_box in true_boxes:
                        current_iou = iou(pred_box, gt_box)
                        max_iou = max(max_iou, current_iou)

                    # If the highest IoU is above the threshold, count it as a true positive
                    if max_iou >= iou_threshold:
                        true_positives += 1
                        iou_sum += max_iou  # Add the IoU to the sum
                    else:
                        false_positives += 1

                # Any ground truth box that was not matched with a prediction is a false negative
                false_negatives = len(true_boxes) - true_positives

                # Accumulate the results across all images
                all_true_positives += true_positives
                all_false_positives += false_positives
                all_false_negatives += false_negatives
                total_iou += iou_sum
                total_detections += len(pred_filtered)

    # Calculate precision, recall, F1-score, and average IoU
    precision = all_true_positives / (all_true_positives + all_false_positives) if (all_true_positives + all_false_positives) > 0 else 0
    recall = all_true_positives / (all_true_positives + all_false_negatives) if (all_true_positives + all_false_negatives) > 0 else 0
    f1_score = 2 * (precision * recall) / (precision + recall) if (precision + recall) > 0 else 0

    avg_iou = total_iou / total_detections if total_detections > 0 else 0

    return precision, recall, f1_score, avg_iou

In [None]:
num_epochs = 2

train_losses = []
train_precisions = []
train_recalls = []
train_f1_scores = []
train_avg_ious = []

val_precisions = []
val_recalls = []
val_f1_scores = []
val_avg_ious = []

# Loop over the number of epochs
for epoch in range(num_epochs):
    # Train the model for one epoch and calculate the loss
    train_loss = train(model, optimizer, train_data_loader, device, epoch)
    train_losses.append(train_loss)  # Append the training loss

    # Calculate metrics (Precision, Recall, F1-score, and Avg IoU) on the training data
    train_precision, train_recall, train_f1_score, train_avg_iou = calculate_metrics(model, train_data_loader, device)
    train_precisions.append(train_precision)
    train_recalls.append(train_recall)
    train_f1_scores.append(train_f1_score)
    train_avg_ious.append(train_avg_iou)

    # Calculate metrics (Precision, Recall, F1-score, and Avg IoU) on the validation data
    val_precision, val_recall, val_f1_score, val_avg_iou = calculate_metrics(model, val_data_loader, device)
    val_precisions.append(val_precision)
    val_recalls.append(val_recall)
    val_f1_scores.append(val_f1_score)
    val_avg_ious.append(val_avg_iou)

    # Print out the metrics for this epoch
    print(f"Epoch {epoch+1}")
    print(f"Train - Loss: {train_loss}, Precision: {train_precision}, Recall: {train_recall}, F1-Score: {train_f1_score}, Average IoU: {train_avg_iou}")
    print(f"Val   - Precision: {val_precision}, Recall: {val_recall}, F1-Score: {val_f1_score}, Average IoU: {val_avg_iou}")

In [None]:
def plot_training_loss(train_losses):
    plt.figure(figsize=(8, 6))
    plt.plot(train_losses, label='Training Loss', linestyle='-', color='blue', marker='o')
    plt.xlabel('Epoch')
    plt.ylabel('Loss')
    plt.title('Training Loss per Epoch')
    plt.legend()
    plt.show()

plot_training_loss(train_losses)

In [None]:
def plot_training_validation_metrics(train_metrics, val_metrics, metric_name, ylabel, color_train='blue', color_val='green'):
    plt.plot(train_metrics, label=f'Training {metric_name}', linestyle='-', color=color_train)
    plt.plot(val_metrics, label=f'Validation {metric_name}', linestyle='--', color=color_val)
    plt.xlabel('Epoch')
    plt.ylabel(ylabel)
    plt.title(f'{metric_name} per Epoch')
    plt.legend()

def plot_all_metrics():
    plt.figure(figsize=(14, 12))

    plt.subplot(2, 2, 1) # Plot Precision
    plot_training_validation_metrics(train_precisions, val_precisions, 'Precision', 'Precision', 'blue', 'green')

    plt.subplot(2, 2, 2) # Plot Recall
    plot_training_validation_metrics(train_recalls, val_recalls, 'Recall', 'Recall', 'blue', 'orange')

    plt.subplot(2, 2, 3) # Plot F1-Score
    plot_training_validation_metrics(train_f1_scores, val_f1_scores, 'F1-Score', 'F1-Score', 'blue', 'red')

    plt.subplot(2, 2, 4) # Plot Average IoU
    plot_training_validation_metrics(train_avg_ious, val_avg_ious, 'Average IoU', 'Average IoU', 'blue', 'purple')

    plt.tight_layout()
    plt.show()

plot_all_metrics()

# Inference

In [None]:
def get_test_transform():
    return A.Compose([
        ToTensorV2(p=1.0)
    ])

def load_image_test(image_path, transform=None):
    image = cv2.imread(image_path)

    image = cv2.cvtColor(image, cv2.COLOR_BGR2RGB).astype(np.float32)
    image /= 255.0

    if transform:
        transformed = transform(image=image)
        image = transformed['image']

    return image

In [None]:
def detect_on_image(model, image_path, device):
    image = load_image_test(image_path, transform=get_test_transform())

    # Add a batch dimension (for a single image) and move the image to the specified device (GPU/CPU)
    image = image.unsqueeze(0).to(device)

    # Switch the model to evaluation mode
    model.eval()

    # Perform inference/prediction without tracking gradients
    with torch.no_grad():
        outputs = model(image)

    # Extract predicted bounding boxes, confidence scores, and class labels from the output
    pred_boxes = outputs[0]['boxes'].cpu().numpy()      # Bounding boxes for each detected object
    pred_scores = outputs[0]['scores'].cpu().numpy()    # Confidence scores for the detected objects
    pred_labels = outputs[0]['labels'].cpu().numpy()    # Class labels for each object

    return pred_boxes, pred_scores, pred_labels

In [None]:
def visualize_detection(image_path, boxes, scores, score_threshold=0.5):
    image = cv2.imread(image_path)
    image = cv2.cvtColor(image, cv2.COLOR_BGR2RGB)

    fig, ax = plt.subplots(1, figsize=(12, 9))
    ax.imshow(image)  # Display the image on the plot

    # Loop over each bounding box and plot it if the confidence score is above the threshold
    for i, box in enumerate(boxes):
        if scores[i] >= score_threshold:  # Filter by confidence score
            x_min, y_min, x_max, y_max = box  # Extract bounding box coordinates

            # Create a rectangle for the bounding box
            rect = patches.Rectangle((x_min, y_min), x_max - x_min, y_max - y_min,
                                     linewidth=2, edgecolor='red', facecolor='none')
            ax.add_patch(rect)  # Add the rectangle to the plot

            # Add the confidence score near the top-left corner of the bounding box
            score = scores[i]
            ax.text(x_min, y_min - 5, f'{score:.2f}', color='yellow', fontsize=12,
                    bbox=dict(facecolor='black', alpha=0.5))  # Add text box with score

    plt.title(f'Detection Results for {os.path.basename(image_path)}')

    # Show the plot with bounding boxes and scores
    plt.show()

In [None]:
# Ensure the trained model is loaded into memory and moved to the correct device GPU
model.to(device)

# Path to the folder containing test images
test_image_dir = 'test'

# Loop to perform detection on multiple images from the test folder
for image_name in os.listdir(test_image_dir):
    # Create the full path to the test image
    image_path = os.path.join(test_image_dir, image_name)

    # Perform object detection on the image
    pred_boxes, pred_scores, _ = detect_on_image(model, image_path, device)

    # Visualize the detection results (bounding boxes and scores)
    visualize_detection(image_path, pred_boxes, pred_scores, score_threshold=0.5)