# Tutorial 2.5: **Y**ou **O**nly **L**ook **O**nce (YOLO)

Author: [René Larisch](mailto:rene.larisch@informatik.tu-chemnitz.de). This implementation based on the implementation of [Jeffrey Tan](https://github.com/tanjeffreyz/yolo-v1) with modifications.


Introduced by [Redmon et al. (2015)](https://arxiv.org/pdf/1506.02640), YOLO (**Y**ou **O**nly **L**ook **O**nce) understands object detection as a regression problem rather than a classification problem.

Previous object detection methods, like region-based convolution neural networks (R-CNN), used multiple region proposals, feed them into a CNN to extract features and uses a support vector machine (SVM) to identify if there is an object in the region or not. Regions are optimized during training to fit the ground truth bounding boxes. This is called the [selective search](https://ivi.fnwi.uva.nl/isis/publications/2013/UijlingsIJCV2013/UijlingsIJCV2013.pdf) algorithm. 
The drawback of this approach is, that each region proposal has to be processed by the CNN separately, slowing down the detection.

YOLO solves that problem by predicting multiple bounding boxes at once and assigning each bounding box to a small part of the input image.


Let's start with the usual initialization:

In [None]:
import torch
import torch.nn as nn

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f'PyTorch version: {torch.__version__} running on {device}')

import numpy as np
import matplotlib.pyplot as plt

import os, sys
notebook_dir = os.getcwd()
root_path = os.path.abspath(os.path.join(notebook_dir, ".."))
if root_path not in sys.path:
    sys.path.append(root_path)
    print(f"Added {root_path} to sys.path")

from Utils.dataloaders import PascalVocDataset
from Utils.plotting import plot_boxes
from Utils.little_helpers import set_seed, get_parameters, get_overlap, get_bboxes

set_seed(42)





YOLO follows the following algorithm:
1. The input image is divided into an $S \times S$ grid. If an object is in the center of the grid, the grid should detect the object. At the end, each grid cell should only predict one class. 
2. Each grid predicts $B$ many bounding boxes. A bounding box is defined by four parameters: 
$(x,y)$ defines the center of the bounding box, relative to the corresponding grid cell. 
The parameters $(h,w)$ define the height and width of the box and are relative to the size of the image. Thus, a bounding box can span over multiple grid cells.
3. Each bounding box predicts also a **confidence score**, which is formally defined as $P(Object) * IoU^{truth}_{pred}$, with $P(Object)$ the probability that there is an object in the grid cell and $IoU^{truth}_{pred}$ the **Intersection over the Union** between the predicted bounding box and the ground truth. Or to say it differently: If there is no object in the grid cell, the **confidence score** is zero, otherwise, it is equal to the $IoU^{truth}_{pred}$.
4. Each grid cell predicts conditional class probabilities ($P(Class_i|Object)$) for each of the $C$ many classes. Despite multiple bounding boxes, only one conditional probability per class is calculated for each grid cell.

<div align="center">
    <img src="figures/Yolo.gif"/>
    <p><i>Figure 1: Schematic view of the functionality of YOLO . From Redmon et al. (2015)</i></p>
</div>

In summary, the number of parameters to be predicted depends on the numbers of grid cells ($S \times S$), the number of classes ($C$), and the five parameters (4 coordinates and 1 confidence score) for each of the bounding boxes ($B * 5$).
Or in total:

$$
S \times S \times (B *5 + C)
$$



## Intersection over Union

The Intersection over Union (IoU) (sometimes called the [Jaccard index](https://en.wikipedia.org/wiki/Jaccard_index)) is calculated as the proportion between the overlapping area (or intersection) between two bounding boxes and the area of the union between the boxes. It calculates how well two bounding boxes match each other.

<div align="center">
    <img src="figures/IOU.png" width="350"/>
    <p><i>Figure 2: Visualization of Intersection of Union. From https://www.pyimagesearch.com/2016/11/07/intersection-over-union-iou-for-object-detection/</i></p>
</div>

If the area of overlap is small, the area of union will be bigger than the area of overlap, leading to a higher denominator and an IoU value closer to zero. In contrast to that, when the area of overlap increases, the size of both areas approaches each other so that the IoU gets closer to one. 

<div align="center">
    <img src="figures/boxes.gif" width="450"/>
    <p><i>Figure 3: Calculating the intersection of two boxes.</i></p>
</div>


To calculate the overlapping area between two rectangular boxes $A$ and $B$, we need $\left(x_{1}^A, y_{1}^A\right)$ as the lower corner positions of box A and $\left(x_{2}^A, y_{2}^A\right)$ as the upper corner positions of box A, and the corresponding parameters for box B ($\left(x_{1}^B, y_{1}^B\right), \left(x_{2}^B, y_{2}^B\right)$).
Next, we define the highest lower left point ($tl$) from the two boxes, which is computed by $x_{tl} = \max(x_{1}^A, x_{1}^B)$ and $y_{tl} = \max(y_{1}^A, y_{1}^B)$. Additionally, we compute the lowest high right point ($br$) with $x_{br} = \min\left(x_{2}^A, x_{2}^B\right)$ and $y_{br} = \min\left(y_{2}^A, y_{2}^B\right)$. With the points $\left(x_{tl}, y_{tl}\right)$ and $\left(x_{br}, y_{br}\right)$ we can define the intersection and calculate its area.

To calculate the union, we simply add the areas of the two boxes and subtract the area of the intersection.

We implement an extra function to calculate the $IoU$ between the predicted bounding box (_p_) and the ground truth (_a_).
The function _get_iou_ gets also the number of boxes (_B_) and the number of classes (_C_) as parameters together with a small value that we use instead of zeros, in the case that there is no object in the box (_epsilon_)


In [None]:
def bbox_attr(data, i, C):
    """Returns the Ith attribute of each bounding box in data."""

    attr_start = C + i
    return data[..., attr_start::5]
    
def bbox_to_coords(t, C):
    """Changes format of bounding boxes from [x, y, width, height] to ([x1, y1], [x2, y2])."""
    
    width = bbox_attr(t, 2, C)
    x = bbox_attr(t, 0, C)
    x1 = x - width / 2.0
    x2 = x + width / 2.0

    height = bbox_attr(t, 3, C)
    y = bbox_attr(t, 1, C)
    y1 = y - height / 2.0
    y2 = y + height / 2.0

    return torch.stack((x1, y1), dim=4), torch.stack((x2, y2), dim=4)

def get_iou(p, a, B, C, epsilon):

    p_tl, p_br = bbox_to_coords(p, C)          # (batch, S, S, B, 2)
    a_tl, a_br = bbox_to_coords(a, C)

    # Largest top-left corner and smallest bottom-right corner give the intersection
    coords_join_size = (-1, -1, -1, B, B, 2)
    tl = torch.max(
        p_tl.unsqueeze(4).expand(coords_join_size),         # (batch, S, S, B, 1, 2) -> (batch, S, S, B, B, 2)
        a_tl.unsqueeze(3).expand(coords_join_size)          # (batch, S, S, 1, B, 2) -> (batch, S, S, B, B, 2)
    )
    br = torch.min(
        p_br.unsqueeze(4).expand(coords_join_size),
        a_br.unsqueeze(3).expand(coords_join_size)
    )

    intersection_sides = torch.clamp(br - tl, min=0.0)
    intersection = intersection_sides[..., 0] \
                   * intersection_sides[..., 1]       # (batch, S, S, B, B)

    p_area = bbox_attr(p, 2, C) * bbox_attr(p, 3, C)                  # (batch, S, S, B)
    p_area = p_area.unsqueeze(4).expand_as(intersection)        # (batch, S, S, B, 1) -> (batch, S, S, B, B)

    a_area = bbox_attr(a, 2, C) * bbox_attr(a, 3, C)                  # (batch, S, S, B)
    a_area = a_area.unsqueeze(3).expand_as(intersection)        # (batch, S, S, 1, B) -> (batch, S, S, B, B)

    union = p_area + a_area - intersection

    # Catch division-by-zero
    zero_unions = (union == 0.0)
    union[zero_unions] = epsilon
    intersection[zero_unions] = 0.0

    return intersection / union
    

## YOLO - Loss function

In order to optimize our network to perform the prediction, the loss function consists of three terms.

### 1. Localization loss

The first loss we need is the localization loss, which is calculated as the MSE between the coordinates of the ground truth box $\left(\hat{x_i}, \hat{y_i}, \hat{w_i}, \hat{h_i}\right)$ and the coordinates of the predicted box $\left(x_i, y_i, w_i, h_i\right)$. It ensures that the spatial parameters of the bounding box are trained:

$$
 \mathcal{L}_{loc}(\theta) = \lambda_{coord}\;\left(\; \sum^{S^2}_{i=0} \sum^B_{j=0}1^{obj}_{i,j} \left[(x_i - \hat{x}_i)^2 + (y_i - \hat{y}_i)^2\right] \; + \;
  \sum^{S^2}_{i=0} \sum^B_{j=0}1^{obj}_{i,j} \left[(\sqrt{w_i} - \sqrt{\hat{w}_i})^2 + (\sqrt{h} - \sqrt{\hat{h}_i})^2\right] \; \right)
$$

The term $1^{obj}_{i,j}$ is one if there is an object in the bounding box. Therefore, $\mathcal{L}_{loc}$ is zero if there is no object in the bounding box. For images with a lot of background information, this can lead to a lower average loss and a network that does not predict the bounding boxes correctly. To avoid this, we use the $\lambda_{coord} = 5$ parameter to increase the loss if there is an object in the box, forcing the network to predict the bounding boxes correctly.

### 2. Confidence loss

The second loss is to learn the confidence score of each bounding box $j$, by minimizing the MSE between the predicted confidence score ($C_i$) of a grid cell $i$, and the IoU between the ground truth bounding box and the predicted bounding box ($\hat{C}_i$).

$$
    \mathcal{L}_{conf}(\theta) = \sum^{S^2}_{i=0} \sum^B_{j=0}1^{obj}_{i,j}(C_{i,j} - \hat{C}_{i,j})^2 + \lambda_{noobj} \sum^{S^2}_{i=0} \sum^B_{j=0}1^{noobj}_{i,j}(C_{i,j} - \hat{C}_{i,j})^2 
$$

To distinguish between the object and the background, the confidence should be fully updated if there is an actual object in the grid cell, i.e. $1^{obj}_{i,j} = 1$. If there is no object in the grid cell (i.e. $1^{noobj}_{i,j}$ = 1), the confidence should be updated only moderately (thus, $\lambda_{noobj} = 0.5$). Since in many images the number of grid cells without an object is higher than the number of cells with an object, the value for $\lambda_{noobj}$ should be low to avoid instability during training.
 
### 3. Classification loss

The last loss function is the classification loss. 
It is calculated as the mean square error (MSE) between the one-hot encoded class labels ($\hat{p}_i(c)$) and the predicted class probabilities ($p_i(c)$), related to a grid cell $i$:

$$
    \mathcal{L}_{class}(\theta) = \sum^{S^2}_{i=0}1^{obj}_{i} \sum_{c \in classes}(p_i(c) - \hat{p}_i(c))^2 
$$

where $1^{obj}_{i}$ is 1 when there actually is an object in grid cell $i$, 0 otherwise. 


### The final loss function

The complete loss function can be written as:

$$
 \mathcal{L}(\theta) = \mathcal{L}_{loc}(\theta) + \mathcal{L}_{conf}(\theta) + \mathcal{L}_{class}(\theta) 
$$


In [None]:
def mse_loss(a, b):
    flattened_a = torch.flatten(a, end_dim=-2)
    flattened_b = torch.flatten(b, end_dim=-2).expand_as(flattened_a)
    return nn.functional.mse_loss(
        flattened_a,
        flattened_b,
        reduction='sum'
    )

class SumSquaredErrorLoss(nn.Module):
    def __init__(self, epsilon = 1e-6, B=3, C = 20, batch_size = 32):
        super().__init__()
        self.l_coord = 5
        self.l_noobj = 0.5
        self.epsilon = epsilon
        self.B = B
        self.C = C
        self.batch_size = batch_size
    
    def forward(self, p, a):
        # Calculate IoU of each predicted bbox against the ground truth bbox
        iou = get_iou(p, a, self.B, self.C, self.epsilon)                     # (batch, S, S, B, B)
        max_iou = torch.max(iou, dim=-1)[0]     # (batch, S, S, B)

        # Get masks
        bbox_mask = bbox_attr(a, 4, self.C) > 0.0
        p_template = bbox_attr(p, 4, self.C) > 0.0
        
        # Use masks to see if there is an object in the grid or not
        obj_i = bbox_mask[..., 0:1]         # 1 if grid I has any object at all
        responsible = torch.zeros_like(p_template).scatter_(       # (batch, S, S, B)
            -1,
            torch.argmax(max_iou, dim=-1, keepdim=True),                # (batch, S, S, B)
            value=1                         # 1 if bounding box is "responsible" for predicting the object
        )
        obj_ij = obj_i * responsible        # 1 if object exists AND bbox is responsible
        noobj_ij = ~obj_ij                  # Otherwise, confidence should be 0

        
        ####
        # Localization losses
        ####
        # XY position losses
        x_losses = mse_loss(
            obj_ij * bbox_attr(p, 0, self.C),
            obj_ij * bbox_attr(a, 0, self.C)
        )
        y_losses = mse_loss(
            obj_ij * bbox_attr(p, 1, self.C),
            obj_ij * bbox_attr(a, 1, self.C)
        )
        pos_losses = x_losses + y_losses
        

        # Bbox dimension losses
        p_width = bbox_attr(p, 2, self.C)
        a_width = bbox_attr(a, 2, self.C)
        width_losses = mse_loss(
            obj_ij * torch.sign(p_width) * torch.sqrt(torch.abs(p_width) + self.epsilon),
            obj_ij * torch.sqrt(a_width)
        )
        p_height = bbox_attr(p, 3, self.C)
        a_height = bbox_attr(a, 3, self.C)
        height_losses = mse_loss(
            obj_ij * torch.sign(p_height) * torch.sqrt(torch.abs(p_height) + self.epsilon),
            obj_ij * torch.sqrt(a_height)
        )
        dim_losses = width_losses + height_losses
        # print('dim_losses', dim_losses.item())

        ####
        # Confidence losses (target confidence is IoU)
        ####
        # there is an object in the cell
        obj_confidence_losses = mse_loss(
            obj_ij * bbox_attr(p, 4, self.C),
            obj_ij * torch.ones_like(max_iou)
        )
        
        # there is no object in the cell
        noobj_confidence_losses = mse_loss(
            noobj_ij * bbox_attr(p, 4, self.C),
            torch.zeros_like(max_iou)
        )

        ####
        # Classification losses
        ####
        class_losses = mse_loss(
            obj_i * p[..., :self.C],
            obj_i * a[..., :self.C]
        )
        # print('class_losses', class_losses.item())

        total = self.l_coord * (pos_losses + dim_losses) \
                + obj_confidence_losses \
                + self.l_noobj * noobj_confidence_losses \
                + class_losses
        return total / self.batch_size

## Network

Once the loss function is written, assembling the network is straightforward. While it is possible to perform end-to-end training with the entire network, we will use a pre-trained backbone network (e.g. a ResNet trained on ImageNet or a subset of it), with the YOLO detection head placed on top.


In [None]:
class YOLO(nn.Module):
    def __init__(self, backbone, back_channels, num_classes=20, num_anchors=3, grid_size=7, pre_trained = True):
        super(YOLO, self).__init__()
        self.num_classes = num_classes #(C)
        self.num_anchors = num_anchors # number of  anchor boxes (B)
        self.grid_size = grid_size # (S)
        self.backbone = backbone # backbone network
        self.back_channels = back_channels
        if pre_trained:
            self.backbone.requires_grad_(False) # Freeze backbone weights

        ## Replace the last layers with Identity layers and attach detection layers
        self.backbone.avgpool = nn.Identity()
        self.backbone.fc = nn.Identity()
        self.backbone.classifier = nn.Identity()
                
        ## detection Head
        self.detector = nn.Sequential(
            nn.Conv2d(self.back_channels ,1024, kernel_size=3, stride=1, padding=1),
            nn.BatchNorm2d(1024),
            nn.ReLU(),
            nn.Conv2d(1024,1024, kernel_size=3, stride=2, padding=1),
            nn.BatchNorm2d(1024),            
            nn.ReLU(),        
            nn.Flatten(),
            nn.Linear(7*7*1024, 4096),
            nn.ReLU(),
            nn.Linear(4096, grid_size*grid_size*(num_anchors * 5 + num_classes))
        )
        
    def forward(self, x):
        features = self.backbone(x)
        features = torch.reshape(features, (-1, self.back_channels , 14, 14))
        predictions = self.detector(features)
        # reshape the predictions to match the label dimension (batch_size, S, S, B*5 + C)
        return predictions.view(-1, self.grid_size, self.grid_size, self.num_anchors * 5 + self.num_classes)

## Dataset

To fine-tune our YOLO network, we use the [Pascal VOC](http://host.robots.ox.ac.uk/pascal/VOC/) dataset in its version from 2012. It consists of 20 classes, providing multiple bounding boxes for multiple objects in a scene. It also contains scenes with objects of multiple classes.

<div align="center">
    <img src="figures/pascalvoc_samples.png" width="650"/>
    <p><i>Figure 4: Samples from the first four classes of the PascalVOC dataset. Image taken from Everingham et al. (2010)</i></p>
</div>


In [None]:
class_list = ['aeroplane', 'bicycle', 'bird', 'boat', 'bottle', 'bus', 'car', 'cat', 'chair', 'cow', 'diningtable', 
              'dog', 'horse', 'motorbike','person','pottedplant', 'sheep', 'sofa', 'train', 'tvmonitor']
n_classes = len(class_list)
C = n_classes # number of classes
B = 3 # number of anchor boxes
S = 7 # number of grid cells
image_size = (448,448)

train_set = PascalVocDataset('train', class_list, num_anchors = B, grid_size=S, image_size=image_size, normalize=True, augment=True)
val_set = PascalVocDataset('test', class_list, num_anchors = B, grid_size=S, image_size=image_size, normalize=False, augment=False)

batch_size = 64
train_loader = torch.utils.data.DataLoader(train_set, batch_size = batch_size, 
                                           shuffle=True, num_workers=4, persistent_workers=True, drop_last=True)
val_loader = torch.utils.data.DataLoader(val_set, batch_size = batch_size, 
                                           shuffle=False, num_workers=4, persistent_workers=True, drop_last=True)


Let's have a look how an image from the test set looks like:

In [None]:
img, label, _ = val_set[3]
plot_boxes(img, label, class_list, max_overlap=float('inf'),image_size = image_size, min_confidence=0.2, file=None)

## Training

At first, we create the YOLO model including the backbone model. As backbone model, we use a ResNet50 trained on ImageNet from PyTorch. Of course, you can use any network from any platform which seems sensible, even the models implemented by yourself in the previous tutorials. The better the backbone model is trained, the better the YOLO model will perform. Additionally, it makes sense to have the backbone model pre-trained on a similar dataset like the one given for the detection task.

As we need the number of feature maps (or out_channels in PyTorch) of the pre-trained ResNet, we first print its architecture to find the layer we want. 

In [None]:
from torchvision import models

## backbone model
# Pre-trained ResNet50 from PyTorch
backbone = models.resnet50(weights=models.ResNet50_Weights.IMAGENET1K_V2)
print(backbone)

We want the number of feature maps from the last convolutional layer. To access this, we need to get the last sequential part (called _layer4_), want the third Bottleneck - module and there the third (as it is the last) convolutional layer. So the Python code line would be
```
ResNet.layer4[2].conv3.out_channels
```
In our example it is:

In [None]:
back_outchannels = backbone.layer4[2].conv3.out_channels
print(back_outchannels)

Keep in mind, that if you change the backbone model, you also have to change how to access the out_channels. 

Now we can initialize the YOLO network. 

In [None]:
## YOLO
model = YOLO(backbone, back_outchannels, num_classes = C, num_anchors=B, grid_size=S, pre_trained=True)

print('Trainable Parameters in YoloNetwork: %.3fM' % get_parameters(model))

Second, we define the optimizer and the loss function.

In [None]:
optimizer = torch.optim.Adam(model.parameters(), lr = 0.0001,)
loss_function = SumSquaredErrorLoss(epsilon = 1e-6, B=B, C = C, batch_size = batch_size)

And finally, we start the training.

In [None]:
from tqdm.notebook import tqdm

train_losses = []

model.to(device)
num_epochs = 10#50
for epoch in range(num_epochs):
    model.train()
    train_loss = 0
    train_pbar = tqdm(train_loader, desc=f'Epoch {epoch + 1}/{num_epochs} [Train]')
    
    for data, labels, _ in train_pbar:
        data = data.to(device)
        labels = labels.to(device)
        optimizer.zero_grad()
        predictions = model.forward(data)
        loss = loss_function(predictions, labels)
        loss.backward()
        optimizer.step()

        train_loss += loss.item() / len(train_loader)
        # Update progress bar
        train_pbar.set_postfix({'loss': loss.item()})
    
    train_losses.append(train_loss)
    print('Loss after epoch: ', train_loss)


plt.figure()
plt.plot(train_losses)
plt.xlabel('Epoch')
plt.ylabel('Mean loss')
plt.show()


# save model
results_folder = 'yolo_model/'
os.makedirs(results_folder, exist_ok=True)
torch.save(model.state_dict(), f'{results_folder}model_dict.pt')

    

## Evaluation

In [None]:
# use no_grad as we did not want to train further
model = model = YOLO(backbone, back_outchannels, num_classes = C, num_anchors=B, grid_size=S, pre_trained=True)
model.load_state_dict(torch.load('yolo_model/model_dict.pt', weights_only=True))
model = model.to(device)

with torch.no_grad():
    for data, labels, _ in val_loader:
        data = data.to(device)
        outputs = model(data)
        outputs = outputs.to('cpu')
        plot_boxes(data[5].to('cpu'), outputs[5], class_list, 
                   max_overlap=float('inf'),image_size = image_size, min_confidence=0.8, file=None)
        
        break

### Mean average precision

Besides of looking on images and how well objects have been detected, a very common method to evaluate object detection and also object segmentation models is to calculate the mean average precision value (mAP). 

To calculate the mAP, we first introduce how to measure the correctness of a binary classificator. 

<div align="center">
    <img src="figures/conf_mat.png" width="500"/>
    <p><i>Figure 5: Confusion matrix</i></p>
</div>

As depicted inf Figure 5, for a binary classificator there are four combinations between the model prediction and the ground truth value:
1. **True positive (TP)**: The model predicts correctly an object, that exists.
2. **False positive (FP)**: The model predicts falsly an object, which does not exist.
3. **False negative (FN)**: The model, falsly, does not predict an object, which exists.
4. **True negative (TN)**: The model, correctly, does not predict an object, which does not exist.

These combinations can be used to quantize how well the model predicts an object correctly in relation to all the predicted objects (called **precision**) 

$$
 P = \frac{TP}{TP + FP}
$$


and how well the model predicts an object correctly in relation to all ground truth objects (called **recall**)

$$
 R = \frac{TP}{TP + FN}
$$

To calculate the mAP, we now have to define, if a model predicts an object or not with the help of the $IoU$.

<div align="center">
    <img src="figures/mAP-basic.png" width="750"/>
    <p><i>Figure 6: The IoU is used to determine if a bounding box is a falsly predicted box (FP) or a correctly predicted box (TP) </i></p>
</div>

To do so, our algorithm has to do the following steps:
1. Measure for each predicted bounding box, if it is a correct one (TP) or a false one (FP)
2. For each class:
   - Sort all predicted bounding box via their confidence scores in a descending order
   - Calculate the precision and recall value for each predicted bounding box, starting by the first box (with the highest confidence score). Accumulate the detected FP and TP boxes for the following boxes.
   - Calculate the average precision score with: $AP = \sum_n (R_n - R_{n-1})P_n$
3. After calculating the $AP$ for each class, calculate the mean average precision (mAP):

$$
 mAP = \frac{1}{N_{classses}} \sum_{i=1}^{N_{classes}} AP_i
$$


More information about the (mean) average precision value can be found [here](https://github.com/rafaelpadilla/Object-Detection-Metrics) or [here](https://www.v7labs.com/blog/mean-average-precision).

#### Step 1: Get the measurement for each bounding box

A predicted bounding box is considered true positive (TP) if the $IoU$ is above a minimum threshold and if it predicts the correct class.
It is considered a false positive (FP) if the $IoU$ is below the threshold or a wrong class is predicted.

We use a minimum threshold for the confidence value to sort out bounding boxes with very low confidence, which will create a lot of false positive boxes.
This means that the minimum confidence value is an additional hyperparameter that needs to be tuned after training.

In [None]:
def sortout_redudancy(bboxes):
    bboxes = [ee for n,ee in enumerate(bboxes) if ee not in bboxes[:n]]
    return bboxes
    
def measure_prediction(data, gt, p, class_list, min_confidence, min_iou, pred_list, gt_dict):
    ## data -> image input data
    ## gt -> ground truth boxes
    ## p -> predictions    
    ## class_list -> list of all classes
    ## min_confidence -> minimum confidence score a box must have to be considered
    ## min_iou -> IoU threshold that we use to distinguish between TP(>= min_iou) and FP(< min_iou) boxes
    ## pred_list -> list of already evaluated predicted boxes, and where we add more evaluated boxes to it
    ## gt_dict -> dictionary to count the total number of ground truth boxes for each class  
    ## return -> updated pred_list and updated gt_dict

    
    images_size = (data.size(dim=2),data.size(dim=3))
    n_samples, s_x, s_y, n_points = np.shape(p)
    num_classes = len(class_list)

    grid_size_x = data.size(dim=3) / s_x
    grid_size_y = data.size(dim=2) / s_y

    ## iterate over all images that we have
    for d in range(len(data)):
        ## get the ground truth boxes
        gt_box = get_bboxes(s_x, s_y, gt[d], grid_size_x, grid_size_y, num_classes, min_confidence=0.5, image_size = images_size)
        gt_box = sortout_redudancy(gt_box)    
        num_boxes_gt = len(gt_box)

        ## update gt_dict
        for b in gt_box:
            if b[-1] in gt_dict:
                gt_dict[b[-1]] = gt_dict[b[-1]] +1
            else:
                gt_dict[b[-1]] = 1
        
        ## get only the prediced bboxes of predictions over the min_confidence score
        p_box = get_bboxes(s_x, s_y, p[d], grid_size_x, grid_size_y, num_classes, min_confidence=min_confidence, image_size = images_size)
        # Sort by highest to lowest confidence
        p_box = sorted(p_box, key=lambda x: x[3], reverse=True)
    
        num_boxes_p = len(p_box)

        ## initialize the array to save the IoU values
        iou_p_gt = [[0 for _ in range(num_boxes_gt)] for _ in range(num_boxes_p)]

        ## for each predicted bounding box, we use the highest IoU to link it to one
        ## ground truth box and to check, if both boxes predicting the same class
        
        ## additional list to track, if a ground truth box is already linked with
        ## a bounding box with a higher confidence score
        list_max_gt_arg = []

        ## iterate over all predicted boxes in the current sample
        for pb in range(num_boxes_p):
            
            # calculate the IoU of the predicted bbox to all ground truth bboxes
            for j in range(num_boxes_gt):
                iou_p_gt[pb][j] = get_overlap(p_box[pb], gt_box[j])
            ## take the ground truth bbox with the highest IoU and check, if its the same class
            max_gt_arg = np.argmax(iou_p_gt[pb])
            ## if a p_bbox with a higher confidence score is already linked to same gt_bbox
            ## ignore the actual p_bbox
            if max_gt_arg in list_max_gt_arg:
                continue
            else:
                list_max_gt_arg.append(max_gt_arg)   
                ## check if the predicted classes are the same
                if p_box[pb][-1] == gt_box[max_gt_arg][-1]:
                    ##check if iou < or > than min_iou
                    if iou_p_gt[pb][max_gt_arg] >= min_iou:
                        # TP
                        class_i = p_box[pb][-1]
                        conf_score = p_box[pb][-2]
                        pred_list.append([class_i, conf_score,1])
                    else:
                        # FP
                        class_i = p_box[pb][-1]
                        conf_score = p_box[pb][-2]
                        pred_list.append([class_i, conf_score,0])
                #if classes are different
                else:
                    # FP
                    class_i = p_box[pb][-1]
                    conf_score = p_box[pb][-2]
                    pred_list.append([class_i, conf_score,0])
            
    ## return a list for predicted bounding boxes, 
    ## the class, the confidence score, and is it TP or FP
    ## TP are indicated with 1 and FP with 0
    ## and also return the number of ground truth boxes per class
    return(pred_list, gt_dict)

Use the pre-trained network and iterate over the entire validation set to gather the necessary information for each predicted bounding box.
To save computation time, we iterate over the validation set only once. As we compute the AP for each class, we also store the predicted class index for each bounding box, along with its confidence score and whether it is a TP or FP.

In [None]:
import torch
from tqdm.notebook import tqdm

model = model = YOLO(backbone, back_outchannels, num_classes = C, num_anchors=B, grid_size=S, pre_trained=True)
model.load_state_dict(torch.load('yolo_model/model_dict.pt', weights_only=True))
model = model.to(device)

## we go through the test set
## save for every predicted bounding box > min_confidence
## the class, confidence value, and if its TP or FP
## TP = 1 / FP = 0
## we also save the total number of ground truth bboxes per class

## list to save the each p_bbox, 
## its class, its confidence score,
## and if TP(1) or FP(0)

pred_list = []

## list to save number of ground truth
## bboxes per class
gt_dict = {}

val_pbar =  tqdm(val_loader)

with torch.no_grad():
    for data, labels, _ in val_pbar:
        data = data.to(device)
        outputs = model(data)
        outputs = outputs.to('cpu')
        pred_list, gt_dict = measure_prediction(data = data, p = outputs, 
                                                gt = labels, class_list = class_list, 
                                                min_confidence=0.15, min_iou = 0.2, pred_list = pred_list, gt_dict = gt_dict)

#### Step 2: Calculate the average precision for each class
After we have collected all the necessary data, we create a dataframe from the data to have easier data management.
Then we iterate over all the classes and calculate the AP score.

In [None]:
import pandas as pd 

# create a pandas data frame to have better access to the values
df = pd.DataFrame(pred_list, columns = ['class', 'confScore', 'TP/FP'])

## iterate over all classes and calculate the average precision AP
all_ap = np.zeros(len(gt_dict))

for c in range(len(gt_dict)):
    # number of all ground truth bboxes
    gt_total = gt_dict[c]

    # get entries only for the current class
    df_c = df.loc[df['class'] == c]

    # sort them over the confidence score
    df_c = df_c.sort_values(by = 'confScore', ascending = False)
    ## no go through all predictions and calculate Precision and Recall
    list_precision = np.zeros(len(df_c))
    list_recall = np.zeros(len(df_c))
    acc_tp = 0
    acc_fp = 0
    ap = 0
    for d in range(len(df_c)):
        row = df_c.iloc[d]
        if row['TP/FP'] == 1:
            acc_tp +=1
        if row['TP/FP'] == 0:
            acc_fp +=1
        list_precision[d] = acc_tp/(acc_tp + acc_fp)
        list_recall[d] = acc_tp/(gt_dict[c]) # gt_dict[c] == TP + FN
        if d > 0:
            ap += (list_recall[d] - list_recall[d-1]) * list_precision[d]
    
    all_ap[c] = ap

#### Step 3: Calculate the mean average precision
By simply calculating the average over all AP values, we can calculate the mAP score.
Since the $IoU$ threshold is crucial to discriminate between TP and FP samples, and thus can strongly influence the mAP score, it is conventional to mention the value in percent.
For example, if we use a $min\_iou = 0.2$, we specify the mAP value with $mAP_{20}$.

In [None]:
print('mAP_{20} = ', np.mean(all_ap))

## Exercises

### 1. Parameters for the mAP
As mentioned above, the mAP value depends on two parameters (in addition to network performance): The lower threshold for the confidence score and the threshold for the $IoU$ parameter.
Change both of them and see how the mAP value changes. Find out which parameter configuration gives you the highest mAP value, and plot some samples to see if the predictions make sense.

In [None]:
# Your code here

### 2. Change backbone model

Switch from the pre-trained ResNet50 from PyTorch to an other backbone model. You can list all available models with `torchvision.models.list_models()`. Note, that all PyTorch models are solely trained on ImageNet. For more information, see [here](https://docs.pytorch.org/vision/main/models.html).
Of course, you can use other resources, e.g. `timm`.
Do not forget to make certain changes at the YOLO model class, depending on your chosen backbone model.

In [None]:
## Search for available pre-trained models
from torchvision import models
torchvision_models = models.list_models()
print(f"torchvision provides {len(torchvision_models)} pre-trained models.")
# filter for specific model type, e.g. EfficientNet
torchvision_models_effNet = models.list_models(include='efficientnet*')
print(f"Available pretrained EfficientNet models ({len(torchvision_models_effNet)}):")
for model in torchvision_models_effNet:
    print(f"  - {model}")

import timm
timm_models = timm.list_models(pretrained=True)
print(f"\ntimm provides {len(timm_models)} pretrained models.")
# filter for specific model type, e.g. EfficientNet
timm_models_effNet = timm.list_models(filter='efficientnet*', pretrained=True)
print(f"Available pretrained EfficientNet models ({len(timm_models_effNet)}):")
for model in timm_models_effNet:
    print(f"  - {model}")
    

# Your code here

### 3. Training YOLO from end-to-end

As seen above, YOLO can be trained with a pre-trained ResNet50 and frozen layers. However, it is also possible to train the entire network (end-to-end training), as end-to-end training can lead to better results than using a pre-trained network.

You just have to define a backbone network (like a CNN from one of the previous notebooks) and make sure that the weights are trainable.
Note that training the entire network will use more VRAM on the GPU and will take longer to train the same number of epochs.

In [None]:
# Your code here

### 4. Change from YOLOv1 to YOLOv2

In this notebook, we demonstrated how to implement the first version of the YOLO network (called YOLOv1). Not long after its release, the developer introduced YOLOv2, with many different improvements. A good description of the changes and why they were made can be found [here](https://medium.com/@sachinsoni600517/yolo-v2-comprehensive-tutorial-building-on-yolo-v1-mistakes-aa7912292c1a) and in the original [paper](https://arxiv.org/pdf/1612.08242v1.pdf).

Task: Implement YOLOv2, based on our YOLOv1 variant.

In [None]:
# Your code here