In [None]:
# default_exp yolov1

# YOLO V1 implementation

> YOLO v1.

In [None]:
#hide
from nbdev.showdoc import *

In this notebook the YOLOv1 will implemented based on the original [paper](https://arxiv.org/pdf/1506.02640v5.pdf)


TODO: 
* [ ] Need to rewrite the plot function so to give the name and the probability prediction for each bounding box
* [ ] Get more metrics from the training function (e.g. training and validation losses)
* [ ] Write a function that will plot the training and validation loss as well as the training and validation accuracy
* [ ] Use the model on the videos that I have from towing tank to see how well the algorithm performs
* [ ] Use images/videos with darker light conditions to train and test the model.

# How Yolo works
Yolo is an object detection algorithm and uses features that learned from a cnn network to detect objects. When prerforming object detection we want to correctly identify in the image the objects in the given image. Most of the classic aproaches in the object detection algorithms using the sliding window method where the classifier is run over evenly spaces lacations over the entire image. Such types of algorithms are the Deformable Parts Models (DPM), the R-CNN which uses proposal methods to generate the bounding boxes in the given image and then run the classifier on the proposed bounding boxes. This approch, and particullarly the DPM method is slow and not optimal for real time uses, and the improved version of R-CNN models is gaining some speed by strategically selecting interesting regions and run through them the classifier.

On the other hand YOLO algorithm based on the idea to split the image in a grid, for axample for a given image we can split it in a 3 by 3 grid (**_SxS = 3x3_**) which gives as 9 cells. As the below image shows, the image consists by a 3 by 3 grid with 9 cells, and each cell has 2 bouning boxes (**_B_**) which finally will give the prediction bounding boxe for the object in the image.

<img alt="Bounding Boxes" width="500" caption="Bounding Boxes" src="images/image.png" id="bboxs"/>

[//]:![image](../notes/images/image.png)

Figure 1

Generally, the YOLO algorithm has the following steps:

1. Divide the image into cells with an **_SxS_** grid
2. Each cell predicts **_B_** bounding boxes (_A cell is responsible for detecting an object if the object's bounding box is within the cell_
3. Return bounding boxes above a given confidence threshold. _The algorithm will show only the bounding box with the highest probability confidence (e.g. 0.90) and will reject all boxes with less values than this threshold_.

**Note:** In practice will like touse larger values of $S and B$, such as $S = 19$ and $B = 5$ to identify more objects, and each cell will output a prediction with a corresponding bounding box for a given image.

The below image shows the YOLO algorithm's result, which returns the bounding boxes for the detected objects. For the algorithm to perform efficiently needs to be trained sufficiently because with each iteration (epoch), the detection accuracy increases. Also, the bounding boxes can be in more than one cells without any issue, and the detection is performed in the cell where the midpoint of the bounding box belongs.

<img alt="Bounding Boxes2" width="500" caption="Bounding Boxes2" src="images/image2.png" id="bboxs2"/>

[//]: ![image](../notes/images/image2.png)

Figure 2

The YOLO object detection algorithm is faster architecture because uses one Convolutional Neural Network (CNN) to run all components in the given image in contrast with the naive sliding window approach where for each image the algorithm (DPM, R-CNN etc) needs to scan it step by step to find the region of interest, the detected objects. The R-CNN for example needs classify around 2000 regions per image which makes the algorithm very time consuming and it's not ideal for real time applications.

The figure below shows how the YOLO model creates an $S x S$ grid in the input image and then for each grid cell creates multiple bounding boxes as well as class probability map, and at the end gives the final predictions of the objects in the image.

<img alt="Bounding Boxesy" width="500" caption="Bounding Boxesy" src="images/yolo_paper.png" id="bboxs"/>

[//]: ![image](../notes/images/yolo_paper.png)

Figure 3

## How the bouning boxes are encoded in YOLO?
One of the most important aspects of this algorithm is the it builds and specifies the bounding boxes, and the other is the the Loss function. The algorithm uses five components to predict an output:

1. The centre of a bounding box $(b_x b_y)$ relative to the bounds of the grid cell
2. The width $(b_w)$
3. The height $(b_h)$. The width and the height of the entire image.
4. The class of the object $(c)$
5. The prediction confidence $(p_c)$ which is the probability of the existance of an object within the bounding box.

Thus, we, optimally, want one bounding box for each object in the given image and we can be sure that only one object will be predicted for each object by taking the midpoint of the cell that is responsible for outputing that object.

So, each bounding box for each cell will have $[x_1, y_1, x_2, y_2]$ coordinates where in the YOLO algorithm will be $[x, y, w, h]$

* $x$ and $y$ will be the coordinates for object midpoint in cell -> these actually will be between $0 - 1$
* $w$ and $h$ will be the width and the heigth of that object relative to the cell -> $w$ can be _greater_ than 1, if the object is wider than the cell, and $h$ can also be _greater_ than 1, if the object is taller than the cell

The labels will look like the following:

$label_{cell} = [c_1, c_2, ..., c_5, p_c, x, y, w,h]$

where:

* $c_1$ to $c_5$ will be the dataset classes
* $p_c$ probability that there is an object (1 or 0)
* $x, y, w,h$ are the coordinates of the bounding boxes


Predictions will look very similar, but will output two bouning boxes (will specialise to output different bounfding boxes (wide vs tall).

$pred_{cell} = [c_1, c_2, ..., c_5, p_{c_1}, x_1, y_1, w_1, h_1, p_{c_2}, x_2, y_2, w_2, h_2]$

**Note:** A cell can only detect one object, this is also one of the YOLO limitations (we can have finer grid to achieve multiple detections as mentioned above.

This is for every cell and the **target** shape for one image will be $(S, S, 10)$

where:

* $S * S$ is the grid size
* $5$ is for the class predictions, $1$ is for the probability score, and $4$ is for the bouning boxes

The **predictions** shape will be $(S, S, 15)$ where there is and additional probability score and four extra bounding box predictions.

## The model architecture

<img alt="YOLO Architecture" width="800" caption="YOLO Architecture" src="images/model.png" id="yolov1"/>

[//]: ![image](../notes/images/model.png)

The original YOLO model consists of 24 convolutional layers followed by 2 fully connected layers.
The model accepts 448x448 images and at the first layer has a 7x7 kernel with 64 output filters with stride of 2 (**also need to have a padding of 3 to much the dimensions**), also there is a 2x2 Maxpool Layer with the stride of 2. Simillarly, the rest of the model consists of convolutional layers and Maxpool layers except the last two layers where there are a fully conected layers where the first one takes as and input the convolutional output and make it a linear layer of 4096 feature vector and outputs to the fully connected which is reshaped to become a 7 by 7 by 30 which is the final split size of the image ($S = 7$ which is a $7$ x $7$ grid) with a vector output of 30 (in my case this will be 15).

To help whith the architecture building it will be usefull to pre-determine the architecure configuration:

```python
architecture_config = [
    # Tuple: (kernel_size, num_filters, stride, padding)
    (7, 64, 2, 3), 
    "M",    # M stands for the MaxPolling Layer and has stride 2x2 and kernel 2x2
    (3, 192, 1, 1),
    "M",
    (1, 128, 1, 0),
    (3, 256, 1, 1), 
    (1, 256, 1, 0), 
    (3, 512, 1, 1), 
    "M",
    # List of tuples: (kernel_size, num_filters, stride, padding), num_of_repeats
    [(1, 256, 1, 0), (3, 512, 1, 1), 4],
    (1, 512, 1, 0), 
    (3, 1024, 1, 1),
    "M", 
    [(1, 512, 1, 0), (3, 1024, 1, 1), 2], 
    (3, 1024, 1, 1),
    (3, 1024, 2, 1),
    (3, 1024, 1, 1), 
    (3, 1024, 1, 1),
]
```

## The Loss Function

The YOLO loss function is the second most important aspect of the algorithm. The basic concept behind all these losses is that are the sum squared error, and if we look at the first part of the loss function is going to be the loss for the box coordinate for the midpoint (taking the $x$ midpoint value and subtractining from the predicted $\hat{x}$ squared). The $\mathbb{1}_{ij}^{obj}$ is the identity function which is calculated when there is an object in the cell, so summurizing there is:

* $\mathbb{1}_{i}^{obj}$ is 1 when there is an object in the cell $i$ otherwise is 0.
* $\mathbb{1}_{ij}^{obj}$ is the $j^{th}$ bounding box prediction for the cell $i$ 
* $\mathbb{1}_{ij}^{noobj}$ has the same concept with the previous one, except that is 1 when there is no object and 0 when there is an object. 

So, to know which bounding box is responsible for outputing that bounding box is by looking at the cell and see which of the predicted bounding boxes has the highest Intersection over Union (IoU) value with the target bouning box. The one with the highest IoU will be the responsible bounding box for the prediction and will be send to the loss function. 

\begin{align}
&\lambda_{coord} \sum_{i=0}^{S^2}\sum_{j=0}^B \mathbb{1}_{ij}^{obj}[(x_i-\hat{x}_i)^2 + (y_i-\hat{y}_i)^2 ] \\&+ \lambda_{coord} \sum_{i=0}^{S^2}\sum_{j=0}^B \mathbb{1}_{ij}^{obj}[(\sqrt{w_i}-\sqrt{\hat{w}_i})^2 +(\sqrt{h_i}-\sqrt{\hat{h}_i})^2 ]\\
&+ \sum_{i=0}^{S^2}\sum_{j=0}^B \mathbb{1}_{ij}^{obj}(C_i - \hat{C}_i)^2 + \lambda_{noobj}\sum_{i=0}^{S^2}\sum_{j=0}^B \mathbb{1}_{ij}^{noobj}(C_i - \hat{C}_i)^2 \\
&+ \sum_{i=0}^{S^2} \mathbb{1}_{i}^{obj}\sum_{c \in classes}(p_i(c) - \hat{p}_i(c))^2 \\
\end{align}




# Algorithm Implementation


In [None]:
#export

# imports
import os
import pandas as pd
import torch
import torch.nn as nn
import torchvision.transforms as transforms
import torch.optim as optim

from PIL import Image
from tqdm import tqdm
from torch.utils.data import DataLoader
from torch.random import seed
from torch.functional import chain_matmul 
from torch.nn.modules import padding

In [None]:
#export

# Get the correct path for utils.py script
if os.getcwd() == '/media/ioannis/DATA/Documents/Machine_learning/Project/src/yolo_v1':
    print(f"The working direcory is: {os.getcwd()}")
else:    
    os.chdir("/media/ioannis/DATA/Documents/Machine_learning/Project/src/yolo_v1")
    print(f"Change to yolo dir: {os.getcwd()}")

Change to yolo dir: /media/ioannis/DATA/Documents/Machine_learning/Project/src/yolo_v1


In [None]:
#export

from utils import intersection_over_union
from utils import(
    intersection_over_union,
    non_max_suppression,
    mean_average_precision,
    cellboxes_to_boxes,
    get_bboxes,
    plot_image,
    save_checkpoint,
    load_checkpoint
)

### YOLO model architecure

#### Architecture configuration based on YOLO paper

In [None]:
#export

architecture_config = [
    (7, 64, 2, 3), 
    "M",    # M stands for the MaxPolling Layer and has stride 2x2 and kernel 2x2
    (3, 192, 1, 1),
    "M",
    (1, 128, 1, 0),
    (3, 256, 1, 1), 
    (1, 256, 1, 0), 
    (3, 512, 1, 1), 
    "M", 
    [(1, 256, 1, 0), (3, 512, 1, 1), 4],
    (1, 512, 1, 0), 
    (3, 1024, 1, 1),
    "M", 
    [(1, 512, 1, 0), (3, 1024, 1, 1), 2], 
    (3, 1024, 1, 1),
    (3, 1024, 2, 1),
    (3, 1024, 1, 1), 
    (3, 1024, 1, 1),
]

In [None]:
#export

class CNNBlock(nn.Module):
    """
    This CNN block is used to as a blueprint of the conv layers for the YoloV1 model.
    Need to use convolutional layers multiple times, so we'll use the CNNBlock for easy of use.

    Args:
        nn ([type]): [description]
    """
    def __init__(self, in_channels, out_channels, **kwargs):
        super(CNNBlock, self).__init__()
        self.conv = nn.Conv2d(in_channels, out_channels, bias=False, **kwargs)
        self.batchnorm = nn.BatchNorm2d(out_channels)
        self.leakyrelu = nn.LeakyReLU(0.1)

    def forward(self, x):
        x = self.leakyrelu(self.batchnorm(self.conv(x)))
        return x
    
class YoloV1(nn.Module):
    def __init__(self, in_channels=3, **kwargs):
        super(YoloV1, self).__init__()
        self.architecture = architecture_config
        self.in_channels = in_channels
        self.darknet = self._create_conv_layers(self.architecture)
        self.fcs = self._create_fcs(**kwargs)

    def forward(self, x):
        x = self.darknet(x)
        return self.fcs(torch.flatten(x, start_dim=1))

    def _create_conv_layers(self, architecture):
        layers = []
        in_channels = self.in_channels

        for x in architecture:
            if type(x) == tuple:
                layers += [
                    CNNBlock(
                        in_channels, x[1], kernel_size=x[0], stride=x[2], padding=x[3],
                    )
                ]
                in_channels = x[1]

            elif type(x) == str:
                layers += [nn.MaxPool2d(kernel_size=(2, 2), stride=(2, 2))]

            elif type(x) == list:
                conv1 = x[0]
                conv2 = x[1]
                num_repeats = x[2]

                for _ in range(num_repeats):
                    layers += [
                        CNNBlock(
                            in_channels,
                            conv1[1],
                            kernel_size=conv1[0],
                            stride=conv1[2],
                            padding=conv1[3],
                        )
                    ]
                    layers += [
                        CNNBlock(
                            conv1[1],
                            conv2[1],
                            kernel_size=conv2[0],
                            stride=conv2[2],
                            padding=conv2[3],
                        )
                    ]
                    in_channels = conv2[1]

        return nn.Sequential(*layers)

    def _create_fcs(self, split_size, num_boxes, num_classes):
        S, B, C = split_size, num_boxes, num_classes

        return nn.Sequential(
            nn.Flatten(),
            nn.Linear(1024 * S * S, 496),
            nn.Dropout(0.0),
            nn.LeakyReLU(0.1),
            nn.Linear(496, S * S * (C + B * 5)),  # (S, S, 30) where C + B * 5 = 30
        )

In [None]:
def test(S=7, B=2, C=5):
    """
    A function to test YoloV1 model
    """
    model = YoloV1(split_size=S, num_boxes=B, num_classes=C)
    x = torch.randn((2, 3, 448, 448))
    print(model(x).shape)


test()

torch.Size([2, 735])
