# **Implementing YOLO v1 in PyTorch**

<div>
     <center><img src="./img/predictions.jpg" width="750"/> </center>
</div>

## **1. What is YOLO?**

<div>
     <center><img src="./img/yolo.png" width="400"/> </center>
</div>

Developed in 2015 by Redmon et. al., YOLO (You Only Look Once) is a deep learning architecture which greatly improved the efficiency of image detection networks. The goal is simply: to detect objects appearing in an image, with their corresponding bounding boxes (see above image). Previous networks would train separately for category detection (dog, bicyle, laptop, e.g.) and bounding box detection (where an object appears in an image), going back and forth between the two to refine predictions. Redmon et. al. proposed a novel method in which bounding boxes and categories were predicted together in one pass through a single model, hence the model's name *You Only Look Once.*

#### **Why is it an important problem? (SAM)**


#### **Main contributions of YOLO (MATT)**

YOLO's main contribution to the study of object recognition is its speed while maintaining correctness. As touched on above by calculating both the bounding boxes and the categorization in a single pass, YOLO completely outperformed the then current the state-of-the-art (Fast R-CNN). YOLO's original code is implemented entirely in C which also contributed to the immense speed up. Additionally compared to Fast R-CNN, YOLO made fewer background mistakes and was able to augment Fast R-CNN to produce an improved mAP score. Overall YOLO was the first object detection software that could run in real time with mostly correct predictions. Running at 45 frames per second with a corrisponding mAP score of 63.4, YOLO exceeds the mAP of other real time detectors by over 2 times; and compared to non-real-time detects only loses ~10 mAP.

## **2. YOLO v1 at a High Level**


an intro

#### **Data and Importing (SAM)**

For our training data input, we resize each image to 112px by 112px and apply a greyscale filter. We take the pixel values in a 112x112 matrix and scale the values between -1 and 1. We take all of these scaled image matrices and feed them into our dataloader.
<div>
     <center><img src="./img/imageinput.png" width="750"/> </center>
</div>

#### **Architecture (MATT)**

<div>
     <center><img src="./img/arch.png" width="750"/> </center>
</div>

YOLO's architecture consists of 24 convolutional layers followed by 2 fullin connected layers detailed above. The 1x1 convolutional layers are in place to reduce the feature space from preceding layers. This may be confusing at first as a 1x1 convolution with a stride of 1 just copies the image however; roughly speaking, different features of the input can be "learned" by different kernels of the 1x1 layer thus reducing the feature space. 

Object detection is a hard task, and at this point had not been achieved in real time. In order to simplify the task YOLO was first pre-trained to recognize images. For pre-training only the first 20 layers were trained followed by an average-pooling and fully connected layer --- these additional layer were truncated in the actual YOLO model.Using the ImageNet 2012 data the model was trained to recognize images from one of 20 catagories. After a week of training, the model achieved a top-5 accuracy of 88\% (meaning each of the network contained the correct answer in the top 5 catagories it predicted 88\% of the time).

The intuitive reasoning for YOLO's pre-training can be explained with a simple example. Say I gave you an image and asked you to find every fish in the image and draw a box around it (this is essentially the YOLO task). You could probably do it --- but say you didn't know what a fish actually looked like, the task becomes significantly harder. The same is true with YOLO, by teaching it how to recognize images in a well defined dataset of ImageNet it will help make predicting bounding boxes possible.

Up until the last layer, YOLO is just a standard CNN. In each layer of the CNN, it learns something about the pixel and the surrounding pixels in order to make its final prediction. The final prediction of YOLO is where the novelty of the model comes from. The last dense layer is reshaped into a SxSxC+(B*5) tensor where S is the chosen grid size, C is the number of catagories and B is number of boxes predicted in each grid square. YOLO uses a grid size of 7x7 and 2 boxes per square to predict across 20 catagories resulting in an output of size 7x7x30. An example is seen in the image below. This approach allows both the catagory and the bounding boxes to be predicted simultaniously for fast unified detection. Working with this output is less straightforward then something like a series of probability weights or a single value; however YOLO provides a unique loss function to handle this. 


<div>
     <center><img src="./img/grid.png" width="750"/> </center>
</div>

#### **Loss (SAM)**

YOLO v1 implements a custom loss function, described in their paper in the following equation:

<div>
     <center><img src="./img/lossfunction.png" width="750"/> </center>
</div>

This may look overwhelming at first, but it's not overly complicated. The loss function takes the SSE across the predicted $(x,y)$ box midpoint, $(h,w)$ box dimensions, and the set of $c_i$ category probabilities. Some other important symbols:

$𝟙_{i}^{\textrm{obj}}$ indicates whether an object appears in cell $i$

$𝟙_{ij}^{\textrm{obj}}$ indicates whether the $j^{\textrm{th}}$ bounding box predictor in cell $i$ is "responsible" for that prediction (highest intersection over union, or IOU)

$\lambda_{\textrm{coord}}$ is a learning parameter set to 5

$\lambda_{\textrm{noobj}}$ is a learning parameter set to 0.5

To implement this loss function, we first have to define a custom loss function in torch. We create a new `torch.nn.Module` for our custom loss, which we will call `YOLOLoss`:

We then define a `forward` function for computing the loss. The full loss function, including the extensive `forward` function, can be found in `loss.py`. We borrow most of this code from https://github.com/aladdinpersson, as while it is not a conceptually difficult loss function, the actual code implementation with torch functions becomes a bit unweildy.

#### **Evaluation (MATT)**

text

## **3. Our Simplified Implementation**

an intro (MATT)

what we did differently
- from 20 image categories to 5 shape categories
    - changing loss function
- removed bounding_box_2 in loss
- simplified network
    - removed some layers
    - reduced number of kernels per layer (20 vs 200)
- overall compressing output


#### **Data and Importing (MATT)**

text

#### **Architecture (MATT)**

text

#### **Loss (SAM)**

text

#### **Evaluation (MATT)**

text

In [1]:
import torch
import torch.nn

class YOLOLoss(torch.nn.Module):
    """
    Computes loss function according to YOLO v1.

    """
    def __init__(self, S = 7, B = 2, C = 20, l_coord = 5, l_noobj = 0.5):
        super().__init__()
        self.S = S
        self.B = B
        self.C = C
        self.l_coord = l_coord
        self.l_noobj = l_noobj
        self.mse = torch.nn.MSELoss(reduction="sum") # SSE instead of MSE