#What is Yolo ?

Yolo stands for You Only Look Once. It is an object detection that uses features learned by a deep convolutional neural network to detect objects.
The biggest advantage over other popular architectures is speed. The Yolo model family models are really fast, much faster than R-CNN and others. This means that we can achieve real-time object detection. 

YOLO has reframed an object detection problem into a single regression problem. It goes directly from image pixels, up to bounding box coordinates and class probabilities. Hence, a single convolutional network predicts multiple bounding boxes and class probabilities for those boxes.

# Theory behind YOLO

As Yolo works with only one look at the image, sliding windows is not the right approach. Instead, the entire image can be splitted into the grid. This grid will be S×S dimensions. Now, each cell is responsible for predicting a few different things.

Each cell is responsible for predicting some number of [bounding boxes](http://datahacker.rs/deep-learning-bounding-boxes/). Also, each cell will predict a confidence value for each of those boxes, a probability that a box contains an object. In case that there is no object in some grid cell, it is important that confidence value is low.

When we visualize all of these predictions, we get a map of all the objects and a bunch of boxes which is ranked by their confidence value.

The second thing, each cell is responsible for predicting class probabilities. This is not saying that some grid cell contains some object, this is just a probability. So, if a grid cell predicts car, it is not saying that there is a car, it is just saying that if there is an object, than that object is a car.

Let us describe in more details how output looks like.

In Yolo, anchor boxes are used to predict bounding boxes. The main idea of anchor boxes is to predefine two different shapes. They are called anchor boxes or anchor box shapes. In this way, we will be able to associate two predictions with the two anchor boxes. In general, we might use even more anchor boxes (five or even more). Anchors were calculated on the COCO dataset using k-means clustering. 

Now that we have a grid each cell will predict the following:

* For each bounding box:
  1. 4 coordinates (tx,ty,tw,th)
  2. 1 objectness error which is confidence score of whether there is an object or not
* Some number of class probabilities

If there is some offset from the top left corner by cx,cy, then the predictions correspond to:

  bx=σ(tx)+cx
  by=σ(ty)+cy
  bw=pwetw
  bh=pheth
  
bx, by, bw, bh are the x,y center coordinates, width and height of our prediction. tx, ty, tw, th is what the network outputs. cx and cy are the top-left coordinates of the grid. pw and ph are anchors dimensions for the box.

Depth-wise, we have S * S * (B x (5 + C)) outputs. B represents the number of bounding boxes each cell can predict. According to the paper, each of these B bounding boxes may specialize in detecting a certain kind of object. Each of the bounding boxes have 5 + C attributes, which describe the center coordinates, the dimensions, the objectness score and C class confidences for each bounding box. YOLO v3 predicts 3 bounding boxes for every cell.

Now, if we take the probability and multiply them by the confidence values, we get all of the bounding boxes weighted by their probabilities for containing that object.

Simple thresholding will get us rid of all low confidence value predictions. For the next step, it is important to define what is Intersection Over Union (IoU) [link text](http://datahacker.rs/deep-learning-intersection-over-union/). The intersection over union computes the size of the intersection and divides it by the size of the union.

After this, duplicates can still be present and to get rid of them, we apply non-maximum suppression. Non-maximum suppression will take a bounding box with the highest probability and than look at other bounding box that are close to the first one and the ones with the highest overlap with this one (highest IoU) will be suppressed.

Because everything is done in just one pass, it is nearly as fast as classification. Also, all detections are predicted simultaneously, which means that model implicitly incorporates global context. In simpler words, it can learn which objects tend to occur together, relative size and location of objects and so on.

# Implementation in PyTorch

After going over the Theory and knowing all the fundemantals behind YOLOv3 we can start building it in PyTorch.