# 1. Object Localization

Object detection is one of the areas of computer vision that’s exploding and it’s working so much better than just a couple years ago. In order to build up object detection we first learn about object localization. Let’s start by defining what that means.

We have already said that the image classification task is to look at a picture and say is there a car or not. Classification with localization means not only do we have to label an object as a car, but also to put a bounding box or draw a rectangle around the position of the car in the image. In the term classification with localization problem, localization refers to figuring out where in the picture is the car we’ve detected.

We’ll learn about the detection problem where there might be multiple objects in the picture and we have to detect them all and localize them all. If we’re doing this for an autonomous driving application then we might need to detect not just other cars but maybe other pedestrians and motorcycles or even other objects. The classification and the classification with localization problems usually have one big object in the middle of the image that we’re trying to recognize or recognize and localize. In contrast in the detection problem there can be multiple objects, and in fact maybe even multiple objects of different categories within a single image. The ideas we learn about image classification will be useful for classification with localization and then the ideas we learn for localization will turn out to be useful for detection.

## 1.1. What are localization and detection ?

<img src="figures/local_detection.png" style="width:700px">

Let’s start by talking about Classification with localization. We’re already familiar with the image classification problem in which we might input a picture into a 𝑐𝑜𝑛𝑣𝑛𝑒𝑡 with multiple layers, and this results in a vector of features that is fed to maybe a 𝑠𝑜𝑓𝑡𝑚𝑎𝑥 unit that outputs the predicted class.

<img src="figures/convnet_softmax.png" style="width:700px">


If we’re building a self-driving car, maybe our object categories are a pedestrian, a car, a motorcycle and a background (this means none of the above). So, if there’s no pedestrian, no car, no motorcycle then we may have an output background. These are our four classes, so we need a 𝑠𝑜𝑓𝑡𝑚𝑎𝑥 with 4 possible outputs.

How about if we want to localize the car in the image as well ? To do that we can change our neural network to have a few more output units that output a bounding box. In particular, we can have the neural network output 4 more numbers, and those numbers will be $𝑏_𝑥$, $𝑏_𝑦$, $𝑏_ℎ$, $𝑏_𝑤$. These 4 numbers parameterize the bounding box of the detected object.


<img src="figures/localization.png" style="width:700px">

## 1.2. Classification with localization

Here is used the notation that the upper left point of the image is (0,0), lower right is (1,1).

<img src="figures/bounding_box.png" style="width:300px">

Specifying the bounding box the red rectangle requires specifying the midpoint, so that’s the point $𝑏_𝑥$, $𝑏_𝑦$ as well as the height that would be $𝑏_ℎ$, as well as the width $𝑏_𝑤$ of this bounding box. Now if our training set contains not just the object class label, which our neural network is trying to predict up here, but it also contains 4 additional numbers giving the bounding box, then we can use supervised learning to make our algorithm outputs not just a class label, but also the 4 parameters to tell us where is the bounding box of the object we detected. In this example the $𝑏_𝑥$ might be about 0.5 because this is about half way to the right to the image, $𝑏_𝑦$ might be about 0.7 since that’s about  70 % of the way down to the image, $𝑏_ℎ$ might be about 0.3 because the height of this red square is about 30 % of the overall height of the image, and $𝑏_𝑤$ might be about 0.4 because the width of the red box is about 0.4 of the overall width of the entire image.

## 1.3. Defining the target label 𝑦

Let’s formalize this a bit more in terms of how we define the target label 𝑌 for this as a supervised learning task. Let’s define the target label 𝑌. It’s going to be a vector where the first component $𝑝_𝑐$ is going to show is there an object. If the object is a pedestrian, a car or a motorcycle, $𝑝_𝑐$ will be equal to 1, and if it is the background class (if it’s none of the objects we’re detected ), then $𝑝_𝑐$ will be 0. We can think $𝑝_𝑐$ stands for the probability that there’s a object, probability that one of the classes we’re trying to detect is there, something other than the background class.

Our vector 𝑦 would be as follows:

$$y=\begin{bmatrix} p_{c}\\ b_{x}\\b_{y}\\b_{h}\\b_{w}\\c_{1}\\c_{2}\\c_{3}\end{bmatrix}$$

Next, if there is an object then we want to output $𝑏_𝑥$, $𝑏_𝑦$, $𝑏_ℎ$ and $𝑏_𝑤$, the bounding box for the object we detected. And finally, if there is an object, so if $𝑝_𝑐$=1, we want to also output 𝐶1, 𝐶2 and 𝐶3 which tells is it the 𝑐𝑙𝑎𝑠𝑠1, 𝑐𝑙𝑎𝑠𝑠2 or 𝑐𝑙𝑎𝑠𝑠3, in other words is it a pedestrian, a car or a motorcycle. We assume that our image has only one object and the most one of these objects appears in the picture in this classification with localization problem. Let’s go through a couple of examples.

If  𝑥 is a training set image, then 𝑦 will have the first component $𝑝_𝑐$=1 because there is an object, then $𝑏_𝑥$,$𝑏_𝑦$, $𝑏_ℎ$ and $𝑏_𝑤$ will specify the bounding box. So, our label training set we’ll need bounding boxes in the labels.

And then finally this is a car, so it’s 𝑐𝑙𝑎𝑠𝑠2. 𝐶1=0 because it’s not a pedestrian, 𝐶2=1 because it is a car, 𝐶3=0 since it’s not a motorcycle. Among 𝐶1,𝐶2,𝐶3 at most one of them should be equal to 1.

$$y= \begin{bmatrix}1\\b_{x}\\b_{y}\\b_{h}\\b_{w}\\0\\1\\0 \end{bmatrix}$$

What if there’s no object in the image? In this case $𝑝_𝑐$=0 and the rest of the elements of this vector can be any number, because if there is no object in this image then we don’t care what bounding box of the neural network outputs as well as which of the three objects 𝐶1,𝐶2,𝐶3 it thinks of this. 

$$y=\begin{bmatrix}0\\?\\?\\?\\?\\?\\?\\? \end{bmatrix}$$


<img src="figures/examples_training.png" style="width:600px">

Given a set of labeled training examples this is how we construct it: 𝑥 the input image as well as 𝑦 the class label both images where there is an object and for images where there is no object, and the set of these will then define our training set.

 
## 1.4. Loss function

Finally let’s describe the loss function we use to train the neural network. The ground truth label was 𝑦, and neural network outputs some 𝑦̂   what should the loss bee.

$$L\left ( \hat{y},y \right )=\left ( \hat{y}_{1}-y_{1} \right )^{2}+\left ( \hat{y}_{2}-y_{2} \right )^{2}+…+\left ( \hat{y}_{8}-y_{8} \right )^{2},   y_{1}=1$$

$$L\left ( \hat{y},y \right )= \left ( \hat{y}_{1}-y_{1} \right )^{2}, y_{1}=0$$


Notice that 𝑦 here has 8 components (the first row of loss function), so that goes from sum of the squares of the difference of the elements, and that’s the loss if  𝑦1=1. That’s the case where there is an object so 𝑦1=$𝑝_𝑐$. So, if there is an object in the image, then the loss can be the sum of squares over all the different elements.

The other case is if 𝑦1=0. That’s if this $𝑝_𝑐$=0, in that case the loss can be just $\hat{𝑦_1} - 𝑦_1$ squared because in that second case all the rest of the components are not important. All we care about is how accurately is the neural network outputting $𝑝_𝑐$ in that case.


# 2. Landmark Detection

In the previous section we saw how we can get a neural network to output 4 numbers: $𝑏_𝑥$, $𝑏_𝑦$ ,$𝑏_ℎ$, and $𝑏_𝑤$ to specify the bounding box of an object we want neural network to localize. In more general cases we can have a neural network which outputs just 𝑥 and 𝑦 coordinates of important points in the image, sometimes called landmarks. 

Let’s see a few examples. Let’s say we’re building a face recognition application, and for some reason we want the algorithm to tell us where is the corner of someone’s eye.

<img src="figures/landmark_detection.png" style="width:600px">


Every point has an 𝑥 and 𝑦 coordinate so we can just have a neural network with final layer that outputs two more numbers which we will call 𝑙𝑥 and 𝑙𝑦 to specify the coordinates of a point that is for example the person’s eye).

Now, what if we wanted the neural network to tell us all four corners of the eye, or both eyes. If we call the points the first, the second, the third and fourth point, going from left to right, then we can modify the neural network to output 𝑙1𝑥, 𝑙1𝑦, for the first point, and 𝑙2𝑥, 𝑙1𝑦 for the second point and so on. The neural network can output the estimated position of all those four points of the person’s face. What if we don’t want just those four points? What if we want the output many points? For example what if we want to output different positions in the eye or shape of the mouth to see weather the person is smiling or not. We could define some number, for the sake of argument, let’s say 64 points or 64 landmarks on the face maybe even some points that helps us define the edge of the face, it defines the jawline. By selecting a number of landmarks and generating a label training set that contains all of these landmarks we can then have the neural network which tell us where are all the key positions or the key landmarks on a face.

<img src="figures/landmark_detection_convnet.png" style="width:700px">

So, what we do is we have this image of person’s face as input, have it go through a 𝑐𝑜𝑛𝑣𝑛𝑒𝑡 and have a 𝑐𝑜𝑛𝑣𝑛𝑒𝑡 then have some set of features maybe have it output 0 or 1, like is there a face in this or not, and then have it also output 𝑙1𝑥, 𝑙1𝑦 and so on down to 𝑙64𝑥, 𝑙64𝑦. We use 𝑙 to stand for a landmark.

This example would have 129 output units, 1 is for where a face or not, and then if we have 64 landmarks that is 64×2 which is equal to 128 plus 1  output units. This can tell us if there’s a face as well as where are all the key landmarks on the face. Of course in order to trade a network like this we will need a label training set. We have a set of images as well as labels 𝑌,  where someone would have had to go through and laboriously annotate all of these landmarks.

## Pose detection

If we are interested in person’s pose detection, we could also define a few key positions (as we can see in the picture below) like the midpoint of the chest, left shoulder, left elbow, wrist and so on. Then we need a neural network to annotate key positions in the person’s pose as well. Having a neural network output all of those points down annotating we could also have the neural network output the pose of the person.

<img src="figures/pose_detection.png" style="width:300px">

To do that we also need to specify on these key landmarks which may be 𝑙1𝑥, 𝑙1𝑦 that is the midpoint of the chest, down to maybe 𝑙32𝑥, 𝑙32𝑦, if we use 32 coordinates to specify the pose of the person.

This idea might seem quite simple of just adding a bunch of output units to output the (𝑥,𝑦) coordinates of different landmarks we want to recognize. To be clear, the identity of landmark 1 must be consistent across different images like maybe landmark 1 is always one corner of the eye, landmark 2 is always another corner of the same eye etc. The labels have to be consistent across different images.

# 3. Object Detection

We have learned about object localization as well as landmark detection, now let’s build an object detection algorithm. In this post we’ll learn how to use a 𝑐𝑜𝑛𝑣𝑛𝑒𝑡 to perform object detection using a Sliding windows detection algorithm.

Let’s say we want to build a car detection algorithm.

<img src="figures/object_detection_training.png" style="width:700px">

We can first create a labeled training set (𝑥,𝑦) with closely cropped examples of cars and some other pictures that aren’t pictures of cars. For making a training dataset we can take a picture and crop it out. We want to cut out anything else that is not a part of a car, so we end up with a car centered in pretty much the entire image. Given this labeled training set we can then train a 𝐶𝑜𝑛𝑣𝑛𝑒𝑡 that inputs an image, like one of these closely cropped images above, and then the job of the 𝐶𝑜𝑛𝑣𝑛𝑒𝑡 is to output 𝑦 (0 or 1 is as a car or not).

Once we have trained up this 𝐶𝑜𝑛𝑣𝑛𝑒𝑡 we can then use it in sliding windows detection. The way we do that is, if we have a test image like the following one, that we start by picking a certain window size shown down there, and then we would input into a 𝐶𝑜𝑛𝑣𝑛𝑒𝑡 this small rectangular region.

<img src="figures/sliding_window.png" style="width:300px">


Take just this little red square, as we draw in the picture above, and put that into the 𝐶𝑜𝑛𝑣𝑛𝑒𝑡, and have a 𝐶𝑜𝑛𝑣𝑛𝑒𝑡 make a prediction. Presumably for that little region in the red square, it will say that a little red square does not contain a car. In the sliding windows detection algorithm, what we do is we then process input a second image now bounded by the red square shifted a little bit over and feed that to the 𝐶𝑜𝑛𝑣𝑛𝑒𝑡 we speed in just the region of the image in the red square to the 𝐶𝑜𝑛𝑣𝑛𝑒𝑡 and run the 𝐶𝑜𝑛𝑣𝑛𝑒𝑡 again, and then we do that with a third image and so on and we keep going until we split the window across every position in the image. We basically go through every region of this size and pass lots of little crafted images into the 𝑐𝑜𝑛𝑣𝑛𝑒𝑡 and have it classify 0 or 1 for each position at some stride.Running this was called a sliding window through the image and  we’d then repeat this but now using a larger window and then a more large window (as we can see in the following image).

<img src="figures/sliding_window_sizes.png" style="width:600px">

So, we take a slightly larger region and run that region, feed that to the 𝑐𝑜𝑛𝑣𝑛𝑒𝑡 and have it output 0 or 1. Then we slide the window over again using some stride and so on, and we run that throughout our entire image until we get to the end. Then we might do the third time using even larger windows and so on.

The hope is that if there’s a car somewhere in the image so, that there will be a window  where for which the 𝑐𝑜𝑛𝑣𝑛𝑒𝑡  will have output 1 for that input region  which means that there is a car. This algorithm is called Sliding windows detection because we take these windows, these square red boxes, and slide them across the entire image and classify every square region with some stride as containing a car or not.

## Disadvantages of sliding window detection and how to overcome them

There’s a huge disadvantage of sliding windows detection which is the **Computational cost**, because we’re cropping out so many different square regions in the image and running each of them independently through a 𝑐𝑜𝑛𝑣𝑛𝑒𝑡. If we use the very course stride, a very big stride, very big step size, then that would reduce the number of windows we need to pass through the 𝑐𝑜𝑛𝑣𝑛𝑒𝑡, but that coarser granularity may hurt performance, whereas if we use a very fine granularity or a very small stride then the huge number of all these little regions we’re passing through the 𝑐𝑜𝑛𝑣𝑛𝑒𝑡 means that there’s a very high computational cost. Before the rise of neural networks people used to use much simpler classifiers, like a simple linear classifier overhand engineer features in order to perform object detection, and in that error because each classifier was relatively cheap to compute it was just a linear function, sliding windows detection ran properly, it was not a bad method, but with 𝑐𝑜𝑛𝑣𝑛𝑒𝑡𝑠 now running a single classification task is much more expensive and sliding windows this way is infeasible slow. Unless we use a very fine granularity or a very small stride we end up not able to localize the objects that accurately within the image as well.

Fortunately, this problem of computational cost has a pretty good solution. In particular the sliding windows object detector can be implemented convolutionally or much more efficiently.


# 4. Convolutional operation of sliding windows


In the previous section we learned about the sliding windows object detection algorithm using a 𝑐𝑜𝑛𝑣𝑛𝑒𝑡, but we saw that it was too slow. In this post we will see how to implement that algorithm convolutionaly. Let’s see what that means.

To build up the convolutional implementation of sliding windows, let’s first see how we can turn 𝐹𝑢𝑙𝑙𝑦 𝑐𝑜𝑛𝑛𝑒𝑐𝑡𝑒𝑑 layers in our neural network into 𝐶𝑜𝑛𝑣𝑜𝑙𝑢𝑡𝑖𝑜𝑛𝑎𝑙 layers. Let’s say that our object detection algorithm inputs 14×14×3 images, this is quite small but we will use it just for illustrative purposes, and let’s say it then uses 5×5 filters and let’s say that it uses 16 of them to map it from 14×14×3 to 10×10×16, and we apply 2×2  𝑀𝑎𝑥 𝑝𝑜𝑜𝑙𝑖𝑛𝑔 layer to reduce the size of a volume to 5×5×16. Then we have a 𝐹𝑢𝑙𝑙𝑦𝑐𝑜𝑛𝑛𝑒𝑐𝑡𝑒𝑑 layer, with 400 units, then another 𝐹𝑢𝑙𝑙𝑦 𝑐𝑜𝑛𝑛𝑒𝑐𝑡𝑒𝑑 layer (also with a 400) units and then a neural network finally outputs 𝑌 using a 𝑠𝑜𝑓𝑡𝑚𝑎𝑥 unit.


<img src="figures/cnn_example.png" style="width:700px">


## 4.1. How to turn 𝐶𝑜𝑛𝑣𝑜𝑙𝑢𝑡𝑖𝑜𝑛𝑎𝑙 layers into 𝐹𝑢𝑙𝑙𝑦 𝐶𝑜𝑛𝑛𝑒𝑐𝑡𝑒𝑑 layers ?

The 𝑐𝑜𝑛𝑣𝑛𝑒𝑡 is the same as before for the first few layers, and now one way of implementing the first 𝐹𝑢𝑙𝑙𝑦𝑐𝑜𝑛𝑛𝑒𝑐𝑡𝑒𝑑 layer, is to implement a 5×5 filter and let’s use 400  5×5 filters (see the picture below). So, we take a 5×5×16 volume and convolve it with a 5×5  filter. Remember that a 5×5 filter is implemented as a 5×5×16 filter because our convention is that the filter looks across all 16 channels. So, if we have 400 of these 5×5×16 filters, then the output dimension is going to be 1×1×400. Rather than viewing these 400 as just a set of nodes (units), we’re going to view this as a 1×1×400 volume and mathematically this is the same as a 𝐹𝑢𝑙𝑙𝑦𝑐𝑜𝑛𝑛𝑒𝑐𝑡𝑒𝑑 layer because each of these 400 nodes has a filter of dimension 5×5×16, so each of those 400 values is some arbitrary linear function of these 5×5×16 activations from the previous layer.

<img src="figures/fully_connected_to_conv.png" style="width:700px">

Next, to implement the next convolutional layer, we’re going to implement a 1×1 convolution, and if we have 400  1×1 filters then the next layer will again be 1×1×400, so that gives us this next 𝐹𝑢𝑙𝑙𝑦 𝑐𝑜𝑛𝑛𝑒𝑐𝑡𝑒𝑑 layer. And finally we’re going to have another 1×1 filter followed by a 𝑠𝑜𝑓𝑡𝑚𝑎𝑥 activation so as to give a 1×1×4 volume to take the place of these four numbers that the network was outputting. This shows how we can take these 𝐹𝑢𝑙𝑙𝑦 𝑐𝑜𝑛𝑛𝑒𝑐𝑡𝑒𝑑 layers and implement them using 𝐶𝑜𝑛𝑣𝑜𝑙𝑢𝑡𝑖𝑜𝑛𝑎𝑙 layes. These sets of units instead are now implemented as 1×1×400 and 1×1×4 volumes.

## 4.2. A convolutional implementation of sliding windows object detection

Let’s say that our sliding windows 𝑐𝑜𝑛𝑣𝑛𝑒𝑡 inputs 14×14×3 images. As before we have a neural network as follows that eventually outputs a 1×1×4 volume which is the output of our 𝑠𝑜𝑓𝑡𝑚𝑎𝑥 unit. We can see the implementation of this neural network in the following picture. 

<img src="figures/fully_connected_to_conv_2.png" style="width:700px">

Let’s say that our 𝑐𝑜𝑛𝑣𝑛𝑒𝑡 inputs 14×14 images or 14×14×3 images and our test set image is 16×16×3 , so now will add that yellow stripe to the border of this image as we can see in the picture below.

<img src="figures/fully_connected_to_conv_3.png" style="width:700px">

In the original sliding windows algorithm we might want to input the blue region into a 𝑐𝑜𝑛𝑣𝑛𝑒𝑡 and run that once to generate a classification(to output 0 or 1) and then slide it down a bit, let’s use the stride of 2 pixels, and then we might slide that to the right (for example we can use a stride of 2 pixels ) to input this green rectangle into the 𝑐𝑜𝑛𝑣𝑛𝑒𝑡 and rerun the whole 𝑐𝑜𝑛𝑣𝑛𝑒𝑡 and get another label 0 or 1.Then we might input this orange region into the 𝑐𝑜𝑛𝑣𝑛𝑒𝑡 and run it one more time to get another label and then do the fourth and final time with this lower right now purple square. To run sliding windows on this 16×16×3 image, this pretty small image, we run this 𝑐𝑜𝑛𝑣𝑛𝑒𝑡 from above 4 times in order to forget 4 labels. It turns out a lot of this computation done by these 4 𝑐𝑜𝑛𝑣𝑛𝑒𝑡𝑠 is highly duplicated, so what the convolutional implementation of sliding windows does is it allows these 4 forward passes of the 𝑐𝑜𝑛𝑣𝑛𝑒𝑡 to share a lot of computation. Specifically, here’s what we can do. We can take the convent and just run it same parameters, the same 16 5×5 filters and run it, and now we can have a 12×12×16 output volume, and then do the max pool same as before, now we have a 6×6×16, run through our same 400  5×5 filters to get 2×2×400 volume. Now instead of a 1×1×400 volume, we have instead a 2×2×400 volume. Run it through our 1×1 filter and it gives us another 2×2×400 instead of 1×1×400, we will do that one more time and now we have a 2×2×4 output volume instead of 1×1×4. It turns out that this blue 1×1×4 subset gives us the result of running in the upper left-hand corner 14×14 image, this upper right 1×1×4 volume gives us the upper right result, the lower left gives us the results of implementing the content on the lower left 14×14 region, and the lower right 1×1×4 volume gives us the same result as running the 𝑐𝑜𝑛𝑣𝑛𝑒𝑡 on the lower right 14×14 region.

If we step through all the steps of the calculation, let’s look at the green example. If we had cropped out just this region and passed it through the 𝑐𝑜𝑛𝑣𝑛𝑒𝑡 on top, then the first layers activations would have been exactly this region, the next layers activation of the max pooling would have been exactly this region, and then the next layer, the next layer would have been as follows. What this process does, what this convolutional inclination does, is instead of forcing us to run 4 propagation on 4 subsets of the input image independently, instead it combines all 4 into 1 for computation and shares a lot of the computation in the regions of the image that are common, all four of the 14×14 patches we saw here.

Let’s go through a bigger example. Let’s say we now want to run sliding windows on a 28×28×3 image. It turns out if we run for crop the same way, then we end up with an 8×8×4 output and this corresponds to running sliding windows with that 14×14 region, and that corresponds to running sliding windows first on that region does giving us the output corresponding on the upper left-hand corner, then using stride of 2 to shift one window over, one window over, one window over and so on, there are 8 positions, so that gives us this first row. Then as we go down the image as well that gives us all of these 8×8×4 outputs. And because of the max pooling of 2 that this corresponds to running our neural network with a stride of 2 on the original image.

<img src="figures/fully_connected_to_conv_bigger.png" style="width:800px">


To recap, to implement sliding windows, previously what we do is we drop out a region, let’s say this is on 14×14, and run that to our convent and do that for the next region over, then do that for the next 14×14 region, then the next one, then the next one, the next one, the next one and so on until hopefully that one recognizes the car. But now instead of doing it sequentially, with this convolutional implementation that we saw in the previous slide, we can implement the entire image of maybe 28×28 and convolutionaly make all the predictions at the same time by one for pass through this big 𝑐𝑜𝑛𝑣𝑛𝑒𝑡 in hope it recognize the position of the car.


<img src="figures/full_sliding_window_example.png" style="width:800px">

That’s how we implement sliding windows convolutionally, and it makes the whole thing much more efficient. This algorithm still has one weakness which is the position of the bounding boxes is not going to be too accurate. In the next lecture let’s see how we can fix that problem.

This algorithm for object detection is computationally efficient but is not the most accurate one. In the next post, we will see how we can detect objects more accurately.

# 5. YOLO: Bounding Box Predictions 

In the last sections, we learned how to use a convolutional implementation of sliding windows. That’s more computationally efficient, but it still has a problem of not outputting the most accurate bounding boxes. 
In this post, we will see how we can obtain more accurate predictions of bounding boxes.

## 5.1. Output accurate bounding boxes

With sliding windows, we take the sets of windows that we move throughout the image and we obtain a set of sliding windows (the purple box). The next thing we will do is applying a classifier to see if there is a car in that particular sliding window or not.  

This is not the most accurate way of getting bounding boxes. Let’s see what we can do. 

A good way to get the more accurate output bounding boxes is with the 𝑌𝑂𝐿𝑂 algorithm. 𝑌𝑂𝐿𝑂 stands for – 𝑌𝑜𝑢𝑂𝑛𝑙𝑦𝐿𝑜𝑜𝑘𝑂𝑛𝑐𝑒.

## 5.2. 𝑌𝑂𝐿𝑂 algorithm

<img src="figures/yolo_algo.png" style="width:400px">


Let’s say we have a 100×100  input image. We’re going to place down a grid on this image and for the purpose of illustration. We are going to use a 3×3 grid. In the actual implementations in practice, we would use a finer one, for example, a 19×19 grid.

We can say that the basic idea of the 𝑌𝑜𝑙𝑜 algorithm is applying both the image classification and localization algorithm on each of nine grid cells. 

### How do we define labels 𝑦?

In the following picture, we can see what are the output vectors 𝑦 for the tree grid cells that are in the purple, green and orange rectangle.


<img src="figures/yolo_label.png" style="width:500px">

Our first output $𝑝_𝑐$ is either 0 or 1 depending on whether or not there is an object in that grid cell. Then, we have $𝑏_𝑥$,$𝑏_𝑦$,$𝑏_ℎ$,$𝑏_𝑤$ to specify the bounding box of an object (in case that there is an object associated with that grid cell). We take 𝑐1,𝑐2,𝑐3 to denote if we had recognized pedestrian’s class, motorcycles and the background class. So, 𝑐1,𝑐2,𝑐3 are labels for the pedestrian, car and motorcycle classes. 

In this image, we have nine grid cells, so for each grid cell, we can define a vector, like the one we saw in the picture above. Let’s start with the upper left grid cell. For this grid cell, we see that there is no object present. So, the label vector 𝑦 for the upper left grid cell will have $𝑝_𝑐$=0, and then we don’t care what the remaining values in this vector are. The output label 𝑦 would be the same for the first tree grid cells because all these tree grid cells don’t have an interesting object in them.

Subsequently, this analyzed image has two objects which are located in the remaining six grid cells. And what the 𝑌𝑂𝐿𝑂 algorithm does, it takes the midpoint of each of the two objects and then assigns the object to the grid cell that contains the midpoint. So, the left car is assigned to the green grid cell, whereas the car on the right is assigned to the orange grid cell. 
Even though four grid cells (bottom right) have some parts of the right car, the object will be assigned to just one grid cell. So, for the central grid cell, the vector 𝑦 also looks like a vector with no object. The first component $𝑝_𝑐$ is equal to 0, and then the rest values in this vector can be of any value. We don’t care about it. Hence, for these two grid cells this we have the following vector 𝑦:

$$y = \begin{bmatrix} 0 \\ ? \\ ? \\ ? \\ ? \\ ? \\ ? \\ ?  \end{bmatrix}$$

On the other hand, for the cell circled in green on the left, the target label 𝑦 will be defined in the following way. First, there is an object, so $𝑝_𝑐$=1, and then we write $𝑏_𝑥$,$𝑏_𝑦$,$𝑏_ℎ$,$𝑏_𝑤$ to specify the position of that bounding box. If class one was to mark a pedestrian then: 𝐶1=0, class two was a car 𝐶2=1 and class three was a motorcycle, so 𝐶3=0. Similarly, for the grid cell on the right, there is an object in it and this vector will have the same structure as the previous one. 

Finally, for each of these nine grid cells, we end up with eight-dimensional output vectors. And because we have 3×3  grid cells, we have nine grid cells, the total volume of the output is going to be 3×3×8. So, for each of the 3×3 grid cells, we have a eight-dimensional 𝑦 vector. 

<img src="figures/yolo_output.png" style="width:200px">

The target output volume is 3×3×8. Where for example, this 1×1×8 volume in the upper left corresponds to the target output vector for the upper left of the nine grid cells. For each of the 3×3 positions, for each of these nine grid cells, we have eight-dimensional target vector 𝑦 that we want to output. Some of which could be vectors that correspond to the cells without an object of importance, if there’s no object in that grid cell. Therefore, the total target output is a 3×3×8 volume. 

### Let’s now see in more details how do we define the output vector 𝑦

First, to train our neural network, the input is 100×100×3 dimensional. Then, we have a usual convolutional neural network with 𝑐𝑜𝑛𝑣𝑜𝑙𝑢𝑡𝑖𝑜𝑛𝑎𝑙 layers, 𝑀𝑎𝑥𝑝𝑜𝑜𝑙 layers, and so on.  So, this neural network maps from an input image to a 3×3×8 output volume. 
We have an input 𝑥 which is the input image like this one in the picture above, and we have these target labels 𝑦 which are 3×3×8. Further, we use backpropagation to train the neural network in order to map an input 𝑥 to this type of output volume 𝑦. 

The advantage of this algorithm is that the neural network outputs precise bounding boxes. At the test time, we feed an input image 𝑥 and run forward propagation step until we get the output 𝐘.Next, for each of the nine outputs, we can read 1 or 0. That is if there is an object is some of those nine positions? 
As long as we don’t have more than one object in each grid cell, this algorithm should work properly. The problem of having multiple objects within the grid cell is something we’ll talk about later. 
Here we have used a relatively coarse 3×3 grid, in practice, we might use a much finer grid maybe 19×19. In that case we end up with 19×19×8 output. This step reduces the probability that we encounter multiple objects assigned to the same grid cell.  
Let’s notice two things:

This algorithm resembles the image classification and localization algorithm that we explained in our previous posts. And that it outputs the bounding box’s coordinates explicitly. This allows our network to output bounding boxes different aspect ratio providing more precise coordinates in contrast to the sliding windows classifier
This is a convolutional implementation because we’re not assessing this algorithm nine times on the 3×3 grid or  361 times if we are using the 19×19 grid. Instead, this is one single convolutional evaluation, and that’s why this algorithm is very efficient.
 𝑌𝑂𝐿𝑂 algorithm gained a lot of popularity because of a convolutional implementation that can detect objects even in the real-time scenarios.

Last but not least, before wrapping up, there’s one more detail: how do we encode these bounding boxes $𝑏_𝑥$,$𝑏_𝑦$,$𝑏_ℎ$,$𝑏_𝑤$ ? 

Let’s take the example of the car in the picture.

<img src="figures/yolo_bounding_box.png" style="width:700px">

In this grid cell there is an object and the target label 𝑦will have $𝑝_𝑐$ equal to one. Then we have some values for $𝑏_𝑥$,$𝑏_𝑦$,$𝑏_ℎ$,$𝑏_𝑤$, and the last three values in this output vector are 0,1,0 because in this cell we have recognized the car, so the class two or 𝐶2 is equal to 1. 

So, how do we specify the bounding box? In the 𝑌𝑂𝐿𝑂 algorithm we take the convention that the upper left point is (0,0) and this lower right point is (1,1). To specify the position of the midpoint, that orange dot in the picture above, $𝑏_𝑥$ might be 0.4 (we are looking the x-axis) because maybe it’s about 0.4 of the way to the right. 𝑦, looks maybe like it is 0.3 (if we are in the direction of the y-axis). Next, the height of the bounding box is specified as a fraction of the overall width of this box. 

The width of this red box in the picture above is maybe 90% of the height of the grid cell and that’s why $𝑏_ℎ$ is 0.9 and the height of this bounding box is maybe one half of the overall height of the grid cell. So, in that case, $𝑏_𝑤$, would be 0.8. In other words, this $𝑏_𝑥$,$𝑏_𝑦$ was specified relative to the grid cell. $𝑏_𝑥$ and $𝑏_𝑦$ , has to be between 0 and 1. Because pretty much by definition that orange dot is within the bounds of that grid cell to which it is assigned to. If it wasn’t between 0 and 1 than it was outside the square that means that it is assigned to another grid cell. 
These could be greater than 1 in case we have a car which is in two grid cells.

Although there are multiple ways of specifying the bounding boxes, this convention can be quite a reasonable one. 
In the 𝑌𝑂𝐿𝑂 research papers, there were other parameterizations that work even a little bit better, but we hope this gives one reasonable condition that should work properly.

# 6. Intersection over Union

When doing the object detection our task is to localize the object in the best possible way. Take a look at the picture below we can see that there are two bounding boxes – a red one which is the ground truth bounding box and the purple one which is the output of our algorithm. We can see that they don’t overlap perfectly, so we need to measure how bad ( or how good) is the actual outcome. To do that we will compute the intersection over union.

In the object detection task, our expectation is to localize the object in the best possible way. Let’s have a look at the image above. If the red bounding box is the ground truth bounding box (where the car is in the image) and our algorithm outputs the bounding box in purple, the intersection over union tells us whether we have a good or a bad outcome.

<img src="figures/IoU.png" style="width:500px">

The union of these two bounding boxes is a blue area. That is the area that is contained in both bounding boxes, whereas the intersection of the boxes is a smaller yellow region. The intersection over union computes the size of the intersection and divides it by the size of the union. By convention the bounding box is correct if the 𝐼𝑜𝑈 is greater than 0.5. If the bounding box we got and the ground truth bounding boxes overlapped perfectly, the 𝐼𝑜𝑈 would be 1 because the intersection would be equal to the union. In general as long as 𝐼𝑜𝑈 is greater than or equal to 0.5 then the obtained answer is rather decent. By convention, 0.5 is used as a threshold to determine whether the predicted bounding box is correct or not.

This is just a convention used in practice. In case that we want to be more strict, we can judge an answer as correct only if the 𝐼𝑜𝑈 is greater than and equals to 0.6 or some other number. However, the higher the 𝐼𝑜𝑈 is, the more accurate the bounding box is. We defined 𝐼𝑜𝑈 as a way to evaluate whether or not our object localization algorithm is accurate or not, but more generally 𝐼𝑜𝑈 is a measure of the overlap between two bounding boxes, where if we have two boxes, we can compute their intersection and union and then we take the ratio of these two areas. This is also a way of measuring how similar two boxes are to each other.

# 7. Non-Max Suppression

In this section, we will learn how the non-max suppression algorithm allows us to overcome multiple detections of the same object in an image. Let’s go through an example! Let’s say we want to detect pedestrians, cars, and motorcycles in this image.

<img src="figures/non_max_suppression.png" style="width:300px">

If we look at the picture above we can see that there are two cars. Each of these two cars has one midpoint so it should be assigned to just one grid cell which then actually predicts that there is a car in the picture. In practice, we’re running an object classification and localization algorithm for every one of these grid cells, so it’s quite possible that different cells might think that the center of a car is in many different cells.

> 𝑁𝑜𝑛−𝑚𝑎𝑥 𝑠𝑢𝑝𝑝𝑒𝑟𝑒𝑠𝑖𝑜𝑛  cleans up these multiple bounding boxes

Let’s see an example of how 𝑁𝑜𝑛−𝑀𝑎𝑥 𝑠𝑢𝑝𝑝𝑟𝑒𝑠𝑠𝑖𝑜𝑛 works. Since we are running the image classification and localization algorithm on every grid cell, it is possible that many of them will be with a large probability $𝑝_𝑐$, that there is an object in that cell. When we run the algorithm we would end up with multiple detections of each object. What the 𝑁𝑜𝑛−𝑀𝑎𝑥 𝑠𝑢𝑝𝑝𝑟𝑒𝑠𝑠𝑖𝑜𝑛 does is it bends together these detections so that we end up with just one detection per car.

<img src="figures/non_max_suppression_2.png" style="width:400px">

More specifically, this algorithm will first search for the probability associated with each of these detections, so it looks at $𝑝_𝑐$ values first, and then it takes the largest one. In the picture above, there is a rectangle associated with 0.9 (the car on the right) and this means that an algorithm has detected a car there. After this, the 𝑁𝑜𝑛−𝑀𝑎𝑥 𝑠𝑢𝑝𝑝𝑟𝑒𝑠𝑠𝑖𝑜𝑛 looks at other rectangles that are close to the first one and the ones with the highest overlap with this one (highest 𝐼𝑜𝑈) will be suppressed. As an example, in the picture above we can see those two rectangles with the 𝐼𝑜𝑈 0.6 and the 0.7 are going to be suppressed.

More specifically, it first searches for the probability associated with each of these detections (it looks at $𝑝_𝑐$ values) and first, it takes the largest one. In the picture above, there is a rectangle associated with 0.9 (the car on the right). This means that an algorithm detected a car there. After this, the 𝑁𝑜𝑛−𝑀𝑎𝑥 𝑠𝑢𝑝𝑝𝑟𝑒𝑠𝑠𝑖𝑜𝑛 looks at other rectangles that are close and the ones with a high overlap with this one (high 𝐼𝑜𝑈) will be suppressed. As an example, we can see those two rectangles with the 𝐼𝑜𝑈 0.6 and the 0.7 are going to be suppressed.

Similarly, the same procedure can be applied for the car on the left, we go through the remaining rectangles and find the one with the highest probability, the highest $𝑝_𝑐$ value. So for the car on the left, it will be the box with the value $𝑝_𝑐$=0.8. Next, a task of 𝑁𝑜𝑛−𝑀𝑎𝑥 𝑠𝑢𝑝𝑝𝑟𝑒𝑠𝑠𝑖𝑜𝑛 is to discard the remaining detections having high 𝐼𝑜𝑈 with this cell. In this case, we have two bounding boxes with lower $𝑝_𝑐$ that recognized the car on the left (two red rectangles in the left in the above picture) and because they both have big 𝐼𝑜𝑈 both of them will be discarded.

Similarly, for the car on the left, we go through the remaining rectangles and find the one with the highest probability, the highest $𝑝_𝑐$ value. Here, it is the box with $𝑝_𝑐$=0.8. Next, a task of 𝑁𝑜𝑛−𝑀𝑎𝑥 𝑠𝑢𝑝𝑝𝑟𝑒𝑠𝑠𝑖𝑜𝑛 is to discard the remaining detections having high 𝐼𝑜𝑈 with this cell. Here, we have two bounding boxes with lower $𝑝_𝑐$ that recognized the car on the left (two red rectangles in the left in the above picture) and because they both have big 𝐼𝑜𝑈 both of them will be discarded.

Applying the 𝑁𝑜𝑛−𝑀𝑎𝑥 𝑠𝑢𝑝𝑝𝑟𝑒𝑠𝑠𝑖𝑜𝑛 means that we’re going to output our maximal classifications probabilities, and suppress the other ones (the other detections of the same object) with lower values. That’s why we use the name 𝑁𝑜𝑛−𝑀𝑎𝑥 𝑠𝑢𝑝𝑝𝑟𝑒𝑠𝑠𝑖𝑜𝑛.

First, within this 19×19 grid we’re going to get a 19×19×8 output volume. For each of these 361 cells we output the vector:

$$\begin{bmatrix} p_{c}\\b_{x}\\b_{y}\\b_{h}\\b_{w}\\c_{1}\\c_{2}\\c_{3} \end{bmatrix}$$

For this example, we will be detecting only cars. So, for every grid cell, we will obtain the following vector: 

$$\begin{bmatrix} p_{c}\\b_{x}\\b_{y}\\b_{h}\\b_{w} \end{bmatrix}$$

Let’s see how 𝑁𝑜𝑛−𝑀𝑎𝑥 𝑠𝑢𝑝𝑝𝑟𝑒𝑠𝑠𝑖𝑜𝑛 works for this example. It will first discard all the boxes with $𝑝_𝑐$ values less than some predefined threshold. We can use for instance a value of 0.6.  For every grid cell we output a bounding box together with a probability of detecting a car within bounding box.

Next, while there are any remaining bounding boxes that we have not yet discarded or processed, we are going to repeatedly select the box with the highest $𝑝_𝑐$. That will be selected as our output prediction. Next, we will discard any remaining box with high overlap ( 𝐼𝑜𝑈) with the box that we have just outputted in the previous step. We keep doing this while there are still any remaining boxes that we’ve not yet process until we have taken each of the boxes and either output it as a prediction, or discarded it. 

In case that we want to detect more objects, for instance: pedestrians, cars, and motorcycles, the output vector will have three $𝑝_𝑐$ components. It has been shown in practice that we can run 𝑁𝑜𝑛−𝑀𝑎𝑥 𝑠𝑢𝑝𝑝𝑟𝑒𝑠𝑠𝑖𝑜𝑛 three times independently, one for every output class.

# 9. Anchor Boxes

As we can see from our previous sections, object detection is quite challenging. This is the final challenge that we are going to explain. Then, we will develop a holistic YOLO algorithm.
One scenario that we may encounter in practice is that several objects of interest are present in the same grid cell. This is shown in the figure below. In this case, we can use the idea of 𝐴𝑛𝑐ℎ𝑜𝑟 𝑏𝑜𝑥𝑒𝑠 to solve this problem. So, let’s start with an example.

<img src="figures/anchor_boxes.png" style="width:700px">

In the figure above, we will use a 3×3 grid. The midpoint of both objects, the car and the pedestrian are almost in the same place within the same grid cell. If we use the previously developed ideas, our output vector 𝑦 will have the following structure:


$$y = \begin{bmatrix} 1\\ b_{x}\\ b_{y}\\ b_{h}\\ b_{w}\\ 1\\  0\\ 0 \end{bmatrix}$$

Obvious challenge is that with this vector we can not detect all three desired classes: pedestrians, cars and motorcycles. That is, we can’t have two detections for a single cell and we have to choose only one.

The main idea of 𝑎𝑛𝑐ℎ𝑜𝑟𝑏𝑜𝑥𝑒𝑠 is to predefine two different shapes. They are called anchor boxes or anchor box shapes. In this way, we will be able to associate two predictions with the two anchor boxes. In general, we might use even more anchor boxes (five or even more), but to make the description easier we will stick with only two shapes.

As we can see in the above picture, we defined anchor box 1 and anchor box 2. Every anchor box is defined with the following values: $𝑝_𝑐$,$𝑏_𝑥$,$𝑏_𝑦$,$𝑏_ℎ$,$𝑏_𝑤$,𝑐1,𝑐2,𝑐3. Obviously, the shape of the pedestrian is more similar to the shape of anchor box 1 and the shape of a car is more similar to the shape of anchor box 2. Hence, the vector associated with the grid cell in the middle will be:

$$y =  \begin{bmatrix} 1\\ b_{x}\\ b_{y}\\ b_{h}\\ b_{w}\\ 1\\  0\\ 0\\ 1\\ b_{x}\\ b_{y}\\ b_{h}\\ b_{w}\\ 0\\ 1\\ 0 \end{bmatrix}$$

Previously, before we were using 𝑎𝑛𝑐𝑜𝑟𝑏𝑜𝑥𝑒𝑠, we defined a grid for each training image and we assigned an object to the grid cell where the center of a grid is. So, the output was 3×3×8 dimensional because we are using the 3×3 grid and in each grid cell we have the values: $𝑝_𝑐$, $𝑏_𝑥$, $𝑏_𝑦$, $𝑏_ℎ$, $𝑏_𝑤$, 𝑐1, 𝑐2, 𝑐3.

## 9.1. How do we encode the objects in the target label?

Previously, each object in the training image is assigned to a grid cell that contains that object’s midpoint. However, now with two anchor boxes, each object is assigned to grid cell that contains object’s midpoint and anchor box for the grid cell with highest Intersection over Union (𝐼𝑜𝑈).

Now, the output 𝑦 is going to be 3×3×16 or 3×3×2×8 because we use now 2𝑎𝑛𝑐ℎ𝑜𝑟𝑏𝑜𝑥𝑒𝑠 and 𝑦 is 8 dimensional. 

Let’s go through a concrete example. For this grid cell let’s specify what is 𝑦.

<img src="figures/anchor_boxes_2.png" style="width:700px">

Looking at this image, we see the pedestrian is more similar to the shape of Anchor box 1, so we will assign the anchor box 1 to the pedestrian. Also looking at the shape of the car, we would assign it to anchor box 2. If a car was actually found in the image, both output 𝑐1 and 𝑐3 would be 0 and 𝑐2  would be 1.

Note that this algorithm will not work properly in two different cases:

* When we have 2 Anchor boxes, and 3 objects in the same grid cell.
* Also, 2 objects in the same grid cell, and both objects have the same Anchor box.

These are some special cases which generally won’t happen so frequently in practice. Hence, so they do not affect the performance of the algorithm that much. It will happen quite rarely especially if we use a 19×19 grid. In this case, the chance that the two objects have the same midpoint will not happen that often.

## 9.2. how do we choose the anchor boxes?

Normally, a simple approach to this selection process is to manually select by hand. For example, choosing 5 to 10 Anchor box shapes that spans the object we wish to detect. A more advance technique is to apply the 𝑘−𝑚𝑒𝑎𝑛𝑠 clustering algorithm to groups together the types of object shapes. 