# Object Detection

## Problems

We can distinguish between:
* Classification
* Classification with localisation
* Detection.

The first two typically have one large image that we want to classify and locate.

With detection, we may very well be dealing with multiple detections.

## Structure and Outputs

To extend a classification problem into a localisation problem, we want four more numerical outputs in addition to the softmax. 

We want to output $b_x, b_y, b_h, b_w$ - these are the x and y coordinates, and the height and width of the box, respectively. 

And again, we need to classify the class that we are drawing from.

So the target label might end up being:

$y = \begin{bmatrix} p_c \\ b_x \\ b_y \\ b_w \\ b_h \\ c_1 \\ c_2 \\ c_3 \\ \end{bmatrix}$

where

$p_c$ indicates whether we have detection an object, and $c_{<n>}$ indicates what class the object belongs to.

In this scenario, our loss function becomes:

$ \mathcal{L}(\hat{y}, y) =  (\hat{y}_1 - y_1)^2 + (\hat{y}_2 - y_2)^2 + ... + (\hat{y}_8 - y_8)^2 $

Although we can, of course, be more specific in how we approch the loss of each set - log likelihood loss for the classification, for example, and logistic for $p_c$.


## Differing approaches

While the above is a typical appraoch, there are some variations. Consider the [YOLO](https://arxiv.org/pdf/1506.02640.pdf) (You Only Look Once) approach.

For the same approach:

$y = \begin{bmatrix} p_c \\ b_x \\ b_y \\ b_w \\ b_h \\ c_1 \\ c_2 \\ c_3 \\ \end{bmatrix}$

We now have $b_x$ and $b_y$ referring to the midpoints of hte image, and $b_w$ and $b_h$ defined relative to the width of the cell it is found in.


## Non-max suppression

It's quite possible that we'll end up with multiple detections of the same object.



So first, we'll throw out all boxes with probabilities less than some value (say, 0.6). 

Then, we'll take the most confident prediction. This is a prediction. We commit to it.

Then, discard any remaining box with IoU > 0.5 with the box output in the previous step. 

We repeat this until we run out of boxes.

## Anchor Boxes

These models may have a struggle with overlapping objects. 

With a base algorithm, each object in a training image is assigned to a grid cell that contains the object's midpoint.

When we use multiple anchor boxes, each object in training is assigned to a grid cell that contains the object's midpoint, *and* to an anchor box for the grid cell the highest IoU.

## Region Proposals

To avoid running the conv net all over windows that we really don't believe anything to be, we can run region proposals first of all.

First, we'll run a segmentation algorithm and classify the area in classes.

Then, having split up the image into 'blobs' of classes, we can run a classifer on each 'blob'. 

A good explanation can be found [here](https://medium.com/@smallfishbigsea/faster-r-cnn-explained-864d4fb7e3f8). 