# Week 3 Notes

## Finding Objects 

IN computer vision, there are three major tasks related to finding objects:

* Object classification: Given a well centered and cropped image with only one dominating object, the task is to classify what the object is, from a pre-defined set of possibilities.

* Object localization: Classify and localize (with bounding boxes or similar tools) the single instance of an object in an image.

* Object detection: Classify and localize (with bounding boxes or similar tools) multiple instances of object in an image.

There is also landmark detection and the related concept of pose estimation. The objective is to estimate points which are related - such as eyes, hairline or a person standing, crouching. This is the technology that powers SNAP filters and related products.


The ideas in these tasks are related and generalize to build up to object detection.

### Object Classification

Object classification is the task of looking at an image and classifying the presence or not of an object.

### Object Localization

Object localization is to look at an image and classify an object as one of some finite predetermined classes and finding its location or co-ordinates in the image. One popular way to locate an object is to give the co-ordinates of a bounding box or rectangle around it.

We tend to use the convention that (0,0) is the top left corner of an image and (1,1) is the bottom right corner. Any points in between are proportionally set.

Specifically, an image is passed through a neural network (usually convolutional) and the head or final output is a longer vector than see before. For a task that is to find the prescence of, class of and co-ordinates of an object, we need:

* the presence of an object, p (binary or probability response)

* the co-ordinates of the object, given as:
    - midpoint of rectangle, x
    - midpoint of rectangle, y
    - height of rectangle
    - width of rectangle

* the classification vector - with one hot encoding as 1 for the class that is predicted, and zero for all others. These are denoted by $c_i$. Alternatively, the classes can be modeled as probabilities.

If $p=0$, then the remaining vector numbers are ignored, particularly when using a loss function to train the model.

#### Loss Function For Object Localization

A crude way to model the loss of misclassification would be to use binary response with squared error for the presence indicator.

The same can be used for class probabilities as well as for the co-ordinates of the midpoint, length and width of the bounding box.

Then summed together, we get a loss function that can be used for training object localization networks.

In practice however, one would use a different loss function. For the classification tasks of presence indication and class label, cross entropy loss is generally seen as a better way to process outputs. That is, logistic loss for the presence indicator and multinomial log-likelihood for the classification label.

### Landmark Detection

The idea of outputting regression vectors from a neural network applied to an image is powerful. It can be used to detect landmarks in images. Of course, significant labeling of training images is needed to train a useful model.

#### Face Mapping

To map the core features of a face: the mouth, eyes, nose, lips - we need to locate around 64 points in the x and y co-ordinates.

#### Pose Estimation

To map the core features of a pose, we usually want to know the location of 32 points, corresponding to joints  - in the x and y co-ordinates.

## Object Detection

Object detection is like object localization, only here the task is to classify and locate (using bounding boxes) possibly many objects in a single image.

By sliding windows of different sizes and strides across an image, we can apply the same localization method to each window throughout an image.

One problem with this approach is the computational cost of working through all these windows at prediction time is very high. This is much harder now since neural networks have a high computational overhead themselves. 

In the past it was acceptable to repeatedly train classifiers because linear classifiers were dominant and are computationally cheap to use.

Although it is possible to use wider windows to speed up performance there is a danger of missing the object or having a bounding box that is too large.

One way to solve this is to use a convolutional implementation of sliding windows - which simultaneously predicts the classifier across all windows.

### Convolutional Implementation Of Sliding Windows

By making use of 1 by 1 convolutions instead of fully connected layers, then we can pass all the sliding windows of the same size through a ConvNet in one parallel computation - this helps to speed up significantly sliding windows detection.

By applying 1 by 1 convolutions and inputting larger regions the parallel speed up helps materially.

### YOLO Algorithm

The YOLO Algorithm is a fast detection algorithm. It is a one stage detector in that it detects and classifies objects in one go. It does suffer from a few issues, including:

* lower accuracy rates as compared to two stage detectors such as Faster R-CNN
* weak classification of smaller objects in a image.

However, it is still very popular. We list some of the components needed to implement YOLO:

* Bounding Box Predictions

* Intersection Over Union

* NonMax Suppression

* Anchor Boxes

These are discussed in more detail, before YOLO itself is outlined. Note that many of these components can be used by other object detectors too.

#### Bounding Box Predictions

The YOLO paper doesn't use sliding windows to detect objects. Instead is applies a grid over the image. 

At training time, the grid is chosen so that inside each grid cell there is at most one object. Then the co-ordinates and classification are recorded in the label data. 

A ConvNet is then applied with the detection head to the grid cells. This can be done to cells over different grid sizes. Note how the learning happens in one ConvNet, avoiding training and inferring from multiple networks.

There are some components that can be used to make the algorithm usable. These are discussed below.

#### Intersection Over Union

When there are many overlapping grid cells which indicate the presence of an object, we need a way to assign an object to just one of the cells. Intersection over Union (IoU) is one way.

When comparing two cells we can measure how similar they are with the following formula:

IoU(A,B) = $\frac{Intersection(A, B)}{Union(A, B)}$ 

where $Intersection(\cdot, \cdot)$ and Union$(\cdot, \cdot)$ are the areas of the the intersection and union of the boxes.

If $IoU(A, B) > \tau$ for some $\tau \geq 0.5$, usually $\tau = 0.5$ is common, then we can say that A and B are similar. 

It is used when deciding if a prediction A is close enough to the ground truth B to be a good fit.

#### NonMax Suppression

One problem with this grid approach to detecting images is that the object may be detected multiple times by different grids. NonMax suppression is one way to associate one object with one tightly bounding grid cell/bounding box.

The strategy is as follows. 

1. Select the bounding box with the highest probability of an object.

2. Suppress all other bounding boxes that have a high IoU, over a threshold $\tau$, say $0.5$.

3. Go back to 1 and pick the next highest probability box.

In this way, all the NonMax boxes are suppressed.

#### Anchor Boxes

Anchor Boxes are a way to work with overlapping objects. It changes the head of the neural network to be larger so that multiple objects can be localized.

It works well for detecting two objects in the same region, one of which is wide and the other long. However it doesn't work so well for three or more objects, or when both objects the same shape.

#### YOLO: Bringing It All Together

Let's now go through the YOLO algorithm. Firstly to compose the final head of the neural network do the following, note it's dimenision.

grid_size by grid_size by anchor_boxes by  (5 + num_classes)

The 5 comes from 4 coordinates for the bounding box and 1 classifier to indicate the presence or not of an object. Then run the ConvNet learning algorithm.

For each cell you get anchor_boxes predictions. First suppress all low probability bounding boxes. Then cycle through each class and apply non-max suppression. What is left, should be the final set of predictions.

## Region Proposals

Another competing family of detectors is that two stage detectors - regional proposal followed by CNN. The are called two stage, because first they propose regions, and then they classify among those regions.

Region Proposal algorithms are slower than YOLO and other one stage detectors, however they also have significantly higher accuracy.

### R-CNN

The original algorithm uses Selective Search algorithm (powered by Greedy Search) to propose 2000 regions to classify. Then, after warping the images are processed by a ConvNet. The output of this is given to multiclass SVM and bounding box regression to detect objects.

This algorithm is slow, involving 2000 regions and the selective search algorithm does not do any learning so can propose bad regions. 

### Fast R-CNN

Because of the speed bottleneck, a different algorithm was proposed by the same authors. First FastR-CNN trains many convolutional filters (or feature maps) over sliding windows. Then selective search algorithm is applied. Then these proposed regions are RoI pooled (a type of irregular size max pooling). 

These RoI features are then passed to fully connected layers, to output classification and regressions.

It runs much faster than R-CNN at test time.

### Faster R-CNN

Similar to Fast R-CNN, the image is provided as an input to a convolutional network which provides a convolutional feature map. Instead of using selective search algorithm on the feature map to identify the region proposals, a separate network is used to predict the region proposals. The predicted region proposals are then reshaped using a RoI pooling layer which is then used to classify the image within the proposed region and predict the offset values for the bounding boxes.


### Focal Loss Detector

The Focal loss detector is a one stage detector which uses a novel loss function to get higher accuracy performance while maintaining speed. It does this by reducing the loss incurred in fitting easy examples. It has a multiple scale and anchor box convolutional network which outputs bounding boxes and classification probabilities.