# Anchor boxes

Typically, a majority of objects have a similar shape – for example, in a majority of
cases, a bounding box corresponding to an image of a person will have a greater
height than width, and a bounding box corresponding to the image of a truck will
have a greater width than height. Thus, we will have a decent idea of the height and
width of the objects present in an image even before training the model (by inspecting
the ground truths of bounding boxes corresponding to objects of various classes).
Furthermore, in some images, the objects of interest might be scaled – resulting in a
much smaller or much greater height and width than average – while still
maintaining the aspect ratio (that is, height/width).
Once we have a decent idea of the aspect ratio and the height and width of objects
(which can be obtained from ground truth values in the dataset) present in our
images, we define the anchor boxes with heights and widths representing the
majority of objects' bounding boxes within our dataset.

Typically, this is obtained by employing K-means clustering on top of the ground
truth bounding boxes of objects present in images.
Now that we understand how anchor boxes' heights and widths are obtained, we will
learn about how to leverage them in the process:

1. Slide each anchor box over an image from top left to bottom right.

2. The anchor box that has a high intersection over union (IoU) with the
object will have a label that mentions that it contains an object, and the
others will be labeled 0:
    - We can modify the threshold of the IoU by mentioning that if the IoU
    is greater than a certain threshold, the object class is 1; if it is less than
    another threshold, the object class is 0, and it is unknown otherwise.

Once we obtain the ground truths as defined here, we can build a model that can
predict the location of an object and also the offset corresponding to the anchor box to
match it with ground truth.

![imgs](./imgs/o5.png)

In the preceding image, we have two anchor boxes, one that has a greater height than
width and the other with a greater width than height, to correspond to the objects
(classes) in the image – a person and a car.

We slide the two anchor boxes over the image and note the locations where the IoU of
the anchor box with the ground truth is the highest and denote that this particular
location contains an object while the rest of the locations do not contain an object.

In addition to the preceding two anchor boxes, we would also create anchor boxes
with varying scales so that we accommodate the differing scales at which an object
can be presented within an image. An example of how the different scales of anchor
boxes look follows:

![imgs](./imgs/o6.png)

Note that all the anchor boxes have the same center but different aspect ratios or
scales.

# Region Proposal Network

Imagine a scenario where we have a 224 x 224 x 3 image. Furthermore, let's say that
the anchor box is of shape 8 x 8 for this example. If we have a stride of 8 pixels, we are
fetching 224/8 = 28 crops of a picture for every row – essentially 28*28 = 576 crops
from a picture. We then take each of these crops and pass through a Region Proposal
Network model (RPN) that indicates whether the crop contains an image. Essentially,
an RPN suggests the likelihood of a crop containing an object.

Let's compare the output of selectivesearch and the output of an RPN.

selectivesearch gives a region candidates based on a set of computations on top
of pixel values. However, an RPN generates region candidates based on the anchor
boxes and the strides with which anchor boxes are slid over the image. Once we
obtain the region candidates using either of these two methods, we identify the
candidates that are most likely to contain an object.

While region proposal generation based on selectivesearch is done outside of the
neural network, we can build an RPN that is a part of the object detection network.
Using an RPN, we are now in a position where we don't have to perform unnecessary
computations to calculate region proposals outside of the network. This way, we have
a single model to identify regions, identify classes of objects in image, and identify
their corresponding bounding box locations.

Next, we will learn how an RPN identifies whether a region candidate (a crop
obtained after sliding an anchor box) contains an object or not. In our training data,
we would have the ground truth correspond to objects. We now take each region
candidate and compare with the ground truth bounding boxes of objects in an image
to identify whether the IoU between a region candidate and a ground truth bounding
box is greater than a certain threshold. If the IoU is greater than a certain threshold
(say, 0.5), the region candidate contains an object, and if the IoU is less than a
threshold (say 0.1), the region candidate does not contain an object and all the
candidates that have an IoU between the two thresholds (0.1 - 0.5) are ignored while
training.

Once we train a model to predict if the region candidate contains an object, we then
perform non-max suppression, as multiple overlapping regions can contain an object.

An RPN trains a model to enable it to identify region proposals with a
high likelihood of containing an object by performing the following steps:
1. Slide anchor boxes of different aspect ratios and sizes across the image to
fetch crops of an image.
2. Calculate the IoU between the ground truth bounding boxes of objects in
the image and the crops obtained in the previous step.
3. Prepare the training dataset in such a way that crops with an IoU greater
than a threshold contain an object and crops with an IoU less than a
threshold do not contain an object.
4. Train the model to identify regions that contain an object.
5. Perform non-max suppression to identify the region candidate that has the
highest probability of containing an object and eliminate other region
candidates that have a high overlap with it.

# Classification and regression

So far, we have learned about the following steps in order to identify objects and
perform offsets to bounding boxes:

1. Identify the regions that contain objects.

2. Ensure that all the feature maps of regions, irrespective of the regions'
shape, are exactly the same using region of interest (RoI) pooling (which
we learned about in the previous chapter).

Two issues with these steps are as follows:

1. The region proposals do not correspond tightly over the object (IoU>0.5 is
the threshold we had in the RPN).

2. We identified whether the region contains an object or not, but not the class
of the object located in the region.

We address these two issues in this section, where we take the uniformly shaped
feature map obtained previously and pass it through a network. We expect the
network to predict the class of the object contained within the region and also the
offsets corresponding to the region to ensure that the bounding box is as tight as
possible around the object in the image.

![imgs](./imgs/o7.png)

In the preceding diagram, we are taking the output of RoI pooling as input (the 7 x 7 x
512 shape), flattening it, and connecting to a dense layer before predicting two
different aspects:

1. Class of object in the region

2. Amount of offset to be done on the predicted bounding boxes of the region
to maximize the IoU with the ground truth

Hence, if there are 20 classes in the data, the output of the neural network contains a
total of 25 outputs – 21 classes (including the background class) and the 4 offsets to be
applied to the height, width, and two center coordinates of the bounding box.

Now that we have learned the different components of an object detection pipeline,
let's summarize it with the following diagram:

![imgs](./imgs/o8.png)