# Anchor boxes

Typically, a majority of objects have a similar shape – for example, in a majority of
cases, a bounding box corresponding to an image of a person will have a greater
height than width, and a bounding box corresponding to the image of a truck will
have a greater width than height. Thus, we will have a decent idea of the height and
width of the objects present in an image even before training the model (by inspecting
the ground truths of bounding boxes corresponding to objects of various classes).
Furthermore, in some images, the objects of interest might be scaled – resulting in a
much smaller or much greater height and width than average – while still
maintaining the aspect ratio (that is, height/width).
Once we have a decent idea of the aspect ratio and the height and width of objects
(which can be obtained from ground truth values in the dataset) present in our
images, we define the anchor boxes with heights and widths representing the
majority of objects' bounding boxes within our dataset.

Typically, this is obtained by employing K-means clustering on top of the ground
truth bounding boxes of objects present in images.
Now that we understand how anchor boxes' heights and widths are obtained, we will
learn about how to leverage them in the process:

1. Slide each anchor box over an image from top left to bottom right.

2. The anchor box that has a high intersection over union (IoU) with the
object will have a label that mentions that it contains an object, and the
others will be labeled 0:
    - We can modify the threshold of the IoU by mentioning that if the IoU
    is greater than a certain threshold, the object class is 1; if it is less than
    another threshold, the object class is 0, and it is unknown otherwise.

Once we obtain the ground truths as defined here, we can build a model that can
predict the location of an object and also the offset corresponding to the anchor box to
match it with ground truth.

# Region Proposal Network

An RPN trains a model to enable it to identify region proposals with a
high likelihood of containing an object by performing the following steps:
1. Slide anchor boxes of different aspect ratios and sizes across the image to
fetch crops of an image.
2. Calculate the IoU between the ground truth bounding boxes of objects in
the image and the crops obtained in the previous step.
3. Prepare the training dataset in such a way that crops with an IoU greater
than a threshold contain an object and crops with an IoU less than a
threshold do not contain an object.
4. Train the model to identify regions that contain an object.
5. Perform non-max suppression to identify the region candidate that has the
highest probability of containing an object and eliminate other region
candidates that have a high overlap with it.