**Object Detection Architectures**

In this notebook, we'll examine neural network architectures for object detection. We'll work through the examples in [`torchvision/models/detection`](https://github.com/pytorch/vision/tree/master/torchvision/models/detection)

First up is Faster R-CNN, defined in `faster_rcnn.py`. The `FasterRCNN` class inherits properties from `GeneralizedRCNN`, defined in `generalized_rcnn.py`. Let's examine this file (it's short! 51 sloc). First let's do our imports: 

In [1]:
from collections import OrderedDict
import torch
from torch import nn

Now let's examine our `GeneralizedRCNN` class, which inherits from `nn.Module`. First we have our constructor:

In [2]:
class GeneralizedRCNN(nn.Module):
    def __init__(self, backbone, rpn, roi_heads, transform):
        self.transform = transform
        self.backbone = backbone
        self.rpn = rpn
        self.roi_heads = roi_heads

Then we have a `forward` method that defines the training process:

In [3]:
def forward(self, images, targets=None):
    if self.training and targets is None:
        raise ValueError("In training mode, targets should be passed")
    original_image_sizes = [img.shape[-2:] for img in images]
    images, targets = self.transform(images, targets)
    features = self.backbone(images.tensors)
    if isinstance(features, torch.Tensor):
        features = OrderedDict([(0, features)])
    proposals, proposal_losses = self.rpn(images, features, targets)
    detections, detector_losses = self.roi_heads(features, proposals, images.image_sizes, targets)
    detections = self.transform.postprocess(detections, images.image_sizes, original_image_sizes)
    losses = {}
    losses.update(detector_losses)
    losses.update(proposal_losses)
    if self.training:
        return losses
    return detections

Let's also examine this: https://github.com/sgrvinod/a-PyTorch-Tutorial-to-Object-Detection

`backbone` is a network used to classify objects within each region.

`rpn` is a region proposal network. We use this to localize each object before applying our backbone to classify each localized object.

`roi_heads` uses our classifications and localizations to compute detections and masks.

`transform` transforms the input data.

Let's see if we can now make sense of what's going on with `FasterRCNN`, starting with the class constructor:

In [4]:
def __init__(self, backbone, num_classes=None, 
             min_size=800, max_size=1333,
             image_mean=None, image_std=None,
             rpn_anchor_generator=None, rpn_head=None,
             rpn_pre_nms_top_n_train=2000, rpn_pre_nms_top_n_test=1000,
             rpn_post_nms_top_n_train=2000, rpn_post_nms_top_n_test=1000,
             rpn_nms_thresh=0.7,
             rpn_fg_iou_thresh=0.7, rpn_bg_iou_thresh=0.3,
             rpn_batch_size_per_image=256, rpn_positive_fraction=0.5,
             box_roi_pool=None, box_head=None, box_predictor=None,
             box_score_thresh=0.05, box_nms_thresh=0.5, box_detections_per_img=100,
             box_fg_iou_thresh=0.5, box_bg_iou_thresh=0.5,
             box_batch_size_per_image=512, box_positive_fraction=0.25,
             bbox_reg_weights=None):
    pass

That's a ton of arguments. They're explained in `class FasterRCNN`:

`backbone`: backbone network

`num_classes`: number of classes to classify and subsequently detect

`min_size`: minimum size of the image to be rescaled before feeding it to the backbone

`max_size`: max size of the image to be rescaled before feeding it to the backbone

We can also normalize our inputs using `image_mean` and `image_std`

The next set of arguments are for the region proposal network.

The Faster R-CNN paper explains the concept of "anchors":

"At each sliding window location, we simultaneously predict multiple region proposals, where the number of maximum possible proposals for each location is denoted as $k$ ... The $k$ proposals are parameterized relative to $k$ reference boxes, which we call *anchors* ... For a convolutional feature map of size $W \cdot H$, there are $W \cdot H \cdot k$ anchors in total."

`rpn_anchor_generator` generates anchors for a set of feature maps

`rpn_head` computes the objectness and regression deltas from the RPN. What are those? Let's consider the definition of a Region Proposal Network from the paper:

"A Region Proposal Network (RPN) takes an image (of any size) as an input and outputs a set of rectangular object proposals, each with an objectness score ... 'Objectness measures membership to a set of object classes vs. background'" 

In other words objectness tells us whether or not an object belongs to one of our classes. Regression loss refers to the loss associated with the difference between the predicted and ground truth bounding boxes.

We also need to specify how many proposals to keep before and after applying "non-maximum suppression" (NMS) for our training and text sets. The default values are set to the following:

`rpn_pre_nms_top_n_train = 2000` <br/>
`rpn_pre_nms_top_n_test = 1000` <br/>
`rpn_post_nms_top_n_train = 2000` <br/>
`rpn_post_nms_top_n_test = 1000`

We also have three thresholds:

`rpn_nms_thresh`: threshold for applying NMS to the RPN proposals (default: 0.7)

`rpn_fg_iou_thresh`: the overlap between the anchor and ground truth above which the box is classified as "positive" (default: 0.7)

`rpn_bg_iou_thresh`: the overlap between the anchor and ground truth below which the box is classified as "negative" (default: 0.3)

and two more RPN arguments:

`rpn_batch_size_per_image`: "number of anchors sampled during training for computing the loss" (default: 256)

`rpn_positive_fraction`: "proportion of positive anchors in a mini-batch during training of the RPN" (default: 0.5)

The remaining arguments pertain to the (bounding) boxes:

`box_roi_pool`: module for cropping and resizing feature maps according to locations determined by bounding boxes

`box_head`: module that takes cropped feature maps and outputs to predictor

`box_predictor`: module for computing classification and regression loss

`box_score_thresh`: classification score threshold above which to return proposals during inference (default: 0.05)

`box_nms_thresh`: NMS threshold for prediction head during inference (default: 0.5)

`box_detections_per_img`: maximum detections per image (default: 100)

`box_fg_iou_thresh`: minimum IoU between proposals and ground truth for positive classification (default: 0.5)

`box_bg_iou_thresh`: maximum IoU between proposals and ground truth for negative classification (default: 0.5)

`box_batch_size_per_image`: number of proposals to sample during training of classification head (default: 512)

`box_positive_fraction`: "proportion of positive proposals in a mini-batch during training of the classification head" (default: 0.25)

`bbox_reg_weights`: "weights for encoding/decoding of bounding boxes"