The studied Mask R-CNN model here is the default class `maskrcnn_resnet50_fpn` in `torchvision.mdoels.detection.mask_rcnn.py`. 
1. the backbone model is ResNet50;
2. `Feature Pyramid Network` is employed between the backbone and the `Region Proposal Network`;
3. Class `FastRCNNPredictor` is used to compute the class probability and bounding box prediction; class `MaskRCNNPredictor` is used to compute the mask fully convolutional network logits. 

<font color=red>The forward function in Faster R-CNN & Mask R-CNN is as follows:</font>
```
images, targets = self.transform(images, targets)
features = self.backbone(images.tensors)
if isinstance(features, torch.Tensor):
    features = OrderedDict([('0', features)])
proposals, proposal_losses = self.rpn(images, features, targets)
detections, detector_losses = self.roi_heads(features, proposals, images.image_sizes, targets)
detections = self.transform.postprocess(detections, images.image_sizes, original_image_sizes)

losses = {}
losses.update(detector_losses)
losses.update(proposal_losses)
```

The difference between Faster R-CNN and Mask R-CNN is that in Mask R-CNN, there are additional `mask_roi_pool, mask_head, mask_predictor`. 

Channel numbers in ResNet50: 3 -- 64 -- 256 -- 512 -- 1024 -- 2048. 

### Feature Pyramid Network
The structure in the default Mask R-CNN model is as follows:

<img src="../img/fpn_model.png" alt="drawing" width="550"/>

### Region Proposal Network
Region Proposal Network consists of two parts: 
1. anchor box generation: which uses the default `AnchorGenerator` class in `torchvision.models.detection.rpn.py`. 
2. anchor box selection: which uses class `RPNHead` in `torchvision.models.detection.rpn.py`. 

<img src="../img/RPN.png" alt="drawing" width="570"/>

#### AnchorGenerator(): to generate anchor boxes for each image
1. default anchor sizes: ((32,), (64,), (128,), (256,), (512,))
2. default aspect ratios: ((0.5, 1.0, 2.0),)

Compute the cell anchors as:
```
(-23, -11, 23, 11), (-16, -16, 16, 16), (-11, -23, 11, 23)
(-45, -23, 45, 23), (-32, -32, 32, 32), (-23, -45, 23, 45)
(-91, -45, 91, 45), (-64, -64, 64, 64), (-45, -91, 45, 91)
(-181, -91, 181, 91), (-128, -128, 128, 128), (-91, -181, 91, 181)
(-362, -181, 362, 181), (-256, -256, 256, 256), (-181, -362, 181, 362)
```
The anchor sizes are in the pixel level of input images. So for each feature map, we need to compute its corresponding size.<br>
Number of generated anchor boxes: $3*(200*272 + 100 * 136 + 50 * 68 + 25 *34 + 13 * 17) = 217413.$

#### RoI selection based on anchor boxes, objectiveness and the offset
1. based on the anchor boxes and predicted offset, get the corresponding RoIs;
2. based on the objectiveness, select top `n` boxes independently per feature map before applying nms; (default `n=2000`. If number of anchor boxes in this level <= 2k, keep them all.)
3. apply non-maximum suppression independently per level, only keep top `k` scoring predictions.
<br>The whole process has nothing to do with the ground-truth bounding boxes. 

**The ground-truth bounding boxes are used when identifying the most matched anchor boxes with them**. 
From the comparision between the most matched anchor boxes with the ground-truth bounding boxes, we can evalute:
1. the mismatch of objectivess;
2. the mistatch of offset;
<font color=red>They are computed as the loss and used to optimize the three convolutional layers in `RegionProposalNetwork`. </font>

<font color=red>The loss in `RegionProposalNetwork` is the only factor determining the optimization of the three convolutional layers in `RegionProposalNetwork`. The remaining losses, like the final bounding box and object class prediction of each proposal box, are not relevant to the three convolutional layers in `RegionProposalNetwork`.</font>

The answer is here: one code in `forward` function of `RegionProposalNetwork`:
```
# note that we detach the deltas because Faster R-CNN do not backprop through the proposals
proposals = self.box_coder.decode(pred_bbox_deltas.detach(), anchors)
```
One good explanation of `detach` method of tensors: http://www.bnikolic.co.uk/blog/pytorch-detach.html

<font color=red>The result from RPN is the selected proposal boxes, and if it is in the training mode, the corresponding loss between anchor boxes and ground-truth bounding boxes are computed and returned as well. </font>
```
proposals, proposal_losses = self.rpn(images, features, targets)
```

From the code in `RPNHead` class, 
```
self.cls_logits = nn.Conv2d(in_channels, num_anchors, kernel_size=1, stride=1)
self.bbox_pred = nn.Conv2d(in_channels, num_anchors * 4, kernel_size=1, stride=1)
```
From Feature Pyramid Network, the output is a list of tensors. They have the same number of channels, but different height and width. Now in the forward function of `RPNHead`, it is handled in the following way: 
```
def forward(self, x):
    # type: (List[Tensor])
    logits = []
    bbox_reg = []
    for feature in x:
        t = F.relu(self.conv(feature))
        logits.append(self.cls_logits(t))
        bbox_reg.append(self.bbox_pred(t))
    return logits, bbox_reg
```

### RoIHeads class
```
(roi_heads): RoIHeads(
(box_roi_pool): MultiScaleRoIAlign()
(box_head): TwoMLPHead(
  (fc6): Linear(in_features=12544, out_features=1024, bias=True)
  (fc7): Linear(in_features=1024, out_features=1024, bias=True)
)
(box_predictor): FastRCNNPredictor(
  (cls_score): Linear(in_features=1024, out_features=91, bias=True)
  (bbox_pred): Linear(in_features=1024, out_features=364, bias=True)
)
```
Part of the forward function in class `RoIHeads`:
```
box_features = self.box_roi_pool(features, proposals, image_shapes)
box_features = self.box_head(box_features)
class_logits, box_regression = self.box_predictor(box_features)
```
After getting the RoIs, further determine the level of feature maps based on equation in FPN paper. (Determine which FPN level each RoI in a set of RoIs should map to based on the heuristic in the FPN paper.)

Then for the RoIs in the same level, convert them into features of the same size. 
```
result_idx_in_level = roi_align(
    per_level_feature, rois_per_level,
    output_size=self.output_size,
    spatial_scale=scale, sampling_ratio=self.sampling_ratio)
```

The called function is 
```
torch.ops.torchvision.roi_align(input, rois, spatial_scale,
                                           output_size[0], output_size[1],
                                           sampling_ratio, aligned)
```

In the forward function, the offset between propoal boxes and ground-truth bounding boxes are computed in the following way:
<img src="../img/ssd_loss.png" alt="drawing" width="550"/>
 