In [None]:
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np

plt.rcParams['figure.figsize'] = (12, 9)

# Object Detection

We covered some popular object detectors in the lectures. Due to limitations on computational resources, in this notebook we will not define or train our own object detector. Instead, we will take a pretrained Faster R-CNN from PyTorch's TorchVision library and use it as an example to illustrate how detectors work. Check out PyTorch's [online docs](https://pytorch.org/vision/stable/models.html) for more information on pretrained models provided by TorchVision.

## Running a Pretrained Object Detector

We will use the `fasterrcnn_resnet50_fpn_v2` model from TorchVision. It is a [Faster R-CNN](https://arxiv.org/abs/1506.01497) model with a 50-layer [ResNet](https://arxiv.org/abs/1512.03385) as its backbone. A [Feature Pyramid Network](https://arxiv.org/abs/1612.03144) is used to make the model better at detecting objects at multiple scales. The model is trained on the [COCO dataset](https://cocodataset.org/) and is able to detect objects from [80 categories](https://cocodataset.org/#explore).

In [None]:
import torch
from torchvision.models.detection import fasterrcnn_resnet50_fpn_v2, FasterRCNN_ResNet50_FPN_V2_Weights


# Create model and load the pretrained weights.
weights = FasterRCNN_ResNet50_FPN_V2_Weights.DEFAULT
model = fasterrcnn_resnet50_fpn_v2(weights=weights)
model.eval()

# Define the transformations used to preprocess input images.
preprocess = weights.transforms()

We will use the following image from COCO as our sample image. Don't worry. This image is from the validation set, so the model has not seen it during training. (You could argue that there is still risk for overfitting, since the hyperparameters are tuned using this validation set. That's very true. We will still use this image for illustrative purposes, since it is easier to get the annotations.)

In [None]:
img = plt.imread('data/000000570688.jpg')
plt.imshow(img)
plt.show()

Let's run the model our sample image,

In [None]:
# Run inference on the sample image.
inp = torch.tensor(np.transpose(img, (2, 0, 1)))
batch = [preprocess(inp)]
with torch.no_grad():
    prediction = model(batch)[0]
category_map = weights.meta['categories']

... and display the results:

In [None]:
from matplotlib.patches import Rectangle

def visualize_detections(img, prediction, category_map, ground_truth=None,
    score_thresh=0.5, categories=None, display_text=False):
    
    plt.imshow(img)
    
    # Visualize detections.
    for i in range(len(prediction['boxes'])):
        x1, y1, x2, y2 = prediction['boxes'][i].detach().cpu().numpy() 
        label = prediction['labels'][i].item()
        score = prediction['scores'][i].item()
        category = category_map[label]
        if score > score_thresh:
            if categories and category not in categories:
                continue
            plt.gca().add_patch(Rectangle((x1, y1), x2-x1, y2-y1,
                facecolor='none', edgecolor=(1,0,0), linewidth=1))
            if display_text:
                plt.text(x1+4., y1-6., f'{category} {score:.3f}',
                    backgroundcolor=(1,0,0,0.5))
    
    # Visualize ground-truth.
    if ground_truth:
        for i in range(len(ground_truth['boxes'])):
            x1, y1, x2, y2 = ground_truth['boxes'][i]
            label = ground_truth['labels'][i]
            category = category_map[label]
            if categories and category not in categories:
                continue
            plt.gca().add_patch(Rectangle((x1, y1), x2-x1, y2-y1,
                    facecolor='none', edgecolor=(0,0,1), linewidth=1))
    
    plt.show()

visualize_detections(img, prediction, category_map, score_thresh=0.9, display_text=True)

## Evaluating the Detector's Performance

It looks like our detector is doing reasonably well on this image. Let's see how we can evaluate its performance. To make things simpler, we will focus on the "kite" category.

In [None]:
gt_kites = np.array([
    [ 58.00, 255.00,  70.00, 261.00],
    [176.80, 237.98, 257.07, 258.72],
    [331.47,  64.73, 389.83,  85.99],
    [281.05, 216.76, 422.40, 265.44],
    [222.76, 265.81, 395.02, 323.61],
    [240.93, 124.85, 264.28, 136.92],
    [431.24,  29.56, 450.01,  44.34],
    [116.24, 178.44, 136.90, 191.40],
    [232.72,  20.46, 242.42,  30.16],
    [221.00, 179.65, 453.09, 241.63],
    [461.84, 252.38, 480.25, 269.08],
    [506.64, 195.03, 523.44, 215.88],
    [323.96, 160.00, 465.92, 215.13],
    [280.32,  92.35, 326.35, 106.77],
])
ground_truth = {
    'boxes': gt_kites,
    'labels': [38 for _ in range(len(gt_kites))]
}
visualize_detections(img, prediction, category_map, ground_truth=ground_truth,
    categories=['kite'], score_thresh=0.9, display_text=False)

In this visualization, detected boxes are marked red, whereas ground-truth boxes are marked blue. You might first notice that some detected kites are not in the ground-truth annotations. This is because in COCO's annotation protocol, scenes like this are considered "crowd" and the annotations may not be exhaustive. "Crowd" areas are ignored when calculating the final metric so they do not negatively affect the evaluation. Here let's just assume the annotations are correct.

We picked a relatively high score threshold in this visualization (`score_thresh=0.9`). Generally speacking, a higher score threshold means your results will have higher *precision*, at the expense of lower *recall*. Precision is defined as the number of correct predictions divided by the total number of predictions. Recall is defined as the number of correct predictions divided by the total number of ground-truth objects. As a result of the high score threshold, some kites, like the one in the middle near the top of the image, are missed by our detector.

If we pick a lower score threshold, we could possibly recover the missed objects, resulting in higher recall. However, this may introduce more false positives. In the visualization below, the previously missed kite in the middle is now detected, but a false positive bounding box encompassing two kites also appears around it.

In [None]:
visualize_detections(img, prediction, category_map, ground_truth=ground_truth,
    categories=['kite'], score_thresh=0.7, display_text=False)

When evaluating the detector, we can choose to plot the *precision-recall* curve instead of picking one specific score threshold. To do this, we first need to sort the predictions based on their confidence scores.

In [None]:
dt_kites = []
dt_scores = []
for i in range(len(prediction['boxes'])):
    if prediction['labels'][i].item() == 38:
        dt_kites.append(prediction['boxes'][i].detach().cpu().numpy())
        dt_scores.append(prediction['scores'][i].item())
dt_kites = np.array(dt_kites)
dt_scores = np.array(dt_scores)

idxs = np.argsort(-dt_scores)
dt_kites = dt_kites[idxs]
dt_scores = dt_scores[idxs]

Whether a prediction is a true positive or a false positive depends on how well it overlaps with the ground-truth bounding boxes. Recall from the lecture that this is usually measured by calculating the IoU (intersection over union) between bounding boxes.

(10 points) In the cell below, implement function `calculate_ious()`. It receives two bounding box lists, and returns a matrix `ious` where `ious[i, j]` is the IoU between `boxes1[i]` and `boxes2[j]`. Try to implement this in vectorized form without using for-loops.

In [None]:
def calculate_ious(boxes1, boxes2):
    """Calculate IoU (intersection over union) between two bounding box lists.
    
    Args:
    - boxes1: M x 4 array representing M bounding boxes in the
        [left, top, right, bottom] format.
    - boxes2: N x 4 array representing N bounding boxes in the
        [left, top, right, bottom] format.
    
    Returns:
    - ious: M x N array where ious[i, j] is the IoU between boxes1[i] and
        boxes2[j].
    """
    # TODO
    return ious

# Calculate IoUs and visualize.
ious = calculate_ious(dt_kites, gt_kites)

from mpl_toolkits.axes_grid1 import make_axes_locatable
plt.matshow(ious)
plt.xlabel('ground-truths')
plt.ylabel('detections')
divider = make_axes_locatable(plt.gca())
cax = divider.append_axes("right", size="5%", pad=0.1)
plt.colorbar(cax=cax)
plt.show()

With the IoU values we can plot the precision-recall curve. First, we take the highest scoring detection, compare it with all ground-truth bounding boxes and check if the highest IoU is over a threshold. The detection is labeled a true positive if so and a false positive otherwise. We then move on to the second highest scoring detection, and try to match it to the remaining ground-truth boxes. Repeat this process when there is no remaining detection or ground-truth. After every detection is labeled, we can calculate the precision and recall values and plot the curve accordingly.

In [None]:
iou_threshold = 0.5
tp_labels = []
# The detections is already sorted by confidence scores.
for i in range(len(dt_kites)):
    # Find the ground-truth that overlap the most.
    j = np.argmax(ious[i, :])
    if ious[i, j] > iou_threshold:
        # True positive if there is a match.
        tp_labels.append(True)
        # Remove matched ground-truth.
        ious[:, j] = -np.inf
    else:
        # False positive if there is no match.
        tp_labels.append(False)

# Calculate precision and recall values.
pr = []
rc = []
tp, fp, n_gt = 0, 0, len(gt_kites)
for l in tp_labels:
    if l:
        tp += 1
    else:
        fp += 1
    pr.append(tp / (tp + fp))
    rc.append(tp / n_gt)

plt.figure(figsize=(6, 6))
plt.plot(rc, pr, '.-', color='b', clip_on=False)
plt.grid()
plt.xlabel('Recall')
plt.ylabel('Precision')
plt.xlim(0, 1)
plt.ylim(0, 1)
plt.show()

## Anchor Boxes and Region Proposals

Recall from the lecture that the R-CNN series of detectors convert the problem of detecting objects in images into sampling regions of interest (RoIs) and classifying (and adjusting) them. Faster R-CNN is usually described as a "two-stage" detector. In the first stage, the detector's region proposal network (RPN) samples a dense grid of boxes called "anchors", predicts the "objectness" scores for each anchor, and uses regression to adjust the locations of the anchors. Top scoring anchors are used as region proposals and passed to the second stage, where the classification head classifies their object category while the regression head further adjusts their locations.

In this section, we will take a look at what these anchors look like. PyTorch let's you fetch intermediate results in a model by registering hooks on its modules. In the cell below, we define some necessary hooks and run inference on the sample image again with these hooks.

In [None]:
outputs = {}

def get_output_hook(name):
    def hook_fn(module, inp, out):
        outputs[name] = out
    return hook_fn

hooked_modules = {
    'transform': model.transform,
    'anchors': model.rpn.anchor_generator,
    'rpn': model.rpn,
    'backbone': model.backbone,
}

handles = []
for name, module in hooked_modules.items():
    handles.append(module.register_forward_hook(get_output_hook(name)))
with torch.no_grad():
    prediction = model(batch)[0]
for handle in handles:
    handle.remove()

Before processing the image, the detector applies some transformations to the input image.

In [None]:
print(model.transform)
img_transformed = outputs['transform'][0].tensors.detach().cpu().numpy()[0]
print(img_transformed.shape)

The backbone of the detector contains a feature pyramid network that produces feature maps at multiple scales:

In [None]:
for name, feature_map in outputs['backbone'].items():
    print(f'{name:<4}: {feature_map.shape}')

Anchors are dense grids of box samples on these feature maps:

In [None]:
grid_sizes = []
for feature_map in outputs['backbone'].values():
    n, c, h, w = feature_map.shape
    grid_sizes.append([h, w])
print(grid_sizes)

When defining the detector, you need to specify the size of anchors corresponding to each feature map and the aspect ratios to be used. In our model here, it generates three bounding boxes for each grid point, their aspect ratios being 0.5, 1 and 2:

In [None]:
print(model.rpn.anchor_generator.sizes)
print(model.rpn.anchor_generator.aspect_ratios)

Let's visualize some anchors on the last grid:

In [None]:
anchors = outputs['anchors'][0].detach().cpu().numpy()

plt.imshow(img_transformed[0], cmap='gray')
# Show grid.
n = grid_sizes[-1][0]*grid_sizes[-1][1]
x1, y1, x2, y2 = np.split(anchors[-3*n::3], 4, axis=1)
xc = (x1 + x2) / 2
yc = (y1 + y2) / 2
plt.scatter(xc, yc, color=(0,0,1), s=10)
# Show bounding boxes of different aspect ratios on one grid point.
for x1, y1, x2, y2 in anchors[-3*(n//2):-3*(n//2)+3]:
    plt.scatter((x1 + x2) / 2, (y1 + y2) / 2, color=(1,0,0), s=10)
    plt.gca().add_patch(Rectangle((x1, y1), x2-x1, y2-y1,
        facecolor='none', edgecolor=(1,0,0), linewidth=1))
plt.show()

The RPN predicts objectness scores for each anchor and keeps the highest scoring ones. In our detector, 1000 are kept and passed to the RoI prediction heads as possible object regions. Some of them are visualized in the cell below:

In [None]:
rois = outputs['rpn'][0][0].detach().cpu().numpy()
print(rois.shape)

plt.imshow(img_transformed[0], cmap='gray')
for x1, y1, x2, y2 in rois[:200]:
    plt.gca().add_patch(Rectangle((x1, y1), x2-x1, y2-y1,
        facecolor='none', edgecolor=(1,0,0), linewidth=1))
plt.show()