# Developing the Region Proposal Network

## Intro, Setup

The Region proposal network was introduced in faster RCNN. Its role is to predict bounding boxes for objects in images. It works by sliding over the final feature map generated by a base convolutional network. At every point, it computes "objectness" scores and regression scores for a selection of anchor boxes. Here, we work on developing a region proposal network (RPN).

Import the necessary packages:

In [1]:
import torch
from torchvision import ops 
import einops
from itertools import product
from typing import List
from torch import nn
from einops.layers.torch import Rearrange
from torch.nn import functional as F 

The input to our network is a tensor of shape (b, 3, 1024, 1024). The final feature map has dimensions (b, 1024, 32, 32). Let us generate a dummy feature map tensor: 

In [2]:
batch_size = 1
input_size = 1024
feature_map_size = 32 
feature_dim = 1024

In [3]:
feature_map = torch.randn((batch_size, feature_dim, *(feature_map_size,) * 2))
feature_map.shape

torch.Size([1, 1024, 32, 32])

## Anchor Boxes

Our first job is to associate every point of the feature map with a number of "anchor boxes" upon which the RPN performs classification and bounding box regression. The anchor boxes are a fixed attribute of the RPN. Since the input image is downsampled by a scale of `d = input_size / feature_map_size`, the corresponding base anchor box for a point (i, j) on the feature map is:

In [4]:
d = input_size / feature_map_size

def base_image_(i, j):
    
    x = i * d
    y = j * d
    w = d 
    h = d 

    return x, y, h, w

Therefore, we can create the tensor of base images: 

In [5]:
indices = product(range(feature_map_size), range(feature_map_size))

base_boxes = torch.zeros((32, 32, 4))

for i, j in indices:
    base_boxes[i, j, :] = torch.tensor(base_image_(i, j))

In [6]:
# SANITY_CHECKS

test_input = torch.zeros((1024, 1024), dtype=bool)

indices = product(range(feature_map_size), range(feature_map_size))
for i, j in indices:
    
    x, y, w, h = base_boxes[i, j, :]
    x = x.long().item()
    y = y.long().item()
    w = w.long().item()
    h = h.long().item()
    
    test_input[ x : x + w, y : y + w ] = True
    
torch.all(test_input)

tensor(True)

Base the base image is a tensor of shape (feature_map_shape, feature_map_shape, 4). However, these are just the base anchor boxes. From these anchor boxes, we can create more anchor boxes by adjusting the scale and aspect ratio of these boxes. In the original paper, 3 scales and aspect ratios are specified. Let us give some sample scales:

In [7]:
scales = [2, 4, 8]
aspect_ratios = [1, 2, 0.5]

To generate the anchor boxes, we will work from the base anchor box. We first change the box format to specifying the center point, which will make the calculations much easier:

In [8]:
# flatten
base_boxes_converted = einops.rearrange(base_boxes, 'h w l -> ( h w ) l')
# convert 
base_boxes_converted = ops.box_convert(base_boxes_converted, 'xywh', 'cxcywh')
# unflatten
base_boxes_converted = einops.rearrange(base_boxes_converted, '( h w ) l -> h w l', h = feature_map_size, w = feature_map_size)
base_boxes_converted[0, 0, :]

tensor([16., 16., 32., 32.])

Now, the first two box coordinates specify the center of the box. We need only change the height and width!

Calculating the new height and width can be difficult. Assuming that the aspect ratio is 1, the scale simply multiplies the width and height of the base image. However, when the aspect ratio is not 1, the width and height must be calculated so that ` width/height = aspect ratio ` while the areas are the same as the square that you would get if the aspect ratio is 1, that is ` width * height = (base_width * scale) ^ 2 `. From this, we can derive the equations used in the implementation below.

In [9]:
def get_anchor_box_from_base(base_boxes, scale, aspect_ratio):
    
    x, y, w, h = base_boxes[0, 0, :]

    w_new = int( ( aspect_ratio **.5 ) * scale * w )
    h_new =  int( scale * w / ( aspect_ratio ** .5 ) )

    anchor_boxes = torch.zeros_like(base_boxes)
    
    indices = product(
        range(anchor_boxes.shape[0]), 
        range(anchor_boxes.shape[1])
    )
    
    for i, j in indices: 
        
        x, y, _, _ = base_boxes[i, j, :]
        anchor_boxes[i, j, :] = torch.tensor([x, y, w_new, h_new])

    return anchor_boxes
    
    

Let's perform some quick sanity checks:

In [10]:
# SANITY CHECKS
assert torch.all( base_boxes_converted == get_anchor_box_from_base( base_boxes_converted, 1, 1))


In [11]:
anchors = get_anchor_box_from_base( base_boxes_converted, 4, 2)
anchors[0, 0, :]

tensor([ 16.,  16., 181.,  90.])

Aspect ratio looks good, 

In [12]:
print( 181 * 90, 32 * 4 * 32 * 4)

16290 16384


Scale looks good too. 

Now, we can apply the function to each combination of scale and aspect ratio to create the desired anchor boxes. We will add an extra dimension and then concatenate them together so we end up with an anchor box tensor of shape ( num_features, num_features, k, 4 ), where k is the number of scale/aspect ratio combos.

In [13]:
anchor_boxes = []

for scale, aspect_ratio in product(scales, aspect_ratios):
    
    anchors = get_anchor_box_from_base(base_boxes_converted, scale, aspect_ratio)
    
    anchors = einops.repeat(
        anchors, 
        'n_features_1 n_features_2 four -> n_features_1 n_features_2 1 four', 
        n_features_1 = feature_map_size, 
        n_features_2 = feature_map_size, 
        four = 4, 
    )
    
    anchor_boxes.append(anchors)
    
anchor_boxes = torch.concat(anchor_boxes, dim = 2)    

Now we can convert all the boxes back to the desired format, which is 'xywh'.

In [14]:
n1, n2, k, four = anchor_boxes.shape

anchor_boxes = einops.rearrange(
    anchor_boxes, 
    'n1 n2 k four -> ( n1 n2 k ) four'
)

anchor_boxes = ops.box_convert( anchor_boxes, 'cxcywh', 'xywh')

anchor_boxes = einops.rearrange(
    anchor_boxes,
    '( n1 n2 k ) four -> n1 n2 k four', 
    n1=n1, n2=n2, k=k, four=four
)

In [15]:
anchor_boxes[0, 0, 0, :]

tensor([-16., -16.,  64.,  64.])

Unfortunately, what we end up with are some boxes which are out of the image bounds. This won't do - let's clip them. This requires conversion to 'xyxy' format again.

In [16]:
anchor_boxes = einops.rearrange(
    anchor_boxes, 
    'n1 n2 k four -> ( n1 n2 k ) four'
)

anchor_boxes = ops.box_convert( anchor_boxes, 'xywh', 'xyxy')

anchor_boxes = einops.rearrange(
    anchor_boxes,
    '( n1 n2 k ) four -> n1 n2 k four', 
    n1=n1, n2=n2, k=k, four=four
)

anchor_boxes = ops.clip_boxes_to_image(anchor_boxes, (input_size, input_size))

anchor_boxes = einops.rearrange(
    anchor_boxes, 
    'n1 n2 k four -> ( n1 n2 k ) four'
)

anchor_boxes = ops.box_convert( anchor_boxes, 'xyxy', 'xywh')

anchor_boxes = einops.rearrange(
    anchor_boxes,
    '( n1 n2 k ) four -> n1 n2 k four', 
    n1=n1, n2=n2, k=k, four=four
)

In [17]:
anchor_boxes[0, 0, 0, :]

tensor([ 0.,  0., 48., 48.])

This concludes the creation of our anchor box tensor.

### Anchor Box Algorithm

In [18]:
def create_anchor_boxes(input_size: int, feature_map_size: int,
                        scales: List[float], aspect_ratios: List[float]):
    
    d = input_size / feature_map_size

    # CREATE THE BASE IMAGE
    def base_image_(i, j):
        
        x = i * d
        y = j * d
        w = d 
        h = d 

        return x, y, h, w
    
    indices = product(range(feature_map_size), range(feature_map_size))

    base_boxes = torch.zeros((32, 32, 4))

    for i, j in indices:
        base_boxes[i, j, :] = torch.tensor(base_image_(i, j))
        
    # CONVERT TO CXCYWH
    base_boxes_converted = einops.rearrange(
        base_boxes, 'h w l -> ( h w ) l'
    )
    base_boxes_converted = ops.box_convert(
        base_boxes_converted, 'xywh', 'cxcywh'
    )
    base_boxes_converted = einops.rearrange(
        base_boxes_converted, '( h w ) l -> h w l', 
        h = feature_map_size, w = feature_map_size
    )

    # CREATE THE ANCHOR BOXES FROM THE BASE
    def get_anchor_box_from_base(base_boxes, scale, aspect_ratio):
    
        x, y, w, h = base_boxes[0, 0, :]

        w_new = int( ( aspect_ratio **.5 ) * scale * w )
        h_new =  int( scale * w / ( aspect_ratio ** .5 ) )

        anchor_boxes = torch.zeros_like(base_boxes)
        
        indices = product(
            range(anchor_boxes.shape[0]), 
            range(anchor_boxes.shape[1])
        )
        
        for i, j in indices: 
            
            x, y, _, _ = base_boxes[i, j, :]
            anchor_boxes[i, j, :] = torch.tensor([x, y, w_new, h_new])

        return anchor_boxes
    
    anchor_boxes = []

    for scale, aspect_ratio in product(scales, aspect_ratios):
        
        anchors = get_anchor_box_from_base(base_boxes_converted, scale, aspect_ratio)
        
        anchors = einops.repeat(
            anchors, 
            'n_features_1 n_features_2 four -> n_features_1 n_features_2 1 four', 
            n_features_1 = feature_map_size, 
            n_features_2 = feature_map_size, 
            four = 4, 
        )
        
        anchor_boxes.append(anchors)
        
    anchor_boxes = torch.concat(anchor_boxes, dim = 2)    
    
    # CLIP TO THE BOUNDS OF THE INPUT IMAGE - 
    # THIS REQUIRES A FEW FORMAT CONVERSIONS
    
    n1, n2, k, four = anchor_boxes.shape
    
    anchor_boxes = einops.rearrange(
        anchor_boxes, 
        'n1 n2 k four -> ( n1 n2 k ) four'
    )

    anchor_boxes = ops.box_convert( anchor_boxes, 'cxcywh', 'xyxy')

    anchor_boxes = einops.rearrange(
        anchor_boxes,
        '( n1 n2 k ) four -> n1 n2 k four', 
        n1=n1, n2=n2, k=k, four=four
    )

    anchor_boxes = ops.clip_boxes_to_image(anchor_boxes, (input_size, input_size))

    anchor_boxes = einops.rearrange(
        anchor_boxes, 
        'n1 n2 k four -> ( n1 n2 k ) four'
    )

    anchor_boxes = ops.box_convert( anchor_boxes, 'xyxy', 'xywh')

    anchor_boxes = einops.rearrange(
        anchor_boxes,
        '( n1 n2 k ) four -> n1 n2 k four', 
        n1=n1, n2=n2, k=k, four=four
    )
    
    return anchor_boxes
    

## Network Components

The region proposal network consists of three main pieces:
- a sliding window over the feature map, which also reduces its dimensionality
- a collection of `k` anchor boxes corresponding to each position of this feature map
- a classification head which attempts to classify each anchor box at each position as either background or object ('objectness score')
- a box regression head which learns a parameterized transform to move the anchor boxes closer to the true boxes when they contain objects.

Let's go over these components. Recall the network parameters:

In [19]:
image_input_size = 1024,
feature_map_size = 32
feature_dim = 1024

We also have available the anchor boxes:

In [20]:
anchor_boxes[1, 1, 0, :]

tensor([16., 16., 64., 64.])

We will start with the sliding window, which is implemented as a simple 3 x 3 convolution. We have to choose the output feature dimension of this sliding window, which is normally smaller than the feature dimensions. In the paper, ReLUs are applied to the output of this convolution. 

In [21]:
hidden_dim = 256

In [22]:
sliding_window = nn.Sequential(
    nn.Conv2d(
        in_channels = feature_dim, 
        out_channels=hidden_dim, 
        kernel_size=3, 
        padding=1
    ), 
    nn.ReLU()
)

try running it on some sample input: 

In [23]:
# SANITY CHECK
feature_map = torch.randn((batch_size, feature_dim, *(feature_map_size,) * 2))
sliding_window(feature_map).shape

torch.Size([1, 256, 32, 32])

Now, the bounding box regression component. A bounding box regressor learns parameters for a transform that maps a bounding box closer to the true bounding box. For each anchor box in each position, it will learn 4 transform parameters `t_x, t_y, t_w, t_h`. From these parameters, the anchor box `x_a, y_a, w_a, h_a` can be transformed to the proposed box `x, y, w, h` via:
- `x = w_a * t_x + x_a`
- `y = h_a * t_y + y_a `
- `w = w_a * exp( t_w )`
- `h = h_a * exp( t_h )`

The bounding box regressor is implemented using a 1x1 convolution layer followed by some simple rearranging to create the desired output, which is of the shape (b, 32, 32, k, 4). 

In [24]:
k = len(scales) * len(aspect_ratios)

bbox_regressor = nn.Sequential(
    nn.Conv2d(
        in_channels=hidden_dim, 
        out_channels=k * 4, 
        kernel_size=1
    ), 
    Rearrange(
        'b (k four) h w -> b h w k four', 
        k = k, 
        four = 4
    )
)

Make sure this produces the expected output:

In [25]:
# TEST
out_ = bbox_regressor( sliding_window(feature_map) ) 
out_.shape

torch.Size([1, 32, 32, 9, 4])

The classifier is similar, but it only outputs two numbers per anchor, representing the scores for the class 0 (background) and 1 (object)

In [26]:
classifier = nn.Sequential(
    nn.Conv2d(
        in_channels=hidden_dim, 
        out_channels=k * 2, 
        kernel_size=1
    ), 
    Rearrange(
        'b (k two) h w -> b h w k two', 
        k = k, 
        two = 2
    )
)

When we call the model, we are interested in not just the regression scores, but the proposed boxes as well. The proposed boxes are computed by applying the transform parameterized by the scores to the corresponding anchor boxes. To apply the transform, the boxes tensor has to be flattened into dimensions (N, 4) - in our case n will be ( H W K ).

In [27]:
def boxreg_transform(regression_scores, anchor_boxes, in_fmt='xywh'):
    """Apply the box regression transform along the last axis of the input.

    Args:
        regression_scores ([type]): a tensor of shape (N, 4), where the last dimension contains t_x, t_y, t_w, t_h
        anchor_boxes ([type]): a tensor of shape (N, 4) specifying the anchor boxes upon which the transofrm is being performed.
        in_fmt (str, optional): The format of the boxes. Defaults to 'xywh'.
    """
    
    # the transform requires 'cxcywh' format:
    anchor_boxes = ops.box_convert(anchor_boxes, in_fmt=in_fmt, out_fmt='cxcywh')
    
    x_a = anchor_boxes[:, 0]
    y_a = anchor_boxes[:, 1]
    w_a = anchor_boxes[:, 2]
    h_a = anchor_boxes[:, 3]
    
    t_x = regression_scores[:, 0]
    t_y = regression_scores[:, 1]
    t_w = regression_scores[:, 2]
    t_h = regression_scores[:, 3]
    
    x = ( t_x * w_a ) + x_a 
    y = ( t_y * h_a ) + y_a
    w = w_a * torch.exp(t_w)
    h = h_a * torch.exp(t_h)
    
    proposed_boxes = torch.stack( [x, y, w, h], dim=-1 )
    proposed_boxes = ops.box_convert(anchor_boxes, in_fmt='cxcywh', out_fmt=in_fmt)
    return proposed_boxes
    

We also need to be able to reverse the transform by obtaining the transform scores given anchor boxes and the proposed boxes created by the output:

In [28]:
def inverse_boxreg_transform(proposed_boxes, anchor_boxes, in_fmt='xywh'):
    """Obtain the transform parameters that would transform the given anchor boxes to the given 
    output ( proposed boxes )

    Args:
        proposed_boxes (Tensor): A tensor of shape (N, 4)
        anchor_boxes (Tensor): A tensor fo shape (N, 4)
        in_fmt (str, optional): The format of the boxes. Defaults to 'xywh'.
    """
    
    anchor_boxes = ops.box_convert(anchor_boxes, in_fmt=in_fmt, out_fmt='cxcywh')
    proposed_boxes = ops.box_convert(proposed_boxes, in_fmt=in_fmt, out_fmt='cxcywh')
    
    x_a = anchor_boxes[:, 0]
    y_a = anchor_boxes[:, 1]
    w_a = anchor_boxes[:, 2]
    h_a = anchor_boxes[:, 3]
    
    x = proposed_boxes[:, 0]
    y = proposed_boxes[:, 1]
    w = proposed_boxes[:, 2]
    h = proposed_boxes[:, 3]
    
    t_x = ( x - x_a ) / w_a
    t_y = ( y - y_a ) / h_a 
    t_w = torch.log( w / w_a )
    t_h = torch.log( h / h_a)
    
    transform_parameters = torch.stack( [t_x, t_y, t_w, t_h], dim=-1 )
    return transform_parameters
    

To obtain the list of proposed boxes, the RPN applies the transform scores obtained by its bounding box regressor layer to its anchor boxes. To estimate the "objectness" of each of these box proposals, it uses the output of its classifier layer. When using the proposals and the objectness scores, we do not care from which feature map position they came, so we will actually collapse their dimensions from ( h, w, k, ... ) to ( h * w * k, ...). Also, because the loss function for training is obtained from the regression scores directly and not from the proposed boxes, we do not keep track of gradients during the regression transform. 

We can summarize this procedure in the following function: 

In [29]:
def propose_boxes(in_feature_map, anchor_boxes):
    
    b, feature_dim, H, W = in_feature_map.shape
    
    sliding_window_output = sliding_window(in_feature_map)
    
    regression_scores = bbox_regressor(sliding_window_output)
    objectness_scores = classifier(sliding_window_output)
    
    # expand anchor boxes into batch dimension
    anchor_boxes = einops.repeat(
        anchor_boxes,
        ' h w k four -> b h w k four ', 
        b=b
    )
    
    # have to fold the outer dimensions together to apply the transform:
    b, h, w, k, four = regression_scores.shape
    
    regression_scores = einops.rearrange(
        regression_scores, 
        'b h w k four -> ( b h w k ) four'
    )
    
    anchor_boxes = einops.rearrange(
        anchor_boxes,  
        'b h w k four -> ( b h w k ) four'
    )
    
    objectness_scores = einops.rearrange(
        objectness_scores,  
        'b h w k two -> ( b h w k ) two'
    )
    
    proposed_boxes = boxreg_transform(regression_scores, anchor_boxes)
    
    proposed_boxes = einops.rearrange(
        proposed_boxes, 
        '( b h w k ) four -> b ( h w k ) four', 
        b=b, h=h, w=w, k=k, four=4
    )
    
    regression_scores = einops.rearrange(
        regression_scores, 
        '( b h w k ) four -> b ( h w k ) four', 
        b=b, h=h, w=w, k=k, four=4
    )
    
    objectness_scores = einops.rearrange(
        objectness_scores, 
        ' ( b h w k ) two -> b ( h w k ) two', 
        b=b, h=h, w=w, k=k, two=2
    )
    
    anchor_boxes = einops.rearrange(
        anchor_boxes, 
        ' ( b h w k ) four -> b ( h w k ) four', 
        b=b, h=h, w=w, k=k, four=4
    )
    
    return {
        'proposed_boxes': proposed_boxes, 
        'regression_scores': regression_scores,
        'objectness_scores': objectness_scores, 
        'anchor_boxes': anchor_boxes
    } 
    

In [30]:
# TEST
propose_boxes( feature_map, anchor_boxes )['proposed_boxes']

tensor([[[  0.0000,   0.0000,  48.0000,  48.0000],
         [  0.0000,   0.0000,  61.0000,  38.5000],
         [  0.0000,   0.0000,  38.5000,  61.0000],
         ...,
         [880.0000, 880.0000, 144.0000, 144.0000],
         [827.0000, 917.5000, 197.0000, 106.5000],
         [917.5000, 827.0000, 106.5000, 197.0000]]])

That is the base output of our region proposal network.

## Training the RPN

To train the RPN, the goal is to improve the objectness prediction and the regression scores generated by the network. To do so, we need to apply a loss function between these scores and some targets. We have objectness scores and regression scores for every single anchor box - however, we only have a small number of ground truth bounding boxes with each image. The way we generate positive and negative training examples is to look at the filtered boxes and mark them as positive (matching a ground truth box), negative (background) or unknown. This is done according to the IoU Filtering algorithm. We went through the implementation of this algorith in the "region proposal" notebook, and we copy it below for convenience: 

In [31]:
def match_proposed_boxes_to_true(
        true_boxes: torch.Tensor, 
        proposed_boxes: torch.Tensor, 
        min_num_positives: int, 
        in_format: str = 'xywh', 
        true_box_labels: torch.Tensor = None, 
        positivity_threshold: float = 0.7, 
        negativity_threshold: float = 0.3
    ):
    """Matches proposed bounding boxes to a tensor of ground truth bounding boxes
       and returns a tensor of labels indicating positive (1, object) or negative 
       (0, no object) or inconclusive (-1) for each match based on whether 
       a certain IoU threshold with a ground truth box is met. This labeling is done
       according to specified thresholds and also with a specified minimum number 
       of positives. If the positivity threshold does not generate enough positives,
       they will be generated by choosing the ones with the best overlap.

    Args:
        true_boxes (torch.Tensor): A tensor of boxes of shape (N, 4)
        
        proposed_boxes (torch.Tensor): A tensor of boxes of shape (M, 4)
        
        min_num_positives (int): minimum number of positives generated by the matching
        
        in_format (str, optional): string specifying the string format - 
        see torchvision ops documentation.Defaults to 'xywh'.
        
        box_labels (torch.Tensor, optiona): tensor of shape (N) giving the class labels
        corresponding with the ground truth boxes. 
        
        positivity_threshold (float, optional): Above this threshold a proposed box will 
        be considered to match with the ground truth. Defaults to 0.7.
        
        negativity_threshold (float, optional): below this threshold a box will be considered 
        to be background. Defaults to 0.3.

    Returns:
        [type]: [description]
    """
    assert len(true_boxes.shape) == 2
    assert len(proposed_boxes.shape) == 2
    
    num_true_boxes, _ = true_boxes.shape
    num_proposed_boxes, _ = proposed_boxes.shape

    ious = ops.box_iou(
        ops.box_convert(proposed_boxes, in_fmt=in_format, out_fmt='xyxy'),
        ops.box_convert(true_boxes, in_fmt=in_format, out_fmt='xyxy')
    )
    
    max_ious = torch.max(ious, dim=-1, )
    matching_true_boxes = true_boxes[max_ious.indices]
    if true_box_labels is not None:
        matching_true_box_labels = true_box_labels[max_ious.indices]
    
    labels = (torch.ones_like(max_ious.values) * -1).long()
    indices = torch.tensor(range(len(labels)))

    positive_indices = indices[max_ious.values >= positivity_threshold]
    if len(positive_indices) < min_num_positives:
        positive_indices = torch.sort(max_ious.values, dim=-1, descending=True).indices[:min_num_positives]

    if true_box_labels is not None: 
        labels[positive_indices] = matching_true_box_labels[positive_indices] 
    else:
        labels[positive_indices] = 1

    negative_indices = indices[max_ious.values < negativity_threshold]

    labels[negative_indices] = 0
    
    return {
        'matching_true_boxes': matching_true_boxes, 
        'proposed_boxes': proposed_boxes,
        'labels': labels
    }

When training the RPN, we would get a batch which consists of ground truth boxes and input feature maps. Instead of storing the ground truth bounding boxes as a tensor with the batch size in the 0'th dimension, we need to store them either as a list of tensors, as a tensor with an extra column indexing which batch it came from, or as a tensor together with an indexing tensor - otherwise we may be dealing with the issue of tensors with different lengths.

In [32]:
def random_boxes(num_boxes, format='xywh'):

    xy = torch.randint(0, 100, (num_boxes, 2))
    wh = torch.randint_like(xy, 200) + xy
    
    boxes = torch.concat([xy, wh], dim=-1)
    return ops.box_convert(boxes, in_fmt='xyxy', out_fmt=format)


This is the kind of thing we need to do to batch the ground truth boxes: 

In [33]:
batch_size = 2

gt_boxes_b1 = random_boxes(2)
gt_boxes_b2 = random_boxes(3)

true_boxes = [gt_boxes_b1, gt_boxes_b2]

indices = []
for i in range(len(true_boxes)):
    indices.extend( [i] * len(true_boxes[i]) )
indices = torch.tensor( indices )
true_boxes = torch.concat(true_boxes, axis=0)

in_feature_map = torch.randn( (2, 1024, 32, 32) )

batch = feature_map, true_boxes, indices 

The above will have to be implemented as part of a manual collate-fn in the dataloader for this network. For now, let's focus on taking computing the loss. This happens in 3 steps: 
- The network computes the proposed boxes, objectness and regression scores.
- The proposed boxes are filtered with reference to the ground truth boxes to label them as negative, positive, or unknown
- The loss is computed based on localization and classification losses between the proposed boxes and matched true boxes.


In [34]:
out_dict = propose_boxes(in_feature_map, anchor_boxes)
out_dict['proposed_boxes'].shape, out_dict['regression_scores'].shape, out_dict['objectness_scores'].shape, out_dict['anchor_boxes'].shape

(torch.Size([2, 9216, 4]),
 torch.Size([2, 9216, 4]),
 torch.Size([2, 9216, 2]),
 torch.Size([2, 9216, 4]))

Our IoU matching algorithm works on a single batch at a time. Let's go through it for the first batch:

In [35]:
torch.manual_seed(0)

<torch._C.Generator at 0x105812d70>

In [36]:
batch = 0

In [37]:
true_boxes_for_batch = true_boxes[indices == batch]
true_boxes_for_batch

tensor([[ 88,  26, 153,  18],
        [ 38,  97,   7,  21]])

In [38]:
match_box_output = match_proposed_boxes_to_true(
    true_boxes_for_batch, 
    out_dict['proposed_boxes'][batch], 
    min_num_positives=64
)

labels_for_batch = match_box_output['labels']
matching_true_boxes_for_batch = match_box_output['matching_true_boxes']


we also have to compute the regression scores that correspond to the matching ground truth boxes relative to the anchor boxes:

In [39]:
target_regression_scores = inverse_boxreg_transform(match_box_output['matching_true_boxes'], out_dict['anchor_boxes'][batch])


We have to select the subset of these scores upon which the loss will be computed:

In [40]:
num_examples_for_loss = 256

In [41]:
indices_for_loss = torch.tensor(range(len(labels_for_batch))).long()

indices_for_loss_positive = indices_for_loss[labels_for_batch == 1]
indices_for_loss_negative = indices_for_loss[labels_for_batch == 0]

indices_for_loss_positive = indices_for_loss_positive[torch.randperm(len(indices_for_loss_positive))]
if len(indices_for_loss_positive) >= num_examples_for_loss//2:
    indices_for_loss_positive = indices_for_loss_positive[:num_examples_for_loss//2]

indices_for_loss_negative = indices_for_loss_negative[torch.randperm(len(indices_for_loss_negative))]
indices_for_loss_negative = indices_for_loss_negative[:num_examples_for_loss - len(indices_for_loss_positive)]

indices_for_loss = torch.concat([indices_for_loss_negative, indices_for_loss_positive])


In [42]:
out_dict['regression_scores'].shape

torch.Size([2, 9216, 4])

In [43]:
reg_scores_for_loss = out_dict['regression_scores'][batch][indices_for_loss]
target_reg_scores_for_loss = target_regression_scores[indices_for_loss]

objectness_scores_for_loss = out_dict['objectness_scores'][batch][indices_for_loss]
labels_for_loss = labels_for_batch[indices_for_loss]

The overall algorithm for selecting training examples is given below: 


In [44]:
def select_training_examples(
    proposed_boxes: torch.Tensor, 
    anchor_boxes: torch.Tensor, 
    regression_scores: torch.Tensor, 
    objectness_scores: torch.Tensor, 
    true_boxes: torch.Tensor, 
    num_training_examples,
    min_num_positives, 
    positivity_threshold: float = 0.7, 
    negativity_threshold: float = 0.3, 
    true_box_labels = None
):

    box_matching_output = match_proposed_boxes_to_true(
        true_boxes, 
        proposed_boxes, 
        min_num_positives, 
        true_box_labels=true_box_labels,
        positivity_threshold=positivity_threshold, 
        negativity_threshold=negativity_threshold
    )

    labels = box_matching_output['labels']
    matching_true_boxes = box_matching_output['matching_true_boxes']
    
    target_regression_scores = inverse_boxreg_transform(
        matching_true_boxes, anchor_boxes
    )
    
    indices_for_loss = torch.tensor(range(len(labels))).long()

    indices_for_loss_positive = indices_for_loss[labels > 0]
    indices_for_loss_negative = indices_for_loss[labels == 0]

    indices_for_loss_positive = indices_for_loss_positive[torch.randperm(len(indices_for_loss_positive))]
    if len(indices_for_loss_positive) >= num_training_examples//2:
        indices_for_loss_positive = indices_for_loss_positive[:num_training_examples//2]

    indices_for_loss_negative = indices_for_loss_negative[torch.randperm(len(indices_for_loss_negative))]
    indices_for_loss_negative = indices_for_loss_negative[:num_training_examples - len(indices_for_loss_positive)]

    indices_for_loss = torch.concat([indices_for_loss_negative, indices_for_loss_positive])
    
    reg_scores_for_loss = regression_scores[indices_for_loss]
    target_reg_scores_for_loss = target_regression_scores[indices_for_loss]

    objectness_scores_for_loss = objectness_scores[indices_for_loss]
    labels_for_loss = labels[indices_for_loss]

    return {
        'regression_scores': reg_scores_for_loss, 
        'target_regression_scores': target_reg_scores_for_loss, 
        'objectness_scores': objectness_scores_for_loss, 
        'labels': labels_for_loss,
        'indices': indices_for_loss,
    }
    

In [61]:
torch.manual_seed(0)

training_examples = select_training_examples(
    out_dict['proposed_boxes'][batch], 
    out_dict['anchor_boxes'][batch],
    out_dict['regression_scores'][batch],
    out_dict['objectness_scores'][batch],
    true_boxes[indices == batch],
    num_training_examples=256,
    min_num_positives=64, 
)



Finally, we can compute the loss function for the batch using the loss function, which is a combination of cross-entropy loss between the objectness scores and targets,
as well as the smooth l1 score between the regression and target regression scores.

## Loss Function

The loss function is fairly straightforward, it is the sum of the localization and classification loss. The classification loss is the cross-entropy between the objectness scores and the target labels:

In [47]:
def classification_loss(class_scores, labels):
    return F.cross_entropy(class_scores, labels.long())

classification_loss(objectness_scores_for_loss, labels_for_loss)

tensor(0.7276, grad_fn=<NllLossBackward0>)

The regression loss is obtained using the "smooth L1" norm of the difference between the true regression scores and the target regression scores, but is only applied to those boxes which are not in the background class. Here is the smooth l1 norm:

In [48]:
def smooth_l1_norm(tensor, dim=-1):
    """Calculates the smooth l1 norm along the specified dimension of the tensor"""
    x = tensor
    x = torch.where(
        torch.abs(x) < 1, 
        .5 * x**2,
        torch.abs(x) - 0.5
    )
    
    return torch.sum(x, dim=dim)

Here is the regression loss (which requires the labels together with the regression scores)

In [49]:
def box_regression_loss(regression_scores, target_regression_scores, labels):
    
    # select non-background indices 
    non_background_indices = labels != 0
    
    regression_scores = regression_scores[non_background_indices]
    target_regression_scores = target_regression_scores[non_background_indices]
    
    norms = smooth_l1_norm(regression_scores - target_regression_scores)
    
    # get average l1 norm 
    return einops.reduce( norms, 'n -> ()', 'mean')
        

In [50]:
box_regression_loss(reg_scores_for_loss, target_reg_scores_for_loss, labels_for_loss)

tensor([0.8550], grad_fn=<ReshapeAliasBackward0>)

They overall loss is their sum, weighted by a parameter lambda which is specified as a hyperparameter of the training system. Let us write it out as a class:

In [51]:
class RCNNLoss(nn.Module):
    
    def __init__(self, lambda_=1):
        
        super().__init__()
        
        self.lambda_ = lambda_

    def forward(self, class_scores, regression_scores, 
                target_regression_scores, labels):
        
        cls_loss = classification_loss(class_scores, labels)
        loc_loss = box_regression_loss(regression_scores, target_regression_scores, labels)    
        
        return cls_loss + self.lambda_ * loc_loss
    

In [52]:
RCNNLoss()(objectness_scores_for_loss, reg_scores_for_loss,
           target_reg_scores_for_loss, labels_for_loss)

tensor([1.5826], grad_fn=<AddBackward0>)

In [63]:
RCNNLoss()(training_examples['objectness_scores'], training_examples['regression_scores'], 
           training_examples['target_regression_scores'], training_examples['labels']
           )

tensor([1.5826], grad_fn=<AddBackward0>)

## Using the RPN

Assume we have a fully-trained RPN that predicts accurate proposed boxes together with objectness scores. To use this box proposals, we must filter them in two ways:
- throwing away box proposals that overlap too much with other box proposals with greater objectness scores (non-max suppression)
- Throwing away boxes that do not have a high enough objectness score to be considered an object. 

Let's go through both of these steps on the box proposals of a single image.

We have the out-dict containing proposed boxes with objectness scores. Let's use the first image from the batch:

In [53]:
proposed_boxes = out_dict['proposed_boxes'][0]
objectness_scores = out_dict['objectness_scores'][0]

iou_threshold = 0.7

In [54]:
proposed_boxes.shape, objectness_scores.shape

(torch.Size([9216, 4]), torch.Size([9216, 2]))

convert the objectness scores to foreground probability:

In [55]:
object_prob = F.softmax(objectness_scores, dim=-1)[:, 1]

Apply nms based on the object probabilities. Torchvision expects the box format to be 'xyxy', so we have to convert first:

In [56]:
indices_to_keep = ops.nms(ops.box_convert(proposed_boxes, in_fmt='xywh', out_fmt='xyxy'), object_prob, iou_threshold=iou_threshold)

In [57]:
proposed_boxes = proposed_boxes[indices_to_keep]
objectness_scores = objectness_scores[indices_to_keep]
object_prob = object_prob[indices_to_keep]

In summary, we have:

In [58]:
def apply_nms_to_region_proposals(
    proposed_boxes: torch.Tensor, 
    objectness_scores: torch.Tensor,
    iou_threshold: float,
):
    """applies nms to remove overlapping boxes with a lower objectness score

    Args:
        proposes_boxes (torch.Tensor): (N, 4) tensor of boxes in format xywh
        objectness_scores (torch.Tensor): (N, 2) tensor of objectness scores
        iou_threshold: The iou threshold for proposed boxes to be considered overlapping. 
        
    Returns: 
        a dict containing the proposed boxes, the objectness scores, and the object probability
        after nms. 
    """
    
    object_prob = F.softmax(objectness_scores, dim=-1)[:, 1]
    
    indices_to_keep = ops.nms(
        ops.box_convert(proposed_boxes, in_fmt='xywh', out_fmt='xyxy'),
        object_prob,
        iou_threshold=iou_threshold
    )
    
    return {
        'proposed_boxes': proposed_boxes[indices_to_keep],
        'objectness_scores': objectness_scores[indices_to_keep],
        'object_probs': object_prob[indices_to_keep]
    }
    

In [81]:
nms_output = apply_nms_to_region_proposals(
    out_dict['proposed_boxes'][0],
    out_dict['objectness_scores'][0], 
    iou_threshold=0.7
)

In [82]:
nms_output

{'proposed_boxes': tensor([[707.0000, 121.5000,  90.0000,  45.0000],
         [  0.0000, 496.0000, 176.0000, 256.0000],
         [661.5000, 643.0000, 181.0000,  90.0000],
         ...,
         [688.0000, 400.0000,  64.0000,  64.0000],
         [176.0000, 176.0000,  64.0000,  64.0000],
         [ 99.0000,  85.5000,  90.0000, 181.0000]]),
 'objectness_scores': tensor([[-0.7585,  0.5623],
         [-0.8518,  0.4643],
         [-0.6227,  0.4560],
         ...,
         [ 0.4460, -0.5735],
         [ 0.5149, -0.5081],
         [ 0.6155, -0.6413]], grad_fn=<IndexBackward0>),
 'object_probs': tensor([0.7893, 0.7885, 0.7462,  ..., 0.2651, 0.2644, 0.2215],
        grad_fn=<IndexBackward0>)}

## Average Precision Calculation

In [65]:
from torchmetrics.detection.map import MeanAveragePrecision

metric = MeanAveragePrecision(box_format='xywh')


In [70]:
metric_preds = [{
    'boxes': nms_output['proposed_boxes'], 
    'scores': nms_output['object_probs'], 
    'labels': torch.IntTensor([0] * len(nms_output['proposed_boxes']))
}]

In [78]:
metric_targets = [
    {'boxes': true_boxes, 'labels': torch.IntTensor([0]* len(true_boxes))}
]

In [79]:
metric(metric_preds, metric_targets)['map']

tensor(0.)