https://medium.com/@fractaldle/guide-to-build-faster-rcnn-in-pytorch-95b10c273439
内容来自以上连接

另一个参考连接：https://medium.com/@fractaldle/brief-overview-on-object-detection-algorithms-ec516929be93

# Feature Extraction

We begin with an image and a set of bounding boxes along with its label as defined below.

In [1]:
import torch
iamge = torch.zeros((1, 3, 800, 800)).float()
bbox = torch.FloatTensor([[20, 30, 400, 500], [300, 400, 500, 600]])
# [y1, x1, y2, x2] format
labels = torch.LongTensor([6, 8]) # 0 represents background
sub_sample = 16



1.Create a dummy image and set the volatile to be False.

In [3]:
import torchvision
dummy_img = torch.zeros((1, 3, 800, 800)).float()
print(dummy_img.shape)

torch.Size([1, 3, 800, 800])


2. List all the layers of the VGG16.

In [5]:
model = torchvision.models.vgg16(pretrained=True)
fe = list(model.features)
print(fe)

[Conv2d(3, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1)), ReLU(inplace=True), Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1)), ReLU(inplace=True), MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False), Conv2d(64, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1)), ReLU(inplace=True), Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1)), ReLU(inplace=True), MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False), Conv2d(128, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1)), ReLU(inplace=True), Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1)), ReLU(inplace=True), Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1)), ReLU(inplace=True), MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False), Conv2d(256, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1)), ReLU(inplace=True), Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1

3. Pass the image through the layers and check where you are getting this size.

In [6]:
req_features = []
k = dummy_img.clone()
for i in fe:
    k = i(k)
    if k.size()[2] < 800//16:
        break
    req_features.append(i)
    out_channels = k.size()[1]
    
print(len(req_features))
print(out_channels)

30
512


4. Convert this list into a Sequential module.

In [8]:
import torch.nn as nn
faster_rcnn_fe_extractor = nn.Sequential(*req_features)

Now this faster_rcnn_fe_extractor can be used as our backend. Lets compute the features

In [10]:
out_map = faster_rcnn_fe_extractor(dummy_img)
print(out_map.size())

torch.Size([1, 512, 50, 50])


# Anchor boxes

This is our first encounter with anchor boxes. A detailed understanding of anchor boxes will allow us to understand object detection very easily. So lets talk in detail on how this is done.

1,Generate Anchor at a feature map location
2,Generate Anchor at all the feature map location.
3,Assign the labels and location of objects (with respect to the anchor) to each and every anchor.
4,Generate Anchor at a feature map location

We will use anchor_scales of 8, 16, 32, ratio of 0.5, 1, 2 and sub sampling of 16 (Since we have pooled our image from 800 px to 50px). Now every pixel in the output feature map maps to corresponding 16 * 16 pixels in the image.

We need to generate anchor boxes on top of this 16 * 16 pixels first and similarly do along x-axis and y-axis to get all the anchor boxes. This is done in the step-2.

At each pixel location on the feature map, We need to generate 9 anchor boxes (number of anchor_scales and number of ratios) and each anchor box will have ‘y1’, ‘x1’, ‘y2’, ‘x2’. So at each location anchor will have a shape of (9, 4). Lets begin with a an empty array filled with zero values.


In [18]:
import numpy as np
ratios = [0.5, 1, 2]
anchor_scales = [8, 16, 32]
anchor_base = np.zeros((len(ratios) * len(anchor_scales), 4), dtype=np.float32)

print(anchor_base)

[[0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]]


Lets fill these values with corresponding y1, x1, y2, x2 at each anchor_scale and ratios. Our center for this base anchor will be at

In [19]:
ctr_y = sub_sample / 2
ctr_x = sub_sample / 2

print(ctr_y, ctr_x)

for i in range(len(ratios)):
    for j in range(len(anchor_scales)):
        h = sub_sample * anchor_scales[j] * np.sqrt(ratios[i])
        w = sub_sample * anchor_scales[j] * np.sqrt(1./ ratios[i])
        
        index = i * len(anchor_scales) + j
        
        anchor_base[index, 0] = ctr_y - h / 2
        anchor_base[index, 1] = ctr_x - w / 2
        anchor_base[index, 2] = ctr_y + h / 2
        anchor_base[index, 3] = ctr_x + w / 2
        
        
print(anchor_base)

8.0 8.0
[[ -37.254833  -82.50967    53.254833   98.50967 ]
 [ -82.50967  -173.01933    98.50967   189.01933 ]
 [-173.01933  -354.03867   189.01933   370.03867 ]
 [ -56.        -56.         72.         72.      ]
 [-120.       -120.        136.        136.      ]
 [-248.       -248.        264.        264.      ]
 [ -82.50967   -37.254833   98.50967    53.254833]
 [-173.01933   -82.50967   189.01933    98.50967 ]
 [-354.03867  -173.01933   370.03867   189.01933 ]]


These are the anchor locations at the first feature map pixel, we have to now generate these anchors at all the locations of feature map. Also note that negitive values mean that the anchor boxes are outside image dimension. In the later section we will label them with -1 and remove them when calculating the loss the functions and generating proposals for anchor boxes. Also Since we got 9 anchors at each location and there 50 * 50 such locations inside an image, We will get 17500 (50 * 50 * 9) anchors in total. Lets generate other anchors now,

2. Generate Anchor at all the feature map location.
In-order to do this, we need to first generate the centres for each and every feature map pixel.

In [42]:
fe_size = (800//16)
ctr_x = np.arange(16, (fe_size+1) * 16, 16)
ctr_y = np.arange(16, (fe_size+1) * 16, 16)

print(len(ctr_x))
print(len(ctr_y))

50
50


Looping through the ctr_x and ctr_y will give us the centers at each and every location. The sudo code is as a below

In [50]:
index = 0
#ctr = list((len(ctr_x), len(ctr_y)))
#print(size(ctr))
#ctr = np.zeros((len(ctr_x), len(ctr_y)))
ctr = []
for x in range(len(ctr_x)):
    for y in range(len(ctr_y)):
        #ctr[index,1] = ctr_x[x] - 8
        #ctr[index,0] = ctr_y[y] - 8
        ctr.append((ctr_y[y] - 8, ctr_x[x] - 8))
        index +=1
        
print(len(ctr))
print(ctr)

2500
[(8, 8), (24, 8), (40, 8), (56, 8), (72, 8), (88, 8), (104, 8), (120, 8), (136, 8), (152, 8), (168, 8), (184, 8), (200, 8), (216, 8), (232, 8), (248, 8), (264, 8), (280, 8), (296, 8), (312, 8), (328, 8), (344, 8), (360, 8), (376, 8), (392, 8), (408, 8), (424, 8), (440, 8), (456, 8), (472, 8), (488, 8), (504, 8), (520, 8), (536, 8), (552, 8), (568, 8), (584, 8), (600, 8), (616, 8), (632, 8), (648, 8), (664, 8), (680, 8), (696, 8), (712, 8), (728, 8), (744, 8), (760, 8), (776, 8), (792, 8), (8, 24), (24, 24), (40, 24), (56, 24), (72, 24), (88, 24), (104, 24), (120, 24), (136, 24), (152, 24), (168, 24), (184, 24), (200, 24), (216, 24), (232, 24), (248, 24), (264, 24), (280, 24), (296, 24), (312, 24), (328, 24), (344, 24), (360, 24), (376, 24), (392, 24), (408, 24), (424, 24), (440, 24), (456, 24), (472, 24), (488, 24), (504, 24), (520, 24), (536, 24), (552, 24), (568, 24), (584, 24), (600, 24), (616, 24), (632, 24), (648, 24), (664, 24), (680, 24), (696, 24), (712, 24), (728, 24), (7

The output will be the (x, y) value at each location as shown in the image above. Together we have 2500 anchor centers. Now at each center we need to generate the anchor boxes. This can be done using the code we have used for generating anchor at one location, adding an extract for loop for supplying centers of each anchor will do. Lets see how this is done

In [51]:
anchors = np.zeros((fe_size * fe_size * 9, 4), dtype=np.float32)
index = 0
for c in ctr:
    ctr_y, ctr_x = c
    for i in range(len(ratios)):
        for j in range(len(anchor_scales)):
            h = sub_sample * anchor_scales[j] * np.sqrt(ratios[i])
            w = sub_sample * anchor_scales[j] * np.sqrt(1./ ratios[i])
      
            anchors[index, 0] = ctr_y - h / 2
            anchors[index, 1] = ctr_x - w / 2
            anchors[index, 2] = ctr_y + h / 2
            anchors[index, 3] = ctr_x + w / 2
            index += 1
print(anchors.shape)

(22500, 4)


Assign the labels and location of objects (with respect to the anchor) to each and every anchor.

Now since we have generated all the anchor boxes, we need to look at the objects inside the image and assign them to the specific anchor boxes which contain them. Faster_R-CNN has some guidelines to assign labels to the anchor boxes
We assign a positive label to two kind of anchors a) The anchor/anchors with the highest Intersection-over-Union(IoU) overlap with a ground-truth-box or b) An anchor that has an IoU overlap higher than 0.7 with ground-truth box.
Note that single ground-truth object may assign positive labels to multiple anchors.
c) We assign a negative label to a non-positive anchor if its IoU ratio is lower than 0.3 for all ground-truth boxes. d) Anchors that are neither positive nor negitive do not contribute to the training objective.
Lets see how this is done.

In [52]:
bbox = np.asarray([[20, 30, 400, 500], [300, 400, 500, 600]], dtype=np.float32) # [y1, x1, y2, x2] format
labels = np.asarray([6, 8], dtype=np.int8) # 0 represents background
print(labels)

[6 8]


We will assign the labels and locations for the anchor boxes in the following ways.
Find the indexes of valid anchor boxes and create an array with these indexes. create an label array with shape index array filled with -1.
check weather one of the above conditition a, b, c is statisfying or not and fill the label accordingly. Incase of positive anchor box (label is 1), Note which ground truth object has resulted in this.
calculate the locations (loc) of ground truth associated with the anchor box wrt to the anchor box.
Reorganize all anchor boxes by filling with -1 for all unvalid anchor boxes and values we have calculated for all valid anchor boxes.
Outputs should be labels with (N, 1) array and locs with (N, 4) array.

Find the index of all valid anchor boxes

In [53]:
index_inside = np.where(
        (anchors[:, 0] >= 0) &
        (anchors[:, 1] >= 0) &
        (anchors[:, 2] <= 800) &
        (anchors[:, 3] <= 800)
    )[0]
print(index_inside.shape)

(8940,)


create an empty label array with inside_index shape and fill with -1. Default is set to (d)

In [55]:
label = np.empty((len(index_inside),), dtype=np.int32)
label.fill(-1)
print(label.shape)
#print(label)

(8940,)
[-1 -1 -1 ... -1 -1 -1]


create an array with valid anchor boxes

In [56]:
valid_anchor_boxes = anchors[index_inside]
print(valid_anchor_boxes.shape)
#Out = (8940, 4)

(8940, 4)


For each valid anchor box calculate the iou with each ground truth object. 
Since we have 8940 anchor boxes and 2 ground truth objects, we should get an array with (8490, 2) as the output. The sudo code for calculating iou between two boxes will be

The python code for calculating the ious is as follows

In [58]:
ious = np.empty((len(valid_anchor_boxes), 2), dtype=np.float32)
ious.fill(0)
print(bbox)
for num1, i in enumerate(valid_anchor_boxes):
    ya1, xa1, ya2, xa2 = i  
    anchor_area = (ya2 - ya1) * (xa2 - xa1)
    for num2, j in enumerate(bbox):
        yb1, xb1, yb2, xb2 = j
        box_area = (yb2- yb1) * (xb2 - xb1)
        inter_x1 = max([xb1, xa1])
        inter_y1 = max([yb1, ya1])
        inter_x2 = min([xb2, xa2])
        inter_y2 = min([yb2, ya2])
        if (inter_x1 < inter_x2) and (inter_y1 < inter_y2):
            iter_area = (inter_y2 - inter_y1) * \
(inter_x2 - inter_x1)
            iou = iter_area / \
(anchor_area+ box_area - iter_area)            
        else:
            iou = 0.
        ious[num1, num2] = iou
print(ious.shape)

[[ 20.  30. 400. 500.]
 [300. 400. 500. 600.]]
(8940, 2)


Note: Using numpy arrays, these calculations can be done much more efficiently and with less verbose. However I try to keep here in this way so that people without strong Algebra can also understand.
Considering the scenarios of a and b, we need to find two things here

-- the highest iou for each gt_box and its corresponding anchor box
-- the highest iou for each anchor box and its corresponding ground truth box

In [61]:
gt_argmax_ious = ious.argmax(axis=0)
print(gt_argmax_ious)

gt_max_ious = ious[gt_argmax_ious, np.arange(ious.shape[1])]
print(gt_max_ious)
print(np.arange(ious.shape[1]))

[2262 5620]
[0.68130493 0.61035156]
[0 1]


case-2

In [63]:
argmax_ious = ious.argmax(axis=1)
print(argmax_ious.shape)
print(argmax_ious)
max_ious = ious[np.arange(len(index_inside)), argmax_ious]
print(max_ious)

(8940,)
[0 0 0 ... 0 0 0]
[0.06811669 0.07083762 0.07083762 ... 0.         0.         0.        ]


Find the anchor_boxes which have this max_ious (gt_max_ious)

In [64]:
gt_argmax_ious = np.where(ious == gt_max_ious)[0]
print(gt_argmax_ious)

[2262 2508 5620 5628 5636 5644 5866 5874 5882 5890 6112 6120 6128 6136
 6358 6366 6374 6382]


Now we have three arrays
argmax_ious — Tells which ground truth object has max iou with each anchor.
max_ious — Tells the max_iou with ground truth object with each anchor.
gt_argmax_ious — Tells the anchors with the highest Intersection-over-Union (IoU) overlap with a ground-truth box.

Using argmax_ious and max_ious we can assign labels and locations to anchor boxes which satisify [b] and [c]. Using gt_argmax_ious we can assign labels and locations to anchor boxes which satisify [a].

Lets put thresholds to some variables

In [65]:
pos_iou_threshold  = 0.7
neg_iou_threshold = 0.3

Assign negitive label (0) to all the anchor boxes which have max_iou less than negitive threshold [c]

In [66]:
label[max_ious < neg_iou_threshold] = 0

Assign positive label (1) to all the anchor boxes which have highest IoU overlap with a ground-truth box [a]

In [67]:
label[gt_argmax_ious] = 1

Assign positive label (1) to all the anchor boxes which have max_iou greater than positive threshold [b]

In [68]:
label[max_ious >= pos_iou_threshold] = 1

Training RPN The Faster_R-CNN paper phrases as follows Each mini-batch arises from a single image that contains many positive and negitive example anchors, but this will bias towards negitive samples as they are dominate. Instead, we randomly sample 256 anchors in an image to compute the loss function of a mini-batch, where the sampled positive and negative anchors have a ratio of up to 1:1. If there are fewer than 128 positive samples in an image, we pad the mini-batch with negitive ones.. From this we can derive two variable as follows

In [69]:
pos_ratio = 0.5
n_sample = 256

Total positive samples

In [70]:
n_pos = pos_ratio * n_sample

Now we need to randomly sample n_pos samples from the positive labels and ignore (-1) the remaining ones. In some cases we get less than n_pos samples, in that we will randomly sample (n_sample — n_pos) negitive samples (0) and assign ignore label to the remaining anchor boxes. This is done using the following code.

positive samples

In [71]:
pos_index = np.where(label == 1)[0]
if len(pos_index) > n_pos:
    disable_index = np.random.choice(pos_index, size=(len(pos_index) - n_pos), replace=False)
    label[disable_index] = -1

negitive samples

In [73]:
n_neg = n_sample - np.sum(label == 1)
neg_index = np.where(label == 0)[0]
if len(neg_index) > n_neg:
    disable_index = np.random.choice(neg_index, size=(len(neg_index) - n_neg), replace = False)
    label[disable_index] = -1

Assigning locations to anchor boxes
Now lets assign the locations to each anchor box with the ground truth object which has maximum iou. Note, we will assign anchor locs to all the valid anchor boxes irrespective of its label, later when we are calculating the losses, we can remove them with simple filters.

We already know which ground truth object has high iou with each anchor box, Now we need to find the locations of ground truth with respect to the anchor box location. Faster_R-CNN uses the following parametrizion for this

t_{x} = (x - x_{a})/w_{a}
t_{y} = (y - y_{a})/h_{a}
t_{w} = log(w/ w_a)
t_{h} = log(h/ h_a)

x, y , w, h are the groud truth box center co-ordinates, width and height. x_a, y_a, h_a and w_a and anchor boxes center cooridinates, width and height.

For each anchor box, find the groundtruth object which has max_iou

In [74]:
max_iou_bbox = bbox[argmax_ious]
print(max_iou_bbox)
#Out
# [[ 20.,  30., 400., 500.],
#  [ 20.,  30., 400., 500.],
#  [ 20.,  30., 400., 500.],
#  ...,
#  [ 20.,  30., 400., 500.],
#  [ 20.,  30., 400., 500.],
#  [ 20.,  30., 400., 500.]]

[[ 20.  30. 400. 500.]
 [ 20.  30. 400. 500.]
 [ 20.  30. 400. 500.]
 ...
 [ 20.  30. 400. 500.]
 [ 20.  30. 400. 500.]
 [ 20.  30. 400. 500.]]


Inorder to find t_{x}, t_{y}, t_{w}, t_{h}, we need to convert the y1, x1, y2, x2 format of valid anchor boxes and associated ground truth boxes with max iou to ctr_y, ctr_x , h, w format.

In [76]:
height = valid_anchor_boxes[:, 2] - valid_anchor_boxes[:, 0]
width = valid_anchor_boxes[:, 3] - valid_anchor_boxes[:, 1]
ctr_y = valid_anchor_boxes[:, 0] + 0.5 * height
ctr_x = valid_anchor_boxes[:, 1] + 0.5 * width
base_height = max_iou_bbox[:, 2] - max_iou_bbox[:, 0]
base_width = max_iou_bbox[:, 3] - max_iou_bbox[:, 1]
base_ctr_y = max_iou_bbox[:, 0] + 0.5 * base_height
base_ctr_x = max_iou_bbox[:, 1] + 0.5 * base_width

Use the above formulas to find the loc

In [77]:
eps = np.finfo(height.dtype).eps
height = np.maximum(height, eps)
width = np.maximum(width, eps)
dy = (base_ctr_y - ctr_y) / height
dx = (base_ctr_x - ctr_x) / width
dh = np.log(base_height / height)
dw = np.log(base_width / width)
anchor_locs = np.vstack((dy, dx, dh, dw)).transpose()
print(anchor_locs)
#Out:
# [[ 0.5855727   2.3091455   0.7415673   1.647276  ]
#  [ 0.49718437  2.3091455   0.7415673   1.647276  ]
#  [ 0.40879607  2.3091455   0.7415673   1.647276  ]
#  ...
#  [-2.50802    -5.292254    0.7415677   1.6472763 ]
#  [-2.5964084  -5.292254    0.7415677   1.6472763 ]
#  [-2.6847968  -5.292254    0.7415677   1.6472763 ]]

[[ 0.5855727   2.3091455   0.7415673   1.647276  ]
 [ 0.49718437  2.3091455   0.7415673   1.647276  ]
 [ 0.40879607  2.3091455   0.7415673   1.647276  ]
 ...
 [-2.50802    -5.292254    0.7415677   1.6472763 ]
 [-2.5964084  -5.292254    0.7415677   1.6472763 ]
 [-2.6847968  -5.292254    0.7415677   1.6472763 ]]


Now we got anchor_locs and label associated with each and every valid anchor boxes
Lets map them to the original anchors using the inside_index variable. Fill the unvalid anchor boxes labels with -1 (ignore) and locations with 0.

In [80]:
anchor_labels = np.empty((len(anchors),), dtype=label.dtype)
anchor_labels.fill(-1)
anchor_labels[index_inside] = label

print(anchor_labels.shape)

anchor_locations = np.empty((len(anchors),) + anchors.shape[1:], dtype=anchor_locs.dtype)
anchor_locations.fill(0)
anchor_locations[index_inside, :] = anchor_locs

print(anchor_locations.shape)

(22500,)
(22500, 4)


The final two matrices are
anchor_locations [N, 4] — [22500, 4]
anchor_labels [N,] — [22500]
These are used as targets to the RPN network. We will see how this RPN network is designed in the next section.

# Region Proposal Network.

As we have discussed earlier, Prior to this work, region proposals for a network were generated using selective search, CPMC, MCG, Edgeboxes etc. Faster_R-CNN is the first work to demonstrate generating region proposals using deep learning.

The network contains a convolution module, on top of which there will be one regression layer, which predicts the location of the box inside the anchor
To generate region proposals, we slide a small network over the convolutional feature map output that we obtained in the feature extraction module. This small network takes as input an n x n spatial window of the input convolutional feature map. Each sliding window is mapped to a lower-dimensional feature [512 features]. This feature is fed into two sibling fully connected layers
A box regrression layer
A box classification layer
we use n=3, as noted in Faster_R-CNN paper. We can implement this Architecture using n x n convolutional layer followed by two sibiling 1 x 1 convolutional layers

In [81]:
import torch.nn as nn

mid_channels = 512

in_channels = 512 # depends on the output feature map. in vgg 16 it is equal to 512

n_anchor = 9 # Number of anchors at each location

conv1 = nn.Conv2d(in_channels, mid_channels, 3, 1, 1) #padding 为1，可以保持特征图的分辨率

reg_layer = nn.Conv2d(mid_channels, n_anchor * 4, 1, 1, 0) #padding 为0
cls_layer = nn.Conv2d(mid_channels, n_anchor * 2, 1, 1, 0) #I will be going to use softmax here. you can equally use sigmoid if u replace 2 with 1.

The paper tells that they initialized these layers with zero mean and 0.01 standard deviation for weights and zeros for base. Lets do that

In [82]:
# conv sliding layer
conv1.weight.data.normal_(0, 0.01)
conv1.bias.data.zero_()

#Regression layer
reg_layer.weight.data.normal_(0, 0.01)
reg_layer.bias.data.zero_()

# classification layer
cls_layer.weight.data.normal_(0, 0.01)
cls_layer.bias.data.zero_()

tensor([0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.])

Now the outputs we got in the feature extraction state should be sent to this network to predict locations of objects with repect to the anchor and the objectness score assoiciated with it.

In [87]:
x = conv1(out_map) # out_map is obtained in section 1
pred_anchor_locs = reg_layer(x)
pred_cls_scores = cls_layer(x)
print(pred_cls_scores.shape, pred_anchor_locs.shape)

torch.Size([1, 18, 50, 50]) torch.Size([1, 36, 50, 50])


Lets reformat these a bit and make it align with our anchor targets we designed previously. We will also find the objectness scores for each anchor box, as this is used to for proposal layer which we will discuss in the next section

In [88]:
pred_anchor_locs = pred_anchor_locs.permute(0, 2, 3, 1).contiguous().view(1, -1, 4)
print(pred_anchor_locs.shape)
#Out: torch.Size([1, 22500, 4])
pred_cls_scores = pred_cls_scores.permute(0, 2, 3, 1).contiguous()
print(pred_cls_scores.shape)
#Out torch.Size([1, 50, 50, 18])
objectness_score = pred_cls_scores.view(1, 50, 50, 9, 2)[:, :, :, :, 1].contiguous().view(1, -1)
print(objectness_score.shape)
#Out torch.Size([1, 22500])
pred_cls_scores  = pred_cls_scores.view(1, -1, 2)
print(pred_cls_scores.shape)
# Out torch.size([1, 22500, 2])

torch.Size([1, 22500, 4])
torch.Size([1, 50, 50, 18])
torch.Size([1, 22500])
torch.Size([1, 22500, 2])


we are done with section
pred_cls_scores and pred_anchor_locs are the output the RPN network and the losses to updates the weights
pred_cls_scores and objectness_scores are used as inputs to the proposal layer, which generate a set of proposal which are further used by RoI network. We will see this in the next section.

# Generating proposals to feed Fast R-CNN network

The proposal function will take the following parameters
Weather training_mode or testing mode
nms_thresh
n_train_pre_nms — number of bboxes before nms during training
n_train_post_nms — number of bboxes after nms during training
n_test_pre_nms — number of bboxes before nms during testing
n_test_post_nms — number of bboxes after nms during testing
min_size — minimum height of the object required to create a proposal.

The Faster R_CNN says, RPN proposals highly overlap with each other. To reduced redundancy, we adopt non-maximum supression (NMS) on the proposal regions based on their cls scores. We fix the IoU threshold for NMS at 0.7, which leaves us about 2000 proposal regions per image. After an ablation study, the authors show that NMS does not harm the ultimate detection accuracy, but substantially reduces the number of proposals. After NMS, we use the top-N ranked proposal regions for detection. In the following we training Fast R-CNN using 2000 RPN proposals. During testing they evaluate only 300 proposals, they have tested this with various numbers and obtained this.

In [89]:
nms_thresh = 0.7
n_train_pre_nms = 12000
n_train_post_nms = 2000
n_test_pre_nms = 6000
n_test_post_nms = 300
min_size = 16

We need to do the following things to generate region of interest proposals to the network.
convert the loc predictions from the rpn network to bbox [y1, x1, y2, x2] format.
clip the predicted boxes to the image
Remove predicted boxes with either height or width < threshold (min_size).
Sort all (proposal, score) pairs by score from highest to lowest.
Take top pre_nms_topN (e.g. 12000 while training and 300 while testing).
Apply nms threshold > 0.7
Take top pos_nms_topN (e.g. 2000 while training and 300 while testing)

We will look at each of the stages in the remainder of this section

In [114]:
#convert anchors format from y1, x1, y2, x2 to ctr_x, ctr_y, h, w
anc_height = anchors[:, 2] - anchors[:, 0]
anc_width = anchors[:, 3] - anchors[:, 1]
anc_ctr_y = anchors[:, 0] + 0.5 * anc_height
anc_ctr_x = anchors[:, 1] + 0.5 * anc_width
print(anc_ctr_x.shape)

pred_anchor_locs_numpy = pred_anchor_locs[0].data.numpy()
objectness_score_numpy = objectness_score[0].data.numpy()
print(pred_anchor_locs_numpy.shape)
print(objectness_score_numpy.shape)

(22500,)
(22500, 4)
(22500,)


In [115]:
pred_anchor_locs_numpy = pred_anchor_locs[0].data.numpy()
objectness_score_numpy = objectness_score[0].data.numpy()
dy = pred_anchor_locs_numpy[:, [0]] #每个anchor box的dy,numpy切片，如果带[]的话，可以保持维度
dx = pred_anchor_locs_numpy[:, [1]] #每个anchor box的dx
dh = pred_anchor_locs_numpy[:, [2]] # 每个anchor box的dh
dw = pred_anchor_locs_numpy[:, [3]] # dw
print(dy.shape)

ctr_y = dy * anc_height[:, np.newaxis] + anc_ctr_y[:, np.newaxis]
ctr_x = dx * anc_width[:, np.newaxis] + anc_ctr_x[:, np.newaxis]
h = np.exp(dh) * anc_height[:, np.newaxis]
w = np.exp(dw) * anc_width[:, np.newaxis]
print(w.shape)


(22500, 1)
(22500, 1)


In [116]:
# 用 labelled 的 anchor boxes 與 RPN 預測的 anchor boxes來計算 ROI = [y1, x1, y2, x2] 
roi = np.zeros(pred_anchor_locs_numpy.shape, dtype=anchor_locs.dtype)
roi[:, 0::4] = ctr_y - 0.5 * h
roi[:, 1::4] = ctr_x - 0.5 * w
roi[:, 2::4] = ctr_y + 0.5 * h
roi[:, 3::4] = ctr_x + 0.5 * w
#Out:
# [[ -36.897102,  -80.29519 ,   54.09939 ,  100.40507 ],
#  [ -83.12463 , -165.74298 ,   98.67854 ,  188.6116  ],
#  [-170.7821  , -378.22214 ,  196.20844 ,  349.81198 ],
#  ...,
#  [ 696.17816 ,  747.13306 ,  883.4582  ,  836.77747 ],
#  [ 621.42114 ,  703.0614  ,  973.04626 ,  885.31226 ],
#  [ 432.86267 ,  622.48926 , 1146.7059  ,  982.9209  ]]

#clip the predicted boxes to the image
img_size = (800, 800) #Image size
roi[:, slice(0, 4, 2)] = np.clip(
            roi[:, slice(0, 4, 2)], 0, img_size[0])
roi[:, slice(1, 4, 2)] = np.clip(
    roi[:, slice(1, 4, 2)], 0, img_size[1])
print(roi)
#Out:
# [[  0.     ,   0.     ,  54.09939, 100.40507],
#  [  0.     ,   0.     ,  98.67854, 188.6116 ],
#  [  0.     ,   0.     , 196.20844, 349.81198],
#  ...,
#  [696.17816, 747.13306, 800.     , 800.     ],
#  [621.42114, 703.0614 , 800.     , 800.     ],
#  [432.86267, 622.48926, 800.     , 800.     ]]

[[  0.         0.        52.917183  98.60744 ]
 [  0.         0.        97.90541  199.06795 ]
 [  0.         0.       198.67064  375.2675  ]
 ...
 [696.2994   745.73676  800.       800.      ]
 [602.5117   700.2239   800.       800.      ]
 [429.5132   608.8695   800.       800.      ]]


In [117]:
#remove predicted boxes with either height or width < threshold
hs = roi[:, 2] - roi[:, 0]
ws = roi[:, 3] - roi[:, 1]
keep = np.where((hs >= min_size) & (ws >= min_size))[0]
roi = roi[keep, :]
score = objectness_score_numpy[keep]
print(score.shape)
#Out:
##(22500, ) all the boxes have minimum size of 16

#Sort all (proposal, score) pairs by score from highest to lowest.
order = score.ravel().argsort()[::-1]
print(order)
#Out:
#[ 889,  929, 1316, ...,  462,  454,    4]

#Take top pre_nms_topN (e.g. 12000 while training and 300 while testing)
order = order[:n_train_pre_nms]
roi = roi[order, :]
print(order.shape, roi.shape)
#print(roi)



(22500,)
[  454   457     2 ...    18 22036   890]
(12000,) (12000, 4)


Apply non-maximum supression threshold > 0.7 First question, What is Non-maximum supression ? It is the process in which we remove/merge extremely highly overlapping bounding boxes. If we look at the below diagram, there are lot of overlapping bounding boxes and we want a few bounding boxes which are unique and doesn’t overlap much. We keep the threshold at 0.7. threshold defines the minimum overlapping area required to merge/remove overlapping bounding boxes

The sudo code for NMS works in the following way

- Take all the roi boxes [roi_array]
- Find the areas of all the boxes [roi_area]
- Take the indexes of order the probability score in descending order [order_array]
keep = []
while order_array.size > 0:
  - take the first element in order_array and append that to keep  
  - Find the area with all other boxes
  - Find the index of all the boxes which have high overlap with this box
  - Remove them from order array
  - Iterate this till we get the order_size to zero (while loop)
- Ouput the keep variable which tells what indexes to consider.

In [118]:
#Take top pos_nms_topN (e.g. 2000 while training and 300 while testing)
y1 = roi[:, 0]
x1 = roi[:, 1]
y2 = roi[:, 2]
x2 = roi[:, 3]
areas = (x2 - x1 + 1) * (y2 - y1 + 1)
order = order.argsort()[::-1]
keep = []
while order.size > 0:
    i = order[0]
    keep.append(i)
    xx1 = np.maximum(x1[i], x1[order[1:]])
    yy1 = np.maximum(y1[i], y1[order[1:]])
    xx2 = np.minimum(x2[i], x2[order[1:]])
    yy2 = np.minimum(y2[i], y2[order[1:]])
    w = np.maximum(0.0, xx2 - xx1 + 1)
    h = np.maximum(0.0, yy2 - yy1 + 1)
    inter = w * h
    ovr = inter / (areas[i] + areas[order[1:]] - inter)
    inds = np.where(ovr <= nms_thresh)[0]
    order = order[inds + 1]
keep = keep[:n_train_post_nms] # while training/testing , use accordingly
roi = roi[keep] # the final region proposals
print(len(keep), roi.shape)

1031 (1031, 4)


The final region proposals were obtained, This is used as the input to the Fast_R-CNN object which finally tries to predict the object locations (with respect to the proposed box) and class of the object (classifcation of each proposal). First we look into how to create targets for these proposals for training this network. After that we will look into how this fast r-cnn network is implemented and pass these proposals to the network to obtain the predicted outputs. Then, we will determine the losses, We will calculate both the rpn loss and fast r-cnn loss.

# Proposal targets

The Fast R-CNN network takes the region proposals (obtained from proposal layer in previous section), ground truth boxes and their respective labels as inputs. It will take the following parameters
n_sample: Number of samples to sample from roi, The default value is 128.
pos_ratio: the number of positive examples out of the n_samples. The default values is 0.25.
pos_iou_thesh: The minimum overlap of region proposal with any groundtruth object to consider it as positive label.
[neg_iou_threshold_lo, neg_iou_threshold_hi] : [0.0, 0.5], The overlap value bounding required to consider a region proposal as negitive [background object].

In [119]:
n_sample = 128
pos_ratio = 0.25
pos_iou_thresh = 0.5
neg_iou_thresh_hi = 0.5
neg_iou_thresh_lo = 0.0

In [120]:
ious = np.empty((len(roi), 2), dtype=np.float32)
ious.fill(0)
for num1, i in enumerate(roi):
    ya1, xa1, ya2, xa2 = i  
    anchor_area = (ya2 - ya1) * (xa2 - xa1)
    for num2, j in enumerate(bbox):
        yb1, xb1, yb2, xb2 = j
        box_area = (yb2- yb1) * (xb2 - xb1)
        inter_x1 = max([xb1, xa1])
        inter_y1 = max([yb1, ya1])
        inter_x2 = min([xb2, xa2])
        inter_y2 = min([yb2, ya2])
        if (inter_x1 < inter_x2) and (inter_y1 < inter_y2):
            iter_area = (inter_y2 - inter_y1) * \
(inter_x2 - inter_x1)
            iou = iter_area / (anchor_area+ \
box_area - iter_area)            
        else:
            iou = 0.
        ious[num1, num2] = iou
print(ious.shape)

(1031, 2)


In [122]:
gt_assignment = ious.argmax(axis=1)
max_iou = ious.max(axis=1)
print(gt_assignment)
print(max_iou)

[0 0 0 ... 0 0 0]
[0.         0.         0.         ... 0.03046864 0.03046921 0.00844959]


In [123]:
gt_roi_label = labels[gt_assignment]
print(gt_roi_label)

[6 6 6 ... 6 6 6]


In [125]:
# Select the foreground rois as per the pos_iou_thesh and 
# n_sample x pos_ratio (128 x 0.25 = 32) foreground samples.
pos_roi_per_image = 32 
pos_index = np.where(max_iou >= pos_iou_thresh)[0]
pos_roi_per_this_image = int(min(pos_roi_per_image, pos_index.size))
if pos_index.size > 0:
    pos_index = np.random.choice(
        pos_index, size=pos_roi_per_this_image, replace=False)
print(pos_roi_per_this_image)
print(pos_index)

23
[334 554 699 844 372 336 679 615 410 722 658 399 617 373 671 741 793 619
 398 579 533 675 411]


In [126]:
neg_index = np.where((max_iou < neg_iou_thresh_hi) &
                             (max_iou >= neg_iou_thresh_lo))[0]
neg_roi_per_this_image = n_sample - pos_roi_per_this_image
neg_roi_per_this_image = int(min(neg_roi_per_this_image,
                                 neg_index.size))
if  neg_index.size > 0 :
    neg_index = np.random.choice(
        neg_index, size=neg_roi_per_this_image, replace=False)
print(neg_roi_per_this_image)
print(neg_index)

105
[ 727  814   36  316   40   59  347  461  644  541  107  858  394  555
  925   21  567  811  976  958  424  972  903   88  156  273  566  395
  215  101  291  631   73  207  604  842  309  258  134  350  506  725
  342  709  889 1026  103  403  939  911  147  414  763  130  954  369
  383  468  327  710  270  225   96  655  478  835  851  986  172  255
  622  863  778  906  605  638  998  376    8  868  490  379  756  132
    6    7  413  960  380  484  116  109  516  286  448  636  562  794
   70  932  462  427  734  491  543]


In [127]:
keep_index = np.append(pos_index, neg_index)
gt_roi_labels = gt_roi_label[keep_index]
gt_roi_labels[pos_roi_per_this_image:] = 0  # negative labels --> 0
sample_roi = roi[keep_index]
print(sample_roi.shape)

(128, 4)


In [128]:
bbox_for_sampled_roi = bbox[gt_assignment[keep_index]]
print(bbox_for_sampled_roi.shape)
#Out
#(128, 4)
height = sample_roi[:, 2] - sample_roi[:, 0]
width = sample_roi[:, 3] - sample_roi[:, 1]
ctr_y = sample_roi[:, 0] + 0.5 * height
ctr_x = sample_roi[:, 1] + 0.5 * width
base_height = bbox_for_sampled_roi[:, 2] - bbox_for_sampled_roi[:, 0]
base_width = bbox_for_sampled_roi[:, 3] - bbox_for_sampled_roi[:, 1]
base_ctr_y = bbox_for_sampled_roi[:, 0] + 0.5 * base_height
base_ctr_x = bbox_for_sampled_roi[:, 1] + 0.5 * base_width

(128, 4)


In [130]:

eps = np.finfo(height.dtype).eps
height = np.maximum(height, eps)
width = np.maximum(width, eps)
dy = (base_ctr_y - ctr_y) / height
dx = (base_ctr_x - ctr_x) / width
dh = np.log(base_height / height)
dw = np.log(base_width / width)
gt_roi_locs = np.vstack((dy, dx, dh, dw)).transpose()
print(gt_roi_locs.shape)

(128, 4)


# Fast R-CNN
取出128 個 ROI samples 的 features, 用 max pooling 調成 same size, H=7, W=7 (ROI Pooling)

In [132]:
rois = torch.from_numpy(sample_roi).float()
roi_indices = 0 * np.ones((len(rois),), dtype=np.int32)
roi_indices = torch.from_numpy(roi_indices).float()
print(rois.shape, roi_indices.shape)

torch.Size([128, 4]) torch.Size([128])


In [133]:
indices_and_rois = torch.cat([roi_indices[:, None], rois], dim=1)
xy_indices_and_rois = indices_and_rois[:, [0, 2, 1, 4, 3]]
indices_and_rois = xy_indices_and_rois.contiguous()
print(xy_indices_and_rois.shape)

torch.Size([128, 5])


In [136]:
size = (7, 7)
adaptive_max_pool = nn.AdaptiveMaxPool2d(size[0], size[1])
output = []
rois = indices_and_rois.data.float()
rois[:, 1:].mul_(1/16.0) # Subsampling ratio
rois = rois.long()
num_rois = rois.size(0)
for i in range(num_rois):
    roi = rois[i]
    im_idx = roi[0]
    im = out_map.narrow(0, im_idx, 1)[..., roi[2]:(roi[4]+1), roi[1]:(roi[3]+1)]
    tmp = adaptive_max_pool(im)
    output.append(tmp[0])
output = torch.cat(output, 0)
print(output.size())
#Out:
# torch.Size([128, 512, 7, 7])
# Reshape the tensor so that we can pass it through the feed forward layer.
k = output.view(output.size(0), -1)
print(k.shape)
#Out:
# torch.Size([128, 25088])

torch.Size([128, 512, 7, 7])
torch.Size([128, 25088])


128 個 ROI samples 的 boxes + features (7x7x512) 送到 Detection network 預測輸入影像的物件 bounding box 與 class

In [137]:
roi_head_classifier = nn.Sequential(*[nn.Linear(25088, 4096),
                                      nn.Linear(4096, 4096)])
cls_loc = nn.Linear(4096, 21 * 4) # (VOC 20 classes + 1 background. Each will have 4 co-ordinates)
cls_loc.weight.data.normal_(0, 0.01)
cls_loc.bias.data.zero_()
score = nn.Linear(4096, 21) # (VOC 20 classes + 1 background)

In [138]:
k = roi_head_classifier(k)
roi_cls_loc = cls_loc(k)
roi_cls_score = score(k)
print(roi_cls_loc.shape, roi_cls_score.shape)

torch.Size([128, 84]) torch.Size([128, 21])


# section 7 Loss functions

## RPN Loss

In [139]:
print(pred_anchor_locs.shape)
print(pred_cls_scores.shape)
print(anchor_locations.shape)
print(anchor_labels.shape)

torch.Size([1, 22500, 4])
torch.Size([1, 22500, 2])
(22500, 4)
(22500,)


In [140]:
rpn_loc = pred_anchor_locs[0]
rpn_score = pred_cls_scores[0]
gt_rpn_loc = torch.from_numpy(anchor_locations)
gt_rpn_score = torch.from_numpy(anchor_labels)
print(rpn_loc.shape, rpn_score.shape, gt_rpn_loc.shape, gt_rpn_score.shape)

torch.Size([22500, 4]) torch.Size([22500, 2]) torch.Size([22500, 4]) torch.Size([22500])


In [141]:
import torch.nn.functional as F
rpn_cls_loss = F.cross_entropy(rpn_score, gt_rpn_score.long(), ignore_index = -1)
print(rpn_cls_loss)
#Out:

tensor(0.6931, grad_fn=<NllLossBackward>)


In [142]:
pos = gt_rpn_score > 0
mask = pos.unsqueeze(1).expand_as(rpn_loc)
print(mask.shape)

torch.Size([22500, 4])


In [143]:
mask_loc_preds = rpn_loc[mask].view(-1, 4)
mask_loc_targets = gt_rpn_loc[mask].view(-1, 4)
print(mask_loc_preds.shape, mask_loc_preds.shape)

torch.Size([18, 4]) torch.Size([18, 4])


In [144]:
x = torch.abs(mask_loc_targets - mask_loc_preds)
rpn_loc_loss = ((x < 1).float() * 0.5 * x**2) + ((x >= 1).float() * (x-0.5))
print(rpn_loc_loss.sum())

tensor(1.2075, grad_fn=<SumBackward0>)


In [145]:
rpn_lambda = 10.
N_reg = (gt_rpn_score >0).float().sum()
rpn_loc_loss = rpn_loc_loss.sum() / N_reg
rpn_loss = rpn_cls_loss + (rpn_lambda * rpn_loc_loss)
print(rpn_loss)

tensor(1.3640, grad_fn=<AddBackward0>)


##  Fast R-CNN loss

In [149]:
#predicted
print(roi_cls_loc.shape)
print(roi_cls_score.shape)

#actual
print(gt_roi_locs.shape)
print(gt_roi_labels.shape)

#Converting ground truth to torch variable
gt_roi_loc = torch.from_numpy(gt_roi_locs)
gt_roi_label = torch.from_numpy(np.float32(gt_roi_labels)).long()
print(gt_roi_loc.shape, gt_roi_label.shape)

#classification loss
roi_cls_loss = F.cross_entropy(roi_cls_score, gt_roi_label, ignore_index=-1)
print(roi_cls_loss.shape)

#Regression loss 
n_sample = roi_cls_loc.shape[0]
roi_loc = roi_cls_loc.view(n_sample, -1, 4)
print(roi_loc.shape)
#Out:
#torch.Size([128, 21, 4])
roi_loc = roi_loc[torch.arange(0, n_sample).long(), gt_roi_label]
print(roi_loc.shape)

# For Regression we use smooth L1 loss as defined in the Fast RCNN paper
pos = gt_roi_label > 0
mask = pos.unsqueeze(1).expand_as(roi_loc)
print(mask.shape)

# take those bounding boxes which have positve labels
mask_loc_preds = roi_loc[mask].view(-1, 4)
mask_loc_targets = gt_roi_loc[mask].view(-1, 4)
print(mask_loc_preds.shape, mask_loc_targets.shape)

x = torch.abs(mask_loc_targets.cpu() - mask_loc_preds.cpu())
roi_loc_loss = ((x < 1).float() * 0.5 * x**2) + ((x >= 1).float() * (x-0.5))
print(roi_loc_loss.sum())

#total roi loss
roi_lambda = 10.
roi_loss = roi_cls_loss + (roi_lambda * roi_loc_loss)
print(roi_loss)



torch.Size([128, 84])
torch.Size([128, 21])
(128, 4)
(128,)
torch.Size([128, 4]) torch.Size([128])
torch.Size([])
torch.Size([128, 21, 4])
torch.Size([128, 4])
torch.Size([128, 4])
torch.Size([23, 4]) torch.Size([23, 4])
tensor(2.1932, grad_fn=<SumBackward0>)
tensor([[3.1561, 3.1140, 3.5262, 3.4983],
        [3.0734, 3.3998, 3.0811, 3.7409],
        [3.0733, 3.1111, 3.1484, 3.0569],
        [3.3515, 3.0767, 3.0812, 3.0555],
        [3.0870, 3.0557, 3.5262, 3.4983],
        [3.0640, 3.1140, 3.5262, 3.4983],
        [3.0547, 3.1509, 3.0546, 3.5456],
        [3.1333, 3.3244, 3.4025, 3.0569],
        [3.0563, 3.1523, 3.5262, 3.4983],
        [3.2778, 3.0778, 3.5784, 3.0551],
        [3.0963, 3.2703, 3.1561, 3.4225],
        [3.1612, 3.1187, 5.2256, 3.0845],
        [3.0877, 3.4312, 3.2006, 3.5556],
        [3.1375, 3.0548, 3.4378, 3.4151],
        [3.3500, 3.1036, 3.4425, 3.0635],
        [3.0877, 3.0551, 3.2006, 3.5556],
        [3.0575, 3.1337, 3.0686, 3.5700],
        [3.1911, 3.3224, 3

Total loss

Now we need to combine the RPN loss and Fast-RCNN loss to compute the total loss for 1 iteration. this is a simple addition

In [151]:
total_loss = rpn_loss + roi_loss
print(total_loss)

tensor([[4.5201, 4.4779, 4.8902, 4.8623],
        [4.4373, 4.7637, 4.4451, 5.1049],
        [4.4372, 4.4750, 4.5124, 4.4208],
        [4.7155, 4.4407, 4.4451, 4.4195],
        [4.4510, 4.4197, 4.8902, 4.8623],
        [4.4280, 4.4779, 4.8902, 4.8623],
        [4.4187, 4.5149, 4.4186, 4.9096],
        [4.4973, 4.6883, 4.7665, 4.4208],
        [4.4202, 4.5163, 4.8902, 4.8623],
        [4.6417, 4.4418, 4.9423, 4.4191],
        [4.4603, 4.6343, 4.5201, 4.7864],
        [4.5251, 4.4827, 6.5896, 4.4484],
        [4.4517, 4.7952, 4.5646, 4.9195],
        [4.5014, 4.4187, 4.8017, 4.7790],
        [4.7140, 4.4676, 4.8065, 4.4275],
        [4.4517, 4.4191, 4.5646, 4.9195],
        [4.4215, 4.4977, 4.4326, 4.9340],
        [4.5551, 4.6864, 4.7412, 4.4212],
        [4.4730, 4.4676, 6.5163, 4.4346],
        [4.4186, 4.8545, 4.4211, 4.4208],
        [4.4191, 5.4536, 4.4186, 4.8986],
        [4.5836, 4.4795, 5.0846, 4.8497],
        [4.6005, 4.5045, 4.8017, 4.7790]], grad_fn=<AddBackward0>)
