## (1) What kind of method exists in the field of object detection?

- Region proposal methods
- Region-based convolutional neural networks (INTRODUCTION)


## (2) It says "Faster", but what mechanism was used to make it faster?

In this work, we introduce a Region Proposal Network (RPN) that shares full-image convolutional features with the detection network, thus enabling nearly cost-free region proposals. An RPN is a fully convolutional network that simultaneously predicts object bounds and objectness scores at each position. The RPN is trained end-to-end to generate high-quality region proposals, which are used by Fast R-CNN for detection. We further merge RPN and Fast R-CNN into a single network by sharing their convolutional features—using the recently popular terminology of neural networks with “attention” mechanisms, the RPN component tells the unified network where to look. For the very deep VGG-16 model [3], our detection system has a frame rate of 5fps (including all steps) on a GPU, while achieving state-of-the-art object detection accuracy on PASCAL VOC 2007, 2012, and MS COCO datasets with only 300 proposals per image. In ILSVRC and COCO 2015 competitions, Faster R-CNN and RPN are the foundations of the 1st-place winning entries in several tracks. Code has been made publicly available.(ABSTRACT)

## (3) How is the One-Stage method different from the Two-Stage method?

- One stage : class-specific detection pipeline.
- Two stage : class-agnostic proposals and class-specific detections(EXPERIMENT)



Our object detection system, called Faster R-CNN, is composed of two modules. The first module is a deep fully convolutional network that proposes regions, and the second module is the Fast R-CNN detector [2] that uses the proposed regions. The entire system is a single, unified network for object detection .(FASTER R-CNN)

## (4) What is RPN?

- An RPN is a fully convolutional
network that simultaneously predicts object bounds and objectness scores at each position. The RPN is trained end-to-end to
generate high-quality region proposals, which are used by Fast R-CNN for detection. (ABSTRACT)

- A Region Proposal Network (RPN) takes an image
(of any size) as input and outputs a set of rectangular
object proposals, each with an objectness score.3 We
model this process with a fully convolutional network
[7], which we describe in this section.(RELATED WORK)

## (5) What is RoI pooling?
- The RoI pooling layer
[2] in Fast R-CNN accepts the convolutional features
and also the predicted bounding boxes as input, so
a theoretically valid backpropagation solver should
also involve gradients w.r.t. the box coordinates. These
gradients are ignored in the above approximate joint
training. In a non-approximate joint training solution,
we need an RoI pooling layer that is differentiable
w.r.t. the box coordinates. This is a nontrivial problem
and a solution can be given by an “RoI warping” layer
as developed in [15], which is beyond the scope of this
paper(FASTER R-CNN)

- The RoI pooling layer uses max pooling to convert the
features inside any valid region of interest into a small feature map with a fixed spatial extent of H × W (e.g., 7 × 7),
where H and W are layer hyper-parameters that are independent of any particular RoI. In this paper, an RoI is a
rectangular window into a conv feature map. Each RoI is
defined by a four-tuple (r, c, h, w) that specifies its top-left
corner (r, c) and its height and width (h, w).RoI max pooling works by dividing the h × w RoI window into an H × W grid of sub-windows of approximate
size h/H × w/W and then max-pooling the values in each
sub-window into the corresponding output grid cell. Pooling is applied independently to each feature map channel,
as in standard max pooling. The RoI layer is simply the
special-case of the spatial pyramid pooling layer used in
SPPnets [11] in which there is only one pyramid level. We
use the pooling sub-window calculation given in [11].(Fast R-CNN architecture and training, Fast R-CNN paper)

## (6) What is the proper size for Anchor?

For anchors, we use 3 scales with box areas of 1282
,
2562
, and 5122 pixels, and 3 aspect ratios of 1:1, 1:2,
and 2:1. These hyper-parameters are not carefully chosen for a particular dataset, and we provide ablation
experiments on their effects in the next section. As discussed, our solution does not need an image pyramid
or filter pyramid to predict regions of multiple scales,
saving considerable running time. Figure 3 (right)
shows the capability of our method for a wide range
of scales and aspect ratios. Table 1 shows the learned
average proposal size for each anchor using the ZF
net. We note that our algorithm allows predictions
that are larger than the underlying receptive field.
Such predictions are not impossible—one may still
roughly infer the extent of an object if only the middle
of the object is visible.
The anchor boxes that cross image boundaries need
to be handled with care. During training, we ignore
all cross-boundary anchors so they do not contribute
to the loss. For a typical 1000 × 600 image, there
will be roughly 20000 (≈ 60 × 40 × 9) anchors in
total. With the cross-boundary anchors ignored, there
are about 6000 anchors per image for training. If the
boundary-crossing outliers are not ignored in training,
they introduce large, difficult to correct error terms in
the objective, and training does not converge. During
testing, however, we still apply the fully convolutional
RPN to the entire image. This may generate crossboundary proposal boxes, which we clip to the image
boundary.(FASTER R-CNN)

## (7)  What kind of data set is used and what kind of index value is obtained compared to the previous research?

- Dataset:  PASCAL VOC 2007, 2012, MS COCO datasets ,ILSVRC (ABSTRACT)

Table 3 shows the results of VGG-16 for both proposal and detection. Using RPN+VGG, the result is 68.5% for unshared features, slightly higher than the SS baseline. As shown above, this is because the proposals generated by RPN+VGG are more accurate than SS. Unlike SS that is predefined, the RPN is actively trained and benefits from better networks. For the feature-shared variant, the result is 69.9%—better than the strong SS baseline, yet with nearly cost-free proposals. We further train the RPN and detection network on the union set of PASCAL VOC 2007 trainval and 2012 trainval. The mAP is 73.2%. Figure 5 shows some results on the PASCAL VOC 2007 test set. On the PASCAL VOC 2012 test set (Table 4), our method has an mAP of 70.4% trained on the union set of VOC 2007 trainval+test and VOC 2012 trainval. Table 6 and Table 7 show the detailed numbers.


In Table 5 we summarize the running time of the entire object detection system. SS takes 1-2 seconds depending on content (on average about 1.5s), and Fast R-CNN with VGG-16 takes 320ms on 2000 SS proposals (or 223ms if using SVD on fully-connected layers [2]). Our system with VGG-16 takes in total 198ms for both proposal and detection. With the convolutional features shared, the RPN alone only takes 10ms computing the additional layers. Our regionwise computation is also lower, thanks to fewer proposals (300 per image). Our system has a frame-rate of 17 fps with the ZF net.


In Table 11 we first report the results of the Fast R-CNN system [2] using the implementation in this paper. Our Fast R-CNN baseline has 39.3% mAP@0.5 on the test-dev set, higher than that reported in [2]. We conjecture that the reason for this gap is mainly due to the definition of the negative samples and also the changes of the mini-batch sizes. We also note that the mAP[.5, .95] is just comparable.
Next we evaluate our Faster R-CNN system. Using the COCO training set to train, Faster R-CNN has 42.1% mAP@0.5 and 21.5% mAP[.5, .95] on the COCO test-dev set. This is 2.8% higher for mAP @0.5 and 2.2% higher for mAP@[.5, .95] than the Fast RCNN counterpart under the same protocol (Table 11). This indicates that RPN performs excellent for improving the localization accuracy at higher IoU thresholds. Using the COCO trainval set to train, Faster RCNN has 42.7% mAP@0.5 and 21.9% mAP@[.5, .95] on the COCO test-dev set. Figure 6 shows some results on the MS COCO test-dev set.(EXPERIMENTS)