# Detect to Track 

One drawback of many correlation trackers [1, 25] is that they only work on
single targets and do not account for changes in object scale
and aspect ratio

Our approach builds on R-FCN  which is a simple and
efficient framework for object detection on region proposals with a fully convolutional nature.

R-FCN reduces the cost for region classification by pushing
the region-wise operations to the end of the network with
the introduction of a position-sensitive RoI pooling layer
which works on convolutional features that encode the spa-
tially subsampled class scores of input RoIs.

# ROI Pooling

At its core, RoIPool shares the forward pass of a CNN for an image across its subregions. Then, the features in each region are pooled (usually using max pooling). So all it takes us is one pass of the original image as opposed to ~2000!

The convolutional feature maps used by region-based detectors, like Fast R- CNN, can also be used for generating region proposals [thus enabling nearly cost-free region proposals].

Faster R-CNN generates these region proposals from CNN features. Faster R-CNN adds a Fully Convolutional Network on top of the features of the CNN creating what’s known as the Region Proposal Network.

The layer takes two inputs:

   1) A fixed-size feature map obtained from a deep convolutional network with several convolutions and max pooling layers.
   
   2) An N x 5 matrix of representing a list of regions of interest, where N is a number of RoIs. The first column represents the image index and the remaining four are the coordinates of the top left and bottom right corners of the region.


For every region of interest from the input list, it takes a section of the input feature map that corresponds to it and scales it to some pre-defined size (e.g., 7×7). The scaling is done by:

   -> Dividing the region proposal into equal-sized sections (the number of which is the same as the dimension of the output)
   
   -> Finding the largest value in each section
   
   -> Copying these max values to the output buffer


# Tracking

Tracking is also an extensively studied prob-
lem in computer vision with most recent progress devoted
to trackers operating on deep ConvNet features. In [26]
a ConvNet is fine-tuned at test-time to track a target from
the same video via detection and bounding box regression.

Since this tracker predicts a bounding box
instead of just the position, it is able to model changes in
scale and aspect of the tracked template. The major draw-
back of this approach is that it only can process a single tar-
get template and it also has to rely on significant data aug-
mentation to learn all possible transformations of tracked
boxes.