# Object Tracking

- **Instructor**: Jongwoo Lim / Jiun Bae
- **Email**: [jlim@hanyang.ac.kr](mailto:jlim@hanyang.ac.kr) / [jiunbae.623@gmail.com](mailto:jiunbae.623@gmail.com)

## Object Tracking (MDNet)

Object Tracking is tracking object in consecutive image sequences.

## Problem in tracking

Visual tracking problems. In traditional approach, tracking using hand-crafted features.

![tracking-prablem](../assets/tracking-problem.png)

Lack of data for visual tracking. Beacuse image sequences has different domains.

## MDNet

Shared layers and domain-specific layers each domain in trained separately.

![MDNet](../assets/MDNet.png)

MDNet tactics:

1. Bounding box regression
2. Hard negative mining
3. Consider long-term and short-term changes

![MDNet-Regression](../assets/MDNet-regression.png)
![MDNet-HardNegativeMining](../assets/MDNet-hardnegative.png)
![MDNet-Short-Long-Term](../assets/MDNet-long-short-term.png)

### Online tracking (at inference)

Drop all domain-specific layers, attach new randomly initilized branch.
Update when first frame given.

![MDNet-online](../assets/MDNet-inference.png)

# Code

### Import packages

First of all, Import some packages for using PyTorch.

- torch.nn: The **Network** of PyTorch basically starts with nn.Module.
- torch.nn.functional: for **Functions** such as *ReLU*, *MaxPool* (in this example)
- torch.optim: for **Optimizers**
- torchvision: Handling **Datasets**

Numpy the basic scientific computing package used in customary.

In [None]:
import numpy as np
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
from torch.autograd import Variable
from torchvision import datasets, transforms

import matplotlib.pyplot as plt

## Dataset

In [None]:
from PIL import Image
from IPython.display import display

In [None]:
test_image = '../assets/detection.jpg'

In [None]:
image = Image.open(test_image)

In [None]:
image

## Model SSD VGG16

Single-Shot Multibox Detector (SSD) is a paper published in ECCV in 2016 and predicts the class score and position of the bounding box like other Detection Algorithms. For each input of a single image, bounding box regression and score prediction are performed by using the default box of different ratios and scales in the feature map of the various stages of the features that passed the CNN.

![SSD](../assets/SSD.png)

The feature extraction convolution filter of several stages added in addition to the backbone generates a fixed number of class scores and bounding box predictions using a small kernel having a $m \times n$ sized $p$ channel. Includes scores for. Therefore, since there are $k$ cell positions and 4 offset information is calculated for $c$ classes, each cell has $(c+4)\times k$ filters, and as a result, the feature map has $(c + 4) \times k \times m \times n$ outputs.

When training, apply loss function and back propagation like other machine learning networks. In the learning process, you can adjust the number and scale of the default boxes described below and use hard negative mining and data augmentation to improve performance.

The total loss function is calculated as the sum of the weights of localization loss (loc) and confidence loss (conf).

$$L(x, c, l, g) = \frac {1} {N} (L_{conf}(x, c) + \alpha L_{loc} (x, l, g)$$

$x^p_{ij} = \{1, 0\}$ is i-th default box of klass $p$'s j-th true value box indicator, $N$ is matched default box count.

Localization loss using smooth L1 loss between predict box $l$ and ground truth bounding box $g$.

$$L_{loc}(x, l, g) = \sum^N_{i \in Pos} \sum_{m \in \{ cx, cy, w, h\} } x^k_{ij} smooth_{L1}(l^m_i - \hat{g}^m_j)$$

Confidence loss is softmax loss of multiple classes.

$$L_{conf}(x, c) = - \sum^N_{i \in Pos} x^p_{ij} log (\hat {c}^p_i) - \sum_{i \in Neg} log(\hat {c}^0_i) where \hat {c}^p_i = \frac{exp(c^p_i}{\sum_p exp( c^p_i) } $$

## Codes

In [None]:
from SSD.model import VGG16

In [None]:
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

In [None]:
model = VGG16.new(21, 1).to(device)
model.eval()
model.load(torch.load('../data/vgg16-pretrained.pth', map_location=lambda s, l: s))

### Image transform

In [None]:
from SSD.lib.augmentation import Compose, ToPercentCoords, Resize, SubtractMeans, ConvertFromInts

In [None]:
transform = Compose([
    ConvertFromInts(),
    ToPercentCoords(),
    Resize((300, 300)),
    SubtractMeans((123, 117, 104)),
    lambda img, boxes=None, labels=None: (img / 1., boxes, labels),
])

In [None]:
inputs = np.array(image)
inputs = cv2.resize(inputs, (300, 300)).astype(np.float32)
inputs -= (104, 117, 123)
inputs = inputs[:, :, ::-1].copy()
inputs = torch.from_numpy(inputs).permute(2, 0, 1)
inputs = Variable(inputs.unsqueeze_(0), requires_grad=False)
inputs = inputs.to(device)

In [None]:
outputs = model(inputs)

## Inference

In [None]:
num_classes = 21
class_names = ('BACKGROUND',
               'aeroplane', 'bicycle', 'bird', 'boat',
               'bottle', 'bus', 'car', 'cat', 'chair',
               'cow', 'diningtable', 'dog', 'horse',
               'motorbike', 'person', 'pottedplant',
               'sheep', 'sofa', 'train', 'tvmonitor')

In [None]:
detection = np.empty((0, 6), dtype=np.float32)

for klass, boxes in enumerate(outputs[0]):
    candidates = boxes[boxes[:, 0] >= .3]
    print(klass, candidates)

    if candidates.size(0) == 0:
        continue

    detection = np.concatenate((
        detection,
        np.hstack((
            np.full((np.size(candidates, 0), 1), klass, dtype=np.uint8),
            candidates.cpu().detach().numpy(),
        )),
    ))

## Visualize

In [None]:
import cv2

In [None]:
def show(ary):
    display(Image.fromarray(ary))

In [None]:
img = np.array(image)
h, w, c = img.shape

In [None]:
colors = [(np.random.randint(0, 255), np.random.randint(0, 255), np.random.randint(0, 255)) for _ in range(21)]

In [None]:
for klass, conf, x, y, x2, y2 in detection:
    if conf < .3:
        continue
    try:
        cv2.rectangle(img, (int(x *w ), int(y *h)), (int(x2 * w), int(y2 * h)), colors[int(klass)], 2)
    except Exception as e:
        pass

In [None]:
show(img)

## Step in to model

In [None]:
model

### Forward pass in SSD(VGG16)

In [None]:
def forward(self, x: torch.Tensor) \
            -> Union[Tuple[torch.Tensor, torch.Tensor, torch.Tensor], torch.Tensor]:
    """Applies network layers and ops on input image(s) x.

    Args:
        x: input image or batch of images. Shape: [batch,3,300,300].

    Return:
        Depending on phase:
        test:
            Variable(tensor) of output class label predictions,
            confidence score, and corresponding location predictions for
            each object detected. Shape: [batch, topk, 7]

        train:
            list of concat outputs from:
                1: confidence layers, Shape: [batch*num_priors, num_classes]
                2: localization layers, Shape: [batch, num_priors*4]
                3: priorbox layers, Shape: [2, num_priors*4]
    """
    def _forward(tensor: torch.Tensor, module: nn.Module) \
            -> torch.Tensor:
        return module.forward(tensor)

    start, sources = 0, []

    # forward layers for extract sources
    for index, layer, *_ in self.appendix:
        x = reduce(_forward, [x, *self.features[start:index]])

        if isinstance(layer, GraphPath):
            x, y = layer(x, self.features[index])
            index += 1

        elif layer is not None:
            y = layer(x)

        else:
            y = x

        sources.append(y)
        start = index

    # forward remain parts
    x = reduce(_forward, [x, *self.features[start:]])

    for i, layer in enumerate(self.extras):
        x = _forward(x, layer)
        sources.append(x)

    def refine(source: torch.Tensor) \
            -> torch.Tensor:
        return source.permute(0, 2, 3, 1).contiguous()

    def reshape(tensor: torch.Tensor) \
            -> torch.Tensor:
        return torch.cat(tuple(map(lambda t: t.view(t.size(0), -1), tensor)), 1)

    locations, confidences = map(reshape, zip(*[(refine(loc(source)), refine(conf(source)))
                                                for source, loc, conf in zip(sources, self.loc, self.conf)]))

    locations = locations.view(self.batch_size, -1, 4)
    confidences = confidences.view(self.batch_size, -1, self.num_classes)

    output = (locations, confidences, self.priors.to(x.device))

    if not self.training:
        output = self.detect(*output).to(x.device)

    return output

## Step by Step forward

### First extract feature from inputs

In [None]:
from functools import reduce

from SSD.layers import GraphPath

In [None]:
def _forward(tensor: torch.Tensor, module: nn.Module) \
        -> torch.Tensor:
    return module.forward(tensor)

In [None]:
x = inputs

In [None]:
start, sources = 0, []

In [None]:
# forward layers for extract sources
for index, layer, *_ in model.appendix:
    x = reduce(_forward, [x, *model.features[start:index]])

    if isinstance(layer, GraphPath):
        x, y = layer(x, model.features[index])
        index += 1

    elif layer is not None:
        y = layer(x)

    else:
        y = x

    sources.append(y)
    start = index

# forward remain parts
x = reduce(_forward, [x, *model.features[start:]])

## Q1. Calculate output shape of x

Input is [1, 3, 300, 300] and pass network like 
```
Sequential(
  (0): Conv2d(3, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
  (1): ReLU(inplace=True)
  (2): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
  (3): ReLU(inplace=True)
  (4): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
  (5): Conv2d(64, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
  (6): ReLU(inplace=True)
  (7): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
  (8): ReLU(inplace=True)
  (9): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
  (10): Conv2d(128, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
  (11): ReLU(inplace=True)
  (12): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
  (13): ReLU(inplace=True)
  (14): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
  (15): ReLU(inplace=True)
  (16): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=True)
  (17): Conv2d(256, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
  (18): ReLU(inplace=True)
  (19): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
  (20): ReLU(inplace=True)
  (21): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
  (22): ReLU(inplace=True)
  (23): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
  (24): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
  (25): ReLU(inplace=True)
  (26): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
  (27): ReLU(inplace=True)
  (28): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
  (29): ReLU(inplace=True)
  (30): MaxPool2d(kernel_size=3, stride=1, padding=1, dilation=1, ceil_mode=False)
  (31): Conv2d(512, 1024, kernel_size=(3, 3), stride=(1, 1), padding=(6, 6), dilation=(6, 6))
  (32): ReLU(inplace=True)
  (33): Conv2d(1024, 1024, kernel_size=(1, 1), stride=(1, 1))
  (34): ReLU(inplace=True)
)
```

then, what is output shape?

In [None]:
x.shape

## Step continue

In [None]:
for i, layer in enumerate(model.extras):
    x = _forward(x, layer)
    sources.append(x)

In [None]:
[v.shape for v in sources]

## Q2. What is in sources?

Sources contains below shape of tensor:
```
[
    torch.Size([1, 512, 38, 38]),
     torch.Size([1, 1024, 19, 19]),
     torch.Size([1, 512, 10, 10]),
     torch.Size([1, 256, 5, 5]),
     torch.Size([1, 256, 3, 3]),
     torch.Size([1, 256, 1, 1])
]
```

What is this and whats for?

## Step continue

In [None]:
def refine(source: torch.Tensor) \
        -> torch.Tensor:
    return source.permute(0, 2, 3, 1).contiguous()

def reshape(tensor: torch.Tensor) \
        -> torch.Tensor:
    return torch.cat(tuple(map(lambda t: t.view(t.size(0), -1), tensor)), 1)

locations, confidences = map(reshape, zip(*[(refine(loc(source)), refine(conf(source)))
                                            for source, loc, conf in zip(sources, model.loc, model.conf)]))

locations = locations.view(model.batch_size, -1, 4)
confidences = confidences.view(model.batch_size, -1, model.num_classes)

output = (locations, confidences, model.priors.to(x.device))

In [None]:
locations.shape

In [None]:
confidences.shape

## Q3. Meaning of nubmers

locations.shape is [1, 8732, 4] and confidences.shape is [1, 8732, 21].

What does each number mean?

### Priors is default anchor boxes

In [None]:
priors =  model.priors.to(x.device)

In [None]:
priors

In [None]:
ww, hh = 1000, 1000
img = np.zeros((ww, hh, 3), dtype=np.uint8)

In [None]:
for x, y, x2, y2 in priors:
    cv2.rectangle(img, (int(x * ww), int(y * hh)), (int(x2 * ww), int(y2 * hh)), (255, 0, 0), 1)

In [None]:
show(img)

## Step continue

### Loss calculate

In [None]:
from typing import Tuple

class Loss(nn.Module):
    """SSD Weighted Loss Function
    Compute Targets:
        1) Produce Confidence Target Indices by matching  ground truth boxes
           with (default) 'priorboxes' that have jaccard index > threshold parameter
           (default threshold: 0.5).
        2) Produce localization target by 'encoding' variance into offsets of ground
           truth boxes and their matched  'priorboxes'.
        3) Hard negative mining to filter the excessive number of negative examples
           that comes with using a large number of default bounding boxes.
           (default negative:positive ratio 3:1)
    Objective Loss:
        L(x,c,l,g) = (Lconf(x, c) + αLloc(x,l,g)) / N
        Where, Lconf is the CrossEntropy Loss and Lloc is the SmoothL1 Loss
        weighted by α which is set to 1 by cross val.
        Args:
            c: class confidences,
            l: predicted boxes,
            g: ground truth boxes
            N: number of matched default boxes
        See: https://arxiv.org/pdf/1512.02325.pdf for more details.
    """

    def forward(self, predictions: Tuple[torch.Tensor, torch.Tensor, torch.Tensor], targets: torch.Tensor):
        """Multibox Loss
        Args:
            predictions (tuple): A tuple containing loc preds, conf preds,
            and prior boxes from SSD net.
                conf shape: torch.size(batch_size,num_priors,num_classes)
                loc shape: torch.size(batch_size,num_priors,4)
                priors shape: torch.size(num_priors,4)

            targets (tensor): Ground truth boxes and labels for a batch,
                shape: [batch_size,num_objs,5] (last idx is the label).
        """
        loc_data, conf_data, priors = predictions
        num = loc_data.size(0)
        priors = priors[:loc_data.size(1), :]
        num_priors = (priors.size(0))

        # match priors (default boxes) and ground truth boxes
        loc_t, conf_t = torch.Tensor(num, num_priors, 4), torch.LongTensor(num, num_priors)

        for idx in range(num):
            truths, labels = targets[idx][:, :-1].data, targets[idx][:, -1].data
            defaults = priors.data

            match(self.threshold, truths, defaults, self.variance, labels, loc_t, conf_t, idx)

        loc_t = Variable(loc_t.to(self.device), requires_grad=False)
        conf_t = Variable(conf_t.to(self.device), requires_grad=False)

        pos = conf_t > 0

        # Localization Loss (Smooth L1)
        # Shape: [batch,num_priors,4]
        pos_idx = pos.unsqueeze(pos.dim()).expand_as(loc_data)
        loc_p, loc_t = loc_data[pos_idx].view(-1, 4), loc_t[pos_idx].view(-1, 4)
        loss_l = F.smooth_l1_loss(loc_p, loc_t, reduction='sum')

        # Compute max conf across batch for hard negative mining
        batch_conf = conf_data.view(-1, self.num_classes)
        loss_c = log_sum_exp(batch_conf) - batch_conf.gather(1, conf_t.view(-1, 1))

        # Hard Negative Mining
        loss_c = loss_c.view(num, -1)
        loss_c[pos] = 0  # filter out pos boxes for now

        _, loss_idx = loss_c.sort(1, descending=True)
        _, idx_rank = loss_idx.sort(1)
        num_pos = pos.long().sum(1, keepdim=True)
        num_neg = torch.clamp(self.negpos_ratio*num_pos, max=pos.size(1)-1)
        neg = idx_rank < num_neg.expand_as(idx_rank)

        # Confidence Loss Including Positive and Negative Examples
        pos_idx = pos.unsqueeze(2).expand_as(conf_data)
        neg_idx = neg.unsqueeze(2).expand_as(conf_data)

        conf_p = conf_data[(pos_idx+neg_idx).gt(0)].view(-1, self.num_classes)
        targets_weighted = conf_t[(pos+neg).gt(0)]
        loss_c = F.cross_entropy(conf_p, targets_weighted, reduction='sum')

        N = num_pos.data.sum().double()

        loss_l, loss_c = loss_l.double() / N, loss_c.double() / N

        return loss_l, loss_c