<i>Copyright (c) Microsoft Corporation. All rights reserved.</i>

<i>Licensed under the MIT License.</i>

# Training a Multi-Object Tracking Model

In this notebook, we give an introduction to training an mult-object tracking model using [torchvision](https://pytorch.org/docs/stable/torchvision/index.html). Using a small dataset, we demonstrate how to train and evaluate a one-shot multi-object tracking model, which jointly detect objects and learn their re-ID features. In particular, we will use FairMOT, the one-shot tracking model developed by MSR Asia and others in this [repo](https://github.com/ifzhang/FairMOT).

To learn more about how multi-object tracking works, visit our [FAQ](./FAQ.md).

## Initialization

Import all the functions we need.

In [4]:
import sys

sys.path.append("../../")

import os
import time
import matplotlib.pyplot as plt
from typing import Iterator
from pathlib import Path
from PIL import Image
from random import randrange
from typing import Tuple
import torch
import torchvision
from torchvision import transforms
import scrapbook as sb

#KIP
from utils_cv.multi_object_tracking.file_format import convert_vott_MOTxywh
from utils_cv.multi_object_tracking.display_with_bb import convert_trackingbbox_video
from ipywidgets import Video
from utils_cv.tracking.dataset import TrackingDataset

#from utils_cv.classification.data import Urls as UrlsIC
#from utils_cv.common.data import unzip_url, data_path
#from utils_cv.detection.data import Urls
#from utils_cv.detection.dataset import DetectionDataset, get_transform
# from utils_cv.detection.plot import (
#     plot_grid,
#     plot_boxes,
#     plot_pr_curves,
#     PlotSettings,
#     plot_counts_curves,
#     plot_detections
# )
# from utils_cv.detection.model import (
#     DetectionLearner,
#     get_pretrained_fasterrcnn,
# )
from utils_cv.common.gpu import which_processor, is_windows

# Change matplotlib backend so that plots are shown for windows
if is_windows():
    plt.switch_backend("TkAgg")

print(f"TorchVision: {torchvision.__version__}")
which_processor()

TorchVision: 0.4.0a0+6b959ee
Torch is using GPU: Tesla K80


This shows your machine's GPUs (if it has any) and the computing device `torch/torchvision` is using.

In [5]:
# Ensure edits to libraries are loaded and plotting is shown in the notebook.
%reload_ext autoreload
%autoreload 2
%matplotlib inline

Next, set some model runtime parameters.

In [6]:
EPOCHS = 10
LEARNING_RATE = 0.0001
#IM_SIZE = 500 #KIP: checked, for OD, DetectionLearner uses it if model is not None, else set to 500
SAVE_MODEL = True

# train on the GPU or on the CPU, if a GPU is not available
device = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu")
print(f"Using torch device: {device}")

Using torch device: cuda


---

# Prepare Datasets for training and object tracking evaluation

## Annotated images for training

In this section, for the training data, we use a toy dataset called *Fridge Objects*, which consists of 134 images of 4 classes of beverage container `{can, carton, milk bottle, water bottle}` photos taken on different backgrounds, as used for the object detection scenario. This will serve as a simple illustration of how finetuning a pre-trained tracking model with a small dataset can greatly enhance its performance.

Similar to the object detection [training introduction notebook](../detection/01_training_introduction.ipynb), we use the helper function downloads and unzips data set to the `ComputerVision/data` directory. #TODO: correct if needed

Set that directory in the `path` variable for ease of use throughout the notebook.

In [7]:
DATA_PATH_TRAIN = "./data/odFridgeObjects_FairMOTformat/" #unzip_url(Urls.fridge_objects_path, exist_ok=True) #TODO
path = Path(DATA_PATH_TRAIN)
os.listdir(path)

['.ipynb_checkpoints', 'FridgeObjects.train', 'images', 'labels_with_ids']

You'll notice that we have two different folders inside, and a file:
- `/images/`
- `/labels_with_ids/
- `/FridgeObjects.train`

This format for object detection and object tracking is fairly common. Compared to object detection, for object tracking, the 'labels_with_ids' files have a field for the id number. 

```
/data
+-- images
|   +-- 00001.jpg
|   +-- 00002.jpg
|   +-- ...
+-- labels_with_ids
|   +-- 00001.txt
|   +-- 00002.txt
|   +-- ...
+-- ...
```

Each image corresponds to a txt file, which must have a similar name, e.g. txt file '00128.txt' contains detections and tracking  information in image file '00128.jpg', i.e. it contains the bounding boxes and the object ids information. In this example, our fridge object dataset is annotated in the format followed by the [FairMOT repo](https://github.com/ifzhang/FairMOT), originally from the [Towards-Realtime-MOT repo](https://github.com/Zhongdao/Towards-Realtime-MOT/blob/master/DATASET_ZOO.md). For example, '00128.txt' contains the following:

```
0 3 0.35671 0.50450 0.17635 0.23724
0 2 0.67335 0.49399 0.36874 0.57057

```
This follows the FairMOT file format, where each line describes a bounding box as follows, as described in [Towards-Realtime-MOT repo](https://github.com/Zhongdao/Towards-Realtime-MOT/blob/master/DATASET_ZOO.md):
```
[class] [identity] [x_center] [y_center] [width] [height]
```
The field `class` is set to 0, for all, as only single-class multi-object tracking is supported.
The field `identity` is an integer from `0` to `num_identities - 1`. In this training dataset, we used this dictionary to convert the original class-labels to ids: `{'milk_bottle': 0, 'water_bottle': 1, 'carton': 2, 'can': 3}`.
The values of ` [x_center] [y_center] [width] [height]` are normalized by the width/height of the image, and range from `0` to `1`. 

In addition to the above images and labels files in their respective folder, FairMOT also requires an info file that lists the path (i.e. with the root path of the 'images' folder) of all image frames used for training. For instance, the first few lines of our info file, `FridgeObjects.train`, are:
```
./data/odFridgeObjects_FairMOTformat/images/00001.jpg
./data/odFridgeObjects_FairMOTformat/images/00002.jpg
./data/odFridgeObjects_FairMOTformat/images/00003.jpg
```

## Annotated video segment for tracking evaluation (*new)
In this section, we cover how to annotate the following video segment, in the scenario that we want to track the moving cans, which are similar to objects in the FridgeObject Dataset used for training.

In [2]:
video_file = "./data/carcans_1s.mp4"

video = Video.from_file(video_file)
video

Video(value=b'\x00\x00\x00\x18ftypmp42\x00\x00\x00\x00mp41isom\x00\x00\x00(uuid\\\xa7\x08\xfb2\x8eB\x05\xa8ae\…

We use an annotation tool, such as VOTT, to annotate the cans. Please see the [FAQ](./FAQ.md) for more details on MOT annotation. For example, in the video, we can draw bounding boxes around the 2 cans, and tag them as `can_1` and `can_2`: 
<p align="center">
<img src="./media/carcans_vott_ui.png" width="600" align="center"/>
</p>

Before annotating, make sure to set the extraction rate to 30fps, similar to that of the video. After annotation, you can export the annotation results in csv form, you will end up with the extracted frames as well as a csv file containing the bounding box and id info: ``` [image] [xmin] [y_min] [x_max] [y_max] [label]```

In [5]:
annot_path = "./data/carcans_vott-csv-export/" #unzip_url(Urls.fridge_objects_path, exist_ok=True) #TODO
path = Path(annot_path)
os.listdir(path)

['Carcans_GT-export.csv',
 'carcans_1s.mp4#t=0.033333.jpg',
 'carcans_1s.mp4#t=0.066667.jpg',
 'carcans_1s.mp4#t=0.1.jpg',
 'carcans_1s.mp4#t=0.133333.jpg',
 'carcans_1s.mp4#t=0.166667.jpg',
 'carcans_1s.mp4#t=0.2.jpg',
 'carcans_1s.mp4#t=0.233333.jpg',
 'carcans_1s.mp4#t=0.266667.jpg',
 'carcans_1s.mp4#t=0.3.jpg',
 'carcans_1s.mp4#t=0.333333.jpg',
 'carcans_1s.mp4#t=0.366667.jpg',
 'carcans_1s.mp4#t=0.4.jpg',
 'carcans_1s.mp4#t=0.433333.jpg',
 'carcans_1s.mp4#t=0.466667.jpg',
 'carcans_1s.mp4#t=0.5.jpg',
 'carcans_1s.mp4#t=0.533333.jpg',
 'carcans_1s.mp4#t=0.566667.jpg',
 'carcans_1s.mp4#t=0.6.jpg',
 'carcans_1s.mp4#t=0.633333.jpg',
 'carcans_1s.mp4#t=0.666667.jpg',
 'carcans_1s.mp4#t=0.7.jpg',
 'carcans_1s.mp4#t=0.733333.jpg',
 'carcans_1s.mp4#t=0.766667.jpg',
 'carcans_1s.mp4#t=0.8.jpg',
 'carcans_1s.mp4#t=0.833333.jpg',
 'carcans_1s.mp4#t=0.866667.jpg',
 'carcans_1s.mp4#t=0.9.jpg',
 'carcans_1s.mp4#t=0.933333.jpg',
 'carcans_1s.mp4#t=0.966667.jpg',
 'carcans_1s.mp4#t=0.jpg',
 'carc

Evaluation of the tracking performance will be carried out using the [py-motmetrics](https://github.com/cheind/py-motmetrics) repository, which provides multiple metrics for benchmarking multi-object trackers. Its `motmetrics` package takes in the ground-truth data in a format similar to that used in the [MOT challenge](https://motchallenge.net/), i.e.: 
```
[frame number] [id number] [bbox left] [bbox top] [bbox width] [bbox height][Confidence score][Class][Visibility]
```
The last 3 columns can be set to -1 by default. To convert the VOTT annotation data to the MOT challenge format, you can use the following utility function,`convert_vott_MOTxywh()`:

In [None]:
DATA_PATH_EVAL_VOTT =  "./data/carcans_vott_format/"
DATA_PATH_EVAL = "./data/carcans_MOTformat/"
convert_vott_FairMOT(DATA_PATH_EVAL_VOTT, DATA_PATH_EVAL) #TODO KIP

path = Path(DATA_PATH_EVAL)
os.listdir(path)

# Load Training Dataset

To load the data, we need to create a TorchvisionDataset object class that can be taken in by our `TrackingLearner` class wrapper, in a way that can recognized by the FairMOT repo code, saved in [multi_object_tracking/references folder](../../utils_cv/multi_object_tracking/references).



In [5]:
#TODO check Casey's Dataset class format
data_train = TrackingDataset(DATA_PATH_TRAIN, {"custom":"./data/FridgeObjects.train"}) #TODO: check with Casey why using a Dict


# Finetune a Pretrained Model

For the TrackingLearner, we use FairMOT's baseline tracking model, which used Torch's Adam algorithm as the default optimizer. 

FairMOT's baseline tracking model is pre-trained on pedestrian datasets, like the [MOT challenge datasets](https://motchallenge.net/). Hence, it does not even detect fridge objects, such as the in the evaluation video.

When we initialize the TrackingLearner, we can pass in the training dataset. By default, the object will set the model to FairMOT's baseline tracking model. #TODO: add option for load_model?

In [None]:
tracker = TrackingLearner(data_train) 
print(f"Model: {type(tracker.model)}")

To run the training, we call the `fit(...)` method in the tracker object. The fit parameters include `num_epochs`, `lr`, `batch_size`.  If they are not set, they will by default be set to  `30`, `0.0001`, `12`.

In [None]:
#TODO: hard-code the 10 epochs in code? coming from initialization of baseline checkpoint
tracker.fit(num_epochs=EPOCHS+10, lr=LEARNING_RATE, bath_size=2) #KIP: Other params include lr_step, 

We can now generate losses over the training epochs, and see how the model improves with training. Losses generated include detection-specific losses (e.g. `hm_loss`, `wh_loss`, `off_loss`) and id-specific losses (`id_loss`). The overall loss (`loss`) is a weighted average of the detection-specific and id-specific losses. We want to run the training for an appropriate `num_epochs` and `lr` (to be fine-tuned by the user) that produces a loss-curve that tails off. 

In [None]:
tracker.plot_training_losses() #TODO: add method to TrackingLearner class

# Predict and evaluate tracking on a video
Using the trained tracking model automatically stored in our `tracker` object, we can run the `predict(...)` method on our evaluation video dataset that we previously annotated, stored in path `DATA_PATH_EVAL`. There are several parameters that can be tweaked to improve the tracking: 
- conf_thres, det_thres, nms_thres, min_box_area: ...
- track_buffer: ...

In [None]:
#input_w, input_h = get_image_size(DATA_PATH_EVAL, format="GT_FairMOT") removed bug in FairMOT code? #TODO: Check with Casey regarding image resolution that's hardcoded
conf_thres=0.6
det_thres=0.3
nms_thres=0.4
track_buffer=frame_rate*10
min_box_area=200
track_results = tracker.predict(DATA_PATH_EVAL, conf_thres=conf_thres, det_thres=det_thres, nms_thres=nms_thres, track_buffer=track_buffer, min_box_area=min_box_area)

`track_results` is a dictionary, where each key is a the frame number, and the value is a list of `TrackingBbox` objects, which represent the tracking information of each object detected, i.e. bounding boxes and tracking ids.

In [None]:
print(track_results)

print("First frame...tracking result:", track_results[0])
print("Last frame...tracking result:", track_results[-1])
      

We can simply pass on our `tracking_results` to the `evaluate()` method in our tracker to evaluate the results. Additionally, we pass on the path of the ground-truth data, on which we can run the evaluation. The result output is the MOT metrics, as developed in the [pymotemtric repo](https://github.com/cheind/py-motmetrics), which give a measure of different aspects of the tracking performance. Refer to the [FAQ](./FAQ.md) for more details. #TODO

In [None]:
metrics = tracker.evaluate(tracking_results, DATA_PATH_EVAL) #TODO: I defined evaluate API, check with Casey, about return form of predict (bboxes objects, vs csv)
metrics

To visualize the results, we can use the following utility function to produce a video with the bounding boxes and tracking ids:

In [None]:
frame_rate=30
video_path = convert_trackingbbox_video(tracking_results, frame_rate) #TODO

video = Video.from_file(video_path)
video

## Save the model
The final step is to save the model to disk, where the save model is the baseline tracking model that has been finetuned on the FridgeObject dataset.

In [None]:
if SAVE_MODEL:
    tracker.save("./models/finetuned_fridgeObjects")

In [25]:
# Preserve some of the notebook outputs
sb.glue("training_losses", tracker.losses)
sb.glue("training_average_precision", tracker.ap)

# Predict tracking on a new video

We can use the saved retrained tracking model to predict tracking in a new video, without outputting evaluation. Let us use the following video, which has not been annotated for ground-truth:

In [8]:
DATA_PATH_TEST = "./data/carcans_8s.mp4"

video = Video.from_file(DATA_PATH_TEST)
video

Video(value=b'\x00\x00\x00\x18ftypmp42\x00\x00\x00\x00mp41isom\x00\x00\x00(uuid\\\xa7\x08\xfb2\x8eB\x05\xa8ae\…

We want to initialize a new `TrackingLearner` object and load the modelwe want, such as the one we saved above.

In [None]:
tracker = TrackingLearner(load_model="./models/finetuned_fridgeObjects")


In [None]:
conf_thres=0.6
det_thres=0.3
nms_thres=0.4
track_buffer=frame_rate*10
min_box_area=200
track_results = tracker.predict(DATA_PATH_TEST, model=loaded_model, conf_thres=conf_thres, det_thres=det_thres, nms_thres=nms_thres, track_buffer=track_buffer, min_box_area=min_box_area)

In [22]:
frame_rate=30
video_path = convert_trackingbbox_video(tracking_results, frame_rate) #TODO: ask about using ffmeg as cmd_str or other means, currently cv2

video = Video.from_file(video_path)
video

As the loaded model has been finetuned on cans, the tracking performance on this video which also contains cans, is good.

# Conclusion

Using the concepts introduced in this notebook, you can bring your own dataset and train an object tracker to track objects of interest in a given video or image sequence. 