<i>Copyright (c) Microsoft Corporation. All rights reserved.</i>

<i>Licensed under the MIT License.</i>

# Training a Multi-Object Tracking Model

In this notebook, we give an introduction to training a multi-object tracking model using [torchvision](https://pytorch.org/docs/stable/torchvision/index.html). Using a small dataset, we demonstrate how to train and evaluate a one-shot multi-object tracking model, which detects objects and learns their re-ID features. In particular, we will use FairMOT, the one-shot tracking model developed by MSR Asia and others in this [repo](https://github.com/ifzhang/FairMOT). We will train the model on a set of still images, then evaluate on a video. We also show how to save and load the trained model for inference on a second video. 

To learn more about how multi-object tracking works, visit our [FAQ](./FAQ.md).

## Initialization

Import all the functions we need.

In [2]:
# Ensure edits to libraries are loaded and plotting is shown in the notebook.
%reload_ext autoreload
%autoreload 2
%matplotlib inline

In [1]:
#Regular python libraries
import os
from pathlib import Path
import sys
import time

from PIL import Image
import matplotlib.pyplot as plt
import scrapbook as sb
from ipywidgets import Video

# Pytorch
import torch
import torchvision

# Computer Vision repository
sys.path.append("../../")
from utils_cv.multi_object_tracking.display_with_bb import convert_trackingbboxes_video
from utils_cv.multi_object_tracking.file_format import convert_vott_MOTxywh 
from utils_cv.tracking.dataset import TrackingDataset 
from utils_cv.tracking.model import TrackingLearner 
from utils_cv.common.gpu import which_processor, is_windows

# Change matplotlib backend so that plots are shown for windows
if is_windows():
    plt.switch_backend("TkAgg")

print(f"TorchVision: {torchvision.__version__}")
which_processor()

TorchVision: 0.4.0a0+6b959ee
Torch is using GPU: Tesla K80


This shows your machine's GPUs (if it has any) and the computing device `torch/torchvision` is using.

Next, set the data input and some model runtime parameters.

In [6]:
# Datasets 
DATA_PATH_TRAIN = "./data/odFridgeObjects_FairMOTformat/" #unzip_url(Urls.fridge_objects_path, exist_ok=True) #TODO
DATA_PATH_EVAL_VIDEO = "./data/carcans_1s.mp4" #unzip_url(Urls.fridge_objects_path, exist_ok=True) #TODO
DATA_PATH_EVAL_VOTT = "./data/carcans_vott-csv-export/" #unzip_url(Urls.fridge_objects_path, exist_ok=True) #TODO
DATA_PATH_EVAL = "./data/carcans_MOTformat/" #unzip_url(Urls.fridge_objects_path, exist_ok=True) #TODO
DATA_PATH_TEST = "./data/carcans_8s.mp4" #unzip_url(Urls.fridge_objects_path, exist_ok=True) #TODO
FRAME_RATE = 30 # frame rate  for 1st inference video
FRAME_RATE_2 = 30 # frame rate  for 2nd inference video

# Model parameters
EPOCHS = 10
LEARNING_RATE = 0.0001
#IM_SIZE = 500 #KIP: checked, for OD, DetectionLearner uses it if model is not None, else set to 500
SAVE_MODEL = True

# Inference parameters
CONF_THRES = 0.6 ; TRACK_BUFFER = FRAME_RATE*10
INPUT_W = 1920; INPUT_H = 1080

# train on the GPU or on the CPU, if a GPU is not available
device = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu")
print(f"Using torch device: {device}")

Using torch device: cuda


---

# Prepare Training Dataset

In this section, for the training data, we use a toy dataset called *Fridge Objects*, which consists of 134 images of 4 classes of beverage container `{can, carton, milk bottle, water bottle}` photos taken on different backgrounds, as used for the object detection scenario. This will serve as a simple illustration of how finetuning a pre-trained tracking model with a small dataset to enhance its tracking performance.

Similar to the object detection [training introduction notebook](../detection/01_training_introduction.ipynb), we use the helper function downloads and unzips data set to the `ComputerVision/data` directory. #TODO: correct if needed

Set that directory in the `path` variable for ease of use throughout the notebook.

In [8]:
path = Path(DATA_PATH_TRAIN)
os.listdir(path)

['images', '.ipynb_checkpoints', 'FridgeObjects.train', 'labels_with_ids']

You'll notice that we have two different folders inside, and a file:
- `/images/`
- `/labels_with_ids/`
- `/FridgeObjects.train`

This format of having 2 folders, one for images and one for labels, is fairly common for object detection and object tracking. Compared to object detection, for object tracking, the 'labels_with_ids' files have a field for the id number. 

```
/data
+-- images
|   +-- 00001.jpg
|   +-- 00002.jpg
|   +-- ...
+-- labels_with_ids
|   +-- 00001.txt
|   +-- 00002.txt
|   +-- ...
+-- ...
```

Each image corresponds to a txt file, which must have a similar name, e.g. txt file '00128.txt' contains detections and tracking  information in image file '00128.jpg', i.e. it contains the bounding boxes and the object ids information. In this example, our fridge object dataset is annotated in the format followed by the [FairMOT repo](https://github.com/ifzhang/FairMOT), originally from the [Towards-Realtime-MOT repo](https://github.com/Zhongdao/Towards-Realtime-MOT/blob/master/DATASET_ZOO.md). For example, '00128.txt' contains the following:

```
0 3 0.35671 0.50450 0.17635 0.23724
0 2 0.67335 0.49399 0.36874 0.57057

```
This follows the FairMOT file format, where each line describes a bounding box as follows, as described in [Towards-Realtime-MOT repo](https://github.com/Zhongdao/Towards-Realtime-MOT/blob/master/DATASET_ZOO.md):
```
[class] [identity] [x_center] [y_center] [width] [height]
```
The field `class` is set to 0, for all, as only single-class multi-object tracking is currently supported by the [FairMOT repo](https://github.com/ifzhang/FairMOT). The field `identity` is an integer from `0` to `num_identities - 1`. In this training dataset, we used this dictionary to convert the original class-labels to ids: `{'milk_bottle': 0, 'water_bottle': 1, 'carton': 2, 'can': 3}`. The values of ` [x_center] [y_center] [width] [height]` are normalized by the width/height of the image, and range from `0` to `1`. 

In addition to the above images and labels files in their respective folder, FairMOT also requires an info file that lists the path (i.e. with the root path of the 'images' folder) of all image frames used for training. For instance, the first few lines of our info file, `FridgeObjects.train`, are:
```
./data/odFridgeObjects_FairMOTformat/images/00001.jpg
./data/odFridgeObjects_FairMOTformat/images/00002.jpg
./data/odFridgeObjects_FairMOTformat/images/00003.jpg
```

# Load Training Dataset

To load the data, we need to create a Torchvision dataset object class using our `TrackingDataset` class wrapper, that also converts the dataset into the appropriate formats.

In [5]:
data_train = TrackingDataset(DATA_PATH_TRAIN, {"custom":"./data/FridgeObjects.train"}) #TODO: Check with Casey, change input data format from FairMOT to xml, and use util funcs to convert


# Finetune a Pretrained Model

For the TrackingLearner, we use FairMOT's baseline tracking model. FairMOT's baseline tracking model is pre-trained on pedestrian datasets, such as in the [MOT challenge datasets](https://motchallenge.net/). Hence, it does not even detect fridge objects, such as the in the evaluation video.

When we initialize the TrackingLearner, we can pass in the training dataset. By default, the object will set the model to FairMOT's baseline tracking model. #TODO: add option for load_model?

In [None]:
tracker = TrackingLearner(data_train) 
print(f"Model: {type(tracker.model)}")

To run the training, we call the `fit(...)` method in the tracker object. The main fit parameters include `num_epochs`, `lr`, `batch_size`. 

In [None]:
#TODO: hard-code the 10 epochs in code? coming from initialization of baseline checkpoint
tracker.fit(num_epochs=EPOCHS+10, lr=LEARNING_RATE, bath_size=2) #KIP: Other params include lr_step, 

We can now generate losses over the training epochs, and see how the model improves with training. We want to run the training for an appropriate `num_epochs` and `lr` (to be fine-tuned by the user) that produces a loss-curve that tails off. The loss-curve for our training is as follows:

In [None]:
tracker.plot_training_losses()


# Prepare Video Dataset for Tracking Evaluation
In this section, we cover how to annotate a video segment for ground-truth to evaluate the trained model's tracking performance. For training, the dataset can consist of still images of unique objects that are properly labeled. These could be image sequences that do not necessarily have temporal meaning. On the other hand, for tracking prediction and evaluation, the dataset needs to be a video of image sequence with a temporal thread for the re-iding component of tracking to make sense. Please see the [Readme](./readme.md) for more information on the components of a tracking algorithm.  

For the chosen video (below), we want to track the moving cans. These objects are similar to the objects in the FridgeObject Dataset used for training, hence the tracking algorithm will be trained to recognize them. 

In [None]:
video = Video.from_file(DATA_PATH_EVAL_VIDEO)
video

We use an annotation tool, such as VOTT, to annotate the cans. Please see the [FAQ](./FAQ.md) for more details on how to do the annotation. After annotation, the annotation files are as follows:

In [5]:
path = Path(DATA_PATH_EVAL_VOTT)
os.listdir(path)

['Carcans_GT-export.csv',
 'carcans_1s.mp4#t=0.033333.jpg',
 'carcans_1s.mp4#t=0.066667.jpg',
 'carcans_1s.mp4#t=0.1.jpg',
 'carcans_1s.mp4#t=0.133333.jpg',
 'carcans_1s.mp4#t=0.166667.jpg',
 'carcans_1s.mp4#t=0.2.jpg',
 'carcans_1s.mp4#t=0.233333.jpg',
 'carcans_1s.mp4#t=0.266667.jpg',
 'carcans_1s.mp4#t=0.3.jpg',
 'carcans_1s.mp4#t=0.333333.jpg',
 'carcans_1s.mp4#t=0.366667.jpg',
 'carcans_1s.mp4#t=0.4.jpg',
 'carcans_1s.mp4#t=0.433333.jpg',
 'carcans_1s.mp4#t=0.466667.jpg',
 'carcans_1s.mp4#t=0.5.jpg',
 'carcans_1s.mp4#t=0.533333.jpg',
 'carcans_1s.mp4#t=0.566667.jpg',
 'carcans_1s.mp4#t=0.6.jpg',
 'carcans_1s.mp4#t=0.633333.jpg',
 'carcans_1s.mp4#t=0.666667.jpg',
 'carcans_1s.mp4#t=0.7.jpg',
 'carcans_1s.mp4#t=0.733333.jpg',
 'carcans_1s.mp4#t=0.766667.jpg',
 'carcans_1s.mp4#t=0.8.jpg',
 'carcans_1s.mp4#t=0.833333.jpg',
 'carcans_1s.mp4#t=0.866667.jpg',
 'carcans_1s.mp4#t=0.9.jpg',
 'carcans_1s.mp4#t=0.933333.jpg',
 'carcans_1s.mp4#t=0.966667.jpg',
 'carcans_1s.mp4#t=0.jpg',
 'carc

Evaluation of the tracking performance will be carried out using the [py-motmetrics](https://github.com/cheind/py-motmetrics) repository, which provides multiple metrics for benchmarking multi-object trackers. Its `motmetrics` package takes in the ground-truth data in a format similar to that used in the MOT Challenge format (see [FAQ](./FAQ.md)). Thus, you can use the following utility function,`convert_vott_MOTxywh()`: 

In [None]:
convert_vott_MOTxywh(DATA_PATH_EVAL_VOTT, DATA_PATH_EVAL) #TODO: change if standardizing input to be xml files, KIP

path = Path(DATA_PATH_EVAL)
os.listdir(path)

# Predict and Evaluate Tracking

To validate the trained model, we want to run it on the evaluation dataset we loaded above, and compare the predicted tracking results with the dataset's ground-truth annotations. 

Using the trained tracking model automatically stored in our `tracker` object, we can run the `predict` function on our evaluation video dataset that we previously annotated, stored in path `DATA_PATH_EVAL`. There are several parameters that can be tweaked to improve the tracking performance and inference speed, including `conf_thres`, `track_buffer`, `input_w`, `input_h`, please see the  [FAQ](./FAQ.md) for more details. 

In [None]:
#input_w, input_h, removed bug in FairMOT code? #TODO: Check with Casey regarding image resolution that's hardcoded
track_results = tracker.predict(DATA_PATH_EVAL,
                                conf_thres=CONF_THRES, track_buffer=TRACK_BUFFER,
                                input_w= INPUT_W, input_h = INPUT_H)

`track_results` is a dictionary, where each key is a the frame number, and the value is a list of `TrackingBbox` objects, which represent the tracking information of each object detected, i.e. bounding boxes and tracking ids.

In [None]:
print(track_results)

print("First frame...tracking result:", track_results[0])
print("Last frame...tracking result:", track_results[-1])
      

We can simply pass on our `tracking_results` to the `evaluate()` method in our tracker to evaluate the results. Additionally, we pass on the path of the ground-truth data, on which we can run the evaluation. The result output is the MOT metrics, e.g. MOTA, IDF1 etc, as calculated by the [pymotmetric repo](https://github.com/cheind/py-motmetrics), which give a measure of different aspects of the tracking performance. Refer to the [FAQ](./FAQ.md) for more details. #TODO

In [None]:
metrics = tracker.evaluate(tracking_results, DATA_PATH_EVAL) #TODO: I defined evaluate API as shown in the cell below, check with Casey, about return form of predict (bboxes objects, vs csv)
metrics

In [None]:
# evaluate function for MOT scenario --> to put in TrackingLearner class
from .references.fairmot.tracking_utils.evaluation import Evaluator
from utils_cv.multi_object_tracking.display_with_bb import convert_trackingbboxes_txt
def evaluate(self, prediction_dict: Dict, data_root_path: str ) -> pandas.DataFrame:
    """ eval code that calls on 'motmetrics' package in referenced FairMOT script, to produce MOT metrics on inference, given ground-truth.
        Args:
            prediction_dict: dictionary of prediction results from predict() function, i.e. Dict[int, List[TrackingBbox]] 
            data_root_path: path of dataset containing GT annotations in MOTchallenge format (xywh)
        Returns:
            strsummary: pandas.DataFrame output by method in 'motmetrics', containing metrics scores
        Raises:
            Exception: if both `prediction_dict` and `self.stored_predictions` are None.
        """
    if prediction_dict is None:
        if not self.stored_predictions: #TODO: add stored_predictions in predict() maybe
            raise Exception("No predict() function run on dataset for for evaluation")
       prediction_dict = self.stored_predictions

    result_filename = os.path.join(data_root_path,'results', 'results.txt')
    convert_trackingbboxes_txt(prediction_dict, result_filename) #TODO: KIP, or put in predict()

    #Implementation inspired from code found here: https://github.com/ifzhang/FairMOT/blob/master/src/track.py
    evaluator = Evaluator(gt_path, "single_vid", "mot")
    accs=[evaluator.eval_file(result_filename)]    

    # get summary
    metrics = mm.metrics.motchallenge_metrics
    mh = mm.metrics.create()
    summary = Evaluator.get_summary(accs, ("single_vid"), metrics)
    strsummary = mm.io.render_summary(
        summary,
        formatters=mh.formatters,
        namemap=mm.io.motchallenge_metric_names
    )
    print(strsummary)
    Evaluator.save_summary(summary, os.path.join(result_root,'results', 'summary_metrics.xlsx'))

    return strsummary


To visualize the results, we can use the following utility function to produce a video with the bounding boxes and tracking ids:

In [None]:
tracking_video_path = convert_trackingbboxes_video(tracking_results, video_to_track, frame_rate = FRAME_RATE)

video = Video.from_file(tracking_video_path)
video

## Save the Model
The final step is to save the model to disk, with the model being the baseline tracking model that has been finetuned on the FridgeObject dataset. Use the `TrackingLearner` `save` function to save the model to the wanted path, e.g.:

In [None]:
if SAVE_MODEL:
    tracker.save("./models/finetuned_fridgeObjects") #TODO: check w Casey format

In [25]:
# Preserve some of the notebook outputs
sb.glue("training_losses", tracker.losses)
sb.glue("training_average_precision", tracker.ap)

# Predict Tracking on a New Video

We can use the saved retrained tracking model to predict tracking in a new video, without outputting evaluation. Let us use the following video, which has not been annotated for ground-truth:

In [3]:
video_to_track = Video.from_file(DATA_PATH_TEST)
video_to_track

Video(value=b'\x00\x00\x00\x18ftypmp42\x00\x00\x00\x00mp41isom\x00\x00\x00(uuid\\\xa7\x08\xfb2\x8eB\x05\xa8ae\…

We want to initialize a new `TrackingLearner` object and load the modelwe want, such as the one we saved above.

In [None]:
tracker = TrackingLearner(load_model="./models/finetuned_fridgeObjects")


In [None]:
track_results = tracker.predict(DATA_PATH_TEST,
                                conf_thres=CONF_THRES, track_buffer=TRACK_BUFFER,
                                input_w= INPUT_W, input_h = INPUT_H)

In [22]:
video_path = convert_trackingbboxes_video(tracking_results, video_to_track, frame_rate = FRAME_RATE_2) 
video = Video.from_file(video_path)
video

As the loaded model has been finetuned on cans, the tracking performance on this video, which also contains cans, is good.

# Conclusion

Using the concepts introduced in this notebook, you can bring your own dataset and train an object tracker to track objects of interest in a given video or image sequence. 

# FAQ