<i>Copyright (c) Microsoft Corporation. All rights reserved.</i>

<i>Licensed under the MIT License.</i>

# Training a Multi-Object Tracking Model

This notebook shows how to train and evaluate a multi-object tracking model.

Specifically, this notebook uses [FairMOT](https://github.com/ifzhang/FairMOT), a state-of-the-art tracking model with high accuracy and fast inference speed. The model is trained on a set of still images, and is then evaluated on video footage. For more information regarding FairMOT and multi-object tracking, please visit the [FAQ](./FAQ.md).

## Initialization
Import all the functions we need.

In [3]:
!pip install -q condacolab
import condacolab
condacolab.install()

[0m✨🍰✨ Everything looks OK!


In [4]:
!conda install -c conda-forge opencv yacs lap progress
!pip install cython_bbox motmetrics

Channels:
 - conda-forge
Platform: linux-64
Collecting package metadata (repodata.json): - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / done
Solving environment: \ | / - \ | / - \ done


    current version: 23.11.0
    latest version: 24.1.2

Please update conda by running

    $ conda update -n base -c conda-forge conda



# All requested packages already installed.

[0m

In [None]:
#%cd computervision-recipes/utils_cv/tracking/references/fairmot/models/networks/DCNv2
#!sh make.sh
!pip install git+https://github.com/microsoft/ComputerVision.git@master#egg=utils_cv


In [6]:
import os
import os.path as osp
import sys

from ipywidgets import Video
import matplotlib.pyplot as plt
import torch
import torchvision

# Computer Vision repository
sys.path.append("../../")
from utils_cv.common.data import data_path, unzip_url
from utils_cv.common.gpu import is_windows, which_processor
from utils_cv.tracking.data import Urls
from utils_cv.tracking.dataset import TrackingDataset
from utils_cv.tracking.model import TrackingLearner
from utils_cv.tracking.plot import plot_single_frame, play_video, write_video

# Change matplotlib backend so that plots are shown for windows
if is_windows():
    plt.switch_backend("TkAgg")

print(f"TorchVision: {torchvision.__version__}")
which_processor()

ModuleNotFoundError: No module named 'torch._six'

The above torchvision command displays your machine's GPUs (if it has any) and the compute that `torch/torchvision` is using.

In [None]:
# Ensure edits to libraries are loaded and plotting is shown in the notebook.
%reload_ext autoreload
%autoreload 2
%matplotlib inline

Next, set some training and inference parameters, as well as the data input parameters. Better accuracy can typically be achieved by increasing the number of training epochs.

In [None]:
# Training parameters
EPOCHS = 10
LEARNING_RATE = 0.0005
BATCH_SIZE = 4
MODEL_PATH = "./models/all_dla34.pth"  # the path of the pretrained model to finetune/train

# Inference parameters
CONF_THRES = 0.3

# Data Location
TRAIN_DATA_PATH = unzip_url(Urls.cans_path, exist_ok=True)
EVAL_DATA_PATH = unzip_url(Urls.carcans_annotations_path, exist_ok=True)

# Train on the GPU or on the CPU, if a GPU is not available
device = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu")
print(f"Using torch device: {device}")

## Setup

Please follow the setup instructions in the [README.md](https://github.com/microsoft/computervision-recipes/blob/master/scenarios/tracking/README.md) to make sure all required libraries are installed.

In addition, to be able to run this notebook, the baseline FairMOT model needs to be downloaded from [here](https://drive.google.com/file/d/1udpOPum8fJdoEQm6n0jsIgMMViOMFinu/view) and saved to the `./models` folder as `all_dla.pth`.

## Prepare Training Dataset

This section will show how to use a small training dataset to finetune a pre-trained model. The dataset consists of 12 images of cans across four classes `{coke, gingerale, espresso, coldbrew}` and with differing backgrounds.

Note that we use different cans for training, so that the re-id component in the FairMOT tracker can learn to distinguish different type of cans from one-another. During inference time, this will enable the tracker to distinguish between cans it had not seen during training.

In [None]:
os.listdir(TRAIN_DATA_PATH)

Within the data folder there are two different subfolders:
- `/images/`
- `/annotations/`

This format, one folder for images and one folder for annotations, is common for object detection and object tracking. In fact, the annotation format (Pascal VOC) is identical to the annotation format used for object detection - see the [01_training_introduction.ipynb](../detection/01_training_introduction.ipynb) notebook for more information.

## Load Training Images

To load the data, we use the `TrackingDataset` class. This object knows how to read images and annotations consistent with the  format specified above.

In [None]:
data_train = TrackingDataset(TRAIN_DATA_PATH, batch_size=BATCH_SIZE)
print("Found {} training images.".format(len(data_train.im_filenames)))

In [None]:
data_train.show_ims()

## Finetune a Pretrained Model

For the TrackingLearner, we use FairMOT's baseline tracking model. FairMOT's baseline tracking model is pre-trained on pedestrian datasets, such as the [MOT challenge datasets](https://motchallenge.net/). Therefore, it does not yet know how to detect cans.

When we initialize the TrackingLearner, we can pass in the training dataset and the path to the baseline
model which by default is `./models/all_dla.pth`.  

In [None]:
tracker = TrackingLearner(data_train, MODEL_PATH)

To run the training, we call the `fit` method in the tracker object. Note that we reduce the learning rate by a factor of 10 after 75% of the epochs to improve convergence to a good minima of the loss function.

In [None]:
tracker.fit(num_epochs=EPOCHS, lr=LEARNING_RATE, lr_step = round(0.75*EPOCHS))

The function below visualizes the training losses after each epoch, and shows how the model improves over time. With appropriate values for `num_epochs` and `lr` this loss-curve should converge towards zero. The loss-curve for our training is as follows:

In [None]:
tracker.plot_training_losses()

# Predict and Evaluate Tracking
To validate the trained model, we run it on an evaluation dataset and compare the predicted tracking results with the dataset's ground-truth annotations.

For that, we annotated each frame of a one second long video sequence called `car_cans_1s.mp4`. For more details on how to prepare the annotation and evaluation dataset please see the [FAQ](./FAQ.md).

In [None]:
eval_video_path = osp.join(EVAL_DATA_PATH, "car_cans_1s.mp4")
#Video.from_file(eval_video_path)   # uncomment this line to play the video

This shows a single frame from around the middle of the evaluation video:

In [None]:
plot_single_frame(eval_video_path, 15)

## Predict

Now, we can run the `predict` function on our evaluation dataset. Note that there are several parameters that can be tweaked to improve the tracking performance and inference speed, including `conf_thres` or `track_buffer`. Please see the  [FAQ](./FAQ.md) for more details.  

In [None]:
eval_results = tracker.predict(
    EVAL_DATA_PATH, conf_thres=CONF_THRES,
)

The call to `predict` returns the dictionary `eval_results` where each key is the frame number, and the value is a list of `TrackingBbox` objects that represent the tracking information of each object detected. For example, when we print out the tracking results from the last frame (frame 30), we can see two objects being tracked:

In [None]:
print("Last frame...tracking result:", eval_results[max(eval_results.keys())])

## Evaluate

To obtain quantitiative evaluation metrics, we can simply pass on our `tracking_results` dictionary to the `evaluate` method in the tracker object. This outputs common MOT metrics such as IDF1 or MOTA. Please refer to the [FAQ](./FAQ.md) for more details on MOT metrics.

In [None]:
eval_metrics = tracker.evaluate(eval_results, EVAL_DATA_PATH)
print(eval_metrics)

## Visualize results

We can visualize the tracking results by overlaying the bounding boxes and ids of the tracked objects onto the video and writing it to the following file:

In [None]:
write_video(eval_results, eval_video_path, "results_eval.mp4")

The following cell extracts and displays certain frames from this video.

In [None]:
for frame_i in [1, int(len(eval_results) / 2), len(eval_results) - 3]:
    im = plot_single_frame(eval_video_path, frame_i, eval_results)

In addition, we can play the video here in the notebook:

In [None]:
play_video(eval_results, eval_video_path)

## Save the trained model
If satified with the results from evalutation, we can save this finetuned model to disk for later use.
```
tracker.save(TRAINED_MODEL_PATH)
```

To load the model and track objects in a new video these commands can be used
```
tracker = TrackingLearner(None, TRAINED_MODEL_PATH)
test_results = tracker.predict(
    path_to_video, conf_thres=CONF_THRES, track_buffer=TRACK_BUFFER,
)
```