# Multiple object tracking
2024-04-07 <br>
[Per Halvorsen](perhalvorsen.com) <br>
[GitHub](https://github.com/pmhalvor/ocean-species-identification) <br>
[LinkedIn](https://www.linkedin.com/in/pmhalvor/) <br>

In this note is meant to serve as a compendium of resources for multiple object tracking.
It can be interpreted as a work-in-progress, with the intention to be expanded upon as more resources arise.

Previous notes on this topic are [Exploring the Monterey Bay Benthic Object Detector](https://perhalvorsen.com/media/notes/mbari_benthic_object_detector.html) and [Applying Super-resolution to MBARI Benthic Object Detector](https://perhalvorsen.com/media/notes/super_resolution_benthic_object_detection.html).

## Overview
- Background
    - Object detection
    - Object tracking
- Tools for tracking 
    - Classic CV techniques
        - Segmentation
            - Background subtraction
            - Contour detection
            - Mean-shift filtering 
        - Association
            - Hungarian algorithm
            - Kalman filter
            - Mahalanobis distance
    - Deep learning techniques
        - SORT 
        - DeepSORT
        - ByteTrack
        <!-- - OC_SORT -->

- Implementation
    <!-- - DeepSORT (deep-sort-realtime) -->
    - ByteTrack (Roboflow)


## Background
<!-- - Object detection vs. object tracking
    - Tracking is detection over multiple frames
    - Detection confidence can vary from frame to frame due to blur, pose, occlusion, etc.
    - One object may recieve many tracked ids if a detection is not produced in a (set of) frame(s)
    - Tracking requires then both detections but also a way to associate them across frames
    - Tracking can be online or offline, depending on the computational resources available
    - Modern tracking techniques typically use a detector (YOLO, ResNet, etc.) together with a tracker (SORT, ByteTrack, etc.)
- Movement predictions require info about velocity, direction, etc.
    - Can be constant, determined beforehand or learned from data
    - Can be adaptive, as seen in Kalman filters 
- Some classic methods exploit derivative-like estimatations of postion from changes in detections between frames
    - Even simplier ones just look at IOU between objects found through filtering
- Some trackers that leverage deep learning incorporate visual information about each detected object to help performance
    - This can be done through embeddings, which are learned representations of the object's appearance
    - These embeddings can be used to associate objects across frames, even when they are occluded or change appearance
- Other important topics
    - Augmentation techniques
        - Augment images together with labels to improve tracking performance
        - Helpful when training on a small or unbalanced dataset
        - Can include flipping, rotation, scaling, etc., but label must be updated accordingly
    - Evaluation
        - The MOTChallenges (MOT20, MOT17, MOT16) are popular benchmarks for evaluating multiple object tracking algorithms
        - Some other benchmarks include DanceTrack, SportsMOT, and WildTrack
        - Papers with Code has a leader board for the current state-of-the-art trackers on popular benchmarks here: https://paperswithcode.com/task/multi-object-tracking  -->

### Object detection

<br>
<img src="../img/OD_dance.png" width="750">
<br>
<i>Figure 1: Example of object detection of a frame from a dance video (<a href="https://github.com/DanceTrack/DanceTrack">Dataset</a>, <a href="https://github.com/noahcao/OC_SORT">Detector</a>)</i>

In object detection, a single frame is considered in isolation, and the goal is to identify and classify particular objects in the frame by drawing bounding boxes around them.
This can be done with a focus on a single object or multiple objects, depending on model objective, complexity, and available compute resources.
Detections are usually thresholded by a confidence score, and [non-maximum suppression](https://learnopencv.com/non-maximum-suppression-theory-and-implementation-in-pytorch/) is applied to remove overlapping boxes.
To achieve state-of-the-art performance on object detection alone, new architectures will likely be focused more on prediciton correctness rather than speed.

Some of the most polular object detection models include YOLO (maintained by [Ultralytics](https://github.com/ultralytics/ultralytics)) and Faster R-CNN.
A state-of-the-art leader board for object detection models can be found on the Papers with Code website [here](https://paperswithcode.com/task/object-detection).
There, you can find the most recent models, along with their whitepapers and code implementations.

### Object tracking
<br>
<img src="https://github.com/noahcao/OC_SORT/raw/master/assets/dancetrack0088_slow.gif" width="750">
<br>
<i>Figure 1: Example of object tracking in a dance video. (Source: <a href="https://github.com/noahcao/OC_SORT">github.com/noahcao/OC_SORT</a>)</i>


Object tracking builds off object detection, but it considers multiple frames in a video sequence.
The goal is to associate detections across frames to track the same object over time.
In these cases, speed is often more important than initial correctness, as the model must process many frames in a video sequence.
The model must consider the object's movement, occlusion, and appearance changes, and update it's detection and tracking accordingly.

Typically, it is useful for these models to assign a unique ID to each object in the video sequence.
A good tracker will be able to maintain the same ID for an object even when it is occluded or changes appearance.
For more rudimentary tracking methods, re-identifying tracked objects occluded by other objects can be a challenge. 
This may result in the same object receiving new IDs after reappearance, or the object being lost entirely.

Modern trackers exploit deep learning components to include more temporal or visual information about the detections to assist in this process.
These more complex trackers can be slower or require more compute resources, introducing a trade-off between speed and correctness.
Decisions around this trade-off will depend on the specific use case and available resources.


## Tools for tracking 

With a brief overview of object tracking as a task, we can now explore some techniques in more detail. 

### Classic
Tradional computer vision techniques can be applied to videos as light-weight trackers.
Below are a list of some of the most popular classic tracking algorithms, along with breif explanations and links to blog posts explaining further.
These simplified set-ups will require two main steps: object segementation and object association.



#### Segmentation



##### Background detection 
While not directly a tracking algorithm, background detection can be used in tandem with other traditional computer vision techniques to make up a light-weight tracker. 
The general idea is to determine a static background from a set of frames, to differentiate these pixels from moving objects.

A video's background can be determied as the [median pixel values](https://learnopencv.com/simple-background-estimation-in-videos-using-opencv-c-python/) for all frames, a process known as _temporal median filtering._
Such a technique assumes few moving objects and a static background, where moving objects are present in any given pixel in less than 50% of the frames.
This method is rudimentary but can be effective in simple tracking tasks, video surveillance, and other applications where the background is relatively static.

More details on background estimation can be found in this blogpost on [Simple Background Estimation in Videos using OpenCV](https://learnopencv.com/simple-background-estimation-in-videos-using-opencv-c-python/).

##### Contour detection
Before the wide-spread use of neural networks, one way of detecting edges in an image was to use greyscales and binary thresholding.
An input image could be converted to greyscale, removing any superfluous color information.
Then, a binary threshold is applied to the greyed image, where pixel values above a certain threshold are set to 1, and those below are set to 0.
The result was then a binary image, where edges were represented by the transition from 0 to 1. 
Here, a simple contour detection algorithm can be applied to trace the pixels along the edges of objects in the image.

<img src="https://learnopencv.com/wp-content/uploads/2024/01/contour_asset.gif" width="750" style="display: block; margin: auto;"><br>

_Example of contour detection in OpenCV. (Source: [learnopencv.com](https://learnopencv.com/moving-object-detection-with-opencv/)_)

This simple yet useful approach can be applied to videos with both static and dynamic background. 
The setup requires minimal computational resources and can therefore can be implemented in various stages of a pipelines. 
In preprocessing steps, contour detection can assist in object detection, telling a model where in a frame it should focus most, increasing detection and segmentation performance. 
Additionally, contour detection can help in post-detection stages to assist in association and tracking, when combined with IOU or other metrics.

A very simple tracker could be build by using contour detection to segment objects in a frame, labeling each segmented object with an id.
The "detection" step can be quickly run for each frame, recalculating the detections and labels for each frame. 
However, in videos where new objects appear or objects disappear, these labels will differ, thus requiring more complex logical around label association between frames. 
This topic will be explored more thoroughly in the section titled [Association](#Association).

A detailed walk-through implementing contour detection can be found [here](https://learnopencv.com/contour-detection-using-opencv-python-c/). 


##### Mean-shift clustering/segmentation
Similar to K-means clustering, mean-shift clustering is a method for finding the modes (or segmentations) of a distribution of data points.

<img src="../img/mean-shift.png" width="750" style="display: block; margin: auto;"><br>

_Example of mean-shift clustering in a 2D feature space. (Source: [Stanford Vision and Learning Lab](http://vision.stanford.edu/teaching/cs131_fall1920/slides/09_kmeans_mean_shift.pdf))_



The algorithm starts by [tessellating](https://dictionary.cambridge.org/dictionary/english/tessellate) the feature space with a grid of windows, each centered on a pixel (i.e. odd numbered height and width window sizes).
The algorithm then shifts each window to the mean of the data points, based on the selected feature space. 
Eventually, the windows will converge around a single point, which is considered a mode or peak in the data distribution.
Windows that converge near the same peak are then merged, and the process is repeated until all windows have converged.

In a simplfied 1D set-up (i.e. a greyscale image), the feature space could be the pixel intesity, between 0 and 255.
For colored image segmentation, one could instead use the RGB values of an input image. 
For an even more complex set-up, the feature space could be a combination of pixel color, texture, or other means of describing a subregion of an image.

Similar to contour detection, mean-shift clustering can be used in both pre- and post-processing stages of a pipeline, helping extract the mos timportant features of an input image.
A great resource for understanding both mean-shift filtering and k-means filtering (skipped in our note) can be found in this lecture from [Stanford Vision and Learning Lab](http://vision.stanford.edu/teaching/cs131_fall1920/slides/09_kmeans_mean_shift.pdf). 

<!-- 
#### Particle filter
- Use a set of particles to represent the object's position
- Update the particles based on the object's movement
- Can be effective for simple tracking tasks -->




#### Association




##### Kalman filter (SORT)
While not exactly an association technique by itself, Kalman filters are often used together with association funcitons like the Hungarian algorithm to update estimations about an object's position.
The Kalman filter is a recursive algorithm that estimates the state of a linear dynamic system from a series of noisy measurements.
It is used in many fields, including robotics, computer vision, and economics, to predict the future state of a system based on its past states.

The Kalman filter is based on two main assumptions:
<ol type="i">
    <li>The state of the system can be represented as a linear dynamic system.</li>
    <li>The noise in the system is Gaussian, representable via a mean about 0 and an unknown variance.</li>
</ol>

By a linear dynamic system, we mean that the state of the system at the next time step can be predicted based on the previous state and the system's dynamics (velocity and direction).
The step-functions are considered constant, and the system's estimated state can be predicted based on these constants, similar to a derivative-based physics simulation using $\frac{\delta x}{\delta t}$. 



<!-- It can be helpful to think recall your high-school physics classes, where you learned about classical mechanics to predict an object's future position given an initial position, intial velocity, and an acceleration:
$$
x(t) = x_0 + v_0t + \frac{1}{2}at^2 \quad \text{and} \quad  v(t) = v_0 + at \tag{1}
$$

Instead of the fixed, simple equations above, the Kalman filters take into account uncertainty around the system's estimates and measurements, and update the state estimate based on new measurements. -->

The Kalman filtering can be broken down into 3 main stages:
1. **Estimatation:** Estimate the current state of the system based on the fixed velocity and (randomly initialized) direction $\rightarrow \hat x_{t+1}$ 
2. **Detection:** Make a (noisy) measurement of the system $\rightarrow y_{t+1}$ 
3. **Update:** Update the state estimate with information of the new measurement, giving a more accurate prediction of the system's state $\rightarrow x_{t+1}$ 


<!-- The equations used for Kalman filters are as followed:

$$
x_{t+1} = Fx_t + u \quad \text{where} \quad  u \sim N(0, P) \tag{1}
$$

$$
y_{t+1} = Hx_t + v \quad \text{where} \quad  v \sim N(0, Q) \tag{2}
$$

Here, $F$ and $H$ are the constant state transition and observation matrices, respectively. 
The $u$ and $v$ terms are the process and measurement noise, respectively, with assumed means about 0 and variances $P$ and $Q$.

The true state of a system at time $t$ will always be unknown, but the most accuracte prediction is represented as $x_t$. 
$y_t$ is the measurement made at time $t$, which in our case is the outputs from our detection model.  
The approximation of the state at time $t$ given the transition matrix is $\hat x_t$. 
We can rewrite the equations above representing their respective approximations as:
$$
\hat x_{t+1} = F \hat x_t + u \quad \text{where} \quad u \sim N(0, P) \tag{3}
$$

$$
\hat y_{t+1} = H \hat x_t + v \quad \text{where} \quad  v \sim N(0, Q) \tag{4}
$$

The approximation $\hat x_{t+1}$ is then used to produce an estimated measurement $\hat y_{t+1}$, which is then used to produce the best prediction possible, $x_{t+1}$, starting the process over again.
The equaiton for updating the state estimate with the new measurement is as follows:
$$
x_{t+1} = \hat x_{t+1} + K(y_{t+1} - \hat y_{t+1}) \tag{5}
$$

Here $K$ is the Kalman ratio, which balances how much information from the the new measurement we want to update our state estimate with, based on the system's uncertianties. -->

One of the reasons Kalman filters are so useful is because they make use of _all_ the information available to the system.
The more information available to a system, the less uncertainty, and thus the more accurate the predictions.

Kalman filters are often used in tracking applications to predict the object's position based on its velocity and update the position based on the detection.
The Simple Online Realtime Tracking (SORT) algorithm is a popular tracking algorithm that uses Kalman filtering for predictions and the Hungarian algorithm for association.

A really great resource for understanding Kalman filters is this [YouTube video](https://www.youtube.com/watch?v=IFeCIbljreY) by [Visually Explained](https://www.youtube.com/@VisuallyExplained).
A hand-written note covering some of these topics can also be found in this repository [here](../img/Kalman-Filtering-handwritten.pdf).

##### Hungarian algorithm (SORT)
The [Hungarian algorithm](https://en.wikipedia.org/wiki/Hungarian_algorithm) is a combinatorial optimization algorithm that solves the assignment problem in polynomial time.
It uses a cost matrix to find the optimal assignment of objects across frames, minimizing the total cost of the assignment.
The metric used to calculate the cost can be the intersection over union (IOU) between the detections in two frames, the Euclidean distance between the detections, or another metric that captures the similarity between the detections.

<img src="../img/Hungarian-alg.png" width="800" style="display: block; margin-left: auto; margin-right: auto;"><br>

_Example of the Hungarian algorithm in action on a small example. (Source: [Wikipedia](https://en.wikipedia.org/wiki/Hungarian_algorithm)_)

The steps needed to achieve this systematically are as follows:
1. Subtract the smallest element in each row from all elements in that row
2. Subtract the smallest element in each column from all elements in that column
3. Draw the minimum number of lines needed to cover all zeros in the matrix
4. If the number of lines drawn equals the number of rows, the optimal assignment is found
5. If not, find the smallest element not covered by a line and subtract it from all uncovered elements, and add it to all elements covered by two lines
6. Repeat steps 3-5 until the optimal assignment is found

The Hungarian algorithm is used in the SORT algorithm to find the optimal assignment of detections across frames, minimizing the total cost of the assignment.
Since the Hungarian algorithm is inherently designed to minimize a cost matrix, the input cost matrix to the algorithm is designed to minimize the negative of the IOU between detections in two frames.
This way, the Hungarian algorithm will find the optimal assignment of detections that maximizes the IOU between the detections.

[Think Autonomous](https://www.thinkautonomous.ai/) has great [blog-post](https://www.thinkautonomous.ai/blog/hungarian-algorithm/) for applying the Hungrian algorithm to object tracking. There you can find an even more detailed break down of each of the steps mentioned above, along with clear video and examples of different cost metrics used in the algorithm.


##### Mahalanobis Distance

The [Mahalanobis distance](https://en.wikipedia.org/wiki/Mahalanobis_distance) is a measure of the distance between a point and a distribution of other data points, introduced by P. C. Mahalanobis in 1936. 
It is a multi-dimensional generalization of the idea of measuring how many standard deviations away the point is from the mean of the distribution. 
This distance is zero if the observed point is at the mean of distribution (think the center of a circle or centroid of an ellipse), and grows as the point moves away from the mean along each principal component axis.


<img src="../img/Mahalanobis-Distance.png" width="800" style="display: block; margin-left:auto; margin-right:auto;"/><br>

_Example of the Mahalanobis distance in a 2D feature space. (Source: [TileStats (YouTube)](https://www.youtube.com/watch?v=xXhLvheEF7o)_)

The example above comes from a video comparing the Euclidean distance to the Mahalanobis distance in a 2D feature space.
In the image we see, the Euclidean distance between the distribution mean and the two data points is the same, while the Mahalanobis distance is different.
The Mahalanobis distance takes into account the covariance of the data, which provides information on the trend of the distribution (diagonally in this case).

In a sense, the Mahalonobis distance mixes the concepts of the mean-shift algorithm mentioned above with information about the covariance of the data.
Taking into account the covariance of the data allows the metric to account for the correlations between the data, which can be useful in tracking applications where the location of an object can change in both the x and y directions.
An object moving along a path will more likely continue along that path, rather than suddenly change direction. 
The Mahalanobis distance can account for this expected path continuation, and thus provide a more accurate measurement of the distance between a detected object and its previous occurances.

In modern object trackers, the Mahalanobis distance sometimes replaces the IOU metrics in associations steps between the predicted bounding boxes and the ground truth bounding boxes, as in Deep SORT.
For more information on the Mahalanobis distance, check out the [video](https://www.youtube.com/watch?v=xXhLvheEF7o) by [TileStats](https://www.youtube.com/@tilestats) mentioned above.



### Modern

#### SORT
We've already mentioned the [Simple Online Realtime Tracking (SORT)](https://arxiv.org/abs/1602.00763) (A. Bewley et. al. 2016) algorithm a few times now, but can coalesce the information here.
This algorithm is a simple, yet effective, tracking algorithm that uses a Kalman filter for state estimation and the Hungarian algorithm for association.
SORT is an online tracking algorithm, meaning it processes frames one at a time and does not require the entire video to be loaded into memory.

For each frame, SORT performs the following steps:

1. **Detection:** Run an object detector on the current frame to get the detections
2. **Esimation:** Make a best estimate of the state of the object in the next frame using a Kalman filter
3. **Association:** Use the Hungarian algorithm to associate the detections across frames
4. **Create & Delete:** Create new tracks for unmatched detections and delete old tracks that have not been matched for a certain number of frames

The methodology section of the [original SORT paper](https://arxiv.org/pdf/1602.00763.pdf) actually does a great job at explaining the algorithm in detail, so I would recommend reading that if you want to learn more about SORT.
Implementations of SORT can be found in the original repository [github.com/abewley/sort](https://github.com/abewley/sort). You can also refer to my [handwritten note](../img/Kalman-Filtering-handwritten.pdf) diving into more of the math around SORT and Kalman filtering.

#### DeepSORT
[DeepSORT](https://arxiv.org/abs/1703.07402) (N. Wojke et. al. 2017) is an extension of the SORT algorithm that uses deep learning to improve tracking performance.
The main difference between SORT and DeepSORT is that DeepSORT uses a deep neural network to extract embeddings from the detections, which are then used to associate the detections across frames.
This allows DeepSORT to associate objects across frames even when they are occluded or change appearance.

<img src="../img/DeepSORT.png" width="750" style="display: block; margin-left: auto; margin-right: auto;"><br>

_Overview of DeepSORT. (Source: [A. Parico et. al. (2021)](https://www.researchgate.net/publication/353256407_Real_Time_Pear_Fruit_Detection_and_Counting_Using_YOLOv4_Models_and_Deep_SORT)_)

The embeddings are learned representations of the object's appearance that are used to associate objects across frames.
They are learned by training a deep neural network on a large dataset of images with labeled objects.
For optimal performance, the embeddings should likely be fine-tuned to the domain on which the tracker will be used.

The DeepSORT algorithm can be broken down into the following steps:
1. **Detection:** Run an object detector on the current frame to get the detections getting the bounding boxes and confidence scores for each detection.
2. **_Feature extraction:_** For each detected object, use a deep neural network to extract embeddings representing the object's appearance.
3. **Estimation:** Make a best estimate of the state of the object in the next frame using a Kalman filter (similar to in SORT).
4. **_Association:_** Use the Hungarian algorithm to associate the detections across frames based on the embeddings. Instead of an IOU metric, the vallues of the cost matrix are the Mahalanobis distances between object locations and a [cosine similarity](https://www.geeksforgeeks.org/cosine-similarity/) of the appearance embeddings.
5. **Create & Delete:** Create new tracks for unmatched detections and delete old tracks that have not been matched for a certain number of frames
6. **_Re-identification:_** Re-identify objects that have been occluded or changed appearance by comparing the embeddings of the detections.

The (italicized) novel steps the DeepSORT algorithm introduces are the feature extraction step, the Mahalanobis distance in association step, and re-identification step. 
The feature extraction and re-id steps are the introduction of deep learning components to the SORT algorithm, which allows DeepSORT to associate objects across frames even when they are occluded or change appearance.

Implementations of DeepSORT can be found in the original repository [github.com/nwojke/deep_sort](https://github.cim/nwojke/deep_sort).

#### ByteTrack
[ByteTrack](https://arxiv.org/abs/2110.06864) (Y. Zhang et. al. 2021) is a recent object tracking algorithm that builds off many of the conecpts of SORT and DeepSORT, but introduces the concept of using both high and low-confidence detections to improve tracking performance.
The algorithm is designed to be lightweight for fast and efficient tracking, making it suitable for real-time applications.

The steps of the ByteTrack model can be outlined as follows:

1. **Detection:** Utilizes an object detector like YOLO to predict detection boxes and scores for each frame in a video sequence.
2. **_Confidence Categorization_:** Separates the detection boxes into high-confidence and low-confidence subsets based on a predefined score threshold.
3. **Estimation:** Applies the Kalman filter to estimate the new locations of tracks in the current frame, capturing the object's motion similar to SORT and DeepSORT.
4. **_High-Confidence Association:_** Associates high-confidence detection boxes with existing tracks using algorithms like the Hungarian algorithm for initial matching. Here, either the slightly modified IOU metric or the re-identification feature distances ([cosine similarities](https://www.geeksforgeeks.org/cosine-similarity/)) are used to calculate the cost matrix.
5. **_Low-Confidence Association:_** Incorporates low-confidence detections with existing tracks, improving the ability to maintain track continuity even with partially occluded objects. Here, only the IOU metric is used to calculate the cost matrix, as the re-id features may be obscured, hence producing a lower confidence around the detection.
6. **Create & Delete:** Similar as before, create new tracks for unmatched high-confidence detections and delete tracks that are no longer visible or have insufficient evidence in the frames.

ByteTrack introduces the differing uses of the high and low-confidence detections, which significantly reduces missed detections and fragmented trajectories often caused by occlusions or poor visibility. 
Combining the concept of confidence handling together with the Kalman filter, the tracker has been able to improve benchmarked tracking performance without sacrificing speed or efficiency.

Implementations of ByteTrack can be found in the original repository [github.com/ifzhang/ByteTrack](https://github.com/ifzhang/ByteTrack) and the white paper on [arXiv](https://arxiv.org/abs/2110.06864).
Roboflow has a great [tutorial](https://roboflow.com/how-to-track/yolov5) on how to implement ByteTrack with YOLOv5, which we will use in the next setion on implementation.

## Implementation

### ByteTrack (Roboflow)
In this section, we will walk through how to implement the ByteTrack algorithm using the YOLOv5 object detector.
here, we can leverage the aforementioned [tutorial](https://roboflow.com/how-to-track/yolov5) from Roboflow to guide us through the process.

Since we do not currently have a lot of local computational resources, we will keep this implementation rather simple. 
With a working worklfow in place, we can later expand on the concepts presented here by distributing the model through Dockerized containers. 


#### Imports
We will pip install the required libraries for our simple implementation of ByteTrack.
Since we still want to use the Benthic Object Detector from MBARI, we will also need to ensure that the model weights are downloaded locally. 

In [None]:
# # if you don't have them already, download supervision and torch  
# !pip install supervision torch -q 
# !pip install -q moviepy


In [85]:
from moviepy.editor import VideoFileClip

import numpy as np
import os
import supervision as sv
import torch


# model_name = "../../models/fathomnet_benthic/mbari-mb-benthic-33k.pt"

model = torch.hub.load('ultralytics/yolov5', 'custom', path=model_name,  force_reload=True)

Downloading: "https://github.com/ultralytics/yolov5/zipball/master" to /Users/per.morten.halvorsen@schibsted.com/.cache/torch/hub/master.zip
YOLOv5 🚀 2024-4-7 Python-3.11.8 torch-2.2.2 CPU

Fusing layers... 
Model summary: 476 layers, 91841704 parameters, 0 gradients
Adding AutoShape... 


Using the `screen_capture.py`script found in `src/tools/`, we scraped some short clips of animals from different sources on YouTube. 
We can use these clips to manually evaluate the performance of the YOLOv5 + ByteTrack pipeline. 
They get stored locally in the `data/` directory.
 
Some of the sources for videos we scraped include:
- [Monterey Bay Aquarium](https://www.youtube.com/user/MontereyBayAquarium)
- [Deep Marine Science](https://www.youtube.com/@DeepMarineScenes)
- [EVNatilus](https://www.youtube.com/@EVNautilus)

In [122]:
# good detections from 8 upwards
i = 32  # 32 is good
byte_tracker = sv.ByteTrack(
    track_activation_threshold= 0.15,
    frame_rate=10
)

box_annotator = sv.BoundingBoxAnnotator()
label_annotator = sv.LabelAnnotator()

def callback(frame: np.ndarray, index: int) -> np.ndarray:
    # frame = super_resolution_model(frame)
    results = model(frame) #[0]
    detections = sv.Detections.from_yolov5(results)
    detections = byte_tracker.update_with_detections(detections)
    labels = [
        f"#{tracker_id} {model.model.names[class_id]} {confidence:0.2f}"
        for bbox, _, confidence, class_id, tracker_id, _
        in detections
    ]
    frame = box_annotator.annotate(scene=frame.copy(), detections=detections)
    frame = label_annotator.annotate(scene=frame.copy(), detections=detections, labels=labels)
    return frame



def mot(name, data_dir="../data/video", output_dir="../data/results", ext="mp4"):
    output_name = name.split("/")[-1] if "/" in name else name
    
    sv.process_video(source_path=f"{data_dir}/{name}.{ext}", target_path=f"{output_dir}/{output_name}_mot.{ext}", callback=callback)

    return f"{output_dir}/{output_name}_mot.{ext}"


def mot_show(name, data_dir="../data/video", output_dir="../data/results", ext="mp4"):
    output_path = mot(name, data_dir, output_dir, ext)
    os.system("open " + output_path)

mot_show(f"aquarium_{i:03}/aquarium_{i:03}", ext="mp4")

In [123]:
# convert to gif
clip = VideoFileClip(f"../data/results/aquarium_{i:03}_mot.mp4")
clip.write_gif(f"../data/example/aquarium_{i:03}_mot.gif")

MoviePy - Building file ../data/example/aquarium_032_mot.gif with imageio.


                                                              

An example output form the tracker can be seen below:

<img src="../data/example/aquarium_032_mot_cropped.gif" width="750" style="display: block; margin-left: auto; margin-right: auto;"><br>

_Crinoidea example footage from the [Deep Marine Science](https://www.youtube.com/@DeepMarineScenes) YouTube channel._




We must admit, it took a few tries before getting footage that actually had decent tracking results.
The quality of the footage, the frame rate, the number of objects in the frame, and the object's movement all play a role in the tracking performance.

However, the main focus here was to get a reasonable tracking for an object obviously classifiable to the human eye.
With a working implementation in place, we can wrap up this blog-post and move on to the next step in this project, namely model distribution and deployment.
Follow along on that PR here


## Next steps 

In my next note, I will wrap this implementation into a Docker container, to be easily shared and deployed to other machines or cloud services.