## Object Tracking in FiftyOne: Adding Tracking Results to a FiftyOne Dataset

## Who this is for

This tutorial is designed for:
- Computer vision practitioners with [basic FiftyOne experience](https://beta-docs.voxel51.com/getting_started/) (can load datasets and use the App)
- Computer vision practitioners interested in implementing object tracking workflows
- Anyone looking to integrate tracking results into their FiftyOne datasets for visualization and analysis

## Assumed Knowledge

### Computer Vision Concepts
- Basic understanding of object detection and tracking
- Familiarity with bounding box coordinates and confidence scores
- Understanding of video sequences and frame-based processing

### Python Skills
- Intermediate Python programming
- Basic understanding of PyTorch
- Experience working with Jupyter notebooks

### FiftyOne Concepts
- [Dataset basics and samples](https://beta-docs.voxel51.com/getting_started/basic/datasets_samples_fields/)
- [Detection fields and labels](https://beta-docs.voxel51.com/api/fiftyone.core.labels.Detection.html)
- [Dataset views and filtering](https://beta-docs.voxel51.com/how_do_i/cheat_sheets/filtering_cheat_sheet/)

## Time to Complete
Estimated time: 30-45 minutes

## Required Packages

We recommend using a virtual environment with [FiftyOne already installed](https://beta-docs.voxel51.com/getting_started/basic/install/). Additional required packages:

```bash
# Assuming you have fiftyone installed
pip install ultralytics torch
```

## Content Overview

This notebook contains several key sections:
1. Dataset Loading - Downloading and preparing the VisDrone-MOT dataset
2. Implementation Pattern - Detailed breakdown of the tracking integration approach
3. Tracking Implementation - Step-by-step code for [adding tracking results to FiftyOne](https://beta-docs.voxel51.com/how_do_i/recipes/adding_detections/)
4. Results Visualization - Exploring tracking results in the [FiftyOne App](https://beta-docs.voxel51.com/getting_started/basic/application_tour/)

#### Object Tracking with Ultralytics

This code demonstrates a pattern for integrating object tracking into a FiftyOne workflow using the [Ultralytics integration](https://beta-docs.voxel51.com/integrations/ultralytics/). This is for **illustration purposes only** and focuses on the implementation pattern rather than prediction quality. 

The example illustrates a common object tracking pipeline:

1. Scene-based Processing: Videos are processed as logical scenes, maintaining tracking continuity within each scene

2. Object Tracking: Using YOLO's built-in tracking capabilities to assign consistent IDs to objects across frames

3. FiftyOne Integration: Storing tracking results as [FiftyOne Detection](https://beta-docs.voxel51.com/api/fiftyone.core.labels.Detection.html)

Let's begin by downloading the VisDrone-MOT dataset from the [Voxel51 Hugging Face org](https://huggingface.co/datasets/Voxel51/visdrone-mot). [Refer to these docs](https://beta-docs.voxel51.com/integrations/huggingface/) for more information about how FiftyOne integrates with Hugging Face.

In [None]:
import fiftyone as fo
import fiftyone.utils.huggingface as fouh

visdrone = fouh.load_from_hub(
    "Voxel51/VisDrone2019-DET",
    name="visdrone-mot",
    persistent=True
    )


### Implementation pattern

We'll use the open-vocabulary detetcion model YOLO-World from Ultralytics due to its ease of implementation. Any arbitrary model will follow the same FiftyOne specific logic for iterating through the sequences and adding the [Detections](https://beta-docs.voxel51.com/api/fiftyone.core.labels.Detection.html) as a [Field](https://beta-docs.voxel51.com/api/fiftyone.core.fields.Field.html). For each scene in dataset:


#### 1. Initialize tracking model
- Grab object classes using FiftyOne's [`distinct()`](https://beta-docs.voxel51.com/fiftyone_concepts/using_aggregations/#distinct-values) method
- Create a fresh YOLO model for each scene to avoid cross-contamination
- Configure model to detect only the filtered classes

#### 2. Load all frames in sequence order
- Use FiftyOne's [`match()`](https://beta-docs.voxel51.com/how_do_i/cheat_sheets/filtering_cheat_sheet/#built-in-filter-and-match-functions) with [`ViewField`](https://beta-docs.voxel51.com/api/fiftyone.core.expressions.ViewField.html) to filter frames by scene ID
- Sort frames by frame number with [`sort_by()`](https://beta-docs.voxel51.com/tutorials/pandas_comparison/#sorting)
- Extract frame filepaths to create a properly sequenced input

#### 3. Run tracking across the entire sequence
- Process all frames in a single batch with YOLO's `track()` method
- Uses BoTSORT tracker to maintain object identity across frames
- Returns detection boxes with consistent ID numbers for the same object

#### 4. Store tracking results with persistent IDs
- Create fresh [`fo.Detections()`](https://beta-docs.voxel51.com/api/fiftyone.core.labels.Detections.html) object for each frame
- Convert YOLO's detection format to FiftyOne's normalized coordinates
- Store track IDs in the [`index` field of each `fo.Detection`](https://beta-docs.voxel51.com/api/fiftyone.core.labels.Detection.html) object
- Save changes to the database with [`video_frame.save()`](https://beta-docs.voxel51.com/faq/#why-didnt-changes-to-my-dataset-save)

#### 5. Clean up resources before next scene
- Delete model and tracking results
- Clear CUDA memory cache
- Force garbage collection to prevent memory leaks

The key advantage of this approach is processing each scene as a complete sequence, allowing the tracker to maintain consistent object identities across all frames in a video.

In [None]:
import torch 
import gc
import fiftyone as fo
from ultralytics import YOLO
from fiftyone import ViewField as F

# Get all classes
all_classes = visdrone.distinct("detections.detections.label")

# Remove unwanted classes
detection_classes = [c for c in all_classes if c not in ["ignored_region", "others"]]

# Use this to group samples by scene_id
scene_ids = visdrone.distinct("scene_id")

for scene_id in scene_ids:
    # Create new model instance for each scene
    tracker_model = YOLO("yolov8x-world.pt")
    tracker_model.set_classes(detection_classes)  # Classes to predict
    
    # Get all frames for this scene and sort by frame_number
    frames_in_scene = visdrone.match(F("scene_id") == scene_id).sort_by("frame_number")
    print(f"Processing scene {scene_id} with {len(frames_in_scene)} frames")
    
    # Get the image paths in sequence order
    frame_filepaths = [frame.filepath for frame in frames_in_scene]
    
    # Run tracking on the sequence of images
    tracking_results = tracker_model.track(
        source=frame_filepaths, 
        show=False,
        persist=True,
        half=True,
        tracker="botsort.yaml",  # or use bytetrack.yaml
    )
    
    # Update the dataset with tracking results
    for i, (video_frame, result) in enumerate(zip(frames_in_scene, tracking_results)):
        # Get image dimensions for normalization
        image_width = video_frame.metadata.width
        image_height = video_frame.metadata.height
        
        # Always create a fresh Detections object for tracked_objects
        video_frame["tracked_objects"] = fo.Detections()
        
        # Get boxes from this result
        boxes = result.boxes
        
        # Skip if no boxes or if boxes is empty
        if boxes is None or len(boxes) == 0:
            video_frame.save()
            continue
        
        # Get all data at once to avoid repeated GPU->CPU transfers
        xyxy_coords = boxes.xyxy.cpu().numpy()
        class_indices = boxes.cls.cpu().numpy()
        confidences = boxes.conf.cpu().numpy()
        
        # Safely get track IDs if they exist, 
        track_ids = None
        if hasattr(boxes, 'id') and boxes.id is not None:
            track_ids = boxes.id.cpu().numpy()
        
        # Get class name mapping
        class_mapping = result.names
        
        for j in range(len(boxes)):
            # Get box coordinates
            x1, y1, x2, y2 = xyxy_coords[j]
            
            # Convert to [x, y, width, height] and normalize
            normalized_bbox = [
                x1/image_width, 
                y1/image_height, 
                (x2-x1)/image_width, 
                (y2-y1)/image_height
            ]
            
            # Get class name, confidence, and track ID
            class_idx = int(class_indices[j])
            class_name = class_mapping[class_idx]
            confidence = float(confidences[j])
            track_id = int(track_ids[j]) if track_ids is not None else None
            
            # Create detection with track ID
            tracked_detection = fo.Detection(
                label=class_name,
                bounding_box=normalized_bbox,
                confidence=confidence,
                index=track_id
            )
            
            video_frame.tracked_objects.detections.append(tracked_detection)
        
        # Save each frame after processing
        video_frame.save()
        
        
    print(f"Completed processing scene {scene_id}")
    
    # Clean up resources before next scene
    del tracker_model, tracking_results
    if torch.cuda.is_available():
        torch.cuda.empty_cache()
    gc.collect()

You can now launch the app and view the results:

```python
fo.launch_app(visdrone)
```

<img src="assets/object-tracking-yolo.gif">

# Tracking on Video Dataset using SAM2

The next example will show you how to perform tracking on a video dataset using the SAM2 model. Let's start by downloading a subset of the [WEBUOT-1M dataset](https://voxel51.com/blog/webuot-1m-a-dataset-for-underwater-object-tracking/) from the Hugging Face Hub:

In [None]:
import fiftyone as fo
from fiftyone.utils.huggingface import load_from_hub

webuot = load_from_hub(
    "Voxel51/WebUOT-238-Test",
    name="webuot")

For illustration purposed, we will randomly select 3 samples from the `webuot` dataset using FiftyOne's [`take()`](https://beta-docs.voxel51.com/api/fiftyone.core.collections.SampleCollection.html#take) method. This creates a lightweight [View](https://beta-docs.voxel51.com/api/fiftyone.core.view.html) without modifying the original dataset

#### Select specific frames for modification
- We'll use [`match_frames()`](https://beta-docs.voxel51.com/api/fiftyone.core.collections.SampleCollection.html#match_frames) with a [`ViewField`](https://beta-docs.voxel51.com/api/fiftyone.core.expressions.ViewField.html) condition to target only frames with frame number greater than 1
- This selects all frames except the first frame in each video sample

#### Clear ground truth annotations
- The [`set_field("frames.gt", None)`](https://beta-docs.voxel51.com/api/fiftyone.core.collections.SampleCollection.html#set_field) operation removes all ground truth annotations from the matched frames
- Sets the `gt` field to None, effectively clearing any existing data

#### Persist changes
- The [`save()`](https://beta-docs.voxel51.com/faq/#why-didnt-changes-to-my-dataset-save) method commits these changes back to the underlying dataset. Without this call, changes would only exist in the view and not affect the actual data
- The [`clone()`](https://beta-docs.voxel51.com/api/fiftyone.core.dataset.Dataset.html#clone) method will create a deep copy of the dataset, however the source media will not be copied.

This will remove annotations from everything except the first frame, which is useful for creating partial annotation scenarios or preparing data for specific evaluation workflows.

In [76]:
smol_view = webuot.take(3, seed=51)

smol_view.match_frames(F("frame_number") > 1).set_field("frames.gt", None).save()

smol_view= smol_view.clone("smol_webout")

Let's inspect the subset and verify that only the annotations from the first frame is kept:

<img src="assets/webuot-smol.gif">

We'll use [FiftyOne's integration with SAM2](https://voxel51.com/blog/sam-2-is-now-available-in-fiftyone/) via the [FiftyOne Model Zoo](https://beta-docs.voxel51.com/models/model_zoo/api/). Prior to using the model, you will need to install SAM2. Follow the [installation instructions from the SAM2 GitHub](https://github.com/facebookresearch/sam2).

[`load_zoo_model()`](https://beta-docs.voxel51.com/api/fiftyone.zoo.models.html#load_zoo_model) will download and initialize SAM2 (Segment Anything Model 2). In this example we are using the `hiera-tiny-video` variant optimized for video processing. Visit the [documentation for the Model Zoo](https://beta-docs.voxel51.com/models/model_zoo/models/) and search for "SAM2" to see all supported checkpoints

We'll use the [`apply_model`](https://beta-docs.voxel51.com/api/fiftyone.core.collections.SampleCollection.html#apply_model) and pass the existing ground truth boxes from the first frame (in `frames.gt`) as prompts to guide segmentation and creates new boxes and [Segmentations](https://beta-docs.voxel51.com/api/fiftyone.core.labels.Segmentation.html) in the `sam_predictions` field.


In [None]:
import torch

import fiftyone.zoo as foz

sam2_model = foz.load_zoo_model("segment-anything-2-hiera-tiny-video-torch")

smol_view.apply_model(
    sam2_model,
    label_field="sam_predictions",
    prompt_field="frames.gt", # Can be a detections or a keypoint field
)

We can launch the app and view the results:

<img src="assets/webuot-sam-tracking.gif">

## Summary

In this tutorial, you learned how to:

1. **Implement Object Tracking in FiftyOne**
   - Set up tracking workflows using both YOLO-World and SAM2
   - Process videos as logical scenes to maintain tracking continuity
   - Store tracking results with persistent object IDs in FiftyOne's format

2. **Handle Common Technical Challenges**
   - Manage GPU memory effectively across long sequences
   - Convert between different coordinate systems (YOLO to FiftyOne)
   - Use proper cleanup procedures to prevent memory leaks

3. **Leverage FiftyOne's Features**
   - Use `distinct()` to extract unique classes and scene IDs
   - Filter and sort frames using `match()` and `sort_by()`
   - Visualize tracking results in the FiftyOne App

The patterns demonstrated here can be adapted for:
- Different tracking models or algorithms
- Custom video datasets
- Various object tracking scenarios (multi-object, single-object)

For more information about working with videos in FiftyOne, check out the [video documentation](https://beta-docs.voxel51.com/api/fiftyone.core.video.html). 

### Next steps

* Check out the [documentation for evaluating detections](https://beta-docs.voxel51.com/tutorials/evaluate_detections/) in FiftyOne

* Check out [this blog](https://voxel51.com/blog/tracking-datasets-in-fiftyone/) for an end-to-end walk through of loading, predicting, and evaluating tracking results with the MOT17 dataset.

* Join the [Discord community](https://community.voxel51.com/)

* Follow us on [LinkedIn](https://www.linkedin.com/company/voxel51/)

