# Getting Started with Object Tracking

## Who this is for

This guide is designed for computer vision practitioners who are new to FiftyOne but have experience working with object tracking datasets. You should be comfortable with Python and basic computer vision concepts. We assume you're trying to:

- Load and organize multi-object tracking data in a structured way

- Visualize tracking results across video frames

- Prepare tracking data for model training or evaluation

## Assumed Knowledge

**CV Concepts:**

- Understanding of bounding boxes and object detection

- Familiarity with multi-object tracking concepts (object IDs, frame sequences)

- Basic knowledge of video data representation (frames, sequences)

**Data Formats:**

- Experience with common annotation formats

- Understanding of image file handling

- Familiarity with dataset directory structures

**Python Skills:**

- Intermediate Python programming

- Experience with pandas DataFrames

- Basic file system operations

**FiftyOne Concepts:**

- [Datasets and Samples](https://beta-docs.voxel51.com/getting_started/basic/datasets_samples_fields/)

- [Labels and Label Types](https://beta-docs.voxel51.com/api/fiftyone.core.labels.html)

- [Dataset Views](https://beta-docs.voxel51.com/how_do_i/recipes/creating_views/)

- [The FiftyOne App](https://beta-docs.voxel51.com/getting_started/basic/application_tour/)

## Time to complete

20-30 minutes (including dataset download time)

## Required packages

We recommend using a virtual environment with FiftyOne already installed. If you need to install FiftyOne, follow the [installation guide](https://beta-docs.voxel51.com/getting_started/basic/install/).

Additional required packages:
```bash
pip install gdown pandas pillow
```

## Content

1. **Loading Object Tracking Data**: Learn how to structure and load multi-object tracking data into FiftyOne, including handling frame sequences, annotations, and metadata.

2. **Working with Scene Attributes**: Understand how to incorporate scene-level attributes and natural language descriptions into your tracking dataset.

3. **Visualizing Tracking Data**: Explore how to use FiftyOne's grouping functionality to visualize tracking sequences as videos without actual video conversion.

# Loading Object Tracking Data

The directory structure of an multi-object tracking (MOT) dataset varies depending on the dataset. However, most datasets follow a similar organization. Generally, here's what a typical object tracking dataset structure might entail:

- **Image Data:** This is typically sequential images corresponding to frames in a video

- **Annotations:** These come in various formats such as JSON, XML, or text files. Annotations typically include bounding box coordinates, object IDs, and sometimes additional metadata like occlusion or truncation levels.

- **Attributes Files:** While not universally included, some datasets might provide additional attributes or metadata at the scene level. 

- **Language:**  As part of a shift towards integrating tasks like Vision-Language Multi-Object Tracking, an emerging trend is including natural language descriptions for each scene in the datasets. This is useful for models that are designed to track objects based on human language commands or descriptions.

#### Parsing the VisDrone dataset into FiftyOne

In this guide, we will work with the VisDrone dataset, which was introduced in the 2020 paper [*Detection and Tracking Meet Drones Challenge*](https://arxiv.org/abs/2001.06303). This dataset contains object detection and multi-object tracking data from drone-captured imagery. Refer to dataset's [GitHub repo](https://github.com/VisDrone/VisDrone-Dataset) for more information.

Start by downloading the validation set of the VisDrone for multi-object tracking. The dataset is located in a Google drive folder, which you can download from [here](https://drive.google.com/file/d/1rqnKe9IgU_crMaxRoel9_nuUsMEBBVQu/view?usp=sharing).

Alternatively, you can download using `gdown` and extract the folder:

```bash

> pip install gdown
> gdown 1rqnKe9IgU_crMaxRoel9_nuUsMEBBVQu
> unzip VisDrone2019-MOT-val.zip
```

This datset contains **sequences of frames and annotations for each frame**, it does not contain scene level attributes. 

To demonstrate how we can parse an attributes or language as part of a MOT dataset, I'll generate dictionaries for attributes and language for each scene in the validation set. In a "real-world" scenario you might have these in `attributes` or `language` directories as part of the dataset. Whatever the case may be, it's just a matter of writing some logic to parse those files.  

What matters for this guide is how those values are parsed as part of a FiftyOne dataset.

In [1]:
scene_attributes = {
    "uav0000086_00000_v": {
        "scene_type": "sporting event",
        "time_of_day": "daytime",
        "pedestrian_density": "high"
    },
    "uav0000117_02622_v": {
        "scene_type": "intersection",
        "time_of_day": "night",
        "pedestrian_density": "medium"
    },
    "uav0000137_00458_v": {
        "scene_type": "intersection",
        "time_of_day": "daytime",
        "pedestrian_density": "high"
    },
    "uav0000182_00000_v": {
        "scene_type": "road",
        "time_of_day": "daytime",
        "pedestrian_density": "low"
    },
    "uav0000268_05773_v": {
        "scene_type": "road",
        "time_of_day": "daytime",
        "pedestrian_density": "low"
    },
    "uav0000305_00000_v": {
        "scene_type": "intersection",
        "time_of_day": "daytime",
        "pedestrian_density": "low"
    },
    "uav0000339_00001_v": {
        "scene_type": "intersection",
        "time_of_day": "dusk",
        "pedestrian_density": "low"
    }
}

scene_language = {
    "uav0000086_00000_v": "A drone flies over a large crowd of people at a sporting complex where people are playing basketball.",
    "uav0000117_02622_v": "This scene shows a busy intersection at night with cars and pedestrians moving around. There seems to be a festial going on.",
    "uav0000137_00458_v": "This scene is a chaotic intersection with cars and pedestrians moving around. No one seems to be following the traffic rules.",
    "uav0000182_00000_v": "This scene shows a drone flying over a road with cars moving in both directions. The road is surrounded by trees.",
    "uav0000268_05773_v": "This scene depicts a highway with cars moving in both directions. The highway is surrounded by trees and buildings.",
    "uav0000305_00000_v": "This scene is a direct overhead shot of an intersection with cars and pedestrians moving around. Traffic seems to be orderly.",
    "uav0000339_00001_v": "This scene is a drone shot of an intersection at dusk with cars, motorcycles, and pedestrians moving around. The scene is well lit."
}



Let's inspect a few lines from one of the annotation files:

In [2]:
!head -n 3 VisDrone2019-MOT-val/annotations/uav0000086_00000_v.txt

102,0,38,666,71,88,1,1,1,0
103,0,45,662,71,91,1,1,1,0
104,0,52,658,72,95,1,1,1,0


Here's what each element in the annotation represents:

- `frame_index`: The index of the frame where the object is detected.

- `target_id`: A unique identifier assigned to each tracked object across frames.

- `bbox_left`: The x-coordinate of the left corner of the bounding box.

- `bbox_top`: The y-coordinate of the left corner of the bounding box.

- `bbox_width`: The width of the bounding box.

- `bbox_height`: The height of the bounding box.

- `score`: The confidence score of the detection.

- `object_category`: The category of the detected object (e.g., person, car, bicycle).

- `truncation`: Indicates if the object is partially outside the image frame.

- `occlusion`: Indicates if the object is partially occluded by another object.

The mapping of object category from integer to a human readable format is as follows:

In [3]:
class_names = {
    0: 'ignored_region', 
    1: 'pedestrian', 
    2: 'people', 
    3: 'bicycle', 
    4: 'car', 
    5: 'van', 
    6: 'truck', 
    7: 'tricycle', 
    8: 'awning-tricycle', 
    9: 'bus', 
    10: 'motor', 
    11: 'others'
    }


This code processes the VisDrone MOT dataset into a [FiftyOne Dataset](https://beta-docs.voxel51.com/api/fiftyone.core.dataset.Dataset.html). Here's what we're working with:

1. **Directory Structure**
   - Sequences directory: Contains image frames for each scene
   - Annotations directory: Contains tracking data in text files
   - Each scene has its own sequence folder and matching annotation file

2. **Data Organization**
   - Scene Level: Attributes (scene type, time of day, etc.) and language descriptions
   - Frame Level: Individual images from each sequence
   - Object Level: Bounding boxes with tracking IDs and classifications

3. **Processing Pipeline**
   - Reads each sequence directory
   - Loads corresponding annotation file as pandas DataFrame
   - For each frame:
     - Creates [FiftyOne Sample](https://beta-docs.voxel51.com/api/fiftyone.core.sample.Sample.html) with image path
     - Adds scene metadata and attributes
     - Converts annotations to [FiftyOne Detections](https://beta-docs.voxel51.com/api/fiftyone.core.labels.Detection.html)
     - Normalizes bounding box coordinates
     - Maps class IDs to readable names

4. **Key Features**
   - Maintains object identity across frames (tracking IDs)
   - Preserves scene context through attributes
   - Includes object properties (occlusion, visibility)
   - Normalizes coordinates for consistent representation

The result is a structured [FiftyOne Dataset](https://beta-docs.voxel51.com/api/fiftyone.core.dataset.Dataset.html) that maintains the hierarchical relationship between scenes, frames, and tracked objects while adding rich metadata and descriptions.

In [4]:
import os
import pandas as pd
import fiftyone as fo
from PIL import Image


# Create dataset
dataset = fo.Dataset(
    name="visdrone-mot",
    overwrite=True,
    persistent=True
    )

# Base directories
sequences_dir = "VisDrone2019-MOT-val/sequences/"
annotations_dir = "VisDrone2019-MOT-val/annotations/"

# List to store all samples
samples = []

# Process each sequence
for sequence_name in os.listdir(sequences_dir):
    sequence_path = os.path.join(sequences_dir, sequence_name)
    if not os.path.isdir(sequence_path):
        continue
        
    # Get scene_id from sequence name
    scene_id = sequence_name  # e.g., "uav0000086_00000_v"
    
    # Load annotations
    anno_file = os.path.join(annotations_dir, f"{sequence_name}.txt")
    
    df = pd.read_csv(anno_file, names=[
        'frame_index', 'target_id', 'bbox_left', 'bbox_top', 'bbox_width', 
        'bbox_height', 'score', 'object_category', 'truncation', 'occlusion'
    ])
    
    # Process each image
    for img_name in sorted(os.listdir(sequence_path)):
        img_path = os.path.join(sequence_path, img_name)
        frame_no = int(os.path.splitext(img_name)[0])
        
        # Get image dimensions
        with Image.open(img_path) as img:
            width, height = img.size
        
        # Create sample
        sample = fo.Sample(filepath=img_path)
        
        # Add scene-level information
        sample["scene_id"] = scene_id
        sample["language"] = scene_language[scene_id]
        sample["frame_number"] = frame_no
        
        # Add scene attributes as Classifications
        for attr_name, attr_value in scene_attributes[scene_id].items():
            sample[attr_name] = fo.Classification(label=attr_value)
        
        # Get detections for this frame
        frame_dets = df[df.frame_index == frame_no]
        
        # Create detections list
        dets = []
        for _, row in frame_dets.iterrows():
            bbox = [
                row.bbox_left / width,
                row.bbox_top / height,
                row.bbox_width / width,
                row.bbox_height / height
            ]
            
            # Create label with class name and target ID
            class_name = class_names[row.object_category] #grab the class name from the dictionary
            
            det = fo.Detection(
                bounding_box=bbox, #bounding box for the detection
                index=row.target_id, #unique identifier for the detection
                confidence=row.score, #confidence score for the detection
                label=class_name, #label for the detection
                visibility=1 if row.truncation == 0 else 0,  # 0=visible, 1=no visibility
                occlusion=1 if row.occlusion == 0 else 0     # 0=fully visible, 1=occluded
            )

            dets.append(det)
            
        sample["detections"] = fo.Detections(detections=dets)
        samples.append(sample)

# Add all samples at once
dataset.add_samples(samples)
dataset.compute_metadata() # compute dataset stats, you can comment this out if you don't want to compute metadata
dataset.save()

 100% |███████████████| 2846/2846 [25.1s elapsed, 0s remaining, 318.4 samples/s]      
Computing metadata...
 100% |███████████████| 2846/2846 [25.3s elapsed, 0s remaining, 112.4 samples/s]       


We can call the Dataset and inspect the fields:

In [5]:
dataset

Name:        visdrone-mot
Media type:  image
Num samples: 2846
Persistent:  True
Tags:        []
Sample fields:
    id:                 fiftyone.core.fields.ObjectIdField
    filepath:           fiftyone.core.fields.StringField
    tags:               fiftyone.core.fields.ListField(fiftyone.core.fields.StringField)
    metadata:           fiftyone.core.fields.EmbeddedDocumentField(fiftyone.core.metadata.ImageMetadata)
    created_at:         fiftyone.core.fields.DateTimeField
    last_modified_at:   fiftyone.core.fields.DateTimeField
    scene_id:           fiftyone.core.fields.StringField
    language:           fiftyone.core.fields.StringField
    frame_number:       fiftyone.core.fields.IntField
    scene_type:         fiftyone.core.fields.EmbeddedDocumentField(fiftyone.core.labels.Classification)
    time_of_day:        fiftyone.core.fields.EmbeddedDocumentField(fiftyone.core.labels.Classification)
    pedestrian_density: fiftyone.core.fields.EmbeddedDocumentField(fiftyone.core.lab

And inspect the first Sample in the Dataset:

In [6]:
dataset.first()

<Sample: {
    'id': '67d074618cd2b7342de36ae4',
    'media_type': 'image',
    'filepath': '/home/harpreet/workspace/getting-started-fo-experiences/object-tracking/VisDrone2019-MOT-val/sequences/uav0000086_00000_v/0000001.jpg',
    'tags': [],
    'metadata': <ImageMetadata: {
        'size_bytes': 165707,
        'mime_type': 'image/jpeg',
        'width': 1344,
        'height': 756,
        'num_channels': 3,
    }>,
    'created_at': datetime.datetime(2025, 3, 11, 17, 35, 29, 56000),
    'last_modified_at': datetime.datetime(2025, 3, 11, 17, 35, 54, 484000),
    'scene_id': 'uav0000086_00000_v',
    'language': 'A drone flies over a large crowd of people at a sporting complex where people are playing basketball.',
    'frame_number': 1,
    'scene_type': <Classification: {
        'id': '67d0744b8cd2b7342de17c1b',
        'tags': [],
        'label': 'sporting event',
        'confidence': None,
        'logits': None,
    }>,
    'time_of_day': <Classification: {
        'id': '6

As mentioned earlier, each scene in this dataset is sequences of frames. Thus they can be parsed as videos.  However, converting frame sequences to MP4 videos is inefficient because:

1. The conversion process is time-consuming

2. High-resolution videos consume excessive storage space

3. Machine learning tasks typically process individual frames anyway, making video conversion unnecessary

Instead, you can use [`group_by()`](https://beta-docs.voxel51.com/fiftyone_concepts/using_views/#grouping) to create a view that groups the data by scene, ordered by frame number/timestamp. When you load a [dynamic](https://beta-docs.voxel51.com/fiftyone_concepts/using_datasets/#dynamic-attributes) grouped view in the App, you'll have the same experience as video datasets:

• You can hover over tiles in the grid to animate scenes' frame data

• When you click on a tile, you'll have familiar video player controls in the modal to navigate the scene

In [7]:
from fiftyone import ViewField as F

view = dataset.group_by(
    "scene_id",
    order_by="frame_number"
)

# Save the view for easy loading in the App 
dataset.save_view("scenes", view)

You can now you can view the scenes in the app:

```python
fo.launch_app(dataset)
```

<img src="assets/visdrone-explore.gif" width="80%">

## Summary

In this guide, we explored how to work with multi-object tracking data in FiftyOne, using the VisDrone dataset as an example. Here's what we covered:

1. **Dataset Structure**: We learned how tracking datasets typically organize their data:
   - Frame sequences for each scene
   - Annotation files with bounding boxes and tracking IDs
   - Optional scene-level attributes and descriptions

2. **Data Loading**: We walked through a complete pipeline for:
   - Loading frame sequences and annotations
   - Converting coordinates to normalized format
   - Adding scene metadata and attributes
   - Creating structured FiftyOne samples with detections

3. **Efficient Visualization**: Instead of converting sequences to videos, we learned how to:
   - Use [`group_by()`](https://beta-docs.voxel51.com/api/fiftyone.core.collections.SampleCollection.html#group_by) to organize frames by scene
   - Create [dynamic](https://beta-docs.voxel51.com/fiftyone_concepts/using_datasets/#dynamic-attributes) views for video-like playback
   - Leverage FiftyOne's built-in visualization capabilities



### Next steps

* If your starting point is a native video, refer to the [docs for how to work with video data](https://beta-docs.voxel51.com/api/fiftyone.core.video.html).

* Check out [this blog](https://voxel51.com/blog/tracking-datasets-in-fiftyone/) for an end-to-end walk through of loading, predicting, and evaluating tracking results with the MOT17 dataset.

* Join the [Discord community](https://community.voxel51.com/)

* Follow us on [LinkedIn](https://www.linkedin.com/company/voxel51/)


