# Getting Started with Object Tracking: Loading Object Tracking Data

The directory structure of an multi-object tracking (MOT) dataset varies depending on the dataset. However, most datasets follow a similar organization. Generally, here's what a typical object tracking dataset structure might entail:

- **Image Data:** This is typically sequential images corresponding to frames in a video

- **Annotations:** These come in various formats such as JSON, XML, or text files. Annotations typically include bounding box coordinates, object IDs, and sometimes additional metadata like occlusion or truncation levels.

- **Attributes Files:** While not universally included, some datasets might provide additional attributes or metadata at the scene level. 

- **Language:**  As part of a shift towards integrating tasks like Vision-Language Multi-Object Tracking, an emerging trend is including natural language descriptions for each scene in the datasets. This is useful for models that are designed to track objects based on human language commands or descriptions.

#### Parsing the VisDrone dataset into FiftyOne

In this guide, we will work with the VisDrone dataset, which was introduced in the 2020 paper [*Detection and Tracking Meet Drones Challenge*](https://arxiv.org/abs/2001.06303). This dataset contains object detection and multi-object tracking data from drone-captured imagery. Refer to dataset's [GitHub repo](https://github.com/VisDrone/VisDrone-Dataset) for more information.

Start by downloading the validation set of the VisDrone for multi-object tracking. The dataset is located in a Google drive folder, which you can download from [here](https://drive.google.com/file/d/1rqnKe9IgU_crMaxRoel9_nuUsMEBBVQu/view?usp=sharing).

Alternatively, you can download using `gdown` and extract the folder:

```bash

> pip install gdown
> gdown 1rqnKe9IgU_crMaxRoel9_nuUsMEBBVQu
> unzip VisDrone2019-MOT-val.zip
```

This datset contains **sequences of frames and annotations for each frame**, it does not contain scene level attributes. 

To demonstrate how we can parse an attributes or language as part of a MOT dataset, I'll generate dictionaries for attributes and language for each scene in the validation set. In a "real-world" scenario you might have these in `attributes` or `language` directories as part of the dataset. Whatever the case may be, it's just a matter of writing some logic to parse those files.  

What matters for this guide is how those values are parsed as part of a FiftyOne dataset.

In [18]:
from pathlib import Path

scene_attributes = {
    "uav0000086_00000_v": {
        "scene_type": "sporting event",
        "time_of_day": "daytime",
        "pedestrian_density": "high"
    },
    "uav0000117_02622_v": {
        "scene_type": "intersection",
        "time_of_day": "night",
        "pedestrian_density": "medium"
    },
    "uav0000137_00458_v": {
        "scene_type": "intersection",
        "time_of_day": "daytime",
        "pedestrian_density": "high"
    },
    "uav0000182_00000_v": {
        "scene_type": "road",
        "time_of_day": "daytime",
        "pedestrian_density": "low"
    },
    "uav0000268_05773_v": {
        "scene_type": "road",
        "time_of_day": "daytime",
        "pedestrian_density": "low"
    },
    "uav0000305_00000_v": {
        "scene_type": "intersection",
        "time_of_day": "daytime",
        "pedestrian_density": "low"
    },
    "uav0000339_00001_v": {
        "scene_type": "intersection",
        "time_of_day": "dusk",
        "pedestrian_density": "low"
    }
}

scene_language = {
    "uav0000086_00000_v": "A drone flies over a large crowd of people at a sporting complex where people are playing basketball.",
    "uav0000117_02622_v": "This scene shows a busy intersection at night with cars and pedestrians moving around. There seems to be a festial going on.",
    "uav0000137_00458_v": "This scene is a chaotic intersection with cars and pedestrians moving around. No one seems to be following the traffic rules.",
    "uav0000182_00000_v": "This scene shows a drone flying over a road with cars moving in both directions. The road is surrounded by trees.",
    "uav0000268_05773_v": "This scene depicts a highway with cars moving in both directions. The highway is surrounded by trees and buildings.",
    "uav0000305_00000_v": "This scene is a direct overhead shot of an intersection with cars and pedestrians moving around. Traffic seems to be orderly.",
    "uav0000339_00001_v": "This scene is a drone shot of an intersection at dusk with cars, motorcycles, and pedestrians moving around. The scene is well lit."
}



Let's inspect a few lines from one of the annotation files:

In [None]:
!head -n 3 VisDrone2019-MOT-val/annotations/uav0000086_00000_v.txt

Here's what each element in the annotation represents:

- `frame_index`: The index of the frame where the object is detected.

- `target_id`: A unique identifier assigned to each tracked object across frames.

- `bbox_left`: The x-coordinate of the left corner of the bounding box.

- `bbox_top`: The y-coordinate of the left corner of the bounding box.

- `bbox_width`: The width of the bounding box.

- `bbox_height`: The height of the bounding box.

- `score`: The confidence score of the detection.

- `object_category`: The category of the detected object (e.g., person, car, bicycle).

- `truncation`: Indicates if the object is partially outside the image frame.

- `occlusion`: Indicates if the object is partially occluded by another object.

The mapping of object category from integer to a human readable format is as follows:

In [24]:
class_names = {
    0: 'ignored_region', 
    1: 'pedestrian', 
    2: 'people', 
    3: 'bicycle', 
    4: 'car', 
    5: 'van', 
    6: 'truck', 
    7: 'tricycle', 
    8: 'awning-tricycle', 
    9: 'bus', 
    10: 'motor', 
    11: 'others'
    }


This code processes the VisDrone MOT (Multiple Object Tracking) dataset into FiftyOne format by:

1. **Directory Structure**
   - Sequences directory: Contains image frames for each scene
   - Annotations directory: Contains tracking data in text files
   - Each scene has its own sequence folder and matching annotation file

2. **Data Organization**
   - Scene Level: Attributes (scene type, time of day, etc.) and language descriptions
   - Frame Level: Individual images from each sequence
   - Object Level: Bounding boxes with tracking IDs and classifications

3. **Processing Pipeline**
   - Reads each sequence directory
   - Loads corresponding annotation file as DataFrame
   - For each frame:
     - Creates FiftyOne Sample with image path
     - Adds scene metadata and attributes
     - Converts annotations to FiftyOne Detections
     - Normalizes bounding box coordinates
     - Maps class IDs to readable names

4. **Key Features**
   - Maintains object identity across frames (tracking IDs)
   - Preserves scene context through attributes
   - Includes object properties (occlusion, visibility)
   - Normalizes coordinates for consistent representation

The result is a structured FiftyOne dataset that maintains the hierarchical relationship between scenes, frames, and tracked objects while adding rich metadata and descriptions.

In [42]:
import os
import pandas as pd
import fiftyone as fo
from PIL import Image


# Create dataset
dataset = fo.Dataset(
    name="visdrone-mot",
    overwrite=True,
    persistent=True
    )

# Base directories
sequences_dir = "VisDrone2019-MOT-val/sequences/"
annotations_dir = "VisDrone2019-MOT-val/annotations/"

# List to store all samples
samples = []

# Process each sequence
for sequence_name in os.listdir(sequences_dir):
    sequence_path = os.path.join(sequences_dir, sequence_name)
    if not os.path.isdir(sequence_path):
        continue
        
    # Get scene_id from sequence name
    scene_id = sequence_name  # e.g., "uav0000086_00000_v"
    
    # Load annotations
    anno_file = os.path.join(annotations_dir, f"{sequence_name}.txt")
    
    df = pd.read_csv(anno_file, names=[
        'frame_index', 'target_id', 'bbox_left', 'bbox_top', 'bbox_width', 
        'bbox_height', 'score', 'object_category', 'truncation', 'occlusion'
    ])
    
    # Process each image
    for img_name in sorted(os.listdir(sequence_path)):
        img_path = os.path.join(sequence_path, img_name)
        frame_no = int(os.path.splitext(img_name)[0])
        
        # Get image dimensions
        with Image.open(img_path) as img:
            width, height = img.size
        
        # Create sample
        sample = fo.Sample(filepath=img_path)
        
        # Add scene-level information
        sample["scene_id"] = scene_id
        sample["language"] = scene_language[scene_id]
        sample["frame_number"] = frame_no
        
        # Add scene attributes as Classifications
        for attr_name, attr_value in scene_attributes[scene_id].items():
            sample[attr_name] = fo.Classification(label=attr_value)
        
        # Get detections for this frame
        frame_dets = df[df.frame_index == frame_no]
        
        # Create detections list
        dets = []
        for _, row in frame_dets.iterrows():
            bbox = [
                row.bbox_left / width,
                row.bbox_top / height,
                row.bbox_width / width,
                row.bbox_height / height
            ]
            
            # Create label with class name and target ID
            class_name = class_names[row.object_category] #grab the class name from the dictionary
            
            det = fo.Detection(
                bounding_box=bbox, #bounding box for the detection
                index=row.target_id, #unique identifier for the detection
                confidence=row.score, #confidence score for the detection
                label=class_name, #label for the detection
                visibility=1 if row.truncation == 0 else 0,  # 0=visible, 1=no visibility
                occlusion=1 if row.occlusion == 0 else 0     # 0=fully visible, 1=occluded
            )

            dets.append(det)
            
        sample["detections"] = fo.Detections(detections=dets)
        samples.append(sample)

# Add all samples at once
dataset.add_samples(samples)
dataset.compute_metadata() # compute dataset stats, you can comment this out if you don't want to compute metadata
dataset.save()

We can call the Dataset and inspect the fields:

In [43]:
dataset

Name:        visdrone-mot
Media type:  image
Num samples: 2846
Persistent:  True
Tags:        []
Sample fields:
    id:                 fiftyone.core.fields.ObjectIdField
    filepath:           fiftyone.core.fields.StringField
    tags:               fiftyone.core.fields.ListField(fiftyone.core.fields.StringField)
    metadata:           fiftyone.core.fields.EmbeddedDocumentField(fiftyone.core.metadata.ImageMetadata)
    created_at:         fiftyone.core.fields.DateTimeField
    last_modified_at:   fiftyone.core.fields.DateTimeField
    scene_id:           fiftyone.core.fields.StringField
    language:           fiftyone.core.fields.StringField
    frame_number:       fiftyone.core.fields.IntField
    scene_type:         fiftyone.core.fields.EmbeddedDocumentField(fiftyone.core.labels.Classification)
    time_of_day:        fiftyone.core.fields.EmbeddedDocumentField(fiftyone.core.labels.Classification)
    pedestrian_density: fiftyone.core.fields.EmbeddedDocumentField(fiftyone.core.lab

And inspect the first Sample in the Dataset:

In [48]:
dataset.first()

<Sample: {
    'id': '67cfaff5e0c485e4833dadf5',
    'media_type': 'image',
    'filepath': '/home/harpreet/workspace/getting-started-fo-experiences/object-tracking/VisDrone2019-MOT-val/sequences/uav0000086_00000_v/0000001.jpg',
    'tags': [],
    'metadata': <ImageMetadata: {
        'size_bytes': 165707,
        'mime_type': 'image/jpeg',
        'width': 1344,
        'height': 756,
        'num_channels': 3,
    }>,
    'created_at': datetime.datetime(2025, 3, 11, 3, 37, 25, 26000),
    'last_modified_at': datetime.datetime(2025, 3, 11, 3, 52, 58, 385000),
    'scene_id': 'uav0000086_00000_v',
    'language': 'A drone flies over a large crowd of people at a sporting complex where people are playing basketball.',
    'frame_number': 1,
    'scene_type': <Classification: {
        'id': '67cfafdde0c485e4833bbf2c',
        'tags': [],
        'label': 'sporting event',
        'confidence': None,
        'logits': None,
    }>,
    'time_of_day': <Classification: {
        'id': '67c

As mentioned earlier, each scene in this dataset is sequences of frames. Thus they can be parsed as videos.  However, converting frame sequences to MP4 videos is inefficient because:

1. The conversion process is time-consuming

2. High-resolution videos consume excessive storage space

3. Machine learning tasks typically process individual frames anyway, making video conversion unnecessary

Instead, you can use `group_by()` to create a view that groups the data by scene, ordered by frame number/timestamp. When you load a dynamic grouped view in the App, you'll have the same experience as video datasets:

• You can hover over tiles in the grid to animate scenes' frame data

• When you click on a tile, you'll have familiar video player controls in the modal to navigate the scene

In [46]:
from fiftyone import ViewField as F

view = dataset.group_by(
    "scene_id",
    order_by="frame_number"
)

# Save the view for easy loading in the App 
dataset.save_view("scenes", view)

You can now you can view the scenes in the app:

```python
fo.launch_app(dataset)
```

<img src="assets/visdrone-explore.gif" width="80%">


### Adding trajectories for objects

Computing trajectories in object detection datasets helps analyze motion patterns and object behaviors over time. This information is valuable for real-world applications like traffic analysis, urban planning, and security systems, while also helping validate dataset quality by detecting annotation inconsistencies and temporal gaps.

In the code below, the index field in our detections captures the trajectory identity; using the `count_values` aggregation gives us a count of the number of frames for each trajectory index:

In [59]:
traj_counts = view.count_values('detections.detections.index')
num_trajs = len(traj_counts)
traj_lens = traj_counts.values()
print(f'There are {num_trajs} trajectories with min/max lengths of {min(traj_lens)}/{max(traj_lens)}')
print(traj_counts)

There are 126 trajectories with min/max lengths of 1/7
{60: 2, 58: 3, 68: 2, 134: 1, 138: 1, 63: 1, 23: 5, 110: 1, 153: 1, 84: 1, 109: 1, 65: 2, 147: 1, 28: 3, 44: 3, 170: 1, 47: 2, 24: 4, 59: 2, 46: 3, 29: 3, 34: 3, 57: 3, 116: 2, 156: 1, 7: 4, 21: 3, 1: 7, 61: 2, 62: 2, 66: 2, 5: 5, 3: 6, 18: 3, 49: 3, 150: 1, 168: 1, 30: 5, 0: 5, 115: 1, 19: 1, 40: 3, 6: 4, 32: 3, 77: 2, 51: 4, 103: 1, 113: 2, 31: 3, 99: 2, 82: 1, 8: 6, 37: 2, 108: 1, 13: 3, 25: 6, 69: 3, 112: 1, 160: 1, 100: 1, 169: 1, 48: 3, 74: 1, 50: 3, 38: 3, 157: 1, 22: 4, 26: 3, 17: 4, 36: 1, 140: 1, 98: 1, 54: 1, 81: 2, 139: 1, 111: 2, 114: 1, 101: 2, 35: 1, 154: 1, 159: 1, 76: 3, 79: 1, 67: 2, 2: 6, 11: 5, 20: 1, 127: 1, 149: 1, 85: 1, 41: 4, 164: 1, 55: 3, 9: 6, 15: 3, 16: 4, 151: 2, 78: 2, 107: 1, 73: 3, 182: 1, 148: 1, 75: 2, 43: 2, 27: 3, 12: 4, 152: 1, 70: 2, 4: 5, 33: 4, 39: 4, 52: 2, 45: 2, 102: 1, 86: 2, 72: 2, 155: 1, 53: 3, 80: 1, 97: 1, 56: 1, 14: 3, 64: 2, 71: 2, 42: 2, 10: 3}


This following code tracks object movement and calculates speeds in an object detection dataset. Here's the breakdown:

1. Centroid Calculation

- Takes a bounding box [x, y, width, height]
- Returns center point by adding half width/height to top-left corner

2. Main Processing Loop

- For each frame/sample:
  - Converts bounding boxes to keypoints at their centroids
  - Keeps track of each object's last position using its ID/index
  - Calculates speed (in pixels) for objects seen in consecutive frames using:
    * Distance between current and last position
    * Converts normalized coordinates to pixel distances using image dimensions
    * Uses hypotenuse (`np.hypot`) to get total displacement

The main limitation is it only works within single frames - to track full trajectories, you'd need to process frames sequentially and maintain object positions across the entire sequence.

In [120]:
import numpy as np

def centroid(bb):
    """Computes the centroid coordinates of a FiftyOne-style bounding box (in normalized image coordinates)
    Args:
        bb (list): Bounding box coordinates in normalized format [x, y, width, height]
            where x,y is the top-left corner.

    Returns:
        tuple: (x,y) coordinates of the centroid in normalized coordinates
    """
    # Compute the centroid of a bounding box
    x = bb[0] + bb[2]/2.0
    y = bb[1] + bb[3]/2.0
    return x, y

# Move last_seen outside the sample loop


for samp in dataset.iter_samples(autosave=True):
    imw = samp.metadata.width
    imh = samp.metadata.height
    
    # Get detections directly from the sample dictionary
    sample_dict = samp.to_dict()
    detections = sample_dict['detections']['detections']
    
    last_seen = {}
    # Create keypoints from detections
    kps = [fo.Keypoint(
        points=[centroid(x['bounding_box'])],
        label=x['label'],
        index=x['index']
    ) for x in detections]
    
    # Calculate speeds
    for kp in kps:
        if samp.frame_number > 1 and kp.index in last_seen:
            pts0 = last_seen[kp.index]
            pts1 = kp.points[0]
            dx_px = imw * (pts1[0]-pts0[0])
            dy_px = imh * (pts1[1]-pts0[1])
            vel = np.hypot(dx_px, dy_px)
            kp['speed'] = vel
        else:
            kp['speed'] = None
            
    last_seen = {x.index: x.points[0] for x in kps}
    samp['keypoints'] = fo.Keypoints(keypoints=kps)
    samp.save()