<a href="https://colab.research.google.com/github/qubvel/transformers-notebooks/blob/main/notebooks/DFine_tracking.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Tracking Demo with D-Fine and DeepSORT


This script demonstrates how to perform person detection and tracking in a video using:
1. A pre-trained object detection model ([DFine](https://huggingface.co/ustc-community))
2. DeepSORT tracking algorithm for object tracking
3. Supervision library for visualization

The workflow:
- Load a video file
- Process each frame to detect people
- Track the detected people across frames
- Visualize the results with boxes, labels, and motion traces
- Save the processed video

In [None]:
!pip install -U -q pip uv
!uv pip install -U -q git+https://github.com/huggingface/transformers  # install the latest HF Transformers
!uv pip install -U -q git+https://github.com/roboflow/trackers         # install the latest Trackers by Roboflow
!uv pip install -U -q supervision numpy==1.*

# NOTE: you might need to restart your env to apply the changes (e.g. in case of `numpy` error)

### Import necessary libraries

In [1]:
import torch
import supervision as sv

from trackers import DeepSORTFeatureExtractor, DeepSORTTracker
from transformers import AutoModelForObjectDetection, AutoImageProcessor

### Defining constants

In [3]:
# Set up device (GPU if available, otherwise CPU)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Using device: {device}")

# Create a color palette for visualization
# These hex color codes define different colors for tracking different objects
color = sv.ColorPalette.from_hex([
    "#ffff00", "#ff9b00", "#ff8080", "#ff66b2", "#ff66ff", "#b266ff",
    "#9999ff", "#3399ff", "#66ffff", "#33ff99", "#66ff66", "#99ff00"
])

# Set the color lookup mode to assign colors by track ID
# This mean objects with the same track ID will be annotated by the same color
color_lookup = sv.ColorLookup.TRACK

Using device: cuda


### Download example data

In [4]:
!wget -q https://storage.googleapis.com/com-roboflow-marketing/supervision/video-examples/bikes-1280x720-1.mp4
!wget -q https://storage.googleapis.com/com-roboflow-marketing/supervision/video-examples/bikes-1280x720-2.mp4

In [7]:
# Define input and output video paths
source_video_path = "/content/bikes-1280x720-1.mp4"
save_video_path = "/content/bikes-1280x720-1-define-deepsort.mp4"

# Extract video information (width, height, fps) from the source
video_info = sv.VideoInfo.from_video_path(source_video_path)
print(video_info)

VideoInfo(width=1280, height=720, fps=59, total_frames=388)


### STEP 1: Set up Object Detection Model

In [9]:
# DFine model trained on Objects365 dataset
checkpoint = "ustc-community/dfine_l_obj365"
print(f"Loading object detection model: {checkpoint}")

image_processor = AutoImageProcessor.from_pretrained(checkpoint)
model = AutoModelForObjectDetection.from_pretrained(checkpoint).to(device)

label2id = {k.lower(): v for k, v in model.config.label2id.items()}

Loading object detection model: ustc-community/dfine_l_obj365


### STEP 2: Set up Tracking Model

In [11]:
# Initialize the DeepSORT feature extractor with a MobileNetV4 backbone
# it's not pretrained for ReID task, so you can find a better model on your own
feature_extractor = DeepSORTFeatureExtractor.from_timm("mobilenetv4_conv_small.e1200_r224_in1k")
tracker = DeepSORTTracker(feature_extractor, frame_rate=video_info.fps)

### STEP 3: Set up Visualization Tools

In [12]:
# Box annotator draws rectangles around detected objects
box_annotator = sv.BoxAnnotator(color, color_lookup=color_lookup)

# Trace annotator draws the path that objects have taken
trace_annotator = sv.TraceAnnotator(color, color_lookup=color_lookup, thickness=1, trace_length=100)

# Label annotator adds text labels to the detections: track id and class name
label_annotator = sv.LabelAnnotator(color, color_lookup=color_lookup, text_color=sv.Color.BLACK, text_scale=0.8)

#### STEP 4: Setup function to prcocess one frame

In [33]:
def process_frame(frame, index):
    """
    Process a single video frame: detect people, track them, and annotate the frame.

    Args:
        frame: The current video frame (numpy array)
        index: The frame number in the sequence

    Returns:
        Annotated frame with detection boxes, labels, and traces
    """

    inputs = image_processor(images=frame, return_tensors="pt").to(device)
    with torch.no_grad():
        outputs = model(**inputs)

    # Convert raw model outputs to bounding boxes, labels, and scores
    h, w, _ = frame.shape
    detections = image_processor.post_process_object_detection(outputs, target_sizes=[(h, w)], threshold=0.4)
    detections = detections[0]  # Get first image results (we're processing one frame at a time)

    # [Optional] Filter predictions by any class name, e.g. if we want to track `person` only
    keep = detections["labels"] == label2id["motorcycle"]
    detections = {k: v[keep] for k, v in detections.items()}

    # Convert detections to Supervision format and update the tracker with new detections
    detections = sv.Detections.from_transformers(detections, id2label=model.config.id2label)
    detections = tracker.update(detections, frame=frame)


    # Create labels for each detection
    labels = [
        f"#{tracker_id} {model.config.id2label[class_id]}"
        for class_id, tracker_id
        in zip(detections.class_id, detections.tracker_id)
    ]

    frame = box_annotator.annotate(scene=frame, detections=detections)
    frame = trace_annotator.annotate(scene=frame, detections=detections)
    frame = label_annotator.annotate(scene=frame, detections=detections, labels=labels)

    return frame

In [34]:
sv.process_video(
    source_path=source_video_path,
    target_path=save_video_path,
    callback=process_frame,  # Apply our processing function to each frame
    show_progress=True,      # Display a progress bar
)
print("Video processing complete!")

Processing video:   0%|          | 0/388 [00:00<?, ?it/s]

Video processing complete!


### View the result!

In [37]:
# We need to encode video with H264 codec to show in browser
converted_video_path = save_video_path.replace(".mp4", "-h264.mp4")
!ffmpeg -y -loglevel error -i {save_video_path} -vcodec libx264 -acodec aac {converted_video_path}

In [38]:
from IPython.display import Video
Video(converted_video_path, embed=True, width=600)