<a href="https://colab.research.google.com/github/deepmind/perception_test/blob/main/baselines/single_object_tracking.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

This notebook demonstrates how to load the object tracking annotations in the validation split of the Perception Test and run the evaluation for a simple baseline model which predicts static boxes for all tracks.

Copyright 2023 DeepMind Technologies Limited

Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at [https://www.apache.org/licenses/LICENSE-2.0](https://www.apache.org/licenses/LICENSE-2.0).
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.


# Single Object Tracking Static Baseline

Github: https://github.com/deepmind/perception_test

## The Perception Test
[Perception Test: A Diagnostic Benchmark for Multimodal Video Models](https://arxiv.org/abs/2305.13786) is a multimodal benchmark designed to comprehensively evaluate the perception and reasoning skills of multimodal video models. The Perception Test dataset introduces real-world videos designed to show perceptually interesting situations and defines multiple computational tasks (object and point tracking, action and sound localisation, multiple-choice and grounded video question-answering). Here, we provide details and a simple baseline for the single object tracking task.


##Single Object Tracking
In the single object tracking task, the model receives a video and a bounding box representing an object, and it is required to track the object throughout the video sequence.

The below image shows examples of object tracking annotations. Note that the original videos have around 30fps, but the annotations are collected at 1fps. Here we show only the annotated frames.

![image collage](https://storage.googleapis.com/dm-perception-test/img/collage.png)


## Static baseline
This notebook demonstrates how to load the challenge subset of the annotations in the validation split of the Perception Test, and run the evaluation for a dummy baseline model. This model assumes static boxes for all object tracks in a video

[![Perception Test Overview Presentation](https://img.youtube.com/vi/8BiajMOBWdk/maxresdefault.jpg)](https://youtu.be/8BiajMOBWdk?t=10)


In [None]:
# VOT toolkit required for evaluation
!pip install git+https://github.com/votchallenge/vot-toolkit-python

Collecting git+https://github.com/votchallenge/vot-toolkit-python
  Cloning https://github.com/votchallenge/vot-toolkit-python to /tmp/pip-req-build-d333lnr6
  Running command git clone --filter=blob:none --quiet https://github.com/votchallenge/vot-toolkit-python /tmp/pip-req-build-d333lnr6
  Resolved https://github.com/votchallenge/vot-toolkit-python to commit 41bd50230d270f52555c484c25def9ce3f650244
  Preparing metadata (setup.py) ... [?25l[?25hdone


In [None]:
# @title Prerequisites
import colorsys
import json
import os
import random
from typing import Tuple, List, Dict, Any
import zipfile

import cv2
from google.colab.patches import cv2_imshow
import imageio
import moviepy.editor as mvp
import numpy as np
import requests
import vot
from vot.region import calculate_overlaps as calculate_region_overlaps
from vot.region import Polygon, Rectangle, Special

In [None]:
# @title Utility functions
def download_and_unzip(url: str, destination: str):
  """Downloads and unzips a .zip file to a destination.

  Downloads a file from the specified URL, saves it to the destination
  directory, and then extracts its contents.

  If the file is larger than 1GB, it will be downloaded in chunks,
  and the download progress will be displayed.

  Args:
    url (str): The URL of the file to download.
    destination (str): The destination directory to save the file and
      extract its contents.
  """
  if not os.path.exists(destination):
    os.makedirs(destination)

  filename = url.split('/')[-1]
  file_path = os.path.join(destination, filename)

  if os.path.exists(file_path):
    print(f'{filename} already exists. Skipping download.')
    return

  response = requests.get(url, stream=True)
  total_size = int(response.headers.get('content-length', 0))
  gb = 1024*1024*1024

  if total_size / gb > 1:
    print(f'{filename} is larger than 1GB, downloading in chunks')
    chunk_flag = True
    chunk_size = int(total_size/100)
  else:
    chunk_flag = False
    chunk_size = total_size

  with open(file_path, 'wb') as file:
    for chunk_idx, chunk in enumerate(
        response.iter_content(chunk_size=chunk_size)):
      if chunk:
        if chunk_flag:
          print(f"""{chunk_idx}% downloading
          {round((chunk_idx*chunk_size)/gb, 1)}GB
          / {round(total_size/gb, 1)}GB""")
        file.write(chunk)
  print(f"'{filename}' downloaded successfully.")

  with zipfile.ZipFile(file_path, 'r') as zip_ref:
    zip_ref.extractall(destination)
  print(f"'{filename}' extracted successfully.")

  os.remove(file_path)


def load_db_json(db_file: str) -> Dict[str, Any]:
  """Loads a JSON file as a dictionary.

  Args:
    db_file (str): Path to the JSON file.

  Returns:
    Dict: Loaded JSON data as a dictionary.

  Raises:
    FileNotFoundError: If the specified file doesn't exist.
    TypeError: If the JSON file is not formatted as a dictionary.
  """
  if not os.path.isfile(db_file):
    raise FileNotFoundError(f'No such file: {db_file}')

  with open(db_file, 'r') as f:
    db_file_dict = json.load(f)
    if not isinstance(db_file_dict, dict):
      raise TypeError('JSON file is not formatted as a dictionary.')
    return db_file_dict


def load_mp4_to_frames(filename: str) -> np.array:
  """Loads an MP4 video file and returns its frames as a NumPy array.

  Args:
    filename (str): Path to the MP4 video file.

  Returns:
    np.array: Frames of the video as a NumPy array.
  """
  assert os.path.exists(filename), f'File {filename} does not exist.'
  cap = cv2.VideoCapture(filename)

  num_frames = int(cap.get(cv2.CAP_PROP_FRAME_COUNT))
  height = int(cap.get(cv2.CAP_PROP_FRAME_HEIGHT))
  width = int(cap.get(cv2.CAP_PROP_FRAME_WIDTH))

  vid_frames = np.empty((num_frames, height, width, 3), dtype=np.uint8)

  idx = 0
  while True:
    ret, frame = cap.read()
    if not ret:
      break

    vid_frames[idx] = frame
    idx += 1

  cap.release()
  return vid_frames


def get_video_frames(data_item: Dict, video_path: str) -> np.array:
  """Loads frames of a video specified by an item dictionary.

  Assumes format of annotations used in the Perception Test Dataset.

  Args:
  	data_item (Dict): Item from dataset containing metadata.
    video_path (str): Path to the directory containing videos.

  Returns:
    np.array: Frames of the video as a NumPy array.
  """
  video_file_path = os.path.join(video_path,
                                 data_item['metadata']['video_id']) + '.mp4'
  vid_frames = load_mp4_to_frames(video_file_path)
  assert data_item['metadata']['num_frames'] == vid_frames.shape[0]
  return vid_frames


def get_start_frame(track_arr: List[List[float]]) -> int:
  """Returns index of the first non-zero element in a track array.

  Args:
    track_arr (list): one hot vector correspoinding to annotations,
      showing which index to start tracking .

  Returns:
    int: Index of the first non-zero element in the track array.

  Raises:
    ValueError: Raises error if the length of the array is 0
      or if there is no one-hot value.
  """
  if not track_arr or np.count_nonzero(track_arr) == 0:
    raise ValueError('Track is empty or has no non-zero elements')
  return np.nonzero(track_arr)[0][0]


def get_start_info(track: Dict[str, Any]) -> Dict[str, Any]:
  """Retrieve information about the start frame of a track.

  Args:
    track (Dict): A dictionary containing information about the track.

  Returns:
    Dict[str: Any]: A dictionary with the following keys:
      'start_id': The frame ID of the start frame.
      'start_bounding_box': The bounding box coordinates of the start
        frame.
      'start_idx': The index of the start frame in the
      'bounding_boxes' list.
  """
  track_start_idx = get_start_frame(track['initial_tracking_box'])
  track_start_id = track['frame_ids'][track_start_idx]
  track_start_bb = track['bounding_boxes'][track_start_idx]

  return {'start_id': track_start_id,
          'start_bounding_box': track_start_bb,
          'start_idx': track_start_idx}

In [None]:
# @title Download data
data_path = './data/'

# This is the Eval.ai challenge subset of object tracking annotations
challenge_valid_annot_url = 'https://storage.googleapis.com/dm-perception-test/zip_data/sot_valid_annotations_challenge2023.zip'
download_and_unzip(challenge_valid_annot_url, data_path)

# validation videos not downloaded because they are too big (approx 70GB).
# not needed for this baseline since we are assuming static boxes
# we do not actually need the videos to calculate the performance.
# valid_videos_url =
# 'https://storage.googleapis.com/dm-perception-test/zip_data/sot_valid_videos_challenge2023.zip'
# download_and_unzip(valid_videos_url, data_path)

'sample_annotations.zip' downloaded successfully.
'sample_annotations.zip' extracted successfully.
'sample_videos.zip' downloaded successfully.
'sample_videos.zip' extracted successfully.
'sot_valid_annotations_challenge2023.zip' downloaded successfully.
'sot_valid_annotations_challenge2023.zip' extracted successfully.


In [None]:
# @title Dataset class
class PerceptionDataset():
  """Dataset class to store video items from dataset.

  Attributes:
    video_folder_path: Path to the folder containing the videos.
    task: Task type for annotations.
    split: Dataset split to load.
  	task_db: List containing annotations for dataset according to
  		split and task availability.
  """

  def __init__(self, db_path: Dict[str, Any], video_folder_path: str,
               task: str, split: str) -> None:
    """Initializes the PerceptionDataset class.

    Args:
      db_path (str): Path to the annotation file.
      video_folder_path (str): Path to the folder containing the videos.
      task (str): Task type for annotations.
      split (str): Dataset split to load.
    """
    self.video_folder_path = video_folder_path
    self.task = task
    self.split = split
    self.task_db = self.load_dataset(db_path)

  def load_dataset(self, db_path: str) -> List:
    """Loads the dataset from the annotation file and processes.

    Dict is processed according to split and task.

    Args:
      db_path (str): Path to the annotation file.

    Returns:
      List: List of database items containing annotations.
    """
    db_dict = load_db_json(db_path)
    db_list = []
    for _, val in db_dict.items():
      if val['metadata']['split'] == self.split:
        if val[self.task]:  # If video has annotations for this task
          db_list.append(val)

    return db_list

  def __len__(self) -> int:
    """Returns the total number of videos in the dataset.

    Returns:
      int: Total number of videos.
    """
    return len(self.pt_db_list)

  def __getitem__(self, idx: int) -> Dict[str, Any]:
    """Returns the video and annotations for a given index.

    Args:
      idx (int): Index of the video.

    Returns:
      Dict: Dictionary containing the video frames, metadata, annotations.
    """
    data_item = self.task_db[idx]
    annot = data_item[self.task]

    metadata = data_item['metadata']
    # here we are loading a placeholder as the frames
    # the commented out function below will actually load frames
    vid_frames = np.zeros((metadata['num_frames'], 1, 1, 1))
    # frames = get_video_frames(video_item, self.video_folder_path)

    return {'metadata': metadata,
            self.task: annot,
            'frames': vid_frames}

In [None]:
# @title Object tracking model (static baseline)
class ObjectTracker():
  """Object tracker class that tracks a given object in a video.

  This model assumes static boxes, given the first bounding box
  which should be tracked in a sequence, for every frame in the
  remaining sequence it will return the same coordinates.

  """

  def __init__(self):
    """Initializes the ObjectTracker class."""
    pass

  def track_object_in_video(self, frames: np.array, start_info: Dict[str, Any]
                            )-> Dict[str, Any]:
    """Tracks an object in a video.

    Tracks an object given a sequence of frames and initial information about
    the coordinates and frame ID of the object.

    Args:
      frames (np.array): Array of frames representing the video.
      start_info (Dict): Dictionary containing the start bounding box and
        frame ID.

    Returns:
      Dict[str: List]: Dictionary containing the tracked bounding boxes and
        corresponding frame IDs.
    """
    # initially take starting bounding box for tracking
    prev_bb = start_info['start_bounding_box']
    output_bounding_boxes = []
    output_frame_ids = []

    for frame_id in range(start_info['start_id'], frames.shape[0]):
      frame = frames[frame_id]
      # here is where the per frame tracking is done by the model
      # we just return the starting coords in this dummy baseline
      bb = self.track_object_in_frame(frame, prev_bb)
      output_bounding_boxes.append(bb)
      output_frame_ids.append(frame_id)

    output_bounding_boxes = np.stack(output_bounding_boxes, axis=0)
    output_frame_ids = np.array(output_frame_ids)
    return output_bounding_boxes, output_frame_ids

  # model inference would be inserted here!!
  def track_object_in_frame(self, frame: np.array,
                            prev_bb: List[float]) -> List[float]:
    """Tracks an object in a single frame.

    Tracks an object in a single frame based on the previous bounding box
    coordinates. Placeholder function that just returns the coords it is given,
    assumes a static object.

    Args:
      frame (np.array): The current frame.
      prev_bb(List): Previous bounding box coordinates. (y2,x2,y1,x1)

    Returns:
      List: The tracked bounding box coordinates in the current frame.
    """
    del frame  # unused
    return prev_bb

In [None]:
# @title Evaluation functions
# https://github.com/votchallenge/toolkit/blob/master/vot/analysis/supervised.py
def bbox2region(bbox: np.array) -> Rectangle:
  """Convert bbox to Rectangle or Polygon Class object.

  Args:
    bbox (ndarray): the format of rectangle bbox is (x1, y1, w, h);
      the format of polygon is (x1, y1, x2, y2, ...).

  Returns:
    Rectangle or Polygon Class object.

  Raises:
  	NotImplementedError: Returns error if unexpected number of coordinates in
      shape.
  """

  if len(bbox) == 1:
    return Special(bbox[0])
  elif len(bbox) == 4:
    return Rectangle(bbox[0], bbox[1], bbox[2], bbox[3])
  elif len(bbox) % 2 == 0 and len(bbox) > 4:
    return Polygon([(x_, y_) for x_, y_ in zip(bbox[::2], bbox[1::2])])
  else:
    raise NotImplementedError(
        f'The length of bbox is {len(bbox)}, which is not supported')


def trajectory2region(trajectory: List) -> List:
  """Convert bbox trajectory to Rectangle or Polygon Class object trajectory.

  Args:
    trajectory (list[ndarray]): The outer list contains bbox of
      each frame in a video. The bbox is a ndarray.

  Returns:
    List: contains the Region Class object of each frame in a
      trajectory.
  """
  traj_region = []
  for bbox in trajectory:
    traj_region.append(bbox2region(bbox))
  return traj_region


def calc_accuracy(gt_trajectory: List, pred_trajectory: List) -> float:
  """Calculate accuracy over the sequence.

  Args:
    gt_trajectory (list[list]): list of bboxes
    pred_trajectory (list[ndarray]): The outer list contains the
      tracking results of each frame in one video. The ndarray has two cases:
      - bbox: denotes the normal tracking box in [x1, y1, w, h]
      	format.
      - special tracking state: [0] denotes the unknown state,
      	namely the skipping frame after failure, [1] denotes the
        initialized state, and [2] denotes the failed state.

  Returns:
    Float: accuracy over the sequence.
  """
  pred_traj_region = trajectory2region(pred_trajectory)
  gt_traj_region = trajectory2region(gt_trajectory)
  overlaps = np.array(calculate_region_overlaps(pred_traj_region,
                                                gt_traj_region))
  mask = np.ones(len(overlaps), dtype=bool)
  return np.mean(overlaps[mask]) if any(mask) else 0.


def filter_pred_boxes(pred_bb: np.ndarray, pred_fid: np.ndarray,
                      gt_fid: np.ndarray) -> tuple[np.ndarray, np.ndarray]:
  """Filter bounding boxes and frame IDs based on ground truth frame IDs.

  Args:
    pred_bb (np.ndarray): Array of predicted bounding boxes.
    pred_fid (np.ndarray): Array of frame IDs for predicted bounding boxes.
    gt_fid (np.ndarray): Array of frame IDs for ground truth bounding boxes.

  Returns:
    Tuple[np.ndarray, np.ndarray]: Filtered predicted bounding boxes and
    	their corresponding frame IDs.
  """
  pred_idx = np.isin(pred_fid, gt_fid).nonzero()[0]
  filter_pred_bb = pred_bb[pred_idx]
  filter_pred_fid = pred_fid[pred_idx]
  return filter_pred_bb, filter_pred_fid


def run_iou(boxes: Dict[str, Any], db_dict: Dict[str, Any], filter=True) -> Dict[str, Any]:
  """Calculate IoU per track and per video.

  Calculate Intersection over Union (IoU) for predicted and ground truth
    bounding boxes for all tracks in the provided outputs.

  Args:
    boxes (Dict): Dict containing predicted and label bounding boxes for each
      video. Boxes must be in format [x1,y1,x2,y2].
    db_dict (Dict): Dict containing annotations.

  Returns:
    Dict: A dictionary with video IDs as keys and
      a list of IoU scores as values.
  """
  all_vot_iou = {}
  for vid_id, pred_tracks in boxes.items():
    gt_tracks = db_dict[vid_id]['object_tracking']

    video_iou = {}
    for pred_track in pred_tracks:
      gt_track = gt_tracks[pred_track['id']]
      # check track IDs
      assert pred_track['id'] == gt_track['id']

      start_info = get_start_info(gt_track)
      start_idx = start_info['start_idx']
      # get bounding boxes from frame ID were tracking is supposed to start +1
      gt_bb = np.array(gt_track['bounding_boxes'])[start_idx+1:]
      gt_fid = gt_track['frame_ids'][start_idx+1:]
      # weird case where only one box is labelled
      if not gt_fid:
        continue

      pred_bb = np.array(pred_track['bounding_boxes'])
      pred_fid = np.array(pred_track['frame_ids'])
      # filter predicted trajectory for frame IDs where we have annotations
      if filter:
        pred_bb, pred_fid = filter_pred_boxes(pred_bb, pred_fid, gt_fid)

      # check for missing frame IDs in prediction
      missing_idx = np.where(np.isin(gt_fid, pred_fid, invert=True))[0]
      if missing_idx.size != 0:
        raise ValueError(f'Missing IDs from object trajectory: {missing_idx}')
      if len(gt_bb) != len(pred_bb):
        raise ValueError('Missing boxes in predictions')

      #  convert y2,x2,y1,x1 [0,1] to x1,y1,w,h in pixel space
      [height, width] = db_dict[vid_id]['metadata']['resolution']
      pred_w = pred_bb[:, 2] - pred_bb[:, 0]
      pred_h = pred_bb[:, 3] - pred_bb[:, 1]
      pred_bb = np.stack([pred_bb[:, 0]*width, pred_bb[:, 1]*height,
                          pred_w*width, pred_h*height], axis=1)
      gt_w = gt_bb[:, 2] - gt_bb[:, 0]
      gt_h = gt_bb[:, 3] - gt_bb[:, 1]
      gt_bb = np.stack([gt_bb[:, 0]*width, gt_bb[:, 1]*height,
                        gt_w*width, gt_h*height], axis=1)

      # compute IoU per track
      iou = calc_accuracy(gt_bb, pred_bb)
      video_iou[pred_track['id']] = iou

    all_vot_iou[vid_id] = video_iou

  return all_vot_iou


def summarise_results(labels: Dict[str, Any], results: Dict[str, Any]):
  """Summarise the results according to camera movement.

  Summarise the results of a dataset by calculating average IoU scores
  across all videos, videos with a static camera and videos with a moving
  camera.

  Args:
    labels (Dict): A dictionary containing metadata and
      information about the dataset.
    results (Dict): A dictionary containing IoU scores
      for each video in the dataset.
  """
  all_ious = []
  # aggregate performance based on camera motion for analysis
  static_ious = []
  moving_ious = []

  for vid, iou_dict in results.items():
    ious = list(iou_dict.values())
    if not ious:
      continue

    all_ious.append(np.mean(ious))

    if labels[vid]['metadata']['is_camera_moving']:
      moving_ious.append(np.mean(ious))
    else:
      static_ious.append(np.mean(ious))

  if all_ious:
    print(f"""Average IoU across all videos in dataset:
          {np.array(all_ious).mean():.3f}""")

  if static_ious:
    print(f"""Average IoU across static camera videos in dataset:
          {np.array(static_ious).mean():.3f}""")

  if moving_ious:
    print(f"""Average IoU across moving camera videos in dataset:
          {np.array(moving_ious).mean():.3f}""")

In [None]:
# @title Evaluate static baseline
label_path = './data/object_tracking_valid_subset.json'
cfg = {'video_folder_path': './data/videos/',
       'task': 'object_tracking',
       'split': 'valid'}

# when true the bounding boxes and frame IDs will be filtered to demonstrate
# how to submit for the subset of annotated frames which will be used for
# evaluation
filter_outputs = True

# init dataset
tracking_dataset = PerceptionDataset(label_path, **cfg)

# init tracking model
object_tracker = ObjectTracker()

# run model across full dataset
results = {}
for video_item in tracking_dataset:
  video_id = video_item['metadata']['video_id']
  video_pred_tracks = []
  for gt_track in video_item['object_tracking']:
    start_info = get_start_info(gt_track)
    pred_bounding_boxes, pred_frame_ids = (
        object_tracker.track_object_in_video(video_item['frames'], start_info)
    )

    # filtering bounding boxes and frame IDs
    if filter_outputs:
      pred_bounding_boxes, pred_frame_ids = (
          filter_pred_boxes(pred_bounding_boxes, pred_frame_ids,
                            gt_track['frame_ids'])
      )

    pred_track = {}
    # .tolist() to serialise without error
    pred_track['bounding_boxes'] = pred_bounding_boxes.tolist()
    pred_track['frame_ids'] = pred_frame_ids.tolist()
    pred_track['id'] = gt_track['id']
    video_pred_tracks.append(pred_track)

  results[video_id] = video_pred_tracks

In [None]:
# @title Compute average IoU
label_dict = load_db_json(label_path)
iou_results = run_iou(results, label_dict, filter=filter_outputs)
summarise_results(label_dict, iou_results)

Average IoU across all videos in dataset:
          0.639
Average IoU across static camera videos in dataset:
          0.676
Average IoU across moving camera videos in dataset:
          0.397


In [None]:
# @title Serialise example results file
# Writing model outputs in the expected competition format. This
# JSON file contains answers to all questions in the validation split in the
# format:

# example_submission = {'video_1009' :{'object_tracking':[
#     {'id': 0, 'bounding_boxes': [[x1,x2,y1,y2],...], 'frame_ids': [n,...]},
#     {'id': 1, 'bounding_boxes': [[x1,x2,y1,y2],...], 'frame_ids': [n,...]}]}}

# This file could be used directly as a submission for the Eval.ai challenge
with open(f'{cfg["task"]}_{cfg["split"]}_results.json', 'w') as my_file:
  json.dump(results, my_file)

del results
del iou_results

In [None]:
# @title Visualisation functions
def get_colors(num_colors: int) -> Tuple[int, int, int]:
  """Generate random colormaps for visualizing different objects.

  Args:
    num_colors (int): The number of colors to generate.

  Returns:
    Tuple[int, int, int]: A tuple of RGB values representing the
      generated colors.
  """
  colors = []
  for i in np.arange(0., 360., 360. / num_colors):
    hue = i / 360.
    lightness = (50 + np.random.rand() * 10) / 100.
    saturation = (90 + np.random.rand() * 10) / 100.
    color = colorsys.hls_to_rgb(hue, lightness, saturation)
    color = (int(color[0] * 255), int(color[1] * 255), int(color[2] * 255))
    colors.append(color)
  random.seed(0)
  random.shuffle(colors)
  return colors


def display_video(vid_frames: np.array, fps: int = 30):
  """Create and display temporary video from numpy array frames.

  Args:
    vid_frames: (np.array): The frames of the video as a
    	numpy array. Format of frames should be:
    	(num_frames, height, width, channels)
    fps (int): Frames per second for the video playback. Default is 30.
  """
  kwargs = {'macro_block_size': None}
  imageio.mimwrite('tmp_video_display.mp4',
                   vid_frames[:, :, :, ::-1], fps=fps, **kwargs)
  display(mvp.ipython_display('tmp_video_display.mp4'))


def display_frame(frame: np.array):
  """Display a frame, converting from RGB to BGR for cv2.

  Args:
    frame (np.array): The frame to be displayed.
  """
  cv2_imshow(frame)


def paint_box(video: np.array, track: Dict[str, Any],
              color: Tuple[int, int, int] = (255, 0, 0),
              addn_label: str = '') -> np.array:
  """Paint bounding box and label on video for a given track.

  Args:
    video (np.array): The video frames as a numpy array.
    track (Dict): The track information containing bounding box
    and frame information, assumes Perception Test Dataset format.
    color (Tuple[int, int, int]): The RGB color values for the bounding box.
      Default is red (255, 0, 0).
    addn_label (str): Additional label to be added to the track label.
      Default is an empty string.

  Returns:
    np.array: The modified video frames with painted bounding box and
      label.
  """
  _, height, width, _ = video.shape
  name = str(track['id']) + ' : ' + track['label'] + addn_label
  bounding_boxes = np.array(track['bounding_boxes'])

  for box, frame_id in zip(bounding_boxes, track['frame_ids']):
    frame = np.array(video[frame_id])
    x1 = int(round(box[0] * width))
    y1 = int(round(box[1] * height))
    x2 = int(round(box[2] * width))
    y2 = int(round(box[3] * height))
    frame = cv2.rectangle(frame, (x1, y1), (x2, y2),
                          color=color, thickness=2)
    frame = cv2.putText(frame, name, (x1, y1 + 20),
                        cv2.FONT_HERSHEY_SIMPLEX, 0.75, color, 2)
    video[frame_id] = frame

  return video


def paint_boxes(video: np.array, tracks: List[Dict],
                colors: Tuple[int, int, int]) -> np.array:
  """Paint bounding boxes and labels on a video for multiple tracks.

  Args:
    video (np.array): The video frames as a numpy array.
    tracks (List): A list of track information,
      where each track contains bounding box and frame information.
    colors (Tuple): Tuple containing randomly generated RGB color values.

  Returns:
    np.array: The modified video frames with painted bounding boxes
      and labels.
  """
  for i, track in enumerate(tracks):
    video = paint_box(video, track, colors[i])
  return video

In [None]:
# @title Ground truth annotations visualised
COLORS = get_colors(num_colors=100)

# sample annotations and videos to visualise the annotations later
sample_annot_url = 'https://storage.googleapis.com/dm-perception-test/zip_data/sample_annotations.zip'
download_and_unzip(sample_annot_url, data_path)

sample_videos_url = 'https://storage.googleapis.com/dm-perception-test/zip_data/sample_videos.zip'
download_and_unzip(sample_videos_url, data_path)

# get sample video info to showcase the annotations
sample_db_path = './data/sample.json'
sample_db_dict = load_db_json(sample_db_path)
video_id = list(sample_db_dict.keys())[0]
video_item = sample_db_dict[video_id]
frames = get_video_frames(video_item, cfg['video_folder_path'])

tracks_to_show = 0.5  # @param {type:"slider", min:0, max:1, step:0.05}
num_tracks = int(len(video_item['object_tracking']) * tracks_to_show)
frames = paint_boxes(frames, video_item['object_tracking'][0 : num_tracks],
                     COLORS)

annotated_frames = []
for frame_idx in video_item['object_tracking'][0]['frame_ids']:
  annotated_frames.append(frames[frame_idx])

annotated_frames = np.array(annotated_frames)
display_video(annotated_frames, 1)
del frames
del annotated_frames

In [None]:
# @title Model outputs visualised (static boxes)
# here we show how actual inference would work with video frames loaded
frames = get_video_frames(video_item, cfg['video_folder_path'])

pred_tracks = []
for gt_track in video_item['object_tracking']:
  start_info = get_start_info(gt_track)
  pred_bounding_boxes, pred_frame_ids = (
      object_tracker.track_object_in_video(frames, start_info)
  )
  pred_track = {}
  pred_track['bounding_boxes'] = pred_bounding_boxes.tolist()
  pred_track['frame_ids'] = pred_frame_ids
  pred_track['label'] = gt_track['label']
  pred_track['id'] = gt_track['id']
  pred_tracks.append(pred_track)

tracks_to_show = 0.5  # @param {type:"slider", min:0, max:1, step:0.05}
num_tracks = int(len(pred_tracks) * tracks_to_show)
frames = paint_boxes(frames, pred_tracks[0: num_tracks], COLORS)

# can display video at full fps since all frames have bounding boxes from
# the model output
display_video(frames, video_item['metadata']['frame_rate'])
del frames

In [None]:
# @title Comparing ground truth labels vs model outputs
track_to_compare = 1  # @param {type:"integer"}

if track_to_compare > len(video_item['object_tracking']):
  raise ValueError(f'Track {track_to_compare} is not in the video')

frames = get_video_frames(video_item, cfg['video_folder_path'])
frames = paint_box(frames, video_item['object_tracking'][track_to_compare],
                   color=COLORS[0], addn_label=': label')
frames = paint_box(frames, pred_tracks[track_to_compare], color=COLORS[1],
                   addn_label=': predicted')

annotated_frames = []
for frame_idx in video_item['object_tracking'][track_to_compare]['frame_ids']:
  annotated_frames.append(frames[frame_idx])

annotated_frames = np.array(annotated_frames)
display_video(annotated_frames, 1)
del frames