# ORBIT Challenge - Getting Started

This notebook will step you through a simple starter task which you can use to get you started on the [ORBIT Few-Shot Object Recognition Challenge 2023](https://eval.ai/web/challenges/challenge-page/1896). In this starter task, you will download a few-shot learning model (Prototypical Networks, Snell et al., 2017) trained on the ORBIT train set, and use it to generate frame predictions on the ORBIT validation set. The predictions will be saved in a JSON in the format required by the Challenge's evaluation server. You can upload this JSON under the 'Starter Task' phase on the evaluation server to check your implementation.

This notebook has been tested using the conda environment specified in [environment.yml](environment.yml).

First, we need a local copy of the ORBIT dataset. If you already have a copy of the data, you can skip this step!

In this script, we will download a local copy of the validation data (already resized to 224x224 frames) as well as extra frame annotations (e.g. object bounding boxes, quality issues) for the train, validation and test data. This will take ~4.3GB of disk space. Note, the validation data comes from 6 validation users and is used here as a starter task. For the main Challenge, you will need to use the test data which comes from a different set of 17 test users. 

To download the full dataset, you can use [download_pretrained_dataset.py](scripts/download_pretrained_dataset.py). The full dataset takes up 83GB (1080x1080 frames) or 54GB (224x224 frames).

In [None]:
from pathlib import Path

DATA_ROOT = "orbit_benchmark_224" # note, we are downloading the validation set already resized to 224x224 frames
DATA_SPLIT = "validation"
validation_path = Path(DATA_ROOT, DATA_SPLIT)
annotation_path = Path(DATA_ROOT, 'annotations')

# download validation split
if not validation_path.is_dir():
    validation_path.parent.mkdir(parents=True, exist_ok=True)
    print("Downloading validation.zip...")
    !wget -O validation.zip https://city.figshare.com/ndownloader/files/28368351 

    print("Unzipping validation.zip...")
    !unzip -q validation.zip -d {DATA_ROOT}

    if not validation_path.is_dir():
        raise ValueError(f"Path {validation_path} is not a directory.")
    else:
        print(f"Dataset ready at {validation_path}.")
    # You can now delete the zip file.
else:
    print(f"Dataset already saved at {validation_path}.")

# download (train, validation and test) annotations
if not annotation_path.is_dir():
    annotation_path.parent.mkdir(parents=True, exist_ok=True)
    print("Downloading orbit_extra_annotations.zip...")
    !wget -O orbit_extra_annotations.zip https://github.com/microsoft/ORBIT-Dataset/raw/dev/data/orbit_extra_annotations.zip

    print("Unzipping orbit_extra_annotations.zip...")
    !unzip -q orbit_extra_annotations.zip -d {annotation_path}

    if not annotation_path.is_dir():
        raise ValueError(f"Path {annotation_path} is not a directory.")
    else:
        print(f"Annotations ready at {annotation_path}.")
    # You can now delete the zip file.
else:
    print(f"Annotations already saved at {annotation_path}.")

Now, we can create an instance of the dataset. This creates a queue of tasks from the dataset that can be divided between multiple workers.

In [None]:
from data.queues import UserEpisodicDatasetQueue

print("Creating data queue...")
data_queue = UserEpisodicDatasetQueue(
    root=Path(DATA_ROOT, DATA_SPLIT), # path to data
    way_method="max", # sample all objects per user
    object_cap=15, # cap number of objects per user to 15 (no ORBIT user has >15 objects)
    shot_method=["max", "max"], # sample [all context videos, all target videos] per object
    shots=[5, 2], # only relevant if shot_method contains strings "specific" or "fixed"
    video_types=["clean", "clutter"], # sample clips from [clean context videos, clutter target videos]
    subsample_factor=30, # sample every 30th clip from a video if clip_method = uniform
    clip_methods=["uniform", "random_200"], # sample [clips uniformly from each context video, 200 random target clips per target video]; note if test_mode=True, target clips will be flattened into a list of frames
    clip_length=1, # sample 1 frame per clip. Can be increased to sample multiple frames per clip.
    frame_size=224, # width and height of frame 
    frame_norm_method='imagenet_inception', # normalize frames using imagenet inception statistics since we're using ViT-B-32 pretrained on ImageNet-21K (see below).
    annotations_to_load=[], # do not load any frame annotations
    filter_by_annotations=[[], ['no_object_not_present_issue']], # only includes target frames with the 'object_not_present_issue=False' tag. Note, context frames are not filtered as extra annotations cannot be used for personalisation.
    num_tasks=10, # sample 10 tasks per user. Note, this is just for the starter task. The full challenge will require you to sample 50 tasks per user.
    test_mode=True, # sample test (rather than train) tasks
    with_cluster_labels=False, # use user's personalised object names as labels, rather than broader object categories
    with_caps=False, # do not impose any sampling caps
    shuffle=False, # do not shuffle task data
    num_workers=2 # use 2 workers to load data
)

print(f"Created data queue, queue uses {data_queue.num_workers} workers.")

We now need to set up the model. For the starter task, we will use a few-shot learning model called Prototypical Networks (Snell et al., 2017) using a cosine rather a Euclidean distance. The model uses a Vision Transformer feature extractor which has been pre-trained on ImageNet-21K (i.e. 'vit_b_32'). We then meta-train this model on the ORBIT train users for the CLUVE or Clutter Video Evaluation task (trained on 224x224 frame size, using LITE). First, we download the checkpoint file that corresponds to this model. We then create an instance of the model using the pretrained weights.

In [None]:
checkpoint_path = Path("orbit_pretrained_checkpoints", "orbit_cluve_protonets_cosine_vit_b_32_224_lite.pth")

if not checkpoint_path.exists():
    checkpoint_path.parent.mkdir(parents=True, exist_ok=True)
    print("Downloading checkpoint file...")
    !wget -O orbit_pretrained_checkpoints/orbit_cluve_protonets_cosine_vit_b_32_224_lite.pth https://taixmachinelearning.z5.web.core.windows.net/orbit_cluve_protonets_cosine_vit_b_32_224_lite.pth
    print(f"Checkpoint saved to {checkpoint_path}.")
else:
    print(f"Checkpoint already exists at {checkpoint_path}.")

In [None]:
import torch
from model.few_shot_recognisers import SingleStepFewShotRecogniser

if torch.cuda.is_available():
    device = torch.device("cuda:0")
    map_location = lambda storage, _: storage.cuda()
else:
    device = torch.device("cpu")
    map_location = lambda storage, _: storage.cpu()

model = SingleStepFewShotRecogniser(
    feature_extractor_name="vit_b_32", # feature extractor is a Vision Transformer
    adapt_features=False, # do not generate FiLM Layers
    classifier="proto_cosine", # use a Prototypical Networks classifier head, with a cosine rather than Euclidean distance metric
    clip_length=1, # number of frames per clip; frame features are mean-pooled to get the clip feature
    batch_size=256, # number of clips within a task to process at a time
    learn_extractor=False, # only relevant when training ProtoNets
    num_lite_samples=16, # only relevant when training with LITE
    logit_scale=32.0 # scalar to scale logits (32.0 for proto_cosine, but typically 1.0 for proto)
)

model._set_device(device)
model._send_to_device()
# load in the pretrained checkpoint weights
model.load_state_dict(torch.load(checkpoint_path, map_location=map_location), strict=False)
# set the model to evaluation mode (ensures batch norm modules are in the correct state)
model.set_test_mode(True)
print(f"Instance of SingleStepFewShotRecogniser created on device {device}.")

We are now going to run our data through our model. We go through each task (10 tasks per user, since we specified `num_tasks = 10` above) and use the task's context clips to create a personalized model for that user's task. We then evaluate the personalized model on each frame in the task's target videos.

The results for each task will be saved to a JSON file (this is what should be submitted to the evaluation server) and the aggregate stats will be printed to the console. You should get a frame accuracy of 83.24 +/- 1.69% - see `Average over all videos (leaderboard metric)`.

In [None]:
import numpy as np
import time
from tqdm.notebook import tqdm
from typing import Dict, Tuple
from data.utils import attach_frame_history
from utils.eval_metrics import TestEvaluator
from thop import clever_format

output_dir = Path("output", DATA_SPLIT)
output_dir.mkdir(exist_ok=True, parents=True)

metrics = ['frame_acc']
evaluator = TestEvaluator(metrics, output_dir, with_ops_counter=True)
evaluator.set_base_params(model)

def get_stats_str(stats: Dict[str, Tuple[float, float]], dps: int=2) -> str:
    stats_str = "\t".join([f"{metric}: {stats[metric][0]*100:.{dps}f} ({stats[metric][1]*100:.{dps}f})" for metric in metrics])
    return stats_str

num_context_clips_per_task = []
num_target_clips_per_task = []
num_test_tasks = data_queue.num_users * data_queue.num_tasks
macs_to_personalise = []
with torch.no_grad():
    for step, task in enumerate(tqdm(data_queue.get_tasks(), desc=f"Running evaluation on {data_queue.num_tasks} tasks per test user", total=num_test_tasks)):
        context_clips = task["context_clips"].to(device)        # Torch tensor of shape: (N, clip_length, C, H, W), dtype float32
        context_labels = task["context_labels"].to(device)      # Torch tensor of shape: (N), dtype int64
        object_list = task["object_list"]                       # List of str of length num_objects
        num_context_clips = len(context_clips)

        # log task in evaluator
        evaluator.set_task_object_list(object_list)
        #evaluator.set_task_context_paths(task["context_paths"])
        
        # personalise the pre-trained model to the current task and log the time it took
        t1 = time.time()
        model.personalise(context_clips, context_labels, ops_counter=evaluator.ops_counter)
        evaluator.log_time(time.time() - t1, 'personalise')

        # loop through each of the user's target videos, and get predictions from the personalised model for every frame
        num_target_clips = 0
        for video_frames, video_paths, video_label in zip(task['target_clips'], task["target_paths"], task['target_labels']):
            # video_frames is a Torch tensor of shape (frame_count, C, H, W), dtype float32
            # video_paths is a Torch tensor of shape (frame_count), dtype object (Path)
            # video_label is single int64

            # first, for each frame, attach a short history of its previous frames if clip_length > 1
            video_frames_with_history = attach_frame_history(video_frames, model.clip_length)      # Torch tensor of shape: (frame_count, clip_length, C, H, W), dtype float32
            num_clips = len(video_frames_with_history)

            t1 = time.time()
            # get predicted logits for each frame and log time per clip
            logits = model.predict(video_frames_with_history)                                      # Torch tensor of shape: (frame_count, num_objects), dtype float32
            evaluator.log_time((time.time() - t1)/float(num_clips), 'inference')
            evaluator.append_video(logits, video_label, video_paths)
            num_target_clips += num_clips

        # reset model for next task 
        model._reset()

        # complete the task (required for correct ops counter numbers)
        evaluator.task_complete()
        num_context_clips_per_task.append(num_context_clips)
        num_target_clips_per_task.append(num_target_clips)

        # check if the user has any more tasks; if tasks_per_user == 50, we reset every 50th task.
        if (step+1) % data_queue.num_tasks == 0:
            evaluator.set_current_user(task["task_id"])
            _,_,_,current_video_stats = evaluator.get_mean_stats(current_user=True)
            macs_to_personalise.append(np.mean(evaluator.macs_counter))
            tqdm.write(f"User {task['task_id']} ({evaluator.current_user+1}/{len(data_queue)}) {get_stats_str(current_video_stats)}, avg MACs to personalise / task: {np.mean(evaluator.macs_counter)}, avg # context clips/task: {np.mean(num_context_clips_per_task):.0f}, avg # target clips/task: {np.mean(num_target_clips_per_task):.0f}")
            if (step+1) < num_test_tasks:
                num_context_clips_per_task = []
                num_target_clips_per_task = []
                evaluator.next_user()
        else:
            evaluator.next_task()

# Compute the aggregate statistics averaged over users and averged over videos. We use the video aggregate stats for the competition.
stats_per_user, stats_per_obj, stats_per_task, stats_per_video = evaluator.get_mean_stats()
print('-'*20)
print(f"Average over all users: {get_stats_str(stats_per_user)}")
print(f"Average over all objects: {get_stats_str(stats_per_obj)}")
print(f"Average over all tasks: {get_stats_str(stats_per_task)}")
print(f"Average over all videos (leaderboard metric): {get_stats_str(stats_per_video)}")
mean_macs, std_macs = clever_format([np.mean(macs_to_personalise), np.std(macs_to_personalise)], "%.2f")
print(f"Average MACs to personalise over all tasks (leaderboard metric): {mean_macs} ({std_macs})")
evaluator.save()
print(f"Results saved to {evaluator.json_results_path}.")