<a href="https://colab.research.google.com/github/deepmind/perception_test/blob/main/baselines/temporal_action_localisation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Copyright 2023 DeepMind Technologies Limited

Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at [https://www.apache.org/licenses/LICENSE-2.0](https://www.apache.org/licenses/LICENSE-2.0).
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.

# Temporal Action Localisation ActionFormer Baseline

Github: https://github.com/deepmind/perception_test

## The Perception Test
[Perception Test: A Diagnostic Benchmark for Multimodal Video Models](https://arxiv.org/abs/2305.13786) is a multimodal benchmark designed to comprehensively evaluate the perception and reasoning skills of multimodal video models. The Perception Test dataset introduces real-world videos designed to show perceptually interesting situations and defines multiple computational tasks (object and point tracking, action and sound localisation, multiple-choice and grounded video question-answering). Here, we provide details and a baseline for the temporal action localisation task.

[![Perception Test Overview Presentation](https://img.youtube.com/vi/8BiajMOBWdk/maxresdefault.jpg)](https://youtu.be/8BiajMOBWdk?t=10)


## Temporal Action Localisation
In the temporal action localisation task, the model receives a video and is required to localise and classify the actions occurring in the video according to a predefined set of classes.

The below image shows examples of temporal action localisation annotations. Each action is represented by a set of timestamps representing the start and end of the action along with its corresponding class label.

![image](https://storage.googleapis.com/dm-perception-test/img/action_annotations.png)

## ActionFormer baseline
This notebook demonstrates how to use our modified [ActionFormer repository](https://github.com/ptchallenge-workshop/actionformer_release_PT) to train and evaluate an ActionFormer model, for both video only input and combined video and audio input modalities.

In [None]:
# Make sure GPU runtime is enabled!!

In [1]:
# @title Clone adapted ActionFormer repository and install
#!git clone https://github.com/ptchallenge-workshop/actionformer_release_PT.git

#%cd /content/actionformer_release_PT/libs/utils
#!python setup.py install --user
#%cd ../..

# dirs for storing models and data
!cd /home/action_recognition/actionformer_release_PT
!mkdir data
!mkdir data/pt
!mkdir ckpt

In [2]:
# @title Download Utility Function
import os
import zipfile
import requests


def download_and_unzip(url: str, destination: str):
  """Downloads and unzips a .zip file to a destination.

  Downloads a file from the specified URL, saves it to the destination
  directory, and then extracts its contents.

  If the file is larger than 1GB, it will be downloaded in chunks,
  and the download progress will be displayed.

  Args:
    url (str): The URL of the file to download.
    destination (str): The destination directory to save the file and
      extract its contents.
  """
  if not os.path.exists(destination):
    os.makedirs(destination)

  filename = url.split('/')[-1]
  file_path = os.path.join(destination, filename)

  if os.path.exists(file_path):
    print(f'{filename} already exists. Skipping download.')
    return

  response = requests.get(url, stream=True)
  total_size = int(response.headers.get('content-length', 0))
  gb = 1024*1024*1024

  if total_size / gb > 1:
    print(f'{filename} is larger than 1GB, downloading in chunks')
    chunk_flag = True
    chunk_size = int(total_size/100)
  else:
    chunk_flag = False
    chunk_size = total_size

  with open(file_path, 'wb') as file:
    for chunk_idx, chunk in enumerate(
        response.iter_content(chunk_size=chunk_size)):
      if chunk:
        if chunk_flag:
          print(f"""{chunk_idx}% downloading
          {round((chunk_idx*chunk_size)/gb, 1)}GB
          / {round(total_size/gb, 1)}GB""")
        file.write(chunk)
  print(f"'{filename}' downloaded successfully.")

  with zipfile.ZipFile(file_path, 'r') as zip_ref:
    zip_ref.extractall(destination)
  print(f"'{filename}' extracted successfully.")

  os.remove(file_path)

In [3]:
# @title Download data and pretrained model
data_path = './data/pt'
model_path = './ckpt'

train_annot_url = 'https://storage.googleapis.com/dm-perception-test/zip_data/challenge_action_localisation_train_annotations.zip'
download_and_unzip(train_annot_url, data_path)
train_video_feat_url = 'https://storage.googleapis.com/dm-perception-test/zip_data/action_localisation_train_video_features.zip'
download_and_unzip(train_video_feat_url, data_path)
train_audio_feat_url = 'https://storage.googleapis.com/dm-perception-test/zip_data/sound_localisation_train_audio_features.zip'
download_and_unzip(train_audio_feat_url, data_path)


valid_annot_url = 'https://storage.googleapis.com/dm-perception-test/zip_data/challenge_action_localisation_valid_annotations.zip'
download_and_unzip(valid_annot_url, data_path)
valid_video_feat_url = 'https://storage.googleapis.com/dm-perception-test/zip_data/action_localisation_valid_video_features.zip'
download_and_unzip(valid_video_feat_url, data_path)
valid_audio_feat_url = 'https://storage.googleapis.com/dm-perception-test/zip_data/sound_localisation_valid_audio_features.zip'
download_and_unzip(valid_audio_feat_url, data_path)

# here we download a pretrained model, this can be commented out and the
# training command below can be ran instead to train the model from scratch
model_url = 'https://storage.googleapis.com/dm-perception-test/saved_models/perception_tal_video_train_reproduce.zip'
download_and_unzip(model_url, model_path)

'challenge_action_localisation_train_annotations.zip' downloaded successfully.
'challenge_action_localisation_train_annotations.zip' extracted successfully.
'action_localisation_train_video_features.zip' downloaded successfully.
'action_localisation_train_video_features.zip' extracted successfully.
'sound_localisation_train_audio_features.zip' downloaded successfully.
'sound_localisation_train_audio_features.zip' extracted successfully.
'challenge_action_localisation_valid_annotations.zip' downloaded successfully.
'challenge_action_localisation_valid_annotations.zip' extracted successfully.
'action_localisation_valid_video_features.zip' downloaded successfully.
'action_localisation_valid_video_features.zip' extracted successfully.
'sound_localisation_valid_audio_features.zip' downloaded successfully.
'sound_localisation_valid_audio_features.zip' extracted successfully.
'perception_tal_video_train_reproduce.zip' downloaded successfully.
'perception_tal_video_train_reproduce.zip' extract

In [None]:
# @title Training ActionFormer model
# Downloading pretrained model in the cell above instead of
# training, uncomment the line below to run command for training
# !python3 train.py configs/perception_tal_video_train.yaml --output reproduce

# multimodal version using audio features alongside video features
# !python3 train.py configs/perception_tal_multi_train.yaml --output reproduce

In [1]:
# @title Evaluating ActionFormer model

# this saves a results JSON file in this location:
# /content/actionformer_release_PT/ckpt/perception_video_train_reproduce/eval_results.json
# this is in the correct format for submission on the eval.ai challenge
# https://eval.ai/web/challenges/challenge-page/2101/overview
!python eval.py configs/perception_tal_video_valid.yaml ckpt/perception_tal_video_train_reproduce/

# multimodal version using audio features alongside video features
# !python eval.py configs/perception_tal_multi_valid.yaml ckpt/perception_tal_multi_train_reproduce/

{'dataset': {'crop_ratio': [0.9, 1.0],
             'default_fps': 15,
             'downsample_rate': 1,
             'feat_folder': './data/pt/action_localisation_valid_video_features',
             'feat_stride': 16,
             'file_ext': '.npy',
             'file_prefix': 'v_',
             'force_upsampling': True,
             'input_dim': 512,
             'input_modality': 'video',
             'json_file': './data/pt/challenge_action_localisation_valid.json',
             'max_seq_len': 192,
             'mm_feat_folder': None,
             'num_classes': 63,
             'num_frames': 16,
             'task': 'action_localisation',
             'trunc_thresh': 0.5},
 'dataset_name': 'perception',
 'devices': ['cuda:0'],
 'init_rand_seed': 1234567891,
 'loader': {'batch_size': 16, 'num_workers': 4},
 'model': {'backbone_arch': (2, 2, 5),
           'backbone_type': 'convTransformer',
           'embd_dim': 512,
           'embd_kernel_size': 3,
           'embd_with_ln': T