Video classification is the task of assigning a label or class to an entire video. Videos are expected to have only one class for each video. Video classification models take a video as input and return a prediction about which class the video belongs to. These models can be used to categorize what a video is all about. A real-world application of video classification is action / activity recognition, which is useful for fitness applications. It is also helpful for vision-impaired individuals, or those who happen to be commuting.

This guide will show you how to:

1. Fine-tune VideoMAE on a subset of the UCF101 dataset.
2. Use your fine-tuned model for inference.

# Libraries

In [1]:
pip install -q pytorchvideo transformers evaluate

Note: you may need to restart the kernel to use updated packages.


In [2]:
import tarfile
import pathlib
import imageio
import evaluate
import numpy as np
import pytorchvideo.data
from IPython.display import Image

from transformers import VideoMAEImageProcessor, VideoMAEForVideoClassification

# Load Data

In [3]:
from huggingface_hub import hf_hub_download

hf_dataset_identifier = "sayakpaul/ucf101-subset"
filename = "UCF101_subset.tar.gz"
file_path = hf_hub_download(repo_id=hf_dataset_identifier, filename=filename, repo_type="dataset")

UCF101_subset.tar.gz:   0%|          | 0.00/171M [00:00<?, ?B/s]

In [4]:
# Extract tarfile after dataset downloaded
with tarfile.open(file_path) as t:
     t.extractall(".")

From a high-level view, this is how the data are organised:

UCF101_subset/
    train/
        BandMarching/
            video_1.mp4
            video_2.mp4
            ...
        Archery
            video_1.mp4
            video_2.mp4
            ...
        ...
    val/
        BandMarching/
            video_1.mp4
            video_2.mp4
            ...
        Archery
            video_1.mp4
            video_2.mp4
            ...
        ...
    test/
        BandMarching/
            video_1.mp4
            video_2.mp4
            ...
        Archery
            video_1.mp4
            video_2.mp4
            ...
        ...

In [5]:
# Gather some metadata about the dataset

# 1. Count the number of videos
dataset_root_path = "UCF101_subset"
dataset_root_path = pathlib.Path(dataset_root_path)
video_count_train = len(list(dataset_root_path.glob("train/*/*.avi")))
video_count_val = len(list(dataset_root_path.glob("val/*/*.avi")))
video_count_test = len(list(dataset_root_path.glob("test/*/*.avi")))
video_total = video_count_train + video_count_val + video_count_test
print(f"Total videos: {video_total}")

# 2. Inspect all video paths
# NB: there are video clips belonging to the same group/scene where group is denoted by g in the video file paths
# v_ApplyEyeMakeup_g07_c04.avi and v_ApplyEyeMakeup_g07_c06.avi, are examples
# Careful because for the validation/eval splits, you wouldn’t want to have video clips from the same group
# The subset that you are using in this tutorial takes this information into account to prevent leakage
all_video_file_paths = (
    list(dataset_root_path.glob("train/*/*.avi"))
    + list(dataset_root_path.glob("val/*/*.avi"))
    + list(dataset_root_path.glob("test/*/*.avi"))
 )
all_video_file_paths[:5]

Total videos: 405


[PosixPath('UCF101_subset/train/BalanceBeam/v_BalanceBeam_g02_c03.avi'),
 PosixPath('UCF101_subset/train/BalanceBeam/v_BalanceBeam_g24_c03.avi'),
 PosixPath('UCF101_subset/train/BalanceBeam/v_BalanceBeam_g12_c04.avi'),
 PosixPath('UCF101_subset/train/BalanceBeam/v_BalanceBeam_g03_c01.avi'),
 PosixPath('UCF101_subset/train/BalanceBeam/v_BalanceBeam_g25_c01.avi')]

In [6]:
# Derive the set of labels present in the dataset
# Also, create two dictionaries that’ll be helpful when initializing the model

- label2id: maps the class names to integers.
- id2label: maps the integers to class names

In [7]:
class_labels = sorted({str(path).split("/")[2] for path in all_video_file_paths})
label2id = {label: i for i, label in enumerate(class_labels)}
id2label = {i: label for label, i in label2id.items()}

print(f"Unique classes: {list(label2id.keys())}.")
# There should be 10 unique classes. For each class, there are 30 videos in the training set.

Unique classes: ['ApplyEyeMakeup', 'ApplyLipstick', 'Archery', 'BabyCrawling', 'BalanceBeam', 'BandMarching', 'BaseballPitch', 'Basketball', 'BasketballDunk', 'BenchPress'].


# Load Model

In [None]:
# Instantiate a video classification model from a pretrained checkpoint
# The model’s encoder comes with pre-trained parameters, and the classification head is randomly initialised
# Image processor will come in handy when writing the preprocessing pipeline

model_ckpt = "MCG-NJU/videomae-base"
image_processor = VideoMAEImageProcessor.from_pretrained(model_ckpt)
model = VideoMAEForVideoClassification.from_pretrained(
    model_ckpt,
    label2id=label2id,
    id2label=id2label,
    ignore_mismatched_sizes=True,  # provide this in case you're planning to fine-tune an already fine-tuned checkpoint
)

# Preprocessing

In [None]:
# Leverage the PyTorchVideo library for preprocessing

# First, import dependencies
from pytorchvideo.transforms import (
    ApplyTransformToKey,
    Normalize,
    RandomShortSideScale,
    RemoveKey,
    ShortSideScale,
    UniformTemporalSubsample,
)

from torchvision.transforms import (
    Compose,
    Lambda,
    RandomCrop,
    RandomHorizontalFlip,
    Resize,
)

# Define some constants
mean = image_processor.image_mean
std = image_processor.image_std
if "shortest_edge" in image_processor.size:
    height = width = image_processor.size["shortest_edge"]
else:
    height = image_processor.size["height"]
    width = image_processor.size["width"]
resize_to = (height, width)

num_frames_to_sample = model.config.num_frames
sample_rate = 4
fps = 30
clip_duration = num_frames_to_sample * sample_rate / fps

# Define the transformations on the dataset and the training data
# For the training data transformations, use a combination of... 
# uniform temporal subsampling, pixel normalization, random cropping, and random horizontal flipping
train_transform = Compose(
    [
        ApplyTransformToKey(
            key="video",
            transform=Compose(
                [
                    UniformTemporalSubsample(num_frames_to_sample),
                    Lambda(lambda x: x / 255.0),
                    Normalize(mean, std),
                    RandomShortSideScale(min_size=256, max_size=320),
                    RandomCrop(resize_to),
                    RandomHorizontalFlip(p=0.5),
                ]
            ),
        ),
    ]
)

train_dataset = pytorchvideo.data.Ucf101(
    data_path=os.path.join(dataset_root_path, "train"),
    clip_sampler=pytorchvideo.data.make_clip_sampler("random", clip_duration),
    decode_audio=False,
    transform=train_transform,
)

# Apply uniform temporal subsampling transformation to evaluation set and test set
val_transform = Compose(
    [
        ApplyTransformToKey(
            key="video",
            transform=Compose(
                [
                    UniformTemporalSubsample(num_frames_to_sample),
                    Lambda(lambda x: x / 255.0),
                    Normalize(mean, std),
                    Resize(resize_to),
                ]
            ),
        ),
    ]
)

val_dataset = pytorchvideo.data.Ucf101(
    data_path=os.path.join(dataset_root_path, "val"),
    clip_sampler=pytorchvideo.data.make_clip_sampler("uniform", clip_duration),
    decode_audio=False,
    transform=val_transform,
)

test_dataset = pytorchvideo.data.Ucf101(
    data_path=os.path.join(dataset_root_path, "test"),
    clip_sampler=pytorchvideo.data.make_clip_sampler("uniform", clip_duration),
    decode_audio=False,
    transform=val_transform,
)

# NB: LabeledVideoDataset class is the base class for all things video in the PyTorchVideo dataset
# If you want to use a custom dataset not supported off-the-shelf by PyTorchVideo, you can extend the LabeledVideoDataset class
# If your dataset follows a similar structure to that shown above, pytorchvideo.data.Ucf101() should work just fine

In [None]:
# Sanity check: inspect the number of videos in the dataset
print(train_dataset.num_videos, val_dataset.num_videos, test_dataset.num_videos)

## Visualisation of training data after processing (for debugging)

In [None]:

def unnormalize_img(img):
    """Un-normalizes the image pixels."""
    img = (img * std) + mean
    img = (img * 255).astype("uint8")
    return img.clip(0, 255)

def create_gif(video_tensor, filename="sample.gif"):
    """Prepares a GIF from a video tensor.
    
    The video tensor is expected to have the following shape:
    (num_frames, num_channels, height, width).
    """
    frames = []
    for video_frame in video_tensor:
        frame_unnormalized = unnormalize_img(video_frame.permute(1, 2, 0).numpy())
        frames.append(frame_unnormalized)
    kargs = {"duration": 0.25}
    imageio.mimsave(filename, frames, "GIF", **kargs)
    return filename

def display_gif(video_tensor, gif_name="sample.gif"):
    """Prepares and displays a GIF from a video tensor."""
    video_tensor = video_tensor.permute(1, 0, 2, 3)
    gif_filename = create_gif(video_tensor, gif_name)
    return Image(filename=gif_filename)

sample_video = next(iter(train_dataset))
video_tensor = sample_video["video"]
display_gif(video_tensor)

# Evaluation

In [None]:
# Note that this checkpoint leads to better performance on this task
# Also note that in the original VideoMAE paper they used a different evaluation method
# It was obtained fine-tuning on a similar downstream task having considerable domain overlap
metric = evaluate.load("accuracy")

def compute_metrics(eval_pred):
    predictions = np.argmax(eval_pred.predictions, axis=1)
    return metric.compute(predictions=predictions, references=eval_pred.label_ids)

# Training