In [41]:
# Transformers installation
! pip install transformers datasets
# To install from source instead of the last release, comment the command above and uncomment the following one.
# ! pip install git+https://github.com/huggingface/transformers.git



# Video classification

Video classification is the task of assigning a label or class to an entire video. Videos are expected to have only one class for each video. Video classification models take a video as input and return a prediction about which class the video belongs to. These models can be used to categorize what a video is all about. A real-world application of video classification is action / activity recognition, which is useful for fitness applications. It is also helpful for vision-impaired individuals, especially when they are commuting.

This guide will show you how to:

1. Fine-tune [VideoMAE](https://huggingface.co/docs/transformers/main/en/model_doc/videomae) on a subset of the [UCF101](https://www.crcv.ucf.edu/data/UCF101.php) dataset.
2. Use your fine-tuned model for inference.

<Tip>
The task illustrated in this tutorial is supported by the following model architectures:

<!--This tip is automatically generated by `make fix-copies`, do not fill manually!-->

[TimeSformer](https://huggingface.co/docs/transformers/main/en/tasks/../model_doc/timesformer), [VideoMAE](https://huggingface.co/docs/transformers/main/en/tasks/../model_doc/videomae)

<!--End of the generated tip-->

</Tip>

Before you begin, make sure you have all the necessary libraries installed:

```bash
pip install -q pytorchvideo transformers evaluate
```

You will use [PyTorchVideo](https://pytorchvideo.org/) (dubbed `pytorchvideo`) to process and prepare the videos.

We encourage you to log in to your Hugging Face account so you can upload and share your model with the community. When prompted, enter your token to log in:

In [5]:
import os
from huggingface_hub import HfFolder

# Read token from environment variable (more secure)
# You can set this environment variable before running the notebook
# export HUGGINGFACE_TOKEN=your_token_here (Linux/Mac)
# set HUGGINGFACE_TOKEN=your_token_here (Windows)
token = os.getenv("HUGGINGFACE_TOKEN")

if token:
    HfFolder.save_token(token)
    print("Hugging Face token successfully loaded from HUGGINGFACE_TOKEN environment variable.")
else:
    print("HUGGINGFACE_TOKEN environment variable not set. If you want to push models to the Hub, please set this variable before starting Jupyter Lab.")

# Commenting out other options to keep the cell clean
# Option 1: Set token directly in code (not recommended for shared notebooks)
# HfFolder.save_token("your_token_here")

# Option 3: Load token from a file (more secure)
# token_path = "path/to/token.txt"
# if os.path.exists(token_path):
#     with open(token_path, "r") as f:
#         token = f.read().strip()
#     HfFolder.save_token(token)
#     print("Hugging Face token successfully loaded from file.")
# else:
#     print(f"Token file not found at {token_path}")

Hugging Face token successfully loaded from HUGGINGFACE_TOKEN environment variable.


## Load UCF101 dataset

Start by loading a subset of the [UCF-101 dataset](https://www.crcv.ucf.edu/data/UCF101.php). This will give you a chance to experiment and make sure everything works before spending more time training on the full dataset.

In [6]:
import os

# Set the path to the local processed dataset folder
dataset_root_path = "processed_dataset"
all_video_file_paths = []

# Read train, validation and test file paths from CSV files
with open(os.path.join(dataset_root_path, "train.csv"), "r") as f:
    train_paths = [line.strip().split()[0] for line in f.readlines()]
    all_video_file_paths.extend([os.path.join(dataset_root_path, path) for path in train_paths])
    
with open(os.path.join(dataset_root_path, "val.csv"), "r") as f:
    val_paths = [line.strip().split()[0] for line in f.readlines()]
    all_video_file_paths.extend([os.path.join(dataset_root_path, path) for path in val_paths])
    
with open(os.path.join(dataset_root_path, "test.csv"), "r") as f:
    test_paths = [line.strip().split()[0] for line in f.readlines()]
    all_video_file_paths.extend([os.path.join(dataset_root_path, path) for path in test_paths])

print(f"Total video files: {len(all_video_file_paths)}")

Total video files: 3950


After the subset has been downloaded, you need to extract the compressed archive:

In [7]:
import os

# Just verify that the videos directory exists
videos_dir = os.path.join(dataset_root_path, "videos")
if os.path.exists(videos_dir):
    print(f"Videos directory exists at {videos_dir}")
    print(f"Number of video files: {len(os.listdir(videos_dir))}")
else:
    print(f"Warning: Videos directory not found at {videos_dir}")

Videos directory exists at processed_dataset\videos
Number of video files: 3885


At a high level, the dataset is organized like so:

```bash
UCF101_subset/
    train/
        BandMarching/
            video_1.mp4
            video_2.mp4
            ...
        Archery
            video_1.mp4
            video_2.mp4
            ...
        ...
    val/
        BandMarching/
            video_1.mp4
            video_2.mp4
            ...
        Archery
            video_1.mp4
            video_2.mp4
            ...
        ...
    test/
        BandMarching/
            video_1.mp4
            video_2.mp4
            ...
        Archery
            video_1.mp4
            video_2.mp4
            ...
        ...
```

The (`sorted`) video paths appear like so:

```bash
...
'UCF101_subset/train/ApplyEyeMakeup/v_ApplyEyeMakeup_g07_c04.avi',
'UCF101_subset/train/ApplyEyeMakeup/v_ApplyEyeMakeup_g07_c06.avi',
'UCF101_subset/train/ApplyEyeMakeup/v_ApplyEyeMakeup_g08_c01.avi',
'UCF101_subset/train/ApplyEyeMakeup/v_ApplyEyeMakeup_g09_c02.avi',
'UCF101_subset/train/ApplyEyeMakeup/v_ApplyEyeMakeup_g09_c06.avi'
...
```

You will notice that there are video clips belonging to the same group / scene where group is denoted by `g` in the video file paths. `v_ApplyEyeMakeup_g07_c04.avi` and `v_ApplyEyeMakeup_g07_c06.avi`, for example.

For the validation and evaluation splits, you wouldn't want to have video clips from the same group / scene to prevent [data leakage](https://www.kaggle.com/code/alexisbcook/data-leakage). The subset that you are using in this tutorial takes this information into account.

Next up, you will derive the set of labels present in the dataset. Also, create two dictionaries that'll be helpful when initializing the model:

* `label2id`: maps the class names to integers.
* `id2label`: maps the integers to class names.

In [8]:
import os

# Assuming dataset_root_path is defined
# dataset_root_path = "path_to_your_dataset"

# Get labels from CSV files
labels = []

with open(os.path.join(dataset_root_path, "train.csv"), "r") as f:
    for line in f.readlines():
        parts = line.strip().split()
        if len(parts) > 1:
            labels.append(parts[1])

with open(os.path.join(dataset_root_path, "val.csv"), "r") as f:
    for line in f.readlines():
        parts = line.strip().split()
        if len(parts) > 1:
            labels.append(parts[1])
            
with open(os.path.join(dataset_root_path, "test.csv"), "r") as f:
    for line in f.readlines():
        parts = line.strip().split()
        if len(parts) > 1:
            labels.append(parts[1])

# Get unique labels and create mappings
class_labels = sorted(set(labels))
label2id = {label: i for i, label in enumerate(class_labels)}
id2label = {i: label for label, i in label2id.items()}

print(f"Unique classes: {len(label2id)}.")
print(f"Class labels: {class_labels}")

Unique classes: 46.
Class labels: ['A', 'B1', 'B1-0-0', 'B1-B2-0', 'B1-B2-B5', 'B1-B2-B6', 'B1-B2-G', 'B1-B4-0', 'B1-B5-0', 'B1-B5-B6', 'B1-B6-0', 'B1-G-0', 'B2-0-0', 'B2-B1-0', 'B2-B1-B5', 'B2-B5-0', 'B2-B5-G', 'B2-B6-0', 'B2-B6-B1', 'B2-B6-G', 'B2-G-0', 'B2-G-B1', 'B2-G-B6', 'B4-0-0', 'B4-B1-0', 'B4-B1-G', 'B4-B2-0', 'B4-B2-B1', 'B4-B5-B1', 'B5-0-0', 'B5-B1-0', 'B5-B1-B2', 'B5-B2-0', 'B6-0-0', 'B6-B2-0', 'B6-B2-G', 'B6-B4-0', 'B6-G-0', 'B6-G-B2', 'G', 'G-0-0', 'G-B1-0', 'G-B2-0', 'G-B2-B1', 'G-B2-B6', 'G-B6-0']


There are 10 unique classes. For each class, there are 30 videos in the training set.

## Load a model to fine-tune

Instantiate a video classification model from a pretrained checkpoint and its associated image processor. The model's encoder comes with pre-trained parameters, and the classification head is randomly initialized. The image processor will come in handy when writing the preprocessing pipeline for our dataset.

In [25]:
from transformers import VideoMAEImageProcessor, VideoMAEForVideoClassification

model_ckpt = "MCG-NJU/videomae-base"
image_processor = VideoMAEImageProcessor.from_pretrained(model_ckpt)
model = VideoMAEForVideoClassification.from_pretrained(
    model_ckpt,
    label2id=label2id,
    id2label=id2label,
    ignore_mismatched_sizes=True,  # provide this in case you're planning to fine-tune an already fine-tuned checkpoint
)

Some weights of VideoMAEForVideoClassification were not initialized from the model checkpoint at MCG-NJU/videomae-base and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


While the model is loading, you might notice the following warning:

```bash
Some weights of the model checkpoint at MCG-NJU/videomae-base were not used when initializing VideoMAEForVideoClassification: [..., 'decoder.decoder_layers.1.attention.output.dense.bias', 'decoder.decoder_layers.2.attention.attention.key.weight']
- This IS expected if you are initializing VideoMAEForVideoClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing VideoMAEForVideoClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of VideoMAEForVideoClassification were not initialized from the model checkpoint at MCG-NJU/videomae-base and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
```

The warning is telling us we are throwing away some weights (e.g. the weights and bias of the `classifier` layer) and randomly initializing some others (the weights and bias of a new `classifier` layer). This is expected in this case, because we are adding a new head for which we don't have pretrained weights, so the library warns us we should fine-tune this model before using it for inference, which is exactly what we are going to do.

**Note** that [this checkpoint](https://huggingface.co/MCG-NJU/videomae-base-finetuned-kinetics) leads to better performance on this task as the checkpoint was obtained fine-tuning on a similar downstream task having considerable domain overlap. You can check out [this checkpoint](https://huggingface.co/sayakpaul/videomae-base-finetuned-kinetics-finetuned-ucf101-subset) which was obtained by fine-tuning `MCG-NJU/videomae-base-finetuned-kinetics`.

## Prepare the datasets for training

For preprocessing the videos, you will leverage the [PyTorchVideo library](https://pytorchvideo.org/). Start by importing the dependencies we need.

In [26]:
!pip install pytorchvideo



In [27]:
import pytorchvideo.data

from pytorchvideo.transforms import (
    ApplyTransformToKey,
    Normalize,
    RandomShortSideScale,
    RemoveKey,
    ShortSideScale,
    UniformTemporalSubsample,
)

from torchvision.transforms import (
    Compose,
    Lambda,
    RandomCrop,
    RandomHorizontalFlip,
    Resize,
)

For the training dataset transformations, use a combination of uniform temporal subsampling, pixel normalization, random cropping, and random horizontal flipping. For the validation and evaluation dataset transformations, keep the same transformation chain except for random cropping and horizontal flipping. To learn more about the details of these transformations check out the [official documentation of PyTorchVideo](https://pytorchvideo.org).  

Use the `image_processor` associated with the pre-trained model to obtain the following information:

* Image mean and standard deviation with which the video frame pixels will be normalized.
* Spatial resolution to which the video frames will be resized.

Start by defining some constants.

In [28]:
mean = image_processor.image_mean
std = image_processor.image_std
if "shortest_edge" in image_processor.size:
    height = width = image_processor.size["shortest_edge"]
else:
    height = image_processor.size["height"]
    width = image_processor.size["width"]
resize_to = (height, width)

num_frames_to_sample = model.config.num_frames
sample_rate = 4
fps = 30
clip_duration = num_frames_to_sample * sample_rate / fps

Now, define the dataset-specific transformations and the datasets respectively. Starting with the training set:

In [29]:
# Define batch size
batch_size = 8

# Define the transforms for training data
train_transform = Compose(
    [
        ApplyTransformToKey(
            key="video",
            transform=Compose(
                [
                    UniformTemporalSubsample(num_frames_to_sample),
                    Lambda(lambda x: x / 255.0),
                    Normalize(mean, std),
                    RandomShortSideScale(min_size=256, max_size=320),
                    RandomCrop(resize_to),
                    RandomHorizontalFlip(p=0.5),
                ]
            ),
        ),
    ]
)

# Helper function to load labeled video paths from a CSV
# This function assumes 'os' module is imported, and 
# 'dataset_root_path' and 'label2id' are defined in the global scope of the notebook.
def load_labeled_video_paths(csv_filename, root_dir_for_csv_paths, label_to_id_map):
    labeled_paths = []
    # Construct the full path to the CSV file
    csv_path = os.path.join(root_dir_for_csv_paths, csv_filename)
    with open(csv_path, "r") as f:
        for line in f.readlines():
            parts = line.strip().split()
            if len(parts) >= 2:
                video_path_in_csv = parts[0]  # e.g., "videos/video_000000.mp4"
                label_str = parts[1]
                
                # Construct the full path to the video file
                # root_dir_for_csv_paths is dataset_root_path (e.g., "processed_dataset")
                full_video_path = os.path.join(root_dir_for_csv_paths, video_path_in_csv)
                
                if label_str in label_to_id_map:
                    label_id = label_to_id_map[label_str]
                    labeled_paths.append((full_video_path, {"label": label_id}))
                else:
                    print(f"Warning: Label '{label_str}' not in label2id map for video {full_video_path}. Skipping.")
            elif line.strip(): # Avoid warning for empty lines if any
                print(f"Warning: Malformed line in {csv_path}: '{line.strip()}'")
    return labeled_paths

# Load labeled video paths for training
# dataset_root_path, label2id, clip_duration, and train_transform are assumed to be defined in previous cells.
labeled_video_paths_train = load_labeled_video_paths("train.csv", dataset_root_path, label2id)

# Create LabeledVideoDataset for training
train_dataset = pytorchvideo.data.LabeledVideoDataset(
    labeled_video_paths=labeled_video_paths_train,
    clip_sampler=pytorchvideo.data.make_clip_sampler("random", clip_duration),
    decode_audio=False,
    transform=train_transform,
)

The same sequence of workflow can be applied to the validation and evaluation sets:

In [30]:
val_transform = Compose(
    [
        ApplyTransformToKey(
            key="video",
            transform=Compose(
                [
                    UniformTemporalSubsample(num_frames_to_sample),
                    Lambda(lambda x: x / 255.0),
                    Normalize(mean, std),
                    Resize(resize_to),
                ]
            ),
        ),
    ]
)

# Load labeled video paths for validation
# dataset_root_path, label2id, clip_duration, and val_transform are assumed to be defined in previous cells.
# The load_labeled_video_paths function is defined in the previous cell.
labeled_video_paths_val = load_labeled_video_paths("val.csv", dataset_root_path, label2id)

val_dataset = pytorchvideo.data.LabeledVideoDataset(
    labeled_video_paths=labeled_video_paths_val,
    clip_sampler=pytorchvideo.data.make_clip_sampler("uniform", clip_duration),
    decode_audio=False,
    transform=val_transform,
)

# Load labeled video paths for test
labeled_video_paths_test = load_labeled_video_paths("test.csv", dataset_root_path, label2id)

test_dataset = pytorchvideo.data.LabeledVideoDataset(
    labeled_video_paths=labeled_video_paths_test,
    clip_sampler=pytorchvideo.data.make_clip_sampler("uniform", clip_duration),
    decode_audio=False,
    transform=val_transform, # Using val_transform for test set as well
)

**Note**: The above dataset pipelines are taken from the [official PyTorchVideo example](https://pytorchvideo.org/docs/tutorial_classification#dataset). We're using the [`pytorchvideo.data.Ucf101()`](https://pytorchvideo.readthedocs.io/en/latest/api/data/data.html#pytorchvideo.data.Ucf101) function because it's tailored for the UCF-101 dataset. Under the hood, it returns a [`pytorchvideo.data.labeled_video_dataset.LabeledVideoDataset`](https://pytorchvideo.readthedocs.io/en/latest/api/data/data.html#pytorchvideo.data.LabeledVideoDataset) object. `LabeledVideoDataset` class is the base class for all things video in the PyTorchVideo dataset. So, if you want to use a custom dataset not supported off-the-shelf by PyTorchVideo, you can extend the `LabeledVideoDataset` class accordingly. Refer to the `data` API [documentation to](https://pytorchvideo.readthedocs.io/en/latest/api/data/data.html) learn more. Also, if your dataset follows a similar structure (as shown above), then using the `pytorchvideo.data.Ucf101()` should work just fine.

You can access the `num_videos` argument to know the number of videos in the dataset.

In [31]:
print(train_dataset.num_videos, val_dataset.num_videos, test_dataset.num_videos)

3160 395 395


## Visualize the preprocessed video for better debugging

In [32]:
import imageio
import numpy as np
from IPython.display import Image

def unnormalize_img(img):
    """Un-normalizes the image pixels."""
    img = (img * std) + mean
    img = (img * 255).astype("uint8")
    return img.clip(0, 255)

def create_gif(video_tensor, filename="sample.gif"):
    """Prepares a GIF from a video tensor.

    The video tensor is expected to have the following shape:
    (num_frames, num_channels, height, width).
    """
    frames = []
    for video_frame in video_tensor:
        frame_unnormalized = unnormalize_img(video_frame.permute(1, 2, 0).numpy())
        frames.append(frame_unnormalized)
    kargs = {"duration": 0.25}
    imageio.mimsave(filename, frames, "GIF", **kargs)
    return filename

def display_gif(video_tensor, gif_name="sample.gif"):
    """Prepares and displays a GIF from a video tensor."""
    video_tensor = video_tensor.permute(1, 0, 2, 3)
    gif_filename = create_gif(video_tensor, gif_name)
    return Image(filename=gif_filename)

sample_video = next(iter(train_dataset))
video_tensor = sample_video["video"]
display_gif(video_tensor)

<IPython.core.display.Image object>

<div class="flex justify-center">
    <img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/tasks/sample_gif.gif" alt="Person playing basketball"/>
</div>

## Train the model

Leverage [`Trainer`](https://huggingface.co/docs/transformers/main_classes/trainer) from  🤗 Transformers for training the model. To instantiate a `Trainer`, you need to define the training configuration and an evaluation metric. The most important is the [`TrainingArguments`](https://huggingface.co/transformers/main_classes/trainer.html#transformers.TrainingArguments), which is a class that contains all the attributes to configure the training. It requires an output folder name, which will be used to save the checkpoints of the model. It also helps sync all the information in the model repository on 🤗 Hub.

Most of the training arguments are self-explanatory, but one that is quite important here is `remove_unused_columns=False`. This one will drop any features not used by the model's call function. By default it's `True` because usually it's ideal to drop unused feature columns, making it easier to unpack inputs into the model's call function. But, in this case, you need the unused features ('video' in particular) in order to create `pixel_values` (which is a mandatory key our model expects in its inputs).

In [33]:
from transformers import TrainingArguments, Trainer

model_name = model_ckpt.split("/")[-1]
new_model_name = f"{model_name}-finetuned-xd-violence"
num_epochs = 4

args = TrainingArguments(
    new_model_name,
    remove_unused_columns=False,
    eval_strategy="epoch",
    save_strategy="epoch",
    learning_rate=5e-5,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    warmup_ratio=0.1,
    logging_steps=10,
    load_best_model_at_end=True,
    metric_for_best_model="accuracy",
    push_to_hub=True,
    max_steps=(train_dataset.num_videos // batch_size) * num_epochs,
)

The dataset returned by `pytorchvideo.data.Ucf101()` doesn't implement the `__len__` method. As such, we must define `max_steps` when instantiating `TrainingArguments`.

Next, you need to define a function to compute the metrics from the predictions, which will use the `metric` you'll load now. The only preprocessing you have to do is to take the argmax of our predicted logits:

In [34]:
import evaluate

metric = evaluate.load("accuracy")


def compute_metrics(eval_pred):
    predictions = np.argmax(eval_pred.predictions, axis=1)
    return metric.compute(predictions=predictions, references=eval_pred.label_ids)

**A note on evaluation**:

In the [VideoMAE paper](https://arxiv.org/abs/2203.12602), the authors use the following evaluation strategy. They evaluate the model on several clips from test videos and apply different crops to those clips and report the aggregate score. However, in the interest of simplicity and brevity, we don't consider that in this tutorial.

Also, define a `collate_fn`, which will be used to batch examples together. Each batch consists of 2 keys, namely `pixel_values` and `labels`.

In [35]:
def collate_fn(examples):
    # permute to (num_frames, num_channels, height, width)
    pixel_values = torch.stack(
        [example["video"].permute(1, 0, 2, 3) for example in examples]
    )
    labels = torch.tensor([example["label"] for example in examples])
    return {"pixel_values": pixel_values, "labels": labels}

Then you just pass all of this along with the datasets to `Trainer`:

In [36]:
trainer = Trainer(
    model,
    args,
    train_dataset=train_dataset,
    eval_dataset=val_dataset,
    tokenizer=image_processor,
    compute_metrics=compute_metrics,
    data_collator=collate_fn,
)

  trainer = Trainer(


You might wonder why you passed along the `image_processor` as a tokenizer when you preprocessed the data already. This is only to make sure the image processor configuration file (stored as JSON) will also be uploaded to the repo on the Hub.

Now fine-tune our model by calling the `train` method:

In [37]:
import torch

In [41]:
import torch
print(f"CUDA available: {torch.cuda.is_available()}")
if torch.cuda.is_available():
    print(f"Number of CUDA devices: {torch.cuda.device_count()}")
    print(f"Current CUDA device: {torch.cuda.current_device()}")
    print(f"Device name: {torch.cuda.get_device_name(torch.cuda.current_device())}")

print(f"Trainer device: {args.device}")
print(f"Model device: {model.device}")

CUDA available: True
Number of CUDA devices: 1
Current CUDA device: 0
Device name: NVIDIA GeForce RTX 3090
Trainer device: cuda:0
Model device: cuda:0


In [None]:
train_results = trainer.train()

Once training is completed, share your model to the Hub with the [push_to_hub()](https://huggingface.co/docs/transformers/main/en/main_classes/trainer#transformers.Trainer.push_to_hub) method so everyone can use your model:

In [None]:
trainer.push_to_hub()

## Inference

Great, now that you have fine-tuned a model, you can use it for inference!

Load a video for inference:

In [42]:
sample_test_video = next(iter(test_dataset))

NameError: name 'test_dataset' is not defined

<div class="flex justify-center">
    <img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/tasks/sample_gif_two.gif" alt="Teams playing basketball"/>
</div>

The simplest way to try out your fine-tuned model for inference is to use it in a [`pipeline`](https://huggingface.co/docs/transformers/main/en/main_classes/pipelines#transformers.VideoClassificationPipeline). Instantiate a `pipeline` for video classification with your model, and pass your video to it:

In [43]:
from transformers import pipeline
import os
import platform

# Load your fine-tuned model for inference
# This should be the directory where your fine-tuned model was saved.
# Based on the training setup, it's "videomae-base-finetuned-xd-violence"
# relative to the notebook's directory.
local_model_directory = "videomae-base-finetuned-xd-violence"
absolute_model_path = os.path.abspath(local_model_directory)

print(f"Attempting to load model from: {absolute_model_path}")
if not os.path.isdir(absolute_model_path):
    print(f"ERROR: Model directory not found at {absolute_model_path}")
    video_cls = None # Prevent further execution if model not found
else:
    print(f"Model directory found. Initializing pipeline...")
    video_cls = pipeline(task="video-classification", model=absolute_model_path)

if video_cls:
    # Use a local test video for inference
    # Ensure dataset_root_path uses raw string for Windows paths
    # The 'r' prefix is crucial here for Windows paths with backslashes
    dataset_root_path = r"D:\BIRKBECK\REPOS\VideoMAEOptimized_fixed\test"
    test_video_filename = "test0.mp4"
    
    # Construct the path
    raw_test_video_path = os.path.join(dataset_root_path, "videos", test_video_filename)
    
    # Normalize the path to resolve any OS-specific issues (like mixed slashes or redundant separators)
    # and to handle potential escape sequence issues more robustly.
    normalized_test_video_path = os.path.normpath(raw_test_video_path)

    print(f"Dataset root path: {dataset_root_path}")
    print(f"Raw video path constructed: {raw_test_video_path}")
    print(f"Normalized video path to be used by av.open: {normalized_test_video_path}")

    if not os.path.exists(normalized_test_video_path):
        print(f"ERROR: Video file not found at {normalized_test_video_path}.")
        print(f"Please ensure the file exists at this exact path.")
        print(f"Expected directory for videos: {os.path.normpath(os.path.join(dataset_root_path, 'videos'))}")
    else:
        print(f"Video file found at {normalized_test_video_path}. Proceeding with classification.")
        try:
            result = video_cls(normalized_test_video_path)
            print("Inference result:")
            print(result)
        except Exception as e:
            print(f"An error occurred during video classification: {e}")
            import traceback
            traceback.print_exc()
else:
    print("Skipping inference due to model loading error.")

Attempting to load model from: D:\BIRKBECK\REPOS\VideoMAEOptimized_fixed\videomae-base-finetuned-xd-violence
Model directory found. Initializing pipeline...


Device set to use cuda:0


Dataset root path: D:\BIRKBECK\REPOS\VideoMAEOptimized_fixed\test
Raw video path constructed: D:\BIRKBECK\REPOS\VideoMAEOptimized_fixed\test\videos\test0.mp4
Normalized video path to be used by av.open: D:\BIRKBECK\REPOS\VideoMAEOptimized_fixed\test\videos\test0.mp4
Video file found at D:\BIRKBECK\REPOS\VideoMAEOptimized_fixed\test\videos\test0.mp4. Proceeding with classification.
Inference result:
[{'score': 0.6736584305763245, 'label': 'A'}, {'score': 0.12989690899848938, 'label': 'B4-0-0'}, {'score': 0.05709938704967499, 'label': 'B1-0-0'}, {'score': 0.04276172071695328, 'label': 'B6-0-0'}, {'score': 0.03424839302897453, 'label': 'G-0-0'}]


You can also manually replicate the results of the `pipeline` if you'd like.

In [44]:
def run_inference(model, video):
    # (num_frames, num_channels, height, width)
    perumuted_sample_test_video = video.permute(1, 0, 2, 3)
    inputs = {
        "pixel_values": perumeted_sample_test_video.unsqueeze(0),
        "labels": torch.tensor(
            [sample_test_video["label"]]
        ),  # this can be skipped if you don't have labels available.
    }

    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
    inputs = {k: v.to(device) for k, v in inputs.items()}
    model = model.to(device)

    # forward pass
    with torch.no_grad():
        outputs = model(**inputs)
        logits = outputs.logits

    return logits

Now, pass your input to the model and return the `logits`:

```
>>> logits = run_inference(trained_model, sample_test_video["video"])
```

Decoding the `logits`, we get:

In [45]:
predicted_class_idx = logits.argmax(-1).item()
print("Predicted class:", model.config.id2label[predicted_class_idx])

NameError: name 'logits' is not defined

# Evaluate on Full Test Set
This section evaluates the fine-tuned model on all videos specified in the `test.csv` file.
It calculates and reports both Top-1 and Top-5 accuracy.

In [46]:
import os
from transformers import pipeline
import torch # For checking device

print("Starting evaluation on the full test set...")

# Define paths (relative to the notebook location)
dataset_root_path = "processed_dataset"
test_csv_filename = "test.csv"
test_csv_path = os.path.join(dataset_root_path, test_csv_filename)

local_model_directory = "videomae-base-finetuned-xd-violence"
absolute_model_path = os.path.abspath(local_model_directory)

# Function to load test data (video paths and true labels)
def load_test_data_from_csv(csv_file_path, data_root_path):
    test_samples = []
    if not os.path.exists(csv_file_path):
        print(f"ERROR: Test CSV file not found at {csv_file_path}")
        return test_samples
        
    with open(csv_file_path, "r") as f:
        for line in f.readlines():
            parts = line.strip().split()
            if len(parts) >= 2:
                relative_video_path = parts[0]  # e.g., "videos/video_000000.mp4"
                true_label_str = parts[1]       # e.g., "A", "B1"
                
                # Construct the full path to the video file
                # data_root_path is dataset_root_path (e.g., "processed_dataset")
                full_video_path = os.path.normpath(os.path.join(data_root_path, relative_video_path))
                test_samples.append((full_video_path, true_label_str))
            elif line.strip(): # Avoid warning for empty lines if any
                print(f"Warning: Malformed line in {csv_file_path}: '{line.strip()}'") # Corrected escaping here
    print(f"Loaded {len(test_samples)} samples from {csv_file_path}")
    return test_samples

# Initialize the video classification pipeline
video_cls = None
print(f"Attempting to load model from: {absolute_model_path}")
if not os.path.isdir(absolute_model_path):
    print(f"ERROR: Model directory not found at {absolute_model_path}")
else:
    print(f"Model directory found. Initializing pipeline...")
    try:
        video_cls = pipeline(
            task="video-classification",
            model=absolute_model_path,
            device=0 if torch.cuda.is_available() else -1 # Use GPU if available
        )
        print(f"Pipeline initialized. Using device: {'cuda:0' if torch.cuda.is_available() else 'cpu'}")
    except Exception as e:
        print(f"Error initializing pipeline: {e}")

if video_cls:
    # Load test data
    test_data = load_test_data_from_csv(test_csv_path, dataset_root_path)

    if test_data:
        top1_correct_predictions = 0
        top5_correct_predictions = 0
        total_videos_processed = 0

        print(f"\nStarting inference on {len(test_data)} test videos...") # Corrected escaping for newline
        for i, (video_path, true_label) in enumerate(test_data):
            if not os.path.exists(video_path):
                print(f"Warning: Video file not found at {video_path}. Skipping.")
                continue

            try:
                # Perform inference
                raw_results = video_cls(video_path)
                total_videos_processed += 1

                if not raw_results:
                    print(f"Warning: No results returned for video {video_path}. Skipping.")
                    continue
                
                # Extract top 5 predicted labels (main part, e.g., "B4" from "B4-0-0")
                # Corrected line:
                predicted_labels_top5 = [res['label'].split('-')[0] for res in raw_results[:5]]

                if not predicted_labels_top5:
                    print(f"Warning: Could not extract top 5 labels for {video_path}. Skipping.")
                    continue
                    
                predicted_label_top1 = predicted_labels_top5[0]

                # Check Top-1 accuracy
                if predicted_label_top1 == true_label:
                    top1_correct_predictions += 1
                
                # Check Top-5 accuracy
                if true_label in predicted_labels_top5:
                    top5_correct_predictions += 1
                
                if (i + 1) % 10 == 0 or (i + 1) == len(test_data): # Print progress
                    print(f"  Processed {i + 1}/{len(test_data)} videos...")

            except Exception as e:
                print(f"An error occurred during processing of {video_path}: {e}")

        # Calculate accuracies
        if total_videos_processed > 0:
            top1_accuracy = (top1_correct_predictions / total_videos_processed) * 100
            top5_accuracy = (top5_correct_predictions / total_videos_processed) * 100
            
            print("\n--- Evaluation Complete ---") # Corrected escaping for newline
            print(f"Total videos processed: {total_videos_processed}")
            print(f"Top-1 Correct Predictions: {top1_correct_predictions}")
            print(f"Top-5 Correct Predictions: {top5_correct_predictions}")
            print(f"Top-1 Accuracy: {top1_accuracy:.2f}%")
            print(f"Top-5 Accuracy: {top5_accuracy:.2f}%")
        else:
            print("\n--- Evaluation Complete ---") # Corrected escaping for newline
            print("No videos were processed successfully.")
    else:
        print("No test data loaded. Cannot perform evaluation.")
else:
    print("Video classification pipeline not initialized. Cannot perform evaluation.")

Starting evaluation on the full test set...
Attempting to load model from: D:\BIRKBECK\REPOS\VideoMAEOptimized_fixed\videomae-base-finetuned-xd-violence
Model directory found. Initializing pipeline...


Device set to use cuda:0


Pipeline initialized. Using device: cuda:0
Loaded 395 samples from processed_dataset\test.csv

Starting inference on 395 test videos...


You seem to be using the pipelines sequentially on GPU. In order to maximize efficiency please use a dataset


  Processed 10/395 videos...
  Processed 20/395 videos...
  Processed 30/395 videos...
  Processed 40/395 videos...
  Processed 50/395 videos...
  Processed 60/395 videos...
  Processed 70/395 videos...
  Processed 80/395 videos...
  Processed 90/395 videos...
  Processed 100/395 videos...
  Processed 110/395 videos...
  Processed 120/395 videos...
  Processed 130/395 videos...
  Processed 140/395 videos...
  Processed 150/395 videos...
  Processed 160/395 videos...
  Processed 170/395 videos...
  Processed 180/395 videos...
  Processed 190/395 videos...
  Processed 200/395 videos...
  Processed 210/395 videos...
  Processed 220/395 videos...
  Processed 230/395 videos...
  Processed 240/395 videos...
  Processed 250/395 videos...
  Processed 260/395 videos...
  Processed 270/395 videos...
  Processed 280/395 videos...
  Processed 290/395 videos...
  Processed 300/395 videos...
  Processed 310/395 videos...
  Processed 320/395 videos...


moov atom not found


An error occurred during processing of processed_dataset\videos\video_000844.mp4: [Errno 1094995529] Invalid data found when processing input: 'processed_dataset\\videos\\video_000844.mp4'; last error log: [mov,mp4,m4a,3gp,3g2,mj2] moov atom not found

--- Evaluation Complete ---
Total videos processed: 329
Top-1 Correct Predictions: 153
Top-5 Correct Predictions: 172
Top-1 Accuracy: 46.50%
Top-5 Accuracy: 52.28%
