# VideoMAE Pre-training (Production-Scale) on Something-Something-V2

This notebook implements a full-scale, production-level self-supervised pre-training run using VideoMAE on the **Something-Something-V2 (SSv2)** dataset. This dataset is significantly larger and more complex than HMDB51, making it ideal for training a powerful, general-purpose video understanding model.

**Key changes from the prototype:**
- **Dataset:** Switched from HMDB51 to the large-scale Something-Something-V2 dataset.
- **Training Duration:** Increased from 3 epochs to 800 epochs for comprehensive learning.
- **Learning Rate Schedule:** Adjusted warmup period to 40 epochs to match the longer training cycle.
- **Performance:** Enabled mixed-precision training (`fp16`) to accelerate the process and reduce GPU memory footprint.

## 1. Setup & Environment

First, we clone the MMAction2 repository and install all the required dependencies. We'll also set the GPU environment for Colab.

In [None]:
import os

# Ensure we are using a GPU runtime
!nvidia-smi

In [None]:
print("Cloning MMAction2 repository...")
!git clone https://github.com/open-mmlab/mmaction2.git
os.chdir('mmaction2')

In [None]:
print("Installing dependencies...")
# Uninstall existing incompatible versions
!pip uninstall mmcv -y
!pip uninstall mmcv-full -y

# Install the correct version of mmcv
!pip install mmcv==2.1.0 -f https://download.openmmlab.com/mmcv/dist/cu118/torch2.0/index.html

# Install other required packages
!pip install -q decord einops timm

# Install mmaction2 from source
!pip install -e .

## 2. Data Preparation

Next, we download and extract the Something-Something-V2 dataset. This is a large dataset, so the download and extraction process will take a significant amount of time and disk space.

In [None]:
# Create a data directory
!mkdir -p ../data/ssv2
os.chdir('../data/ssv2')

In [None]:
print("Downloading Something-Something-V2 videos (20GB). This will take a while...")
# Download the main video files (Note: The original file is a .zip, not .tgz)
!wget https://s3.amazonaws.com/something-something-v2/20bn-something-something-v2-video.zip -O 20bn-something-something-v2-video.zip

In [None]:
print("Downloading train/validation splits and labels...")
# Download the official train/test splits and labels
!wget https://s3.amazonaws.com/something-something-v2/20bn-something-something-v2-labels.json -O labels.json
!wget https://s3.amazonaws.com/something-something-v2/20bn-something-something-v2-train.json -O train.json
!wget https://s3.amazonaws.com/something-something-v2/20bn-something-something-v2-validation.json -O validation.json

In [None]:
print("Extracting video files...")
# Extract the video files
!unzip 20bn-something-something-v2-video.zip -d videos/

In [None]:
print("Data preparation complete.")
os.chdir('../../mmaction2') # Go back to mmaction2 directory

### Generate Annotation Files

We now parse the downloaded JSON files to create annotation lists in the format required by MMAction2 for pre-training. For self-supervised learning, we only need the video paths, not the labels.

In [None]:
import json

print("Generating annotation files for MMAction2...")

data_root = '../data/ssv2/'
video_root = os.path.join(data_root, 'videos')
anno_dir = os.path.join(data_root, 'annotations')
os.makedirs(anno_dir, exist_ok=True)

def generate_ssv2_anno(json_path, output_path):
    with open(json_path, 'r') as f:
        data = json.load(f)
    
    with open(output_path, 'w') as f_out:
        for item in data:
            video_id = item['id']
            video_path = os.path.join('videos', f"{video_id}.webm")
            # For pre-training, we don't need labels, but mmaction2 expects a placeholder (-1)
            f_out.write(f"{video_path} -1\n")

# We use the training set for pre-training
train_json_path = os.path.join(data_root, 'train.json')
output_train_anno = os.path.join(anno_dir, 'ssv2_train_list_videos.txt')
generate_ssv2_anno(train_json_path, output_train_anno)

print(f"Annotation file created at: {output_train_anno}")

In [None]:
!echo "Sample from annotation file:"
!head -n 5 {output_train_anno}

## 3. Configuration

We will now create a custom configuration file for the VideoMAE pre-training task on SSv2. This config inherits from the base VideoMAE config and overrides data paths and training parameters for our production-scale run.

In [None]:
config_content = """_base_ = [\n    './configs/_base_/models/videomae_vit-base-p16.py',\n    './configs/_base_/default_runtime.py'\n]\n\n# model settings\nmodel = dict(\n    backbone=dict(drop_path_rate=0.1),\n    neck=dict(type='VideoMAEPretrainNeck',\n        embed_dims=768,\n        patch_size=16,\n        tube_size=2,\n        decoder_embed_dims=384,\n        decoder_depth=4,\n        decoder_num_heads=6,\n        mlp_ratio=4.,\n        norm_pix_loss=True),\n    head=dict(type='VideoMAEPretrainHead',\n        norm_pix_loss=True,\n        patch_size=16,\n        tube_size=2))\n\n# dataset settings\ndataset_type = 'VideoDataset'\ndata_root = '../data/ssv2/'\nann_file_train = '../data/ssv2/annotations/ssv2_train_list_videos.txt'\n\ntrain_pipeline = [\n    dict(type='DecordInit'),\n    dict(type='SampleFrames', clip_len=16, frame_interval=4, num_clips=1),\n    dict(type='DecordDecode'),\n    dict(type='Resize', scale=(-1, 256)),\n    dict(type='RandomResizedCrop', area_range=(0.5, 1.0)),\n    dict(type='Resize', scale=(224, 224), keep_ratio=False),\n    dict(type='Flip', flip_ratio=0.5),\n    dict(type='FormatShape', input_format='NCTHW'),\n    dict(type='MaskingGenerator', mask_window_size=(8, 7, 7), mask_ratio=0.75),\n    dict(type='Collect', keys=['imgs', 'mask'], meta_keys=()),\n    dict(type='ToTensor', keys=['imgs', 'mask'])]\n\ndata = dict(\n    videos_per_gpu=8, \n    workers_per_gpu=4,\n    train=dict(\n        type=dataset_type,\n        ann_file=ann_file_train,\n        data_prefix=data_root,\n        pipeline=train_pipeline))\n\n# optimizer\noptimizer = dict(\n    type='AdamW',\n    lr=1.5e-4,\n    betas=(0.9, 0.95),\n    weight_decay=0.05)\n\n# learning policy\nlr_config = dict(\n    policy='CosineAnnealing',\n    min_lr=0,\n    warmup='linear',\n    warmup_by_epoch=True,\n    warmup_iters=40) # Increased warmup for longer run\n\ntotal_epochs = 800 # Production-scale training duration\n\n# runtime settings\nwork_dir = './work_dirs/videomae_pretrain_ssv2_production'\nlog_config = dict(interval=50)\n\n# enable mixed-precision training\nfp16 = dict(loss_scale='dynamic')\n"""\n\nconfig_path = './configs/recognition/videomae/videomae_pretrain_ssv2_production.py'\nwith open(config_path, 'w') as f:\n    f.write(config_content)\n\nprint(f"Configuration file created at: {config_path}")

## 4. Run Pre-training

With the data and configuration ready, we can now launch the pre-training script. This will be a long-running job.

In [None]:
!python tools/train.py \
    ./configs/recognition/videomae/videomae_pretrain_ssv2_production.py \
    --work-dir ./work_dirs/videomae_pretrain_ssv2_production \
    --validate \
    --seed 42 \
    --deterministic \
    --gpu-ids 0

## 5. Results & Validation (Placeholder)

After the full 800-epoch training run, this section would involve:

1.  **Analyzing Loss Curves:** Plotting the training loss from the log files (e.g., using TensorBoard) to ensure the model was learning effectively over the long run.
2.  **Visualizing Reconstructions:** Running inference on a few validation videos and visualizing the model's masked reconstructions to qualitatively assess its understanding of motion and appearance.
3.  **Downstream Task Evaluation:** The ultimate validation is fine-tuning this pre-trained model on a downstream task (like action classification) and measuring its performance improvement compared to training from scratch.

## 6. ONNX Export

Finally, we export the trained ViT encoder backbone to the ONNX format. This makes the model portable and ready for deployment in various inference environments. We will use the checkpoint from the final epoch of our training run.

In [None]:
import torch
from mmaction.apis import init_recognizer
from mmaction.core.deployment import torch2onnx

print("Exporting model to ONNX...")

# Path to the config file we created
config_file = './configs/recognition/videomae/videomae_pretrain_ssv2_production.py'

# Path to the checkpoint file from the training run
# NOTE: MMAction2 saves checkpoints as epoch_X.pth
checkpoint_file = './work_dirs/videomae_pretrain_ssv2_production/epoch_800.pth'
output_file = '../video_mae_encoder_ssv2_production.onnx'

# Build the model from a config file and a checkpoint file
model = init_recognizer(config_file, checkpoint_file, device='cpu')

# We only want to export the encoder (backbone)
encoder = model.backbone

# Create a dummy input with the expected shape
# (batch_size, num_channels, num_frames, height, width)
dummy_input = torch.randn(1, 3, 16, 224, 224)

# The torch2onnx function from MMAction2 handles the export
torch.onnx.export(
    encoder,
    dummy_input,
    output_file,
    input_names=['input'],
    output_names=['output'],
    opset_version=11,
    dynamic_axes={'input': {0: 'batch_size'}, 'output': {0: 'batch_size'}}
)

print(f"ONNX model saved to: {output_file}")

In [None]:
!ls -lh ../