# VideoMAE Pre-training Prototype on HMDB51

This notebook implements the initial experiment for self-supervised pre-training using VideoMAE on the HMDB51 dataset, as outlined in the research plan. We will use the OpenMMLab's MMAction2 repository as the foundation for this experiment.

## 1. Setup & Environment

First, we clone the MMAction2 repository and install all the required dependencies. We'll also set the GPU environment for Colab.

In [None]:
import os

# Ensure we are using a GPU runtime
!nvidia-smi

In [None]:
print("Cloning MMAction2 repository...")
!git clone https://github.com/open-mmlab/mmaction2.git
os.chdir('mmaction2')

In [None]:
print("Installing dependencies...")
# Install MMCV, which is a dependency for MMAction2
!pip install -q mmcv-full -f https://download.openmmlab.com/mmcv/dist/cu113/torch1.12.0/index.html

# Install other required packages
!pip install -q decord einops timm

# Install mmaction2 from source
!pip install -e .

## 2. Data Preparation

Next, we download and extract the HMDB51 dataset. This dataset contains video files and their corresponding annotations.

In [None]:
# Create a data directory
!mkdir -p ../data/hmdb51
os.chdir('../data/hmdb51')

print("Downloading HMDB51 videos...")
# Download the main video files
!wget http://serre-lab.clps.brown.edu/wp-content/uploads/2013/10/hmdb51_org.rar -O hmdb51_org.rar

print("Downloading train/test splits...")
# Download the official train/test splits
!wget http://serre-lab.clps.brown.edu/wp-content/uploads/2013/10/test_train_splits.rar -O test_train_splits.rar

print("Installing unrar...")
!sudo apt-get install -y unrar

print("Extracting files...")
# Extract the video files
!unrar e hmdb51_org.rar videos/
# Extract the split files
!unrar e test_train_splits.rar

print("Data preparation complete.")
os.chdir('../../mmaction2') # Go back to mmaction2 directory

In [None]:
import re
import random

print("Generating annotation files for MMAction2...")

data_root = '../data/hmdb51/'
video_root = os.path.join(data_root, 'videos')
split_dir = os.path.join(data_root, 'testTrainMulti_7030_splits')
anno_dir = os.path.join(data_root, 'annotations')
os.makedirs(anno_dir, exist_ok=True)

# We will use the first split for this prototype
split_num = 1

def generate_anno_file(split_file_pattern, output_path):
    with open(output_path, 'w') as f_out:
        for class_name in sorted(os.listdir(video_root)):
            split_file = os.path.join(split_dir, f"{class_name}_{split_file_pattern}")
            if not os.path.exists(split_file):
                continue
            
            with open(split_file, 'r') as f_in:
                for line in f_in:
                    video_name, tag = line.strip().split()
                    if int(tag) == 1: # 1 indicates training video for this class
                        video_path = os.path.join('videos', class_name, video_name)
                        # For pre-training, we don't need labels, but mmaction2 expects a placeholder
                        # We also don't need frame counts for this step
                        f_out.write(f"{video_path} -1\n")

# For pre-training, we typically use the entire training set
train_split_pattern = f"test_split{split_num}.txt"
output_train_anno = os.path.join(anno_dir, 'hmdb51_train_list_videos.txt')
generate_anno_file(train_split_pattern, output_train_anno)

print(f"Annotation file created at: {output_train_anno}")
!echo "Sample from annotation file:"
!head -n 5 {output_train_anno}

## 3. Configuration

We will now create a custom configuration file for the VideoMAE pre-training task. This config inherits from a base VideoMAE config and overrides the data paths and training parameters for our specific experiment. We'll set it for a very short run (2 epochs) to verify the pipeline works.

In [None]:
config_content = """_base_ = [\n    './configs/_base_/models/videomae_vit-base-p16.py',\n    './configs/_base_/default_runtime.py'\n]\n\n# model settings\nmodel = dict(\n    backbone=dict(drop_path_rate=0.1),\n    neck=dict(type='VideoMAEPretrainNeck',\n        embed_dims=768,\n        patch_size=16,\n        tube_size=2,\n        decoder_embed_dims=384,\n        decoder_depth=4,\n        decoder_num_heads=6,\n        mlp_ratio=4.,\n        norm_pix_loss=True),\n    head=dict(type='VideoMAEPretrainHead',\n        norm_pix_loss=True,\n        patch_size=16,\n        tube_size=2))\n\n# dataset settings\ndataset_type = 'VideoDataset'\ndata_root = '../data/hmdb51/'\ndata_prefix = 'videos'\nann_file_train = '../data/hmdb51/annotations/hmdb51_train_list_videos.txt'\n\ntrain_pipeline = [\n    dict(type='DecordInit'),\n    dict(type='SampleFrames', clip_len=16, frame_interval=4, num_clips=1),\n    dict(type='DecordDecode'),\n    dict(type='Resize', scale=(-1, 256)),\n    dict(type='RandomResizedCrop', area_range=(0.5, 1.0)),\n    dict(type='Resize', scale=(224, 224), keep_ratio=False),\n    dict(type='Flip', flip_ratio=0.5),\n    dict(type='FormatShape', input_format='NCTHW'),\n    dict(type='MaskingGenerator', mask_window_size=(8, 7, 7), mask_ratio=0.75),\n    dict(type='Collect', keys=['imgs', 'mask'], meta_keys=()),\n    dict(type='ToTensor', keys=['imgs', 'mask'])]\n\ndata = dict(\n    videos_per_gpu=8, \n    workers_per_gpu=2,\n    train=dict(\n        type=dataset_type,\n        ann_file=ann_file_train,\n        data_prefix=data_root,\n        pipeline=train_pipeline))\n\n# optimizer\noptimizer = dict(\n    type='AdamW',\n    lr=1.5e-4,\n    betas=(0.9, 0.95),\n    weight_decay=0.05)\n\n# learning policy\nlr_config = dict(\n    policy='CosineAnnealing',\n    min_lr=0,\n    warmup='linear',\n    warmup_by_epoch=True,\n    warmup_iters=2)\n\ntotal_epochs = 3\n\n# runtime settings\nwork_dir = './work_dirs/videomae_pretrain_hmdb51_prototype'\nlog_config = dict(interval=10)\n"""\n\nconfig_path = './configs/recognition/videomae/videomae_pretrain_hmdb51_prototype.py'\nwith open(config_path, 'w') as f:\n    f.write(config_content)\n\nprint(f"Configuration file created at: {config_path}")

## 4. Run Pre-training

With the data and configuration ready, we can now launch the pre-training script. We'll use a single GPU for this experiment.

In [None]:
{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# VideoMAE Pre-training Prototype on HMDB51\n",
    "\n",
    "This notebook implements the initial experiment for self-supervised pre-training using VideoMAE on the HMDB51 dataset, as outlined in the research plan. We will use the OpenMMLab's MMAction2 repository as the foundation for this experiment."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 1. Setup & Environment\n",
    "\n",
    "First, we clone the MMAction2 repository and install all the required dependencies. We'll also set the GPU environment for Colab."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import os\n",
    "\n",
    "# Ensure we are using a GPU runtime\n",
    "!nvidia-smi"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "print(\"Cloning MMAction2 repository...\")\n",
    "!git clone https://github.com/open-mmlab/mmaction2.git\n",
    "os.chdir('mmaction2')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "print(\"Installing dependencies...\")\n",
    "# Uninstall existing incompatible versions\n",
    "!pip uninstall mmcv -y\n",
    "!pip uninstall mmcv-full -y\n",
    "\n",
    "# Install the correct version of mmcv\n",
    "!pip install mmcv==2.0.0 -f https://download.openmmlab.com/mmcv/dist/cu118/torch2.0/index.html\n",
    "\n",
    "# Install other required packages\n",
    "!pip install -q decord einops timm\n",
    "\n",
    "# Install mmaction2 from source\n",
    "!pip install -e ."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 2. Data Preparation\n",
    "\n",
    "Next, we download and extract the HMDB51 dataset. This dataset contains video files and their corresponding annotations."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Create a data directory\n",
    "!mkdir -p ../data/hmdb51\n",
    "os.chdir('../data/hmdb51')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "print(\"Downloading HMDB51 videos...\")\n",
    "# Download the main video files\n",
    "!wget http://serre-lab.clps.brown.edu/wp-content/uploads/2013/10/hmdb51_org.rar -O hmdb51_org.rar"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "print(\"Downloading train/test splits...\")\n",
    "# Download the official train/test splits\n",
    "!wget http://serre-lab.clps.brown.edu/wp-content/uploads/2013/10/test_train_splits.rar -O test_train_splits.rar"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "print(\"Installing unrar...\")\n",
    "!sudo apt-get install -y unrar"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "print(\"Extracting files...\")\n",
    "# Extract the video files\n",
    "!unrar e hmdb51_org.rar videos/\n",
    "# Extract the split files\n",
    "!unrar e test_train_splits.rar"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "print(\"Data preparation complete.\")\n",
    "os.chdir('../../mmaction2') # Go back to mmaction2 directory"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import re\n",
    "import random\n",
    "\n",
    "print(\"Generating annotation files for MMAction2...\")\n",
    "\n",
    "data_root = '../data/hmdb51/'\n",
    "video_root = os.path.join(data_root, 'videos')\n",
    "split_dir = os.path.join(data_root, 'testTrainMulti_7030_splits')\n",
    "anno_dir = os.path.join(data_root, 'annotations')\n",
    "os.makedirs(anno_dir, exist_ok=True)\n",
    "\n",
    "# We will use the first split for this prototype\n",
    "split_num = 1\n",
    "\n",
    "def generate_anno_file(split_file_pattern, output_path):\n",
    "    with open(output_path, 'w') as f_out:\n",
    "        for class_name in sorted(os.listdir(video_root)):\n",
    "            split_file = os.path.join(split_dir, f\"{class_name}_{split_file_pattern}\")\n",
    "            if not os.path.exists(split_file):\n",
    "                continue\n",
    "            \n",
    "            with open(split_file, 'r') as f_in:\n",
    "                for line in f_in:\n",
    "                    video_name, tag = line.strip().split()\n",
    "                    if int(tag) == 1: # 1 indicates training video for this class\n",
    "                        video_path = os.path.join('videos', class_name, video_name)\n",
    "                        # For pre-training, we don't need labels, but mmaction2 expects a placeholder\n",
    "                        # We also don't need frame counts for this step\n",
    "                        f_out.write(f\"{video_path} -1\\n\")\n",
    "\n",
    "# For pre-training, we typically use the entire training set\n",
    "train_split_pattern = f\"test_split{split_num}.txt\"\n",
    "output_train_anno = os.path.join(anno_dir, 'hmdb51_train_list_videos.txt')\n",
    "generate_anno_file(train_split_pattern, output_train_anno)\n",
    "\n",
    "print(f\"Annotation file created at: {output_train_anno}\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "!echo \"Sample from annotation file:\"\n",
    "!head -n 5 {output_train_anno}"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 3. Configuration\n",
    "\n",
    "We will now create a custom configuration file for the VideoMAE pre-training task. This config inherits from a base VideoMAE config and overrides the data paths and training parameters for our specific experiment. We'll set it for a very short run (2 epochs) to verify the pipeline works."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "config_content = \"\"\"_base_ = [\\n    './configs/_base_/models/videomae_vit-base-p16.py',\\n    './configs/_base_/default_runtime.py'\\n]\\n\\n# model settings\\nmodel = dict(\\n    backbone=dict(drop_path_rate=0.1),\\n    neck=dict(type='VideoMAEPretrainNeck',\\n        embed_dims=768,\\n        patch_size=16,\\n        tube_size=2,\\n        decoder_embed_dims=384,\\n        decoder_depth=4,\\n        decoder_num_heads=6,\\n        mlp_ratio=4.,\\n        norm_pix_loss=True),\\n    head=dict(type='VideoMAEPretrainHead',\\n        norm_pix_loss=True,\\n        patch_size=16,\\n        tube_size=2))\\n\\n# dataset settings\\ndataset_type = 'VideoDataset'\\ndata_root = '../data/hmdb51/'\\ndata_prefix = 'videos'\\nann_file_train = '../data/hmdb51/annotations/hmdb51_train_list_videos.txt'\\n\\ntrain_pipeline = [\\n    dict(type='DecordInit'),\\n    dict(type='SampleFrames', clip_len=16, frame_interval=4, num_clips=1),\\n    dict(type='DecordDecode'),\\n    dict(type='Resize', scale=(-1, 256)),\\n    dict(type='RandomResizedCrop', area_range=(0.5, 1.0)),\\n    dict(type='Resize', scale=(224, 224), keep_ratio=False),\\n    dict(type='Flip', flip_ratio=0.5),\\n    dict(type='FormatShape', input_format='NCTHW'),\\n    dict(type='MaskingGenerator', mask_window_size=(8, 7, 7), mask_ratio=0.75),\\n    dict(type='Collect', keys=['imgs', 'mask'], meta_keys=()),\\n    dict(type='ToTensor', keys=['imgs', 'mask'])]\\n\\ndata = dict(\\n    videos_per_gpu=8, \\n    workers_per_gpu=2,\\n    train=dict(\\n        type=dataset_type,\\n        ann_file=ann_file_train,\\n        data_prefix=data_root,\\n        pipeline=train_pipeline))\\n\\n# optimizer\\noptimizer = dict(\\n    type='AdamW',\\n    lr=1.5e-4,\\n    betas=(0.9, 0.95),\\n    weight_decay=0.05)\\n\\n# learning policy\\nlr_config = dict(\\n    policy='CosineAnnealing',\\n    min_lr=0,\\n    warmup='linear',\\n    warmup_by_epoch=True,\\n    warmup_iters=2)\\n\\ntotal_epochs = 3\\n\\n# runtime settings\\nwork_dir = './work_dirs/videomae_pretrain_hmdb51_prototype'\\nlog_config = dict(interval=10)\\n\"\"\"\\n\\nconfig_path = './configs/recognition/videomae/videomae_pretrain_hmdb51_prototype.py'\\nwith open(config_path, 'w') as f:\\n    f.write(config_content)\\n\\nprint(f\"Configuration file created at: {config_path}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 4. Run Pre-training\n",
    "\n",
    "With the data and configuration ready, we can now launch the pre-training script. We'll use a single GPU for this experiment."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "!python tools/train.py \\\n",
    "    ./configs/recognition/videomae/videomae_pretrain_hmdb51_prototype.py \\\n",
    "    --work-dir ./work_dirs/videomae_pretrain_hmdb51_prototype \\\n",
    "    --validate \\\n",
    "    --seed 42 \\\n",
    "    --deterministic \\\n",
    "    --gpu-ids 0"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 5. Results & Validation (Placeholder)\n",
    "\n",
    "After a full training run, this section would involve:\n",
    "\n",
    "1.  **Analyzing Loss Curves:** Plotting the training loss from the log files (e.g., using TensorBoard) to ensure the model was learning effectively.\n",
    "2.  **Visualizing Reconstructions:** Running inference on a few validation videos and visualizing the model's masked reconstructions to qualitatively assess its understanding of motion and appearance."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 6. ONNX Export\n",
    "\n",
    "Finally, we export the trained ViT encoder backbone to the ONNX format. This makes the model portable and ready for deployment in various inference environments. We will use the checkpoint from the last epoch of our short training run."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import torch\n",
    "from mmaction.apis import init_recognizer\n",
    "from mmaction.core.deployment import torch2onnx\n",
    "\n",
    "print(\"Exporting model to ONNX...\")\n",
    "\n",
    "# Path to the config file we created\n",
    "config_file = './configs/recognition/videomae/videomae_pretrain_hmdb51_prototype.py'\n",
    "\n",
    "# Path to the checkpoint file from the training run\n",
    "# NOTE: MMAction2 saves checkpoints as epoch_X.pth\n",
    "checkpoint_file = './work_dirs/videomae_pretrain_hmdb51_prototype/epoch_3.pth'\n",
    "output_file = '../video_mae_encoder.onnx'\n",
    "\n",
    "# Build the model from a config file and a checkpoint file\n",
    "model = init_recognizer(config_file, checkpoint_file, device='cpu')\n",
    "\n",
    "# We only want to export the encoder (backbone)\n",
    "encoder = model.backbone\n",
    "\n",
    "# Create a dummy input with the expected shape\n",
    "# (batch_size, num_channels, num_frames, height, width)\n",
    "dummy_input = torch.randn(1, 3, 16, 224, 224)\n",
    "\n",
    "# The torch2onnx function from MMAction2 handles the export\n",
    "torch.onnx.export(\n",
    "    encoder,\n",
    "    dummy_input,\n",
    "    output_file,\n",
    "    input_names=['input'],\n",
    "    output_names=['output'],\n",
    "    opset_version=11,\n",
    "    dynamic_axes={'input': {0: 'batch_size'}, 'output': {0: 'batch_size'}}\n",
    ")\n",
    "\n",
    "print(f\"ONNX model saved to: {output_file}\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "!ls -lh ../"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.9.12"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}


## 5. Results & Validation (Placeholder)

After a full training run, this section would involve:
1.  **Analyzing Loss Curves:** Plotting the training loss from the log files (e.g., using TensorBoard) to ensure the model was learning effectively.
2.  **Visualizing Reconstructions:** Running inference on a few validation videos and visualizing the model's masked reconstructions to qualitatively assess its understanding of motion and appearance.

## 6. ONNX Export

Finally, we export the trained ViT encoder backbone to the ONNX format. This makes the model portable and ready for deployment in various inference environments. We will use the checkpoint from the last epoch of our short training run.

In [None]:
{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# VideoMAE Pre-training Prototype on HMDB51\n",
    "\n",
    "This notebook implements the initial experiment for self-supervised pre-training using VideoMAE on the HMDB51 dataset, as outlined in the research plan. We will use the OpenMMLab's MMAction2 repository as the foundation for this experiment."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 1. Setup & Environment\n",
    "\n",
    "First, we clone the MMAction2 repository and install all the required dependencies. We'll also set the GPU environment for Colab."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import os\n",
    "\n",
    "# Ensure we are using a GPU runtime\n",
    "!nvidia-smi"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "print(\"Cloning MMAction2 repository...\")\n",
    "!git clone https://github.com/open-mmlab/mmaction2.git\n",
    "os.chdir('mmaction2')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "print(\"Installing dependencies...\")\n",
    "# Uninstall existing incompatible versions\n",
    "!pip uninstall mmcv -y\n",
    "!pip uninstall mmcv-full -y\n",
    "\n",
    "# Install the correct version of mmcv\n",
    "!pip install mmcv==2.0.0 -f https://download.openmmlab.com/mmcv/dist/cu118/torch2.0/index.html\n",
    "\n",
    "# Install other required packages\n",
    "!pip install -q decord einops timm\n",
    "\n",
    "# Install mmaction2 from source\n",
    "!pip install -e ."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 2. Data Preparation\n",
    "\n",
    "Next, we download and extract the HMDB51 dataset. This dataset contains video files and their corresponding annotations."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Create a data directory\n",
    "!mkdir -p ../data/hmdb51\n",
    "os.chdir('../data/hmdb51')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "print(\"Downloading HMDB51 videos...\")\n",
    "# Download the main video files\n",
    "!wget http://serre-lab.clps.brown.edu/wp-content/uploads/2013/10/hmdb51_org.rar -O hmdb51_org.rar"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "print(\"Downloading train/test splits...\")\n",
    "# Download the official train/test splits\n",
    "!wget http://serre-lab.clps.brown.edu/wp-content/uploads/2013/10/test_train_splits.rar -O test_train_splits.rar"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "print(\"Installing unrar...\")\n",
    "!sudo apt-get install -y unrar"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "print(\"Extracting files...\")\n",
    "# Extract the video files\n",
    "!unrar e hmdb51_org.rar videos/\n",
    "# Extract the split files\n",
    "!unrar e test_train_splits.rar"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "print(\"Data preparation complete.\")\n",
    "os.chdir('../../mmaction2') # Go back to mmaction2 directory"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import re\n",
    "import random\n",
    "\n",
    "print(\"Generating annotation files for MMAction2...\")\n",
    "\n",
    "data_root = '../data/hmdb51/'\n",
    "video_root = os.path.join(data_root, 'videos')\n",
    "split_dir = os.path.join(data_root, 'testTrainMulti_7030_splits')\n",
    "anno_dir = os.path.join(data_root, 'annotations')\n",
    "os.makedirs(anno_dir, exist_ok=True)\n",
    "\n",
    "# We will use the first split for this prototype\n",
    "split_num = 1\n",
    "\n",
    "def generate_anno_file(split_file_pattern, output_path):\n",
    "    with open(output_path, 'w') as f_out:\n",
    "        for class_name in sorted(os.listdir(video_root)):\n",
    "            split_file = os.path.join(split_dir, f\"{class_name}_{split_file_pattern}\")\n",
    "            if not os.path.exists(split_file):\n",
    "                continue\n",
    "            \n",
    "            with open(split_file, 'r') as f_in:\n",
    "                for line in f_in:\n",
    "                    video_name, tag = line.strip().split()\n",
    "                    if int(tag) == 1: # 1 indicates training video for this class\n",
    "                        video_path = os.path.join('videos', class_name, video_name)\n",
    "                        # For pre-training, we don't need labels, but mmaction2 expects a placeholder\n",
    "                        # We also don't need frame counts for this step\n",
    "                        f_out.write(f\"{video_path} -1\\n\")\n",
    "\n",
    "# For pre-training, we typically use the entire training set\n",
    "train_split_pattern = f\"test_split{split_num}.txt\"\n",
    "output_train_anno = os.path.join(anno_dir, 'hmdb51_train_list_videos.txt')\n",
    "generate_anno_file(train_split_pattern, output_train_anno)\n",
    "\n",
    "print(f\"Annotation file created at: {output_train_anno}\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "!echo \"Sample from annotation file:\"\n",
    "!head -n 5 {output_train_anno}"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 3. Configuration\n",
    "\n",
    "We will now create a custom configuration file for the VideoMAE pre-training task. This config inherits from a base VideoMAE config and overrides the data paths and training parameters for our specific experiment. We'll set it for a very short run (2 epochs) to verify the pipeline works."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "config_content = \"\"\"_base_ = [\\n    './configs/_base_/models/videomae_vit-base-p16.py',\\n    './configs/_base_/default_runtime.py'\\n]\\n\\n# model settings\\nmodel = dict(\\n    backbone=dict(drop_path_rate=0.1),\\n    neck=dict(type='VideoMAEPretrainNeck',\\n        embed_dims=768,\\n        patch_size=16,\\n        tube_size=2,\\n        decoder_embed_dims=384,\\n        decoder_depth=4,\\n        decoder_num_heads=6,\\n        mlp_ratio=4.,\\n        norm_pix_loss=True),\\n    head=dict(type='VideoMAEPretrainHead',\\n        norm_pix_loss=True,\\n        patch_size=16,\\n        tube_size=2))\\n\\n# dataset settings\\ndataset_type = 'VideoDataset'\\ndata_root = '../data/hmdb51/'\\ndata_prefix = 'videos'\\nann_file_train = '../data/hmdb51/annotations/hmdb51_train_list_videos.txt'\\n\\ntrain_pipeline = [\\n    dict(type='DecordInit'),\\n    dict(type='SampleFrames', clip_len=16, frame_interval=4, num_clips=1),\\n    dict(type='DecordDecode'),\\n    dict(type='Resize', scale=(-1, 256)),\\n    dict(type='RandomResizedCrop', area_range=(0.5, 1.0)),\\n    dict(type='Resize', scale=(224, 224), keep_ratio=False),\\n    dict(type='Flip', flip_ratio=0.5),\\n    dict(type='FormatShape', input_format='NCTHW'),\\n    dict(type='MaskingGenerator', mask_window_size=(8, 7, 7), mask_ratio=0.75),\\n    dict(type='Collect', keys=['imgs', 'mask'], meta_keys=()),\\n    dict(type='ToTensor', keys=['imgs', 'mask'])]\\n\\ndata = dict(\\n    videos_per_gpu=8, \\n    workers_per_gpu=2,\\n    train=dict(\\n        type=dataset_type,\\n        ann_file=ann_file_train,\\n        data_prefix=data_root,\\n        pipeline=train_pipeline))\\n\\n# optimizer\\noptimizer = dict(\\n    type='AdamW',\\n    lr=1.5e-4,\\n    betas=(0.9, 0.95),\\n    weight_decay=0.05)\\n\\n# learning policy\\nlr_config = dict(\\n    policy='CosineAnnealing',\\n    min_lr=0,\\n    warmup='linear',\\n    warmup_by_epoch=True,\\n    warmup_iters=2)\\n\\ntotal_epochs = 3\\n\\n# runtime settings\\nwork_dir = './work_dirs/videomae_pretrain_hmdb51_prototype'\\nlog_config = dict(interval=10)\\n\"\"\"\\n\\nconfig_path = './configs/recognition/videomae/videomae_pretrain_hmdb51_prototype.py'\\nwith open(config_path, 'w') as f:\\n    f.write(config_content)\\n\\nprint(f\"Configuration file created at: {config_path}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 4. Run Pre-training\n",
    "\n",
    "With the data and configuration ready, we can now launch the pre-training script. We'll use a single GPU for this experiment."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "!python tools/train.py \\\n",
    "    ./configs/recognition/videomae/videomae_pretrain_hmdb51_prototype.py \\\n",
    "    --work-dir ./work_dirs/videomae_pretrain_hmdb51_prototype \\\n",
    "    --validate \\\n",
    "    --seed 42 \\\n",
    "    --deterministic \\\n",
    "    --gpu-ids 0"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 5. Results & Validation (Placeholder)\n",
    "\n",
    "After a full training run, this section would involve:\n",
    "\n",
    "1.  **Analyzing Loss Curves:** Plotting the training loss from the log files (e.g., using TensorBoard) to ensure the model was learning effectively.\n",
    "2.  **Visualizing Reconstructions:** Running inference on a few validation videos and visualizing the model's masked reconstructions to qualitatively assess its understanding of motion and appearance."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 6. ONNX Export\n",
    "\n",
    "Finally, we export the trained ViT encoder backbone to the ONNX format. This makes the model portable and ready for deployment in various inference environments. We will use the checkpoint from the last epoch of our short training run."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import torch\n",
    "from mmaction.apis import init_recognizer\n",
    "from mmaction.core.deployment import torch2onnx\n",
    "\n",
    "print(\"Exporting model to ONNX...\")\n",
    "\n",
    "# Path to the config file we created\n",
    "config_file = './configs/recognition/videomae/videomae_pretrain_hmdb51_prototype.py'\n",
    "\n",
    "# Path to the checkpoint file from the training run\n",
    "# NOTE: MMAction2 saves checkpoints as epoch_X.pth\n",
    "checkpoint_file = './work_dirs/videomae_pretrain_hmdb51_prototype/epoch_3.pth'\n",
    "output_file = '../video_mae_encoder.onnx'\n",
    "\n",
    "# Build the model from a config file and a checkpoint file\n",
    "model = init_recognizer(config_file, checkpoint_file, device='cpu')\n",
    "\n",
    "# We only want to export the encoder (backbone)\n",
    "encoder = model.backbone\n",
    "\n",
    "# Create a dummy input with the expected shape\n",
    "# (batch_size, num_channels, num_frames, height, width)\n",
    "dummy_input = torch.randn(1, 3, 16, 224, 224)\n",
    "\n",
    "# The torch2onnx function from MMAction2 handles the export\n",
    "torch.onnx.export(\n",
    "    encoder,\n",
    "    dummy_input,\n",
    "    output_file,\n",
    "    input_names=['input'],\n",
    "    output_names=['output'],\n",
    "    opset_version=11,\n",
    "    dynamic_axes={'input': {0: 'batch_size'}, 'output': {0: 'batch_size'}}\n",
    ")\n",
    "\n",
    "print(f\"ONNX model saved to: {output_file}\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "!ls -lh ../"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.9.12"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}
