# Production-Ready VideoMAE Pre-training on Google Colab

This notebook provides a robust and user-friendly workflow to run VideoMAE (Masked Autoencoders for Video) pre-training on a public dataset using Google Colab's resources. It has been refactored for better configuration, persistence, and rapid experimentation.

The process is broken down into four main stages:

1.  **Configuration & Environment Setup**: Setting all parameters in one place, mounting Google Drive for persistent storage, and checking the GPU environment.
2.  **Data Acquisition & Preparation**: Efficiently downloading a video dataset and converting it to the format required by the training script. Includes a "smoke test" for fast validation.
3.  **Execute Training**: Launching the `train.py` script with dynamically configured parameters.
4.  **Cleanup**: An optional step to remove temporary data from the Colab runtime.

## 1. Centralized Configuration

**Instructions**: Adjust all parameters for your training run in the code cell below. This is the only place you should need to make changes.

In [None]:
{
  "cells": [
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "# Production-Ready VideoMAE Pre-training on Google Colab\n",
        "\n",
        "This notebook provides a robust and user-friendly workflow to run VideoMAE (Masked Autoencoders for Video) pre-training on a public dataset using Google Colab's resources. It has been refactored for better configuration, persistence, and rapid experimentation.\n",
        "\n",
        "The process is broken down into four main stages:\n",
        "\n",
        "1.  **Configuration & Environment Setup**: Setting all parameters in one place, mounting Google Drive for persistent storage, and checking the GPU environment.\n",
        "2.  **Data Acquisition & Preparation**: Efficiently downloading a video dataset and converting it to the format required by the training script. Includes a \"smoke test\" for fast validation.\n",
        "3.  **Execute Training**: Launching the `train.py` script with dynamically configured parameters.\n",
        "4.  **Cleanup**: An optional step to remove temporary data from the Colab runtime."
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## Authentication for Private Repository\n",
        "\n",
        "Because the repository is private, a GitHub Personal Access Token (PAT) is required for `git clone` to work.\n",
        "\n",
        "**Instructions:**\n",
        "\n",
        "1.  Create a GitHub Personal Access Token (PAT) with `repo` scope. Follow the official guide: [Managing your personal access tokens](https://docs.github.com/en/authentication/keeping-your-account-and-data-secure/managing-your-personal-access-tokens).\n",
        "2.  In this Colab notebook, go to the \"Secrets\" tab (key icon on the left pane).\n",
        "3.  Add two new secrets:\n",
        "    *   `GITHUB_USER`: Your GitHub username.\n",
        "    *   `GITHUB_TOKEN`: The Personal Access Token you just created.\n",
        "\n",
        "**Security Warning**: Do NOT save your token directly in the notebook code. Always use the secrets manager."
      ]
    },
    {
      "cell_type": "code",
      "metadata": {},
      "outputs": [],
      "source": [
        "from google.colab import userdata\n",
        "import os\n",
        "\n",
        "# Fetch secrets from Colab's secret manager\n",
        "# IMPORTANT: User must store these secrets in Colab before running.\n",
        "# Go to \"Secrets\" (key icon) on the left pane and add:\n",
        "# 1. GITHUB_USER -> Your GitHub username\n",
        "# 2. GITHUB_TOKEN -> Your GitHub Personal Access Token\n",
        "try:\n",
        "    GITHUB_USER = userdata.get('GITHUB_USER')\n",
        "    GITHUB_TOKEN = userdata.get('GITHUB_TOKEN')\n",
        "except userdata.SecretNotFoundError:\n",
        "    raise userdata.SecretNotFoundError('Please store your GITHUB_USER and GITHUB_TOKEN in Colab secrets.')\n",
        "\n",
        "# Construct the authenticated Git URL\n",
        "# This is the correct way to clone a private repo non-interactively.\n",
        "GIT_REPO_URL_AUTH = f\"https://{GITHUB_USER}:{GITHUB_TOKEN}@github.com/m1qaweb/miqai.git\"\n",
        "print(\"✅ GitHub credentials fetched successfully.\")"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## 1. Centralized Configuration\n",
        "\n",
        "**Instructions**: Adjust all parameters for your training run in the code cell below. This is the only place you should need to make changes."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "# --- General Project Setup ---\n",
        "# After cloning, the root folder will be 'miqai'.\n",
        "PROJECT_ROOT_DIR = 'miqai'\n",
        "# The notebook and scripts are located in a subdirectory.\n",
        "PROJECT_DIR = f'{PROJECT_ROOT_DIR}/video-ai-system'\n",
        "\n",
        "# --- Google Drive Integration ---\n",
        "# All outputs (models, logs) will be saved here for persistence.\n",
        "GDRIVE_MOUNT_PATH = '/content/drive'\n",
        "GDRIVE_OUTPUT_DIR = f'{GDRIVE_MOUNT_PATH}/MyDrive/VideoAI_Outputs'\n",
        "\n",
        "# --- Dataset Configuration ---\n",
        "TFDS_DATASET_NAME = 'ucf101'  # Dataset to download from TensorFlow Datasets\n",
        "DATA_DIR = '/content/data'      # Local Colab path for raw TFDS download\n",
        "VIDEO_OUTPUT_DIR = f'{DATA_DIR}/{TFDS_DATASET_NAME}_videos' # Local path for converted videos\n",
        "\n",
        "# --- Experiment Controls ---\n",
        "SMOKE_TEST = True  # If True, only processes 50 examples to quickly test the pipeline\n",
        "SMOKE_TEST_EXAMPLE_COUNT = 50\n",
        "\n",
        "# --- Training Hyperparameters ---\n",
        "TRAINING_PARAMS = {\n",
        "    \"total_epochs\": 5,\n",
        "    \"batch_size\": 4,       # Adjust based on available GPU memory\n",
        "    \"learning_rate\": 1.5e-4\n",
        "    # Add other train.py arguments here as needed\n",
        "}\n",
        "\n",
        "# --- Print configuration for verification ---\n",
        "print(f\"✅ Configuration loaded.\")\n",
        "print(f\"   Project Root Directory: {PROJECT_ROOT_DIR}\")\n",
        "print(f\"   Project Subdirectory: {PROJECT_DIR}\")\n",
        "print(f\"   Dataset: {TFDS_DATASET_NAME}\")\n",
        "print(f\"   Smoke Test Enabled: {SMOKE_TEST}\")\n",
        "print(f\"   Outputs will be saved to: {GDRIVE_OUTPUT_DIR}\")"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## 2. Environment Setup\n",
        "\n",
        "This section prepares the Colab environment. It performs the following critical steps:\n",
        "\n",
        "1.  **Mount Google Drive**: This is essential for saving your trained models and logs. Without this, all outputs will be lost when the Colab runtime disconnects.\n",
        "2.  **Check for GPU**: This confirms that you are using a GPU-accelerated runtime, which is necessary for efficient model training.\n",
        "3.  **Clone Repository & Install Dependencies**: Clones the project code from GitHub and installs all required Python packages."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "from google.colab import drive\n",
        "import os\n",
        "\n",
        "print(\"Mounting Google Drive...\")\n",
        "drive.mount(GDRIVE_MOUNT_PATH, force_remount=True)\n",
        "\n",
        "# Create the output directory in Google Drive if it doesn't exist\n",
        "os.makedirs(GDRIVE_OUTPUT_DIR, exist_ok=True)\n",
        "print(f\"Google Drive mounted. Output directory is ready at: {GDRIVE_OUTPUT_DIR}\")"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "# Verify that a GPU is available\n",
        "print(\"Checking for GPU...\")\n",
        "!nvidia-smi"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "***Note on GPU Check***: *If the `!nvidia-smi` command fails or shows no devices, go to `Runtime > Change runtime type` and select a GPU Hardware accelerator (e.g., T4 GPU).*"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "import os\n",
        "\n",
        "# Clean up any previous clones. We check for the root directory created by git.\n",
        "if os.path.exists(PROJECT_ROOT_DIR):\n",
        "    print(f\"Removing existing project directory: {PROJECT_ROOT_DIR}...\")\n",
        "    !rm -rf {PROJECT_ROOT_DIR}\n",
        "\n",
        "# Configure git to use the token for authentication.\n",
        "!git config --global credential.helper store\n",
        "\n",
        "# Clone the repository using the authenticated URL.\n",
        "print(f\"Cloning repository using authenticated URL...\")\n",
        "!git clone -q {GIT_REPO_URL_AUTH}\n",
        "\n",
        "# Change directory into the project folder where scripts are located\n",
        "print(f\"Changing directory to {PROJECT_DIR}\")\n",
        "%cd {PROJECT_DIR}\n",
        "\n",
        "print(\"\\nInstalling dependencies...\")\n",
        "!pip install -q -r scripts/requirements.txt\n",
        "!pip install -q tensorflow tensorflow_datasets imageio\n",
        "print(\"\\n✅ Dependencies installed successfully.\")"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## 3. Data Acquisition and Preparation\n",
        "\n",
        "This section handles downloading the dataset and preparing it for our training script.\n",
        "\n",
        "**Why do we need this step?**\n",
        "The `train.py` script expects video files (like `.mp4` or `.avi`) organized into folders where each folder name corresponds to a class label. However, datasets from `tensorflow_datasets` (`tfds`) are provided in a special `tf.data.Dataset` format. This code bridges that gap by:\n",
        "\n",
        "1.  **Downloading the data** using the efficient `tfds.load()` function.\n",
        "2.  **Iterating through the dataset** and saving each video example as a standard video file.\n",
        "3.  **Organizing the videos** into the required class-based directory structure.\n",
        "\n",
        "If `SMOKE_TEST` is `True`, this process will only convert a small number of videos, allowing you to verify the entire pipeline in minutes instead of hours."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "import tensorflow_datasets as tfds\n",
        "import tensorflow as tf\n",
        "import imageio\n",
        "import numpy as np\n",
        "import os\n",
        "from tqdm.auto import tqdm\n",
        "\n",
        "print(f\"Downloading '{TFDS_DATASET_NAME}' dataset...\")\n",
        "\n",
        "# Use the more direct tfds.load API\n",
        "dataset, info = tfds.load(TFDS_DATASET_NAME, split='train', with_info=True, data_dir=DATA_DIR)\n",
        "label_names = info.features['label'].names\n",
        "\n",
        "# Apply smoke test if enabled\n",
        "if SMOKE_TEST:\n",
        "    print(f\"Smoke test is ON. Processing only {SMOKE_TEST_EXAMPLE_COUNT} examples.\")\n",
        "    dataset = dataset.take(SMOKE_TEST_EXAMPLE_COUNT)\n",
        "\n",
        "os.makedirs(VIDEO_OUTPUT_DIR, exist_ok=True)\n",
        "print(f\"Converting dataset and saving videos to {VIDEO_OUTPUT_DIR}...\")\n",
        "\n",
        "converted_count = 0\n",
        "class_folders = set()\n",
        "\n",
        "# Convert tf.data.Dataset to a numpy iterator for easier handling\n",
        "dataset_numpy = tfds.as_numpy(dataset)\n",
        "\n",
        "for example in tqdm(dataset_numpy):\n",
        "    video_frames = example['video']\n",
        "    label_index = example['label']\n",
        "    video_name = example.get('video_name', f'video_{converted_count}').decode('utf-8')\n",
        "\n",
        "    class_name = label_names[label_index]\n",
        "    class_dir = os.path.join(VIDEO_OUTPUT_DIR, class_name)\n",
        "    os.makedirs(class_dir, exist_ok=True)\n",
        "    class_folders.add(class_name)\n",
        "\n",
        "    output_video_path = os.path.join(class_dir, f\"{video_name}.avi\")\n",
        "\n",
        "    # Save the frames as a video file\n",
        "    try:\n",
        "        imageio.mimsave(output_video_path, video_frames, fps=25, macro_block_size=1)\n",
        "        converted_count += 1\n",
        "    except Exception as e:\n",
        "        print(f\"Could not save {output_video_path}. Error: {e}\")\n",
        "\n",
        "print(\"\\n--- Data Conversion Summary ---\")\n",
        "print(f\"✅ Successfully converted {converted_count} videos.\")\n",
        "print(f\"✅ Found {len(class_folders)} class folders.\")"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## 4. Execute Training\n",
        "\n",
        "Now we are ready to launch the pre-training script. The command below is constructed dynamically using the parameters you defined in the configuration cell.\n",
        "\n",
        "**Key Improvements**:\n",
        "\n",
        "- **Dynamic Worker Count**: We automatically detect the number of available CPU cores in the Colab runtime to set `--num-workers` optimally.\n",
        "- **Centralized Params**: All arguments (`--data-path`, `--total-epochs`, etc.) are read from the configuration variables.\n",
        "- **Persistent Outputs**: The `--output-dir` is pointed to your Google Drive, ensuring that model checkpoints and logs are saved."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "import os\n",
        "\n",
        "# Programmatically determine the number of CPU cores for optimal data loading\n",
        "num_workers = os.cpu_count()\n",
        "print(f\"Using {num_workers} workers for data loading.\")\n",
        "\n",
        "# Construct the training command from the configuration\n",
        "training_command = (\n",
        "    f\"python scripts/train.py \"\n",
        "    f\"--data-path {VIDEO_OUTPUT_DIR} \"\n",
        "    f\"--output-dir {GDRIVE_OUTPUT_DIR} \"\n",
        "    f\"--total-epochs {TRAINING_PARAMS['total_epochs']} \"\n",
        "    f\"--batch-size {TRAINING_PARAMS['batch_size']} \"\n",
        "    f\"--lr {TRAINING_PARAMS['learning_rate']} \"\n",
        "    f\"--num-workers {num_workers}\"\n",
        ")\n",
        "\n",
        "print(\"\\n--- Starting Training ---\")\n",
        "print(f\"Executing command:\\n{training_command}\")\n",
        "\n",
        "# Run the training\n",
        "!{training_command}"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## 5. Cleanup (Optional)\n",
        "\n",
        "The following cell will delete the downloaded raw data and the converted videos from the local Colab runtime. This is useful for managing disk space, especially if you are working with large datasets.\n",
        "\n",
        "**Important**: This will NOT delete the model outputs saved in your Google Drive. It only cleans up the temporary files on the Colab machine."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "print(\"Cleaning up local data directories...\")\n",
        "!rm -rf {DATA_DIR}\n",
        "!rm -rf {VIDEO_OUTPUT_DIR}\n",
        "print(\"✅ Local data directories removed.\")"
      ]
    }
  ],
  "metadata": {
    "colab": {
      "provenance": []
    },
    "kernelspec": {
      "display_name": "Python 3",
      "name": "python3"
    },
    "language_info": {
      "name": "python"
    }
  },
  "nbformat": 4,
  "nbformat_minor": 0
}


## 2. Environment Setup

This section prepares the Colab environment. It performs the following critical steps:

1.  **Mount Google Drive**: This is essential for saving your trained models and logs. Without this, all outputs will be lost when the Colab runtime disconnects.
2.  **Check for GPU**: This confirms that you are using a GPU-accelerated runtime, which is necessary for efficient model training.
3.  **Clone Repository & Install Dependencies**: Clones the project code from GitHub and installs all required Python packages.

In [None]:
from google.colab import drive
import os

print("Mounting Google Drive...")
drive.mount(GDRIVE_MOUNT_PATH, force_remount=True)

# Create the output directory in Google Drive if it doesn't exist
os.makedirs(GDRIVE_OUTPUT_DIR, exist_ok=True)
print(f"Google Drive mounted. Output directory is ready at: {GDRIVE_OUTPUT_DIR}")

In [None]:
# Verify that a GPU is available
print("Checking for GPU...")
!nvidia-smi

***Note on GPU Check***: *If the `!nvidia-smi` command fails or shows no devices, go to `Runtime > Change runtime type` and select a GPU Hardware accelerator (e.g., T4 GPU).*

In [None]:
import os

# Clean up any previous clones. We check for the root directory created by git.
if os.path.exists(PROJECT_ROOT_DIR):
    print(f"Removing existing project directory: {PROJECT_ROOT_DIR}...")
    !rm -rf {PROJECT_ROOT_DIR}

# Note: The `git clone` command requires the full repository URL ending in .git.
# It is not possible to clone just a subdirectory from a repository URL.
# We clone the entire repository and then change our working directory into the correct subfolder.
print(f"Cloning repository from {GIT_REPO_URL}...")
!git clone {GIT_REPO_URL}

# Change directory into the project folder where scripts are located
print(f"Changing directory to {PROJECT_DIR}")
%cd {PROJECT_DIR}

print("\nInstalling dependencies...")
!pip install -q -r scripts/requirements.txt
!pip install -q tensorflow tensorflow_datasets imageio
print("\n✅ Dependencies installed successfully.")

## 3. Data Acquisition and Preparation

This section handles downloading the dataset and preparing it for our training script.

**Why do we need this step?**
The `train.py` script expects video files (like `.mp4` or `.avi`) organized into folders where each folder name corresponds to a class label. However, datasets from `tensorflow_datasets` (`tfds`) are provided in a special `tf.data.Dataset` format. This code bridges that gap by:

1.  **Downloading the data** using the efficient `tfds.load()` function.
2.  **Iterating through the dataset** and saving each video example as a standard video file.
3.  **Organizing the videos** into the required class-based directory structure.

If `SMOKE_TEST` is `True`, this process will only convert a small number of videos, allowing you to verify the entire pipeline in minutes instead of hours.

In [None]:
import tensorflow_datasets as tfds
import tensorflow as tf
import imageio
import numpy as np
import os
from tqdm.auto import tqdm

print(f"Downloading '{TFDS_DATASET_NAME}' dataset...")

# Use the more direct tfds.load API
dataset, info = tfds.load(
    TFDS_DATASET_NAME,
    split='train',
    with_info=True,
    data_dir=DATA_DIR,
    download_and_prepare_kwargs={'download_config': tfds.download.DownloadConfig(verify_ssl=False)} # Disable SSL verification
)
label_names = info.features['label'].names

# Apply smoke test if enabled
if SMOKE_TEST:
    print(f"Smoke test is ON. Processing only {SMOKE_TEST_EXAMPLE_COUNT} examples.")
    dataset = dataset.take(SMOKE_TEST_EXAMPLE_COUNT)

os.makedirs(VIDEO_OUTPUT_DIR, exist_ok=True)
print(f"Converting dataset and saving videos to {VIDEO_OUTPUT_DIR}...")

converted_count = 0
class_folders = set()

# Convert tf.data.Dataset to a numpy iterator for easier handling
dataset_numpy = tfds.as_numpy(dataset)

for example in tqdm(dataset_numpy):
    video_frames = example['video']
    label_index = example['label']
    video_name = example.get('video_name', f'video_{converted_count}').decode('utf-8')

    class_name = label_names[label_index]
    class_dir = os.path.join(VIDEO_OUTPUT_DIR, class_name)
    os.makedirs(class_dir, exist_ok=True)
    class_folders.add(class_name)

    output_video_path = os.path.join(class_dir, f"{video_name}.avi")

    # Save the frames as a video file
    try:
        imageio.mimsave(output_video_path, video_frames, fps=25, macro_block_size=1)
        converted_count += 1
    except Exception as e:
        print(f"Could not save {output_video_path}. Error: {e}")

print("\n--- Data Conversion Summary ---")
print(f"✅ Successfully converted {converted_count} videos.")
print(f"✅ Found {len(class_folders)} class folders.")

## 4. Execute Training

Now we are ready to launch the pre-training script. The command below is constructed dynamically using the parameters you defined in the configuration cell.

**Key Improvements**:

- **Dynamic Worker Count**: We automatically detect the number of available CPU cores in the Colab runtime to set `--num-workers` optimally.
- **Centralized Params**: All arguments (`--data-path`, `--total-epochs`, etc.) are read from the configuration variables.
- **Persistent Outputs**: The `--output-dir` is pointed to your Google Drive, ensuring that model checkpoints and logs are saved.

In [None]:
import os

# Programmatically determine the number of CPU cores for optimal data loading
num_workers = os.cpu_count()
print(f"Using {num_workers} workers for data loading.")

# Construct the training command from the configuration
training_command = (
    f"python scripts/train.py "
    f"--data-path {VIDEO_OUTPUT_DIR} "
    f"--output-dir {GDRIVE_OUTPUT_DIR} "
    f"--total-epochs {TRAINING_PARAMS['total_epochs']} "
    f"--batch-size {TRAINING_PARAMS['batch_size']} "
    f"--lr {TRAINING_PARAMS['learning_rate']} "
    f"--num-workers {num_workers}"
)

print("\n--- Starting Training ---")
print(f"Executing command:\n{training_command}")

# Run the training
!{training_command}

## 5. Cleanup (Optional)

The following cell will delete the downloaded raw data and the converted videos from the local Colab runtime. This is useful for managing disk space, especially if you are working with large datasets.

**Important**: This will NOT delete the model outputs saved in your Google Drive. It only cleans up the temporary files on the Colab machine.

In [None]:
print("Cleaning up local data directories...")
!rm -rf {DATA_DIR}
!rm -rf {VIDEO_OUTPUT_DIR}
print("✅ Local data directories removed.")