# SAAB-v3 Training on Google Colab

This notebook provides an interface to train SAAB-v3 models on Google Colab with GPU support.

## Setup & Installation


### Cell 1: Installation and Dependencies

Install PyTorch with CUDA support and all project dependencies.


In [1]:
# Install PyTorch with CUDA support for Colab
!pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118

# Install project dependencies
!pip install pandas numpy pydantic networkx tqdm scikit-learn matplotlib pyyaml


Looking in indexes: https://download.pytorch.org/whl/cu118
CUDA available: True
GPU: NVIDIA A100-SXM4-40GB
CUDA version: 12.6
Mounted at /content/drive

Google Drive mounted successfully!


In [16]:
# Verify GPU availability
import torch
print(f"CUDA available: {torch.cuda.is_available()}")
if torch.cuda.is_available():
    print(f"GPU: {torch.cuda.get_device_name(0)}")
    print(f"CUDA version: {torch.version.cuda}")

# Mount Google Drive for data/checkpoint persistence
from google.colab import drive
drive.mount('/content/drive')
print("\nGoogle Drive mounted successfully!")

CUDA available: True
GPU: NVIDIA A100-SXM4-40GB
CUDA version: 12.6
Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).

Google Drive mounted successfully!


### Cell 2: Project Upload

Upload your project ZIP file and extract it.


### Cell 3: Path Setup

Set up Google Drive paths for data, checkpoints, and logs.


In [17]:
from pathlib import Path

# Google Drive base path
drive_base = Path("/content/drive/MyDrive/SAAB")

# Create directory structure in Drive
directories = [
    drive_base / "dataset" / "raw",
    drive_base / "dataset" / "artifacts",
    drive_base / "checkpoints",
    drive_base / "logs"
]

for directory in directories:
    directory.mkdir(parents=True, exist_ok=True)

print("Google Drive directory structure:")
print(f"  Dataset: {drive_base / 'dataset' / 'raw'}")
print(f"  Artifacts: {drive_base / 'dataset' / 'artifacts'}")
print(f"  Checkpoints: {drive_base / 'checkpoints'}")
print(f"  Logs: {drive_base / 'logs'}")
print("\n✓ Directories created/verified")


Google Drive directory structure:
  Dataset: /content/drive/MyDrive/SAAB/dataset/raw
  Artifacts: /content/drive/MyDrive/SAAB/dataset/artifacts
  Checkpoints: /content/drive/MyDrive/SAAB/checkpoints
  Logs: /content/drive/MyDrive/SAAB/logs

✓ Directories created/verified


## Configuration

Set your training parameters below.


### Cell 4: Training Configuration

Edit these variables to configure your training run.


In [24]:
# Training configuration
dataset_name = "dbpedia"  # Dataset identifier
model_type = "flat"     # "flat", "scratch", or "saab"
experiment_name = None      # Optional: custom experiment name (defaults to {dataset_name}_{model_type})
resume_checkpoint = None    # Optional: path to checkpoint file for resuming training

# Auto-generate experiment name if not provided
if experiment_name is None:
    experiment_name = f"{dataset_name}_{model_type}"

# Display configuration
print("Training Configuration:")
print(f"  Dataset: {dataset_name}")
print(f"  Model Type: {model_type}")
print(f"  Experiment Name: {experiment_name}")
if resume_checkpoint:
    print(f"  Resume from: {resume_checkpoint}")
else:
    print(f"  Resume from: None (starting fresh)")


Training Configuration:
  Dataset: dbpedia
  Model Type: flat
  Experiment Name: dbpedia_flat
  Resume from: None (starting fresh)


## Training

Execute the training command.


### Cell 5: Run Training

This cell executes the training CLI command. Output will be displayed in real-time.


In [25]:
import subprocess
import sys

cmd = [
    sys.executable,
    "-u",
    "-m",
    "saab_v3.train",
    "--dataset",
    str(dataset_name),
    "--model",
    str(model_type),
    "--experiment-name",
    str(experiment_name),
]

if resume_checkpoint:
    cmd += ["--resume", str(resume_checkpoint)]

p = subprocess.Popen(
    ["stdbuf", "-oL", "-eL", *cmd],
    stdout=subprocess.PIPE,
    stderr=subprocess.STDOUT,
    text=True,
    bufsize=1,
)

for line in p.stdout:
    print(line, end="")

p.wait()
if p.returncode:
    raise RuntimeError(f"exit code {p.returncode}")


Using Pydantic defaults for configuration...

Preprocessing:
  - vocab_size: 30000
  - max_seq_len: 512
  - device: cuda

Model:
  - d_model: 768
  - num_layers: 4
  - num_heads: 6
  - dropout: 0.2
  - device: cuda

Training:
  - learning_rate: 1e-06
  - batch_size: 64
  - num_epochs: None
  - lr_schedule: reduce_on_plateau
  - max_grad_norm: 0.1
  - early_stopping_patience: 3
  - device: cuda

Fitting preprocessor on training data from /content/drive/MyDrive/SAAB/dataset/raw/dbpedia/train.csv...

Extracting tokens:   0%|          | 0/560000 [00:00<?, ?it/s]
Extracting tokens:   0%|          | 1908/560000 [00:00<00:40, 13790.16it/s]
Extracting tokens:   1%|          | 5937/560000 [00:00<00:20, 27250.15it/s]
Extracting tokens:   2%|▏         | 8875/560000 [00:00<00:25, 21618.82it/s]
Extracting tokens:   2%|▏         | 12696/560000 [00:00<00:27, 19693.87it/s]
Extracting tokens:   3%|▎         | 16734/560000 [00:00<00:21, 24834.08it/s]
Extracting tokens:   4%|▎         | 20185/560000 [00

## Evaluation

Evaluate trained models on test/validation data.


### Cell 6: Evaluation Configuration

Edit these variables to configure your evaluation run.


In [None]:
# Evaluation configuration
checkpoint_path = "/content/drive/MyDrive/SAAB/checkpoints/dbpedia_flat/best_model.pt"  # Path to checkpoint file
eval_dataset_name = "dbpedia"  # Dataset identifier (usually same as training)
eval_split = "test"  # "val" or "test"
eval_batch_size = 64  # Batch size for evaluation

# Display configuration
print("Evaluation Configuration:")
print(f"  Checkpoint: {checkpoint_path}")
print(f"  Dataset: {eval_dataset_name}")
print(f"  Split: {eval_split}")
print(f"  Batch Size: {eval_batch_size}")


### Cell 7: Run Evaluation

This cell executes the evaluation CLI command. Output will be displayed in real-time.


In [None]:
import subprocess
import sys

cmd = [
    sys.executable,
    "-u",
    "-m",
    "saab_v3.evaluate",
    "--checkpoint",
    str(checkpoint_path),
    "--dataset-name",
    str(eval_dataset_name),
    "--split",
    str(eval_split),
    "--device",
    "cuda",
    "--batch-size",
    str(eval_batch_size),
]

p = subprocess.Popen(
    ["stdbuf", "-oL", "-eL", *cmd],
    stdout=subprocess.PIPE,
    stderr=subprocess.STDOUT,
    text=True,
    bufsize=1,
)

for line in p.stdout:
    print(line, end="")

p.wait()
if p.returncode:
    raise RuntimeError(f"exit code {p.returncode}")
