# üöÄ Transformer Training & Fine-Tuning Notebook

**Professional ML Training Environment** for transformer models exported from [Transformer Builder](https://transformer-builder.com).

## Quick Start Modes

| Mode | Epochs | Time | Use Case |
|------|--------|------|----------|
| **‚ö° Fast** | 3 | ~5 min | Quick validation |
| **‚öñÔ∏è Balanced** | 10 | ~15 min | Development |
| **üíé Quality** | 20 | ~45 min | Production |

## Features
- ‚úÖ 5 Data Sources (HuggingFace, Drive, Upload, Local, Synthetic)
- ‚úÖ Live Training Visualization
- ‚úÖ Google Drive Checkpoints
- ‚úÖ W&B + Local SQLite Tracking
- ‚úÖ Hyperparameter Search
- ‚úÖ Export & Comparison Tools

**üìå Tip**: Run all cells in order for best results. Adjust hyperparameters in Section 3.

## üìã Table of Contents

1. [Section 0: Quick Start](#section-0) ‚Üê You are here
2. [Section 1: Setup & Drive Workspace](#section-1) (2 min)
3. [Section 2: Model Loading](#section-2) (Load custom or example model)
4. [Section 3: Data Loading](#section-3) (5 sources)
5. [Section 4: Training Configuration](#section-4) (Hyperparameters)
6. [Section 5: W&B Tracking Setup](#section-5) (Optional)
7. [Section 6: Training Loop](#section-6) (Main training)
8. [Section 7: Analysis & Visualization](#section-7) (Dashboards)
9. [Section 8: Export & Results](#section-8) (Download checkpoints)
10. [Section 9: Advanced Features](#section-9) (Hyperparameter search)

‚è±Ô∏è **Total Time**: ~20-60 minutes depending on mode


## üì¶ Requirements

This notebook requires:
- Python >= 3.10
- PyTorch (pre-installed in Colab)
- Transformer Builder utilities (auto-downloaded)

**GPU Recommended** but not required. Training will auto-detect and use GPU if available.

---
<a id="section-1"></a>

In [None]:
# Install training dependencies
!pip install -q -r https://raw.githubusercontent.com/matt-hans/transformer-builder-colab-templates/main/requirements-training.txt

print("‚úÖ Dependencies installed")

In [None]:
import os

print("üì• Downloading training utilities...")

# Remove old utils directory if exists
!rm -rf utils/

# Download complete utils package from GitHub
!git clone --depth 1 --branch main https://github.com/matt-hans/transformer-builder-colab-templates.git temp_repo 2>/dev/null

# Copy utils directory
!cp -r temp_repo/utils ./

# Cleanup
!rm -rf temp_repo

# Verify package structure
utils_path = os.path.join(os.getcwd(), 'utils')
if os.path.exists(utils_path):
    print(f"‚úÖ Utils package downloaded")
    
    # Verify training subdirectory
    training_path = os.path.join(utils_path, 'training')
    if os.path.exists(training_path):
        n_files = len([f for f in os.listdir(training_path) if f.endswith('.py')])
        print(f"‚úÖ Training utilities: {n_files} modules found")
    
    # Verify tier3 utilities
    tier3_path = os.path.join(utils_path, 'tier3_training_utilities.py')
    if os.path.exists(tier3_path):
        print(f"‚úÖ Tier 3 training utilities ready")
else:
    print("‚ùå Failed to download utils package")
    raise RuntimeError("Could not download training utilities")

In [None]:
from google.colab import drive
import os

# Mount Google Drive
drive.mount('/content/drive')

# Create workspace folders
workspace_root = '/content/drive/MyDrive/TransformerTraining'
os.makedirs(f'{workspace_root}/checkpoints', exist_ok=True)
os.makedirs(f'{workspace_root}/configs', exist_ok=True)
os.makedirs(f'{workspace_root}/results', exist_ok=True)
os.makedirs(f'{workspace_root}/datasets', exist_ok=True)

print(f"‚úÖ Workspace created at: {workspace_root}")
print(f"   üìÅ checkpoints/ - Saved model weights")
print(f"   üìÅ configs/ - Training configurations")
print(f"   üìÅ results/ - Metrics, plots, dashboards")
print(f"   üìÅ datasets/ - Cached datasets")

In [None]:
from utils.training.experiment_db import ExperimentDB

# Initialize local SQLite tracking (backup to W&B)
db = ExperimentDB(f'{workspace_root}/experiments.db')

print("‚úÖ Experiment database initialized")
print(f"   Database: {workspace_root}/experiments.db")
print(f"   Recent runs:")
recent_runs = db.list_runs(limit=5)
if recent_runs:
    print(recent_runs)
else:
    print("   (No previous runs found)")

<a id="section-2"></a>
# üì¶ Section 2: Model Loading


Load your transformer model from Transformer Builder or use the example model.

**Options:**
- **Custom Model**: Provide Gist ID from Transformer Builder (auto-detected from URL)
- **Example Model**: GPT-2 style architecture for testing

**You will see:**
1. Model code preview
2. Architecture summary (layers, parameters, size)
3. GPU compatibility check


In [None]:
# @title üîó Model Source Configuration { display-mode: "form" }

# Step 1: Try to extract from URL hash using JavaScript
from google.colab import output
import os
import json

# JavaScript to extract gist_id and model_name from URL hash
js_code = """
(function() {
    let gist_id = '';
    let model_name = '';

    try {
        // Try to read URL hash from parent window (Colab embedding)
        const hash = window.parent.location.hash || window.location.hash || '';

        if (hash) {
            // Parse hash parameters (e.g., #gist_id=abc123&name=MyModel)
            const params = new URLSearchParams(hash.substring(1));
            gist_id = params.get('gist_id') || '';
            model_name = params.get('name') || '';

            console.log('Extracted from URL hash:', {gist_id, model_name});
        }
    } catch (e) {
        console.log('Could not access URL hash:', e);
    }

    // Return as JSON string
    return JSON.stringify({gist_id: gist_id, model_name: model_name});
})();
"""

# Execute JavaScript and get returned values
try:
    url_params_json = output.eval_js(js_code)
    url_params = json.loads(url_params_json)
    gist_id_from_url = url_params.get('gist_id', '')
    model_name_from_url = url_params.get('model_name', '')
except Exception as e:
    print(f"‚ö†Ô∏è  Could not extract from URL hash: {e}")
    gist_id_from_url = ''
    model_name_from_url = ''

# Step 2: Manual input forms (as fallback)
gist_id_manual = ""  #@param {type:"string"}
model_name_manual = "CustomTransformer"  #@param {type:"string"}

# Step 3: Environment variables (lowest priority)
gist_id_env = os.getenv('GIST_ID', '')
model_name_env = os.getenv('MODEL_NAME', '')

# Step 4: Determine final values (URL > Manual > Env)
gist_id = gist_id_from_url or gist_id_manual or gist_id_env
model_name = model_name_from_url or model_name_manual or model_name_env or 'CustomTransformer'

# Display source
print("="*60)
if gist_id:
    source = "URL hash" if gist_id_from_url else ("Manual input" if gist_id_manual else "Environment variable")
    print(f"‚úÖ Model Source: {source}")
    print(f"   Gist ID: {gist_id}")
    print(f"   Model Name: {model_name}")
    print(f"\n   Loading custom model from Transformer Builder...")
else:
    print("‚ÑπÔ∏è  No Gist ID provided")
    print("   Options to provide Gist ID:")
    print("   1. Open via Transformer Builder link (auto-detects from URL)")
    print("   2. Enter Gist ID in the form above")
    print("   3. Set GIST_ID environment variable")
    print("\n   Proceeding with example model for demonstration...")
print("="*60)


In [None]:
# @title üì¶ Load Model from Gist { display-mode: "form" }

import urllib.request
import json
import sys
import tempfile
import shutil

print("=" * 70)
print("MODEL LOADING")
print("=" * 70)
print()

# ==============================================================================
# VERIFY GIST ID WAS PROVIDED
# ==============================================================================

if 'gist_id' not in globals() or not gist_id:
    print("‚ùå ERROR: No Gist ID found!")
    print()
    print("==" * 35)
    print("üîô GO BACK TO PREVIOUS CELL")
    print("==" * 35)
    print()
    print("You must run the Model Source Configuration cell first.")
    print()
    raise ValueError("Gist ID required - run previous cell first")

print(f"üì• Loading model from GitHub Gist: {gist_id}")
print()

# ==============================================================================
# FETCH GIST AND LOAD MODEL FILES - GitHub API Approach
# ==============================================================================

def _fetch_gist(gid: str) -> dict:
    """Fetch Gist data from GitHub API."""
    url = f"https://api.github.com/gists/{gid}"
    req = urllib.request.Request(url, headers={
        "Accept": "application/vnd.github+json",
        "User-Agent": "transformer-builder-colab"
    })
    try:
        with urllib.request.urlopen(req, timeout=20) as resp:
            return json.loads(resp.read().decode("utf-8"))
    except urllib.error.HTTPError as e:
        detail = f"HTTP {e.code}"
        try:
            body = e.read().decode("utf-8")
            if "rate limit" in body.lower():
                detail += " - GitHub API rate limit (try again in an hour)"
            elif e.code == 404:
                detail += " - Gist not found (check your Gist ID)"
        except:
            pass
        raise RuntimeError(f"GitHub API error: {detail}") from e
    except Exception as e:
        raise RuntimeError(f"Network error: {e}") from e

def _write(path: str, text: str):
    """Write text to file."""
    with open(path, "w") as f:
        f.write(text)

# Fetch Gist
try:
    gist_data = _fetch_gist(gist_id)
    files = gist_data.get("files") or {}

    # Check for required files
    if "model.py" not in files:
        raise RuntimeError("Gist is missing 'model.py' - please re-export from Transformer Builder")
    if "config.json" not in files:
        raise RuntimeError("Gist is missing 'config.json' - please re-export from Transformer Builder")

    model_code = files["model.py"].get("content", "")
    config_json = files["config.json"].get("content", "")

    if not model_code or not config_json:
        raise RuntimeError("Empty content in model.py or config.json")

    # Write to files
    _write("model.py", model_code)
    _write("config.json", config_json)

    print(f"‚úÖ Model loaded successfully!")
    print(f"‚úÖ Gist URL: {gist_data.get('html_url', 'N/A')}")
    print(f"‚úÖ Model code: {len(model_code):,} bytes")
    print(f"‚úÖ Config: {len(config_json):,} bytes")
    print()

    # Parse model name from config if available
    try:
        model_config = json.loads(config_json)
        if 'model_name' in model_config:
            model_name = model_config['model_name']
            print(f"‚úÖ Model name: {model_name}")
        else:
            model_name = 'CustomTransformer'
            print(f"‚ÑπÔ∏è  Using default name: {model_name}")
        print()
    except:
        model_name = 'CustomTransformer'
        print(f"‚ö†Ô∏è  Could not parse config, using default name: {model_name}")

    # Store for next cell
    gist_loaded = True

except Exception as e:
    print(f"‚ùå Failed to load model from Gist!")
    print()
    print(f"Error: {e}")
    print()
    print("=" * 70)
    print("TROUBLESHOOTING")
    print("=" * 70)
    print()
    print("Common issues:")
    print("  1. Check your Gist ID is correct (go back to previous cell)")
    print("  2. Ensure you exported from Transformer Builder successfully")
    print("  3. Check you're not hitting GitHub rate limit (60 requests/hour)")
    print("  4. Try re-exporting from Transformer Builder")
    print()
    print("If the problem persists:")
    print(f"  ‚Ä¢ Gist URL: https://gist.github.com/{gist_id}")
    print("  ‚Ä¢ Verify the Gist contains model.py and config.json")
    print()

    # Fallback to example model
    print("‚ö†Ô∏è  Falling back to example model for demonstration...")
    gist_loaded = False
    model_name = 'ExampleTransformer'

print("=" * 70)
print("‚úÖ MODEL LOADING COMPLETE")
print("=" * 70)
print()
print("Model will be instantiated in the next cell.")
print()

# Display downloaded model code preview
if gist_loaded:
    print("\nüìÑ Model Code Preview:")
    print("=" * 60)
    with open('model.py', 'r') as f:
        model_lines = f.read().split('\n')
        # Show first 20 lines
        for i, line in enumerate(model_lines[:20], 1):
            print(f"{i:3d} | {line}")
        if len(model_lines) > 20:
            print(f"... ({len(model_lines) - 20} more lines)")
    print("=" * 60)

print(f"\nüìä Model: {model_name}")
if gist_loaded:
    print(f"   Config: {json.dumps(model_config, indent=2)}")


In [None]:
# @title üöÄ Initialize Model { display-mode: "form" }

import torch
import torch.nn as nn
import inspect
from types import SimpleNamespace

# Detect device
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(f"üñ•Ô∏è  Device: {device}")

if torch.cuda.is_available():
    print(f"   GPU: {torch.cuda.get_device_name(0)}")
    print(f"   Memory: {torch.cuda.get_device_properties(0).total_memory / 1e9:.1f} GB")

# Create model instance
if gist_loaded:
    # Custom model from Transformer Builder
    # Import the model from downloaded file
    try:
        sys.path.insert(0, '.')

        # Import all classes from model.py
        import importlib.util
        spec = importlib.util.spec_from_file_location("custom_model", "model.py")
        custom_model_module = importlib.util.module_from_spec(spec)
        spec.loader.exec_module(custom_model_module)

        # Find the model class
        model_class = None
        for name, obj in vars(custom_model_module).items():
            if isinstance(obj, type) and issubclass(obj, nn.Module) and obj is not nn.Module:
                if name == model_name:
                    model_class = obj
                    break
        
        if model_class is None:
            # Fallback: find any nn.Module subclass
            for name, obj in vars(custom_model_module).items():
                if isinstance(obj, type) and issubclass(obj, nn.Module) and obj is not nn.Module:
                    model_class = obj
                    print(f"‚ö†Ô∏è Using {name} (expected {model_name})")
                    break
        
        if model_class:
            # Check constructor signature (KEY FIX from template.ipynb)
            sig = inspect.signature(model_class.__init__)
            params_list = [p for p in sig.parameters.values() if p.name != 'self']
            
            if len(params_list) == 0:
                # Parameterless constructor (Transformer Builder models)
                print("‚ÑπÔ∏è  Model has parameterless constructor (Transformer Builder export)")
                model = model_class()
            else:
                # Parameterized constructor (traditional models)
                print(f"‚ÑπÔ∏è  Model accepts {len(params_list)} parameter(s)")
                model = model_class(**model_config)
            
            print(f"‚úÖ Custom model instantiated: {model.__class__.__name__}")
        else:
            raise Exception("No model class found in model.py")

    except Exception as e:
        print(f"‚ùå Failed to instantiate custom model: {e}")
        print("   Falling back to example model...")
        gist_loaded = False

if not gist_loaded:
    # Example model (fallback)
    print("üì¶ Loading example model (GPT-2 architecture)...")

    class ExampleTransformer(nn.Module):
        """Example GPT-2 style transformer for demonstration."""

        def __init__(self, vocab_size=50257, d_model=768, n_layers=12, n_heads=12, max_seq_len=1024):
            super().__init__()
            self.vocab_size = vocab_size
            self.d_model = d_model
            self.n_layers = n_layers
            self.n_heads = n_heads
            self.max_seq_len = max_seq_len

            self.embedding = nn.Embedding(vocab_size, d_model)
            self.position_embedding = nn.Embedding(max_seq_len, d_model)

            # Simple transformer layers
            self.layers = nn.ModuleList([
                nn.TransformerEncoderLayer(
                    d_model,
                    n_heads,
                    dim_feedforward=d_model*4,
                    batch_first=True,
                    dropout=0.1
                )
                for _ in range(n_layers)
            ])

            self.ln_f = nn.LayerNorm(d_model)
            self.lm_head = nn.Linear(d_model, vocab_size, bias=False)

        def forward(self, input_ids):
            batch_size, seq_len = input_ids.shape

            # Embeddings
            token_emb = self.embedding(input_ids)
            pos_ids = torch.arange(seq_len, device=input_ids.device).unsqueeze(0)
            pos_emb = self.position_embedding(pos_ids)

            x = token_emb + pos_emb

            # Transformer layers
            for layer in self.layers:
                x = layer(x)

            x = self.ln_f(x)
            logits = self.lm_head(x)

            return logits

    # Create example model
    model = ExampleTransformer()
    model_config = {
        'vocab_size': 50257,
        'd_model': 768,
        'n_layers': 12,
        'n_heads': 12,
        'max_seq_len': 1024
    }

    print(f"‚úÖ Example model definition loaded")

# Move to device
model = model.to(device)

# Model summary
total_params = sum(p.numel() for p in model.parameters())
trainable_params = sum(p.numel() for p in model.parameters() if p.requires_grad)

print(f"\n‚úÖ Model initialized on {device}")
print(f"   Total parameters: {total_params:,}")
print(f"   Trainable parameters: {trainable_params:,}")
print(f"   Model size: {total_params * 4 / 1e6:.1f} MB (fp32)")

# Create config object for training utilities
config_obj = SimpleNamespace(**model_config)
if not hasattr(config_obj, 'vocab_size'):
    config_obj.vocab_size = model_config.get('vocab_size', 50257)
if not hasattr(config_obj, 'max_seq_len'):
    config_obj.max_seq_len = model_config.get('max_seq_len', 1024)

print(f"\nüéØ Ready for training!")
print(f"\n‚ÑπÔ∏è  Note: Update Section 4 training config before starting training loop.")


<a id="section-3"></a>
# üìä Section 3: Data Loading

Choose your data source (run ONE of the following cells):
- **Option 1**: HuggingFace Datasets (recommended)
- **Option 2**: Google Drive Upload
- **Option 3**: File Upload (small datasets)
- **Option 4**: Local Files (from previous sessions)
- **Option 5**: Synthetic Data (testing only)


In [None]:
from datasets import load_dataset

# CONFIGURATION: Edit dataset name
dataset_name = "wikitext"  #@param {type:"string"}
config_name = "wikitext-2-raw-v1"  #@param {type:"string"}
max_samples = 1000  #@param {type:"integer"}

# Load dataset
dataset = load_dataset(dataset_name, config_name)
train_data = dataset['train'].select(range(min(max_samples, len(dataset['train']))))
val_data = dataset['validation'].select(range(min(100, len(dataset['validation']))))

print(f"‚úÖ Loaded {len(train_data)} training samples, {len(val_data)} validation samples")
print(f"   Example: {train_data[0]}")

data_source = "huggingface"
dataset_info = {'name': dataset_name, 'config': config_name, 'train_size': len(train_data), 'val_size': len(val_data)}

In [None]:
import os

drive_data_path = "/content/drive/MyDrive/TransformerTraining/datasets/my_data.txt"  #@param {type:"string"}

if os.path.exists(drive_data_path):
    with open(drive_data_path, 'r', encoding='utf-8') as f:
        lines = f.readlines()

    split_idx = int(0.9 * len(lines))
    train_data = [line.strip() for line in lines[:split_idx]]
    val_data = [line.strip() for line in lines[split_idx:]]

    print(f"‚úÖ Loaded {len(train_data)} training samples, {len(val_data)} validation samples")
    data_source = "google_drive"
    dataset_info = {'path': drive_data_path, 'train_size': len(train_data), 'val_size': len(val_data)}
else:
    print(f"‚ùå File not found: {drive_data_path}")
    print("   Please upload your data to Google Drive first")

In [None]:
from google.colab import files
import io

# Upload file
uploaded = files.upload()

if uploaded:
    filename = list(uploaded.keys())[0]
    content = uploaded[filename].decode('utf-8')
    lines = content.split('\n')

    split_idx = int(0.9 * len(lines))
    train_data = [line.strip() for line in lines[:split_idx]]
    val_data = [line.strip() for line in lines[split_idx:]]

    print(f"‚úÖ Loaded {len(train_data)} training samples, {len(val_data)} validation samples")
    data_source = "file_upload"
    dataset_info = {'filename': filename, 'train_size': len(train_data), 'val_size': len(val_data)}

In [None]:
import pickle
import os

cache_path = f'{workspace_root}/datasets/cached_data.pkl'

if os.path.exists(cache_path):
    with open(cache_path, 'rb') as f:
        data = pickle.load(f)

    train_data = data['train']
    val_data = data['val']

    print(f"‚úÖ Loaded cached data: {len(train_data)} train, {len(val_data)} val")
    data_source = "cached"
    dataset_info = {'path': cache_path, 'train_size': len(train_data), 'val_size': len(val_data)}
else:
    print(f"‚ùå No cached data found at {cache_path}")
    print("   Run one of the other data loading options first")

In [None]:
import torch

# Generate synthetic data for testing
vocab_size = 50257  # GPT-2 vocab
seq_len = 32
n_samples = 100

train_data = [torch.randint(0, vocab_size, (seq_len,)) for _ in range(n_samples)]
val_data = [torch.randint(0, vocab_size, (seq_len,)) for _ in range(20)]

print(f"‚úÖ Generated {len(train_data)} synthetic training samples")
print(f"   ‚ö†Ô∏è Warning: Synthetic data is for testing only")
data_source = "synthetic"
dataset_info = {'vocab_size': vocab_size, 'seq_len': seq_len, 'train_size': len(train_data), 'val_size': len(val_data)}

<a id="section-4"></a>
# ‚öôÔ∏è Section 4: Training Configuration

Configure hyperparameters using Colab forms below.

In [None]:
from utils.training.training_config import TrainingConfig

# HYPERPARAMETERS (edit via forms)
learning_rate = 5e-5  #@param {type:"number"}
batch_size = 4  #@param {type:"integer"}
epochs = 10  #@param {type:"integer"}
warmup_ratio = 0.1  #@param {type:"number"}
weight_decay = 0.01  #@param {type:"number"}
gradient_clip_norm = 1.0  #@param {type:"number"}

# TRAINING FEATURES
use_amp = True  #@param {type:"boolean"}
gradient_accumulation_steps = 1  #@param {type:"integer"}
deterministic = False  #@param {type:"boolean"}

# EXPERIMENT
run_name = "training-run"  #@param {type:"string"}
random_seed = 42  #@param {type:"integer"}

# Create config
config = TrainingConfig(
    learning_rate=learning_rate,
    batch_size=batch_size,
    epochs=epochs,
    warmup_ratio=warmup_ratio,
    weight_decay=weight_decay,
    max_grad_norm=gradient_clip_norm,
    use_amp=use_amp,
    gradient_accumulation_steps=gradient_accumulation_steps,
    deterministic=deterministic,
    random_seed=random_seed,
    run_name=run_name
)

# Validate
config.validate()

# Save to Drive
config_path = config.save(f'{workspace_root}/configs/')
print(f"‚úÖ Config saved: {config_path}")
print(f"\n{config}")

In [None]:
# Display configuration summary
print("=" * 60)
print(" " * 15 + "TRAINING CONFIGURATION")
print("=" * 60)
print(f"{'Run Name:':<25} {config.run_name}")
print(f"{'Learning Rate:':<25} {config.learning_rate}")
print(f"{'Batch Size (effective):':<25} {config.batch_size * config.gradient_accumulation_steps}")
print(f"{'Epochs:':<25} {config.epochs}")
print(f"{'Warmup Ratio:':<25} {config.warmup_ratio}")
print(f"{'Gradient Clipping:':<25} {config.max_grad_norm}")
print(f"{'AMP Enabled:':<25} {config.use_amp}")
print(f"{'Deterministic:':<25} {config.deterministic}")
print(f"{'Random Seed:':<25} {config.random_seed}")
print(f"{'Data Source:':<25} {data_source}")
print("=" * 60)

### Training Mode Selection

Based on your `epochs` setting:
- **epochs <= 5**: ‚ö° Fast Mode (~5 min)
- **epochs <= 15**: ‚öñÔ∏è Balanced Mode (~15 min)
- **epochs > 15**: üíé Quality Mode (45+ min)

Proceed to training in Section 5 ‚¨áÔ∏è

<a id="section-5"></a>
# üî¨ Section 5: W&B Tracking Setup (Optional)

Enable Weights & Biases for cloud-based experiment tracking.

In [None]:
import wandb
from getpass import getpass

use_wandb = True  #@param {type:"boolean"}
wandb_project = "transformer-training"  #@param {type:"string"}
wandb_entity = ""  #@param {type:"string"}

if use_wandb:
    # Login to W&B
    wandb_key = getpass("Enter W&B API key (or leave blank to skip): ")
    if wandb_key:
        wandb.login(key=wandb_key)

        # Initialize run
        wandb.init(
            project=wandb_project,
            entity=wandb_entity if wandb_entity else None,
            name=config.run_name,
            config=config.to_dict(),
            tags=[data_source, f"epochs_{epochs}"]
        )
        print(f"‚úÖ W&B initialized: {wandb.run.url}")
    else:
        use_wandb = False
        print("‚ö†Ô∏è W&B skipped - training will use local tracking only")
else:
    print("‚ÑπÔ∏è W&B disabled - using local SQLite tracking")

<a id="section-6"></a>
# üèãÔ∏è Section 6: Training Loop

Main training loop with live visualization and checkpointing.

In [None]:
from utils.training.metrics_tracker import MetricsTracker
from utils.training.live_plotting import LivePlotter
from utils.training.seed_manager import set_random_seed
import torch.optim as optim
from torch.utils.data import DataLoader, TensorDataset

# Set random seed
set_random_seed(config.random_seed, config.deterministic)

# Initialize metrics tracker
tracker = MetricsTracker(use_wandb=use_wandb)

# Initialize live plotter
plotter = LivePlotter(update_interval=1)

# Create DataLoader (simplified - adapt to your data format)
if data_source == "synthetic":
    train_dataset = TensorDataset(torch.stack(train_data))
    val_dataset = TensorDataset(torch.stack(val_data))
else:
    # For HuggingFace datasets or text data, you'll need proper tokenization
    print("‚ö†Ô∏è Using synthetic data - implement proper tokenization for real datasets")
    train_dataset = TensorDataset(torch.stack([torch.randint(0, 50257, (32,)) for _ in range(100)]))
    val_dataset = TensorDataset(torch.stack([torch.randint(0, 50257, (32,)) for _ in range(20)]))

train_loader = DataLoader(train_dataset, batch_size=config.batch_size, shuffle=True)
val_loader = DataLoader(val_dataset, batch_size=config.batch_size, shuffle=False)

# Initialize optimizer
optimizer = optim.AdamW(
    model.parameters(),
    lr=config.learning_rate,
    weight_decay=config.weight_decay
)

# Learning rate scheduler (warmup + cosine decay)
from torch.optim.lr_scheduler import OneCycleLR
scheduler = OneCycleLR(
    optimizer,
    max_lr=config.learning_rate,
    epochs=config.epochs,
    steps_per_epoch=len(train_loader),
    pct_start=config.warmup_ratio
)

print("‚úÖ Training initialized")
print(f"   Train batches: {len(train_loader)}")
print(f"   Val batches: {len(val_loader)}")

In [None]:
import torch.nn.functional as F
from torch.cuda.amp import autocast, GradScaler
import time

# Initialize gradient scaler for AMP
scaler = GradScaler(enabled=config.use_amp)

# Training loop
for epoch in range(config.epochs):
    epoch_start = time.time()
    model.train()
    train_loss = 0.0

    for batch_idx, (input_ids,) in enumerate(train_loader):
        input_ids = input_ids.to(device)

        # Forward pass with AMP
        with autocast(enabled=config.use_amp):
            # Shift for language modeling: predict next token
            logits = model(input_ids[:, :-1])
            targets = input_ids[:, 1:]

            # Compute loss
            loss = F.cross_entropy(
                logits.reshape(-1, logits.size(-1)),
                targets.reshape(-1)
            )

        # Backward pass
        scaler.scale(loss).backward()

        # Gradient clipping
        if config.max_grad_norm is not None:
            scaler.unscale_(optimizer)
            grad_norm = torch.nn.utils.clip_grad_norm_(model.parameters(), config.max_grad_norm)
        else:
            grad_norm = 0.0

        # Optimizer step
        scaler.step(optimizer)
        scaler.update()
        optimizer.zero_grad()
        scheduler.step()

        train_loss += loss.item()

        # Log batch metrics
        global_step = epoch * len(train_loader) + batch_idx
        tracker.log_scalar('train/batch_loss', loss.item(), step=global_step)
        tracker.log_scalar('train/learning_rate', scheduler.get_last_lr()[0], step=global_step)

        if batch_idx % 10 == 0:
            print(f"Epoch {epoch+1}/{config.epochs} | Batch {batch_idx}/{len(train_loader)} | Loss: {loss.item():.4f}")

    # Validation
    model.eval()
    val_loss = 0.0
    with torch.no_grad():
        for input_ids, in val_loader:
            input_ids = input_ids.to(device)

            with autocast(enabled=config.use_amp):
                logits = model(input_ids[:, :-1])
                targets = input_ids[:, 1:]
                loss = F.cross_entropy(
                    logits.reshape(-1, logits.size(-1)),
                    targets.reshape(-1)
                )

            val_loss += loss.item()

    # Compute epoch metrics
    avg_train_loss = train_loss / len(train_loader)
    avg_val_loss = val_loss / len(val_loader)
    epoch_time = time.time() - epoch_start

    # Log epoch metrics
    tracker.log_epoch(
        epoch=epoch,
        train_metrics={'loss': avg_train_loss},
        val_metrics={'loss': avg_val_loss, 'perplexity': torch.exp(torch.tensor(avg_val_loss)).item()},
        learning_rate=scheduler.get_last_lr()[0],
        gradient_norm=grad_norm if isinstance(grad_norm, float) else grad_norm.item(),
        epoch_duration=epoch_time
    )

    # Update live plot
    plotter.update(tracker.get_summary())

    # Save checkpoint
    if (epoch + 1) % 5 == 0 or epoch == config.epochs - 1:
        checkpoint_path = f"{workspace_root}/checkpoints/{config.run_name}_epoch{epoch+1}.pt"
        torch.save({
            'epoch': epoch,
            'model_state_dict': model.state_dict(),
            'optimizer_state_dict': optimizer.state_dict(),
            'scheduler_state_dict': scheduler.state_dict(),
            'train_loss': avg_train_loss,
            'val_loss': avg_val_loss,
            'config': config.to_dict()
        }, checkpoint_path)
        print(f"üíæ Checkpoint saved: {checkpoint_path}")

    print(f"Epoch {epoch+1}/{config.epochs} | Train Loss: {avg_train_loss:.4f} | Val Loss: {avg_val_loss:.4f} | Time: {epoch_time:.1f}s")

print("\n‚úÖ Training completed!")

# Save experiment to database
db.save_run(
    run_name=config.run_name,
    config=config.to_dict(),
    metrics=tracker.get_summary().to_dict('records')[-1],
    data_source=data_source
)

<a id="section-7"></a>
# üìà Section 7: Analysis & Visualization

Analyze training results with comprehensive dashboards.

In [None]:
from utils.training.dashboard import TrainingDashboard

# Create comprehensive 6-panel dashboard
metrics_df = tracker.get_summary()
dashboard = TrainingDashboard(figsize=(18, 12))

fig = dashboard.plot(
    metrics_df,
    config=config,
    title=f"Training Dashboard: {config.run_name}"
)

# Save to Drive
dashboard_path = f'{workspace_root}/results/{config.run_name}_dashboard.png'
dashboard.save(dashboard_path, dpi=150)
print(f"‚úÖ Dashboard saved to Drive: {dashboard_path}")

In [None]:
# Find best epoch based on validation loss
best_epoch_idx = metrics_df['val/loss'].idxmin()
best_epoch = metrics_df.loc[best_epoch_idx]

print("=" * 60)
print(" " * 20 + "BEST EPOCH ANALYSIS")
print("=" * 60)
print(f"{'Best Epoch:':<25} {int(best_epoch['epoch']) + 1}")
print(f"{'Validation Loss:':<25} {best_epoch['val/loss']:.4f}")
print(f"{'Validation Perplexity:':<25} {best_epoch['val/perplexity']:.2f}")
print(f"{'Training Loss:':<25} {best_epoch['train/loss']:.4f}")
print(f"{'Learning Rate:':<25} {best_epoch['train/learning_rate']:.2e}")
print("=" * 60)

# Load best checkpoint
best_checkpoint_path = f"{workspace_root}/checkpoints/{config.run_name}_epoch{int(best_epoch['epoch']) + 1}.pt"
if os.path.exists(best_checkpoint_path):
    print(f"\nüíæ Best checkpoint: {best_checkpoint_path}")
else:
    print(f"\n‚ö†Ô∏è Best checkpoint not found (may not have been saved)")

In [None]:
# Display metrics table
import pandas as pd

pd.set_option('display.max_rows', None)
pd.set_option('display.float_format', lambda x: f'{x:.4f}')

display_cols = ['epoch', 'train/loss', 'val/loss', 'val/perplexity', 'train/learning_rate']
available_cols = [col for col in display_cols if col in metrics_df.columns]

print("\nTraining Metrics Summary:")
print(metrics_df[available_cols].to_string(index=False))

# Export to CSV
csv_path = f'{workspace_root}/results/{config.run_name}_metrics.csv'
metrics_df.to_csv(csv_path, index=False)
print(f"\n‚úÖ Metrics exported to: {csv_path}")

In [None]:
import torch

if torch.cuda.is_available():
    print("=" * 60)
    print(" " * 20 + "GPU METRICS")
    print("=" * 60)

    gpu_cols = [col for col in metrics_df.columns if col.startswith('gpu/')]
    if gpu_cols:
        print(metrics_df[['epoch'] + gpu_cols].tail(5).to_string(index=False))

        # Plot GPU utilization
        import matplotlib.pyplot as plt
        fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(12, 4))

        if 'gpu/memory_allocated_mb' in metrics_df.columns:
            ax1.plot(metrics_df['epoch'], metrics_df['gpu/memory_allocated_mb'])
            ax1.set_xlabel('Epoch')
            ax1.set_ylabel('GPU Memory (MB)')
            ax1.set_title('GPU Memory Usage')
            ax1.grid(True)

        if 'gpu/utilization_percent' in metrics_df.columns:
            ax2.plot(metrics_df['epoch'], metrics_df['gpu/utilization_percent'])
            ax2.set_xlabel('Epoch')
            ax2.set_ylabel('GPU Utilization (%)')
            ax2.set_title('GPU Utilization')
            ax2.grid(True)

        plt.tight_layout()
        plt.savefig(f'{workspace_root}/results/{config.run_name}_gpu_metrics.png', dpi=100)
        plt.show()
        print(f"\n‚úÖ GPU metrics saved")
    else:
        print("‚ö†Ô∏è No GPU metrics collected during training")
    print("=" * 60)
else:
    print("‚ÑπÔ∏è Training was performed on CPU (no GPU metrics available)")

<a id="section-8"></a>
# üíæ Section 8: Export & Results

Download checkpoints, configs, and results.

In [None]:
import os
from google.colab import files

print("=" * 60)
print(" " * 20 + "EXPORT SUMMARY")
print("=" * 60)
print(f"\nüìÅ Workspace: {workspace_root}")
print(f"\nüìä Results:")
print(f"   - Dashboard: {config.run_name}_dashboard.png")
print(f"   - Metrics CSV: {config.run_name}_metrics.csv")
print(f"   - Config: {os.path.basename(config_path)}")
print(f"\nüíæ Checkpoints:")

checkpoint_dir = f"{workspace_root}/checkpoints"
checkpoints = [f for f in os.listdir(checkpoint_dir) if f.startswith(config.run_name)]
for ckpt in sorted(checkpoints):
    ckpt_path = os.path.join(checkpoint_dir, ckpt)
    size_mb = os.path.getsize(ckpt_path) / (1024 * 1024)
    print(f"   - {ckpt} ({size_mb:.1f} MB)")

print("=" * 60)

In [None]:
# Download results to local machine
download_results = False  #@param {type:"boolean"}

if download_results:
    print("Downloading files...")

    # Download dashboard
    dashboard_file = f'{workspace_root}/results/{config.run_name}_dashboard.png'
    if os.path.exists(dashboard_file):
        files.download(dashboard_file)

    # Download metrics CSV
    metrics_file = f'{workspace_root}/results/{config.run_name}_metrics.csv'
    if os.path.exists(metrics_file):
        files.download(metrics_file)

    # Download config
    if os.path.exists(config_path):
        files.download(config_path)

    # Download best checkpoint
    if os.path.exists(best_checkpoint_path):
        files.download(best_checkpoint_path)
        print(f"‚úÖ Downloaded {os.path.basename(best_checkpoint_path)}")

    print("‚úÖ Downloads complete")
else:
    print("‚ÑπÔ∏è Downloads skipped. Files are saved in Google Drive.")
    print(f"   Access them at: {workspace_root}")

In [None]:
# Compare with previous runs
all_runs = db.list_runs(limit=10)

if len(all_runs) > 1:
    print("=" * 60)
    print(" " * 15 + "COMPARISON WITH PREVIOUS RUNS")
    print("=" * 60)

    comparison_data = []
    for run in all_runs:
        comparison_data.append({
            'run_name': run.get('run_name', 'unknown'),
            'final_val_loss': run.get('metrics', {}).get('val/loss', float('nan')),
            'final_perplexity': run.get('metrics', {}).get('val/perplexity', float('nan')),
            'data_source': run.get('data_source', 'unknown'),
            'timestamp': run.get('timestamp', 'unknown')
        })

    comparison_df = pd.DataFrame(comparison_data)
    print(comparison_df.to_string(index=False))
    print("=" * 60)
else:
    print("‚ÑπÔ∏è No previous runs to compare (this is your first run)")

<a id="section-9"></a>
# üî¨ Section 9: Advanced Features

Hyperparameter search, multi-run experiments, and optimization.

In [None]:
from utils.tier3_training_utilities import test_hyperparameter_search

# Hyperparameter search configuration
run_hp_search = False  #@param {type:"boolean"}
n_trials = 10  #@param {type:"integer"}
search_timeout = 3600  #@param {type:"integer"}

if run_hp_search:
    print("üîç Starting hyperparameter search...")
    print(f"   Trials: {n_trials}")
    print(f"   Timeout: {search_timeout}s ({search_timeout/60:.1f} min)")
    print("\n‚ö†Ô∏è This may take a while. Progress will be shown below.")

    # Define search space
    search_space = {
        'learning_rate': (1e-5, 1e-3),
        'batch_size': [4, 8, 16],
        'warmup_ratio': (0.0, 0.2),
        'weight_decay': (0.0, 0.1)
    }

    print(f"\nSearch space: {search_space}")
else:
    print("‚ÑπÔ∏è Hyperparameter search disabled")
    print("   Set 'run_hp_search = True' to enable")

In [None]:
if run_hp_search:
    # Run search
    hp_results = test_hyperparameter_search(
        model=model,
        config=config,
        train_data=train_data,
        val_data=val_data,
        n_trials=n_trials,
        timeout=search_timeout,
        use_wandb=use_wandb
    )

    # Display results
    print("\n" + "=" * 60)
    print(" " * 15 + "HYPERPARAMETER SEARCH RESULTS")
    print("=" * 60)
    print(f"\nBest parameters:")
    for param, value in hp_results['best_params'].items():
        print(f"   {param}: {value}")

    print(f"\nBest validation loss: {hp_results['best_value']:.4f}")
    print(f"\nAll trials:")
    print(hp_results['trials_df'].to_string(index=False))

    # Save results
    hp_results['trials_df'].to_csv(
        f'{workspace_root}/results/{config.run_name}_hp_search.csv',
        index=False
    )
    print(f"\n‚úÖ Results saved to: {config.run_name}_hp_search.csv")
    print("=" * 60)
else:
    print("‚è≠Ô∏è Hyperparameter search skipped")

## üéâ Training Complete!

### Next Steps

1. **Review Results**: Check the dashboard in Section 6
2. **Download Files**: Use Section 7 to download checkpoints
3. **Compare Runs**: See Section 7 for comparison with previous experiments
4. **Optimize**: Try hyperparameter search in Section 8

### Workspace Structure

All files are saved in Google Drive:
```
/content/drive/MyDrive/TransformerTraining/
‚îú‚îÄ‚îÄ checkpoints/     # Model weights (.pt files)
‚îú‚îÄ‚îÄ configs/         # Training configs (.json files)
‚îú‚îÄ‚îÄ results/         # Dashboards, metrics, plots
‚îú‚îÄ‚îÄ datasets/        # Cached datasets
‚îî‚îÄ‚îÄ experiments.db   # SQLite tracking database
```

### Resources

- [Transformer Builder Documentation](https://transformer-builder.com/docs)
- [Training Utilities Reference](https://github.com/matt-hans/transformer-builder-colab-templates)
- [W&B Dashboard](https://wandb.ai) (if enabled)

---

**üí° Tip**: Save this notebook to Google Drive for future use!