# Cricket GNN Hyperparameter Search on Colab

This notebook runs hyperparameter optimization for the Cricket Ball Prediction GNN model using a free T4 GPU on Google Colab.

**Prerequisites:**
- Google account
- WandB account (free at wandb.ai)
- GitHub Personal Access Token (for private repo access) - [Create one here](https://github.com/settings/tokens)

**Important:** Colab sessions have a 12-hour limit. Save your results frequently!

For detailed instructions, see `notes/training-guides/colab-guide.md`

## 1. GPU Check & Setup

First, verify that a GPU is available. You should see a T4 GPU listed.

In [None]:
# Verify GPU is available
!nvidia-smi

import torch
print(f"\nCUDA available: {torch.cuda.is_available()}")
if torch.cuda.is_available():
    print(f"GPU: {torch.cuda.get_device_name(0)}")
    print(f"GPU Memory: {torch.cuda.get_device_properties(0).total_memory / 1e9:.1f} GB")
else:
    print("\n** WARNING: No GPU detected! **")
    print("Go to Runtime > Change runtime type > Hardware accelerator > T4 GPU")

## 2. Clone Repository

Clone the Cricket GNN repository from GitHub. Since this is a private repo, you'll need a GitHub Personal Access Token (PAT).

**To create a PAT:**
1. Go to [GitHub Settings > Tokens](https://github.com/settings/tokens)
2. Click "Generate new token (classic)"
3. Give it a name, select `repo` scope, and generate
4. Copy the token (you won't see it again)

In [None]:
from getpass import getpass
import os

# Securely input your GitHub PAT (won't be displayed)
github_token = getpass("Enter your GitHub Personal Access Token: ")

# Clone using the token
!git clone https://{github_token}@github.com/lyndonkl/cricketmodel.git

# Clear token from memory
del github_token

%cd cricketmodel
!git log --oneline -3  # Show recent commits

## 3. Install Dependencies

Install PyTorch Geometric and other required packages. This may take a few minutes.

In [None]:
# Install PyTorch with CUDA (matching local environment versions)
!pip install torch==2.8.0 --index-url https://download.pytorch.org/whl/cu121

# Install torch-geometric (matching local version)
!pip install torch-geometric==2.7.0

# Install Optuna and WandB for hyperparameter search
!pip install optuna optuna-integration[wandb] wandb

# Install other dependencies
!pip install tqdm pyyaml scikit-learn plotly

# Verify versions
import torch
import torch_geometric
print("\n" + "="*50)
print(f"PyTorch: {torch.__version__}")
print(f"torch-geometric: {torch_geometric.__version__}")
print(f"CUDA available: {torch.cuda.is_available()}")
print("="*50)

## 4. WandB Login

Login to Weights & Biases for experiment tracking. You'll need your API key from [wandb.ai/authorize](https://wandb.ai/authorize).

In [None]:
import wandb

# This will prompt you to enter your API key
# Get it from: https://wandb.ai/authorize
wandb.login()

## 5. Load Processed Data from Google Drive

The processed data (~97 GB) should be uploaded to your Google Drive as `processed.zip`.

**One-time setup (on your local machine):**
```bash
cd data && zip -r processed.zip processed/
# Upload processed.zip to Google Drive
```

In [None]:
from google.colab import drive
import os

# Mount Google Drive
drive.mount('/content/drive')

# Path to your processed.zip in Google Drive (adjust if in a subfolder)
drive_zip_path = '/content/drive/MyDrive/processed.zip'

# Check if file exists
if os.path.exists(drive_zip_path):
    print(f"Found {drive_zip_path}")
    print("Extracting to data/processed/ ...")
    !unzip -q "{drive_zip_path}" -d data/
    
    # Verify extraction
    file_count = len([f for f in os.listdir('data/processed') if f.endswith('.pt')])
    print(f"Extracted {file_count} .pt files to data/processed/")
else:
    print(f"ERROR: {drive_zip_path} not found!")
    print("Please upload processed.zip to your Google Drive root folder.")

## 6. Hyperparameter Search (4 Phases)

The HP search runs in 4 phases, each building on the previous results:

| Phase | Parameters | Purpose | Trials |
|-------|------------|---------|--------|
| 1. Coarse | hidden_dim, lr | Find ballpark model size and learning rate | 10 |
| 2. Architecture | num_layers, num_heads | Tune depth and attention | 12 |
| 3. Training | lr, dropout, weight_decay | Regularization tuning | 15 |
| 4. Loss | focal_gamma, use_class_weights | Loss function tuning | 10 |

**Why `--n-jobs 1`?** On GPU, parallel jobs compete for VRAM. The speedup comes from the GPU itself, not parallelism. Use `--n-jobs > 1` only for CPU training.

**Time estimate:** ~1-2 hours per phase with default settings.

In [None]:
# Phase 1: Coarse search - find ballpark hidden_dim and learning rate
# Adjust --n-trials and --epochs based on your time budget

!python scripts/hp_search.py \
    --phase phase1_coarse \
    --n-trials 10 \
    --epochs 25 \
    --wandb \
    --device cuda \
    --n-jobs 1

print("\n" + "="*50)
print("Phase 1 complete! Run the next cell to see results.")
print("="*50)

### Phase 1 Results

In [None]:
import json
import glob
import os

# Find the most recent Phase 1 best_params.json
phase1_files = glob.glob('checkpoints/optuna/cricket_gnn_phase1_coarse_*/best_params.json')

if phase1_files:
    PHASE1_BEST = max(phase1_files, key=os.path.getmtime)
    print(f"Phase 1 results: {PHASE1_BEST}")
    print("="*50)
    
    with open(PHASE1_BEST) as f:
        best_params = json.load(f)
    
    for key, value in best_params.items():
        print(f"{key}: {value}")
else:
    print("No Phase 1 results found. Run Phase 1 first.")
    PHASE1_BEST = None

### Phase 2: Architecture Tuning

Tune num_layers and num_heads using Phase 1's best hidden_dim and lr.

In [None]:
# Phase 2: Architecture tuning - num_layers and num_heads
if PHASE1_BEST:
    !python scripts/hp_search.py \
        --phase phase2_architecture \
        --n-trials 12 \
        --epochs 25 \
        --best-params "{PHASE1_BEST}" \
        --wandb \
        --device cuda \
        --n-jobs 1
    
    print("\n" + "="*50)
    print("Phase 2 complete!")
    print("="*50)
else:
    print("ERROR: Run Phase 1 first!")

### Phase 3: Training Dynamics

Fine-tune learning rate, dropout, and weight_decay.

In [None]:
# Find Phase 2 results
phase2_files = glob.glob('checkpoints/optuna/cricket_gnn_phase2_architecture_*/best_params.json')

if phase2_files:
    PHASE2_BEST = max(phase2_files, key=os.path.getmtime)
    print(f"Using Phase 2 results: {PHASE2_BEST}")
    
    # Phase 3: Training dynamics - lr, dropout, weight_decay
    !python scripts/hp_search.py \
        --phase phase3_training \
        --n-trials 15 \
        --epochs 30 \
        --best-params "{PHASE2_BEST}" \
        --wandb \
        --device cuda \
        --n-jobs 1
    
    print("\n" + "="*50)
    print("Phase 3 complete!")
    print("="*50)
else:
    print("ERROR: Run Phase 2 first!")

### Phase 4: Loss Function Tuning

Optimize focal_gamma and class weighting.

In [None]:
# Find Phase 3 results
phase3_files = glob.glob('checkpoints/optuna/cricket_gnn_phase3_training_*/best_params.json')

if phase3_files:
    PHASE3_BEST = max(phase3_files, key=os.path.getmtime)
    print(f"Using Phase 3 results: {PHASE3_BEST}")
    
    # Phase 4: Loss function - focal_gamma, use_class_weights
    !python scripts/hp_search.py \
        --phase phase4_loss \
        --n-trials 10 \
        --epochs 30 \
        --best-params "{PHASE3_BEST}" \
        --wandb \
        --device cuda \
        --n-jobs 1
    
    print("\n" + "="*50)
    print("Phase 4 complete! All phases done.")
    print("="*50)
else:
    print("ERROR: Run Phase 3 first!")

## 7. Final Results

Display the best hyperparameters from all phases.

In [None]:
# Show final results from Phase 4 (or latest completed phase)
for phase_name, pattern in [
    ("Phase 4 (Loss)", "phase4_loss"),
    ("Phase 3 (Training)", "phase3_training"),
    ("Phase 2 (Architecture)", "phase2_architecture"),
    ("Phase 1 (Coarse)", "phase1_coarse"),
]:
    files = glob.glob(f'checkpoints/optuna/cricket_gnn_{pattern}_*/best_params.json')
    if files:
        latest = max(files, key=os.path.getmtime)
        print(f"=== {phase_name} ===")
        print(f"File: {latest}")
        with open(latest) as f:
            params = json.load(f)
        for k, v in params.items():
            print(f"  {k}: {v}")
        print()
        break
else:
    print("No results found. Run the HP search phases first.")

## 8. Download Results

Download all checkpoints and results to your local machine.

In [None]:
from google.colab import files

# Create results archive
!zip -r results.zip checkpoints/optuna/ optuna_studies.db

# Show archive contents
!unzip -l results.zip | head -30
print("...")

# Get file size
size_mb = os.path.getsize('results.zip') / 1e6
print(f"\nTotal archive size: {size_mb:.1f} MB")

# Download
print("\nStarting download...")
files.download('results.zip')

## Troubleshooting

**Git clone fails (private repo):**
- Error: `could not read Username for 'https://github.com'`
- Solution: Use a GitHub Personal Access Token (PAT) - the notebook prompts for this securely
- Create a PAT at [github.com/settings/tokens](https://github.com/settings/tokens) with `repo` scope

**GPU not detected:**
- Go to Runtime > Change runtime type > Hardware accelerator > T4 GPU

**Out of memory:**
- Reduce batch size: add `--batch-size 32` to the hp_search command
- Restart runtime: Runtime > Restart runtime

**Session disconnected:**

Colab has two timeout mechanisms:
- **Browser inactivity (~90 min):** If your browser tab is idle (no interaction), Colab disconnects
- **Max runtime (12 hours):** Hard limit regardless of activity

To prevent idle disconnects: keep the tab visible, occasionally scroll/click, don't let your computer sleep.

To recover:
1. Click "Reconnect" in the toolbar
2. Re-run cells 1-5 (setup)
3. Cell 6 will use cached data if available
4. Your Optuna study is saved to SQLite and resumes automatically

**WandB errors:**
- Run `wandb.login()` again if your session expired

**Why pip instead of conda?**

Colab uses pip, not conda. Many packages (torch, numpy, pandas, sklearn, matplotlib) are pre-installed. We only install what's missing: torch-geometric, optuna, wandb, plotly.