# SASLM v2 Training Notebook

Train the Sri Aurobindo Small Language Model with:
- Weighted sampling (mature works prioritized)
- Checkpointing to Google Drive (survives disconnects)
- Grokking detection
- Comprehensive logging

## Experiments
- **EXP-A1**: From scratch, prose only
- **EXP-B1**: Fine-tune GPT-2, prose only
- **EXP-A2**: From scratch, prose + poetry
- **EXP-B2**: Fine-tune GPT-2, prose + poetry

## 1. Setup

In [None]:
# Mount Google Drive
from google.colab import drive
drive.mount('/content/drive')

In [None]:
# Define paths - CORPUS IS ON GOOGLE DRIVE
DRIVE_BASE = '/content/drive/MyDrive/saslm'
CORPUS_PATH = f'{DRIVE_BASE}/clean_prose'  # Your uploaded corpus
EXPERIMENTS_PATH = f'{DRIVE_BASE}/experiments'  # Checkpoints will be saved here
TOKENIZER_PATH = f'{DRIVE_BASE}/tokenizers/tokenizer_16k'  # Tokenizer on Drive too

print(f"Corpus path: {CORPUS_PATH}")
print(f"Experiments path: {EXPERIMENTS_PATH}")
print(f"Tokenizer path: {TOKENIZER_PATH}")

In [None]:
# Verify corpus exists on Drive
import os

corpus_files = os.listdir(CORPUS_PATH)
print(f"Found {len(corpus_files)} files in corpus:")
for f in sorted(corpus_files)[:10]:
    print(f"  {f}")
if len(corpus_files) > 10:
    print(f"  ... and {len(corpus_files) - 10} more")

In [None]:
# Clone the code repository to Colab's LOCAL storage (not Drive!)
# Code goes to /content/saslm, Data stays on Drive at /content/drive/MyDrive/saslm
import os

CODE_PATH = '/content/saslm'  # Local Colab storage (fast, temporary)

if os.path.exists(CODE_PATH):
    print('Repo already exists, pulling latest changes...')
    !cd {CODE_PATH} && git pull
else:
    print('Cloning repository...')
    !git clone https://github.com/maheshcr/saslm.git {CODE_PATH}

%cd {CODE_PATH}
print(f"\nWorking directory: {os.getcwd()}")

In [None]:
# Install dependencies
!pip install torch transformers tokenizers datasets wandb tqdm pyyaml numpy -q

In [None]:
# Verify GPU
import torch
print(f"PyTorch version: {torch.__version__}")
print(f"CUDA available: {torch.cuda.is_available()}")
if torch.cuda.is_available():
    print(f"GPU: {torch.cuda.get_device_name(0)}")
    print(f"Memory: {torch.cuda.get_device_properties(0).total_memory / 1e9:.1f} GB")

## 2. Train Tokenizer (if not already done)

The tokenizer will be saved to Google Drive so you don't need to retrain it.

In [None]:
# Check if tokenizer exists on Drive
import os

tokenizer_file = f'{TOKENIZER_PATH}/tokenizer.json'

if not os.path.exists(tokenizer_file):
    print("Training tokenizer (this takes ~5 minutes)...")
    !python src/data/train_tokenizer.py \
        --corpus {CORPUS_PATH} \
        --vocab-size 16384 \
        --output {TOKENIZER_PATH}
else:
    print(f"Tokenizer already exists at {TOKENIZER_PATH}")
    !python src/data/train_tokenizer.py --analyze {TOKENIZER_PATH} --corpus {CORPUS_PATH}

## 3. Select Experiment

In [None]:
# Choose experiment
EXPERIMENT = "EXP-A1"  # Options: EXP-A1, EXP-B1, EXP-A2, EXP-B2

config_map = {
    "EXP-A1": "configs/exp_a1_prose_only.yaml",
    "EXP-B1": "configs/exp_b1_prose_only_finetune.yaml",
    "EXP-A2": "configs/exp_a2_prose_poetry.yaml",
    "EXP-B2": "configs/exp_b2_prose_poetry_finetune.yaml",
}

CONFIG_PATH = config_map[EXPERIMENT]
print(f"Selected: {EXPERIMENT}")
print(f"Config: {CONFIG_PATH}")

In [None]:
# Update config to use Drive paths
import yaml

with open(CONFIG_PATH, 'r') as f:
    config = yaml.safe_load(f)

# Update paths to use Google Drive
config['data']['corpus_path'] = CORPUS_PATH
config['tokenizer']['tokenizer_path'] = TOKENIZER_PATH
config['hardware']['drive_path'] = EXPERIMENTS_PATH

# Save updated config
updated_config_path = f'/content/saslm/configs/{EXPERIMENT.lower()}_updated.yaml'
with open(updated_config_path, 'w') as f:
    yaml.dump(config, f, default_flow_style=False)

print(f"Updated config saved to: {updated_config_path}")
print(f"\nKey paths:")
print(f"  Corpus: {config['data']['corpus_path']}")
print(f"  Tokenizer: {config['tokenizer']['tokenizer_path']}")
print(f"  Checkpoints: {config['hardware']['drive_path']}")

In [None]:
# View updated config
!cat {updated_config_path}

## 4. Run Training

Training will:
- Auto-resume from checkpoint if disconnected
- Save checkpoints to Google Drive every 1000 steps
- Log metrics to wandb (optional)
- Detect grokking phenomenon

**If Colab disconnects**: Just re-run this cell. It will auto-resume!

In [None]:
# Optional: Login to Weights & Biases for tracking
# import wandb
# wandb.login()

In [None]:
# Run training (will auto-resume if checkpoint exists)
!python src/training/train.py \
    --config {updated_config_path} \
    --resume

## 5. Evaluate Model

In [None]:
# Load the best model and generate samples
import torch
from tokenizers import Tokenizer
import sys
sys.path.insert(0, '/content/saslm')

from src.training.train import GPT
from src.training.checkpoint_manager import CheckpointManager

# Load tokenizer from Drive
tokenizer = Tokenizer.from_file(f'{TOKENIZER_PATH}/tokenizer.json')
vocab_size = tokenizer.get_vocab_size()
print(f"Loaded tokenizer with {vocab_size:,} tokens")

# Create model
model = GPT(
    vocab_size=vocab_size,
    block_size=512,
    n_layer=6,
    n_head=6,
    n_embd=384,
)

# Load best checkpoint from Drive
checkpoint_mgr = CheckpointManager(
    experiment_name=EXPERIMENT,
    base_path=EXPERIMENTS_PATH,
)
checkpoint_mgr.load_best(model, device='cuda')
model = model.cuda()
model.eval()

print("Model loaded!")

In [None]:
# Generate samples
prompts = [
    "The Supermind is",
    "The psychic being differs from the soul in that",
    "The goal of Integral Yoga is not merely liberation but",
    "In the process of spiritual evolution,",
    "The three modes of Nature are",
]

for prompt in prompts:
    # Encode
    encoded = tokenizer.encode(prompt)
    input_ids = torch.tensor([encoded.ids], device='cuda')
    
    # Generate
    with torch.no_grad():
        output = model.generate(input_ids, max_new_tokens=100, temperature=0.8, top_k=50)
    
    # Decode
    generated = tokenizer.decode(output[0].tolist())
    
    print(f"\n{'='*60}")
    print(f"Prompt: {prompt}")
    print(f"Generated: {generated}")

## 6. Run LLM Judge Evaluation

In [None]:
# Set API key for evaluation (choose one)
import os

# Option 1: OpenAI
# os.environ['OPENAI_API_KEY'] = 'your-key-here'

# Option 2: Anthropic
# os.environ['ANTHROPIC_API_KEY'] = 'your-key-here'

# Option 3: Google
# os.environ['GEMINI_API_KEY'] = 'your-key-here'

In [None]:
# Run evaluation
# !python src/evaluate.py \
#     --model-path {EXPERIMENTS_PATH}/{EXPERIMENT}/checkpoint_best.pt \
#     --tokenizer {TOKENIZER_PATH} \
#     --judge claude \
#     --output {DRIVE_BASE}/results/{EXPERIMENT}_eval.csv

## 7. View Training Curves

In [None]:
import json
import matplotlib.pyplot as plt

# Load metrics from Drive
metrics_path = f'{EXPERIMENTS_PATH}/{EXPERIMENT}/metrics.jsonl'

if os.path.exists(metrics_path):
    steps = []
    train_losses = []
    val_losses = []

    with open(metrics_path, 'r') as f:
        for line in f:
            data = json.loads(line)
            steps.append(data['step'])
            if 'train_loss' in data:
                train_losses.append((data['step'], data['train_loss']))
            if 'val_loss' in data:
                val_losses.append((data['step'], data['val_loss']))

    # Plot
    fig, (ax1, ax2) = plt.subplots(2, 1, figsize=(12, 8))

    # Training loss
    if train_losses:
        x, y = zip(*train_losses)
        ax1.plot(x, y, label='Train Loss', alpha=0.7)
    if val_losses:
        x, y = zip(*val_losses)
        ax1.plot(x, y, label='Val Loss', alpha=0.7)
    ax1.set_xlabel('Step')
    ax1.set_ylabel('Loss')
    ax1.set_title('Training and Validation Loss')
    ax1.legend()
    ax1.grid(True, alpha=0.3)

    # Gap (for grokking detection)
    if train_losses and val_losses:
        train_dict = dict(train_losses)
        val_dict = dict(val_losses)
        common_steps = sorted(set(train_dict.keys()) & set(val_dict.keys()))
        if common_steps:
            gaps = [val_dict[s] - train_dict[s] for s in common_steps]
            ax2.plot(common_steps, gaps, label='Val - Train Gap', color='purple')
            ax2.axhline(y=0, color='gray', linestyle='--')
            ax2.set_xlabel('Step')
            ax2.set_ylabel('Gap')
            ax2.set_title('Generalization Gap (Grokking Indicator)')
            ax2.legend()
            ax2.grid(True, alpha=0.3)

    plt.tight_layout()
    plt.savefig(f'{DRIVE_BASE}/results/{EXPERIMENT}_training_curves.png', dpi=150)
    plt.show()
else:
    print(f"No metrics file found at {metrics_path}")
    print("Run training first!")

## 8. Upload to HuggingFace (Optional)

In [None]:
# Login to HuggingFace
# from huggingface_hub import login
# login()

In [None]:
# Upload model
# from huggingface_hub import HfApi
# api = HfApi()
# 
# api.upload_folder(
#     folder_path=f'{EXPERIMENTS_PATH}/{EXPERIMENT}',
#     repo_id='your-username/saslm-v2',
#     repo_type='model',
# )

---

## Notes

### If Colab Disconnects
Just re-run from the "Run Training" cell. The training will automatically resume from the last checkpoint saved to Google Drive.

### File Locations on Google Drive
```
/content/drive/MyDrive/saslm/
├── clean_prose/              # Your uploaded corpus (23 files)
├── tokenizers/
│   └── tokenizer_16k/        # Trained tokenizer
├── experiments/
│   └── EXP-A1/               # Checkpoints & metrics
│       ├── checkpoint_latest.pt
│       ├── checkpoint_best.pt
│       └── metrics.jsonl
└── results/                  # Evaluation results
```

### Expected Training Time
- EXP-A1 (from scratch): ~8-12 hours for 100K steps on T4
- EXP-B1 (fine-tune): ~4-6 hours for 50K steps on T4