# Phase 1: Image-like Latent Training on Google Colab

This notebook trains a text encoder to produce 2D visual latents (32×32×6) with natural image statistics.

**Goal**: Make the latent grid look image-like using only analytic priors (no semantic understanding).

**Training time**: ~2-3 hours on T4 GPU

---

## Setup Instructions

1. **Runtime → Change runtime type → T4 GPU**
2. Run all cells in order
3. Checkpoints save to Google Drive automatically
4. Results appear at the end

## 1. Environment Setup

In [None]:
# Check GPU availability
import torch
print("="*70)
print("GPU Check")
print("="*70)
print(f"GPU Available: {torch.cuda.is_available()}")
if torch.cuda.is_available():
    print(f"GPU Name: {torch.cuda.get_device_name(0)}")
    print(f"GPU Memory: {torch.cuda.get_device_properties(0).total_memory / 1e9:.2f} GB")
    print(f"CUDA Version: {torch.version.cuda}")
else:
    print("⚠️  WARNING: No GPU found! Training will be very slow.")
    print("   Please go to Runtime → Change runtime type → T4 GPU")
print("="*70)

In [None]:
# Mount Google Drive to save checkpoints
from google.colab import drive
drive.mount('/content/drive')

# Create output directory on Drive
!mkdir -p /content/drive/MyDrive/blind_lm_outputs
print("✓ Google Drive mounted")
print("✓ Checkpoints will save to: /content/drive/MyDrive/blind_lm_outputs/")

In [None]:
# Clone the repository
!git clone https://github.com/jtooates/blind_lm.git
%cd blind_lm

print("\n" + "="*70)
print("Repository cloned successfully!")
print("="*70)

In [None]:
# Install dependencies
print("Installing dependencies...")
!pip install -q transformers scipy tqdm matplotlib

# Suppress tokenizer warning
import os
os.environ["TOKENIZERS_PARALLELISM"] = "false"

print("✓ Dependencies installed")

## 2. Data Preparation

In [None]:
# Check if training data exists, generate if needed
import os

if not os.path.exists('train_sentences.txt'):
    print("Generating training data (10,000 sentences)...")
    !python generate_sentences.py --num 10000 --complexity 1 --seed 42 --output train_sentences.txt
    print("✓ Training data generated")
else:
    print("✓ Training data already exists")

if not os.path.exists('val_sentences.txt'):
    print("Generating validation data (1,000 sentences)...")
    !python generate_sentences.py --num 1000 --complexity 1 --seed 100 --output val_sentences.txt
    print("✓ Validation data generated")
else:
    print("✓ Validation data already exists")

# Show stats
print("\n" + "="*70)
print("Data Statistics")
print("="*70)
!wc -l train_sentences.txt val_sentences.txt

print("\nSample sentences:")
!head -5 train_sentences.txt

## 3. Configuration

In [None]:
# Create Colab-optimized config
import json

config = {
    "description": "Colab training configuration with T4 GPU optimizations",
    "device": "cuda" if torch.cuda.is_available() else "cpu",
    "output_dir": "/content/drive/MyDrive/blind_lm_outputs/phase1",

    "model": {
        "vocab_size": 50257,
        "max_seq_len": 64,
        "hidden_size": 384,
        "num_layers": 6,
        "num_heads": 8,
        "ffn_size": 1536,
        "dropout": 0.1,
        "grid_size": 32,
        "num_channels": 6,
        "use_rope": True,
        "use_smooth_head": True,
        "tokenizer_name": "gpt2"
    },

    "loss": {
        "lambda_spec": 0.5,
        "lambda_tv": 0.1,
        "lambda_wav": 0.1,
        "lambda_kurt": 0.05,
        "lambda_cov": 0.05,
        "lambda_var": 0.05
    },

    "training": {
        "batch_size": 128,  # Reduced for T4 GPU
        "lr": 2e-4,
        "beta1": 0.9,
        "beta2": 0.95,
        "weight_decay": 0.01,
        "warmup_steps": 1000,
        "num_epochs": 10,
        "max_steps": 50000,
        "ema_decay": 0.999,
        "grad_clip": 1.0,
        "blur_sigma": 0.8,
        "blur_warmup_steps": 2000
    },

    "data": {
        "train_file": "../train_sentences.txt",
        "val_file": "../val_sentences.txt",
        "num_workers": 2,  # Colab-optimized
        "file_format": "txt"
    },

    "eval": {
        "eval_interval": 500,
        "save_interval": 2000,
        "num_fixed_sentences": 16
    }
}

# Save config
!mkdir -p phase1/configs
with open('phase1/configs/phase1_colab.json', 'w') as f:
    json.dump(config, f, indent=2)

print("Configuration created:")
print(f"  Device: {config['device']}")
print(f"  Batch size: {config['training']['batch_size']}")
print(f"  Max steps: {config['training']['max_steps']}")
print(f"  Output: {config['output_dir']}")
print("\n✓ Ready to train!")

## 4. Training

This will take approximately **2-3 hours** on a T4 GPU.

The training loop will:
- Train for up to 50,000 steps (or 10 epochs)
- Evaluate every 500 steps
- Save checkpoints every 2,000 steps to Google Drive
- Display progress bars and loss values

In [None]:
# Run training
%cd phase1

print("="*70)
print("Starting Phase 1 Training")
print("="*70)
print("This will take approximately 2-3 hours on T4 GPU")
print("You can monitor progress below...")
print("="*70)
print()

!python train.py --config configs/phase1_colab.json

## 5. Monitor Training (Optional)

Run this cell **while training** to see intermediate results

In [None]:
# Check training progress
import os
import json

output_dir = "/content/drive/MyDrive/blind_lm_outputs/phase1"

if os.path.exists(output_dir):
    print("Checkpoint files:")
    !ls -lh {output_dir}/checkpoint_*.pt
    
    # Try to load latest checkpoint and show metrics
    latest = os.path.join(output_dir, "checkpoint_latest.pt")
    if os.path.exists(latest):
        import torch
        checkpoint = torch.load(latest, map_location='cpu')
        print(f"\nCurrent step: {checkpoint['step']}")
        print(f"Current epoch: {checkpoint['epoch']}")
        
        if 'metrics_history' in checkpoint and checkpoint['metrics_history']:
            latest_metrics = checkpoint['metrics_history'][-1]
            print(f"\nLatest evaluation metrics:")
            print(f"  Loss: {latest_metrics.get('eval_loss', 'N/A')}")
            if 'eval_metrics' in latest_metrics:
                print(f"  Mean slope: {latest_metrics['eval_metrics'].get('mean_slope', 'N/A')}")
else:
    print("No checkpoints found yet. Training may not have started.")

## 6. Evaluation and Visualization

After training completes, generate comprehensive evaluation report

In [None]:
# Generate evaluation report
print("Generating evaluation report...")

eval_code = """
import sys
sys.path.append('..')

from visualize import create_evaluation_report
from model import create_model
from dataloader import create_fixed_eval_set
from transformers import AutoTokenizer
import torch
import json
import os

output_dir = '/content/drive/MyDrive/blind_lm_outputs/phase1'

# Load config
with open(os.path.join(output_dir, 'config.json')) as f:
    config = json.load(f)

# Load model
device = 'cuda' if torch.cuda.is_available() else 'cpu'
model = create_model(config['model']).to(device)

# Load checkpoint
checkpoint_path = os.path.join(output_dir, 'checkpoint_latest.pt')
checkpoint = torch.load(checkpoint_path, map_location=device)
model.load_state_dict(checkpoint['model_state_dict'])

print(f'Loaded checkpoint from step {checkpoint[\"step\"]}')

# Create eval set
tokenizer = AutoTokenizer.from_pretrained('gpt2')
eval_set = create_fixed_eval_set(tokenizer)

# Generate report
eval_report_dir = os.path.join(output_dir, 'eval_report')
summary = create_evaluation_report(
    model, eval_set, device=device, output_dir=eval_report_dir
)

print('\nEvaluation complete!')
print(f'Report saved to: {eval_report_dir}')
"""

with open('/tmp/eval_script.py', 'w') as f:
    f.write(eval_code)

!python /tmp/eval_script.py

In [None]:
# Display evaluation results
from IPython.display import Image, display, JSON
import json
import os

eval_dir = "/content/drive/MyDrive/blind_lm_outputs/phase1/eval_report"

print("="*70)
print("EVALUATION RESULTS")
print("="*70)

# Show summary
summary_path = os.path.join(eval_dir, "evaluation_summary.json")
if os.path.exists(summary_path):
    with open(summary_path) as f:
        summary = json.load(f)
    
    print("\nSummary:")
    print(f"  Mean slope: {summary['mean_slope']:.2f}")
    print(f"  Slopes in target range [1.5, 2.5]: {summary['slopes_in_range']}/6")
    print(f"  Gradient kurtosis (h): {summary['kurtosis_h']:.2f}")
    print(f"  Gradient kurtosis (w): {summary['kurtosis_w']:.2f}")
    
    print("\nSlopes per channel:")
    for i, slope in enumerate(summary['slopes']):
        in_range = "✓" if 1.5 <= slope <= 2.5 else "✗"
        print(f"  Channel {i+1}: {slope:.2f} {in_range}")

print("\n" + "="*70)
print("VISUALIZATIONS")
print("="*70)

# Display visualizations
viz_files = [
    ("power_spectra.png", "Power Spectra (target: α ∈ [1.5, 2.5])"),
    ("channel_montage.png", "Channel Montage (16 sentences × 6 channels)"),
    ("gradient_histograms.png", "Gradient Distributions (target: kurtosis > 3)"),
    ("channel_covariance.png", "Channel Covariance (target: diagonal)")
]

for filename, title in viz_files:
    path = os.path.join(eval_dir, filename)
    if os.path.exists(path):
        print(f"\n{title}:")
        display(Image(path))
    else:
        print(f"\n⚠️  {filename} not found")

## 7. Interpret Results

### Pass Criteria

Phase 1 **PASSES** if:
- ✅ ≥ 4/6 channels have α ∈ [1.5, 2.5]
- ✅ Gradient kurtosis > 3
- ✅ Channel covariance is approximately diagonal
- ✅ Visual montage shows smooth blobs/edges (no checkerboards)

### What to Look For

**Power Spectra**: Should show slopes between 1.5-2.5 (natural images are ~2)

**Channel Montage**: Should show varied patterns with:
- Smooth regions
- Clear edges
- Different patterns per channel
- NO checkerboard artifacts

**Gradient Histograms**: Should show heavy tails (kurtosis > 3)

**Channel Covariance**: Should be near-diagonal (channels independent)

## 8. Download Checkpoints (Optional)

Download the final checkpoint and visualizations to your local machine

In [None]:
# Create a zip file with important results
import shutil
import os

output_dir = "/content/drive/MyDrive/blind_lm_outputs/phase1"
zip_path = "/content/phase1_results.zip"

print("Creating results archive...")

# Create temporary directory
temp_dir = "/content/phase1_results_temp"
os.makedirs(temp_dir, exist_ok=True)

# Copy important files
files_to_include = [
    "config.json",
    "checkpoint_latest.pt",
    "eval_report/power_spectra.png",
    "eval_report/channel_montage.png",
    "eval_report/gradient_histograms.png",
    "eval_report/channel_covariance.png",
    "eval_report/evaluation_summary.json"
]

for file in files_to_include:
    src = os.path.join(output_dir, file)
    if os.path.exists(src):
        dst_dir = os.path.join(temp_dir, os.path.dirname(file))
        os.makedirs(dst_dir, exist_ok=True)
        shutil.copy2(src, os.path.join(temp_dir, file))
        print(f"  ✓ {file}")

# Create zip
shutil.make_archive('/content/phase1_results', 'zip', temp_dir)

# Download
from google.colab import files
print("\nDownloading...")
files.download('/content/phase1_results.zip')

print("\n✓ Download complete!")

## 9. Next Steps

After Phase 1 passes:

1. **Phase 2**: Add semantic meaning via contrastive learning
   - Paraphrases should produce similar latents
   - Counterfactuals should produce different latents

2. **Phase 3**: Spatial jitter robustness
   - Latents should be invariant to small shifts

3. **Phase 4**: Add text decoder
   - Reconstruct text from latent

4. **Phase 5**: Round-trip generation
   - Generate paraphrases without copying

---

**Questions or issues?** Check the [project documentation](https://github.com/jtooates/blind_lm)