# IaC-GPT GPU Training (Kaggle/Colab)

Train nanochat IaC-GPT on GPU accelerators.

**Setup:**
1. Kaggle: Settings ‚Üí Accelerator ‚Üí GPU T4 x2 or P100 x2
2. Colab: Runtime ‚Üí Change runtime type ‚Üí GPU (T4)
3. Run all cells

**GPU Support:**
- Kaggle: GPU T4 x2 (2x 16GB VRAM, 30 hrs/week free) or P100 x2 (2x 16GB)
- Colab: T4 (16GB VRAM, limited hours)
- Native bfloat16 on Ampere+ GPUs (A100, H100)
- ~2-4 hours for full d12 training on T4 x2

In [None]:
# Install dependencies using uv (better dependency resolution than pip)
# Step 1: Install uv
!curl -LsSf https://astral.sh/uv/install.sh | sh
!source $HOME/.cargo/env

# Step 2: Use uv to install all dependencies (GPU version - no torch-xla)
!~/.cargo/bin/uv pip install --system \
    torch \
    tiktoken pyarrow filelock rustbpe wandb tabulate regex zstandard pyyaml \
    anthropic

print("‚úÖ Installation complete via uv")

In [None]:
# Clone nanochat repo (ROBUST FIX - prevents nested directories)
import os
import subprocess

# Determine root directory
root_dir = "/content" if os.path.exists("/content") else ("/kaggle/working" if os.path.exists("/kaggle/working") else os.path.expanduser("~"))
os.chdir(root_dir)
print(f"Working from: {os.getcwd()}")

# Clean up any nested mess
if os.path.exists("nanochat/nanochat"):
    print("‚ö†Ô∏è  Detected nested directories, removing entire nanochat folder...")
    import shutil
    shutil.rmtree("nanochat")
    print("‚úÖ Cleaned up nested directories")

# Clone or update
if os.path.exists("nanochat/.git"):
    print("‚úÖ Updating existing nanochat repo...")
    subprocess.run(["git", "-C", "nanochat", "pull", "origin", "master"], check=True)
else:
    print("üì• Cloning fresh nanochat repo...")
    subprocess.run(["git", "clone", "https://github.com/holynakamoto/iacgpt.git", "nanochat"], check=True)

# Change to nanochat directory
os.chdir("nanochat")
final_path = os.getcwd()
print(f"\n‚úÖ Repository ready at: {final_path}")

# Safety check
if final_path.count("nanochat") > 1:
    print("‚ùå ERROR: Still nested! Path contains 'nanochat' multiple times")
    print("   Please manually delete the nanochat folder and re-run this cell")
else:
    print("‚úÖ Path is clean (no nesting)")

In [None]:
# Verify GPU detection
import torch

print("=" * 70)
print("GPU DETECTION TEST")
print("=" * 70)

if torch.cuda.is_available():
    print(f"‚úÖ CUDA available: {torch.cuda.get_device_name(0)}")
    print(f"   GPU count: {torch.cuda.device_count()}")
    print(f"   CUDA version: {torch.version.cuda}")
    print(f"   Memory: {torch.cuda.get_device_properties(0).total_memory / 1024**3:.1f} GB")
    
    # Test tensor operation
    x = torch.randn(3, 3, device='cuda')
    y = x @ x.t()
    print(f"\n‚úÖ Test matmul successful: {y.shape}")
else:
    print("‚ö†Ô∏è  No CUDA GPU detected! Training will be VERY slow on CPU.")
    print("   Change Kaggle accelerator to: GPU T4 x2 or P100")

print("=" * 70)

## GPU Support Status

‚úÖ **Native GPU/CUDA support built into nanochat!**

The following files auto-detect and optimize for GPU:
- `common.py`: Auto-detects CUDA and handles device initialization
- `scripts/base_train.py`: Uses optimized CUDA kernels and mixed precision
- `gpt.py`: Flash Attention 2/3 for Ampere+ GPUs, SDPA fallback for older GPUs

No manual configuration needed - just run the training command!

# Setup Claude API for automated log analysis (OPTIONAL)
import os
from getpass import getpass

# Check if API key is already set
if not os.environ.get("ANTHROPIC_API_KEY"):
    try:
        api_key = getpass("Enter your Anthropic API key (or press Enter to skip): ")
        if api_key:
            os.environ["ANTHROPIC_API_KEY"] = api_key
            print("‚úÖ API key set! Claude will analyze logs automatically.")
        else:
            print("‚ö†Ô∏è  Skipped. Logs won't be auto-analyzed by Claude.")
    except:
        print("‚ö†Ô∏è  Could not set API key. Continuing without auto-analysis.")
else:
    print("‚úÖ API key already set from environment.")

## Prepare IaC Training Data

**Expanded corpus: 110+ repos across Terraform, Kubernetes, Ansible, Crossplane, Helm, Docker, Pulumi**

**‚ö° Smart Caching:**
- **First run**: Scrapes 110+ repos (~15-30 min) ‚Üí creates 8-15 parquet shards
- **Future runs**: Uses cached Kaggle Dataset (~5 seconds) if available

**To cache data across sessions:**
1. After first run, click **"Save Version"** ‚Üí **"Save & Run All"**
2. In Output sidebar, click **"‚ãÆ"** next to `/kaggle/working/iac_training_data` ‚Üí **"Create Dataset"**
3. Name it `iac-training-corpus`, click **"Create"**
4. In future notebooks: **"Add Input"** ‚Üí search `iac-training-corpus` ‚Üí attach it
5. Cell will auto-detect and use cached data!

In [None]:
import os, glob, subprocess, io, shutil
from contextlib import redirect_stdout, redirect_stderr

# Helper function to capture and analyze logs with Claude
def analyze_with_claude(logs, task_name):
    """Send logs to Claude for analysis"""
    if not os.environ.get("ANTHROPIC_API_KEY"):
        return  # Skip if no API key
    
    try:
        import anthropic
        client = anthropic.Anthropic()
        
        print("\n" + "=" * 80)
        print(f"ü§ñ Analyzing {task_name} with Claude...")
        print("=" * 80)
        
        response = client.messages.create(
            model="claude-sonnet-4-5-20250929",
            max_tokens=2000,
            messages=[{
                "role": "user",
                "content": f"""Analyze these logs from IaC data scraping in a Kaggle notebook.

Task: {task_name}

Logs:
{logs}

Provide:
1. Summary: What happened (success/failure counts, extracted file counts)
2. Issues: Any errors or warnings that need attention
3. Recommendations: How to improve results if only 1 shard was created

Be concise and actionable."""
            }]
        )
        
        print(response.content[0].text)
        print("=" * 80 + "\n")
        
    except Exception as e:
        print(f"‚ö†Ô∏è  Claude analysis failed: {e}\n")

# Setup directories
CACHE_DIR = os.path.expanduser("~/.cache/nanochat")
DATA_DIR = os.path.join(CACHE_DIR, "iac_data")
BASE_DATA = os.path.join(CACHE_DIR, "base_data")
KAGGLE_OUTPUT = "/kaggle/working/iac_training_data"

# Check for cached Kaggle Dataset
print("=" * 80)
print("Checking for cached IaC training data...")
print("=" * 80)

cached_data_path = None
# Look for any attached dataset with parquet files
for input_dir in glob.glob("/kaggle/input/*"):
    parquet_files = glob.glob(f"{input_dir}/*.parquet")
    if parquet_files:
        cached_data_path = input_dir
        print(f"‚úÖ Found cached data: {cached_data_path}")
        print(f"   Contains {len(parquet_files)} parquet shards")
        break

if cached_data_path:
    # Use cached data (fast path)
    print("\n‚ö° Using cached dataset - skipping 15-30 min scraping!")
    
    # Create symlink to cached data
    os.makedirs(CACHE_DIR, exist_ok=True)
    if os.path.islink(BASE_DATA):
        os.unlink(BASE_DATA)
    elif os.path.exists(BASE_DATA):
        shutil.rmtree(BASE_DATA)
    
    os.symlink(cached_data_path, BASE_DATA)
    
    shard_count = len(glob.glob(f'{BASE_DATA}/*.parquet'))
    print(f"‚úÖ Loaded {shard_count} parquet shards from cache")
    
    # Show shard contents
    print("\nCached data contents:")
    for f in sorted(glob.glob(f'{BASE_DATA}/*.parquet')):
        size_mb = os.path.getsize(f) / (1024 * 1024)
        print(f"  {os.path.basename(f):30s} {size_mb:6.2f} MB")
    
    print("\n" + "=" * 80)
    print("‚úÖ Data ready - proceed to Cell 9 to train tokenizer")
    print("=" * 80)

else:
    # No cache found - run full scraping (slow path)
    print("‚ö†Ô∏è  No cached data found - running full scraping (~15-30 min)")
    print("üí° After this completes, save as Dataset to avoid re-scraping!\n")
    
    captured_logs = io.StringIO()
    
    print("=" * 80)
    print("Updating nanochat to latest version...")
    print("=" * 80)
    result = subprocess.run(["git", "pull", "origin", "master"], cwd=".", capture_output=True, text=True)
    print(result.stdout)
    captured_logs.write(result.stdout + "\n")
    if result.stderr:
        print(result.stderr)
        captured_logs.write(result.stderr + "\n")
    
    # Verify we have the expanded repo list
    result = subprocess.run(["grep", "-c", "terraform-aws-modules", "dev/fast_scrape_iac.sh"], 
                           capture_output=True, text=True)
    repo_count = int(result.stdout.strip()) if result.returncode == 0 else 0
    msg = f"‚úÖ Script has {repo_count} terraform-aws-modules repos\n"
    print(msg)
    captured_logs.write(msg)
    
    if repo_count < 20:
        msg = "‚ö†Ô∏è  WARNING: Script may not be updated! Should have 20+ terraform-aws-modules repos\n"
        print(msg)
        captured_logs.write(msg)
    else:
        msg = "‚úÖ Script is updated with expanded repo list\n"
        print(msg)
        captured_logs.write(msg)
    
    # Scrape 110+ IaC repositories
    print("\n" + "=" * 80)
    print("Scraping 110+ IaC repositories...")
    print("This will take ~15-30 minutes")
    print("=" * 80)
    
    scrape_result = subprocess.run(["bash", "dev/fast_scrape_iac.sh"], 
                                   input="n\n", capture_output=True, text=True)
    print(scrape_result.stdout)
    captured_logs.write(scrape_result.stdout + "\n")
    if scrape_result.stderr:
        print(scrape_result.stderr)
        captured_logs.write(scrape_result.stderr + "\n")
    
    # Convert to training shards
    print("\n" + "=" * 80)
    print("Converting to parquet shards...")
    print("=" * 80)
    repack_result = subprocess.run([
        "python3", "dev/repackage_iac_data.py",
        "--input-dir", "data/iac_raw_cloned",
        "--output-dir", DATA_DIR,
        "--include-synthetic", "--include-docs"
    ], capture_output=True, text=True)
    print(repack_result.stdout)
    captured_logs.write(repack_result.stdout + "\n")
    if repack_result.stderr:
        print(repack_result.stderr)
        captured_logs.write(repack_result.stderr + "\n")
    
    # Link base_data
    if os.path.islink(BASE_DATA):
        os.unlink(BASE_DATA)
    if not os.path.exists(BASE_DATA):
        os.symlink(DATA_DIR, BASE_DATA)
    
    shard_count = len(glob.glob(f'{BASE_DATA}/*.parquet'))
    msg = f"\n{'=' * 80}\n‚úÖ Data ready: {shard_count} shards\nLocation: {BASE_DATA}\n{'=' * 80}\n"
    print(msg)
    captured_logs.write(msg)
    
    # Show shard contents
    print("\nData directory contents:")
    for f in sorted(glob.glob(f'{BASE_DATA}/*.parquet')):
        size_mb = os.path.getsize(f) / (1024 * 1024)
        shard_info = f"  {os.path.basename(f):30s} {size_mb:6.2f} MB"
        print(shard_info)
        captured_logs.write(shard_info + "\n")
    
    # Copy to Kaggle output directory for dataset creation
    print("\n" + "=" * 80)
    print("Copying data to /kaggle/working for dataset creation...")
    print("=" * 80)
    
    if os.path.exists(KAGGLE_OUTPUT):
        shutil.rmtree(KAGGLE_OUTPUT)
    shutil.copytree(BASE_DATA, KAGGLE_OUTPUT)
    
    print(f"‚úÖ Copied {len(os.listdir(KAGGLE_OUTPUT))} files to {KAGGLE_OUTPUT}")
    print("\nüìã To cache this data for future runs:")
    print("   1. Click 'Save Version' ‚Üí 'Save & Run All'")
    print("   2. In Output sidebar ‚Üí '‚ãÆ' next to iac_training_data ‚Üí 'Create Dataset'")
    print("   3. Name: iac-training-corpus")
    print("   4. Future notebooks: Add Input ‚Üí search iac-training-corpus")
    print("=" * 80)
    
    # Analyze logs with Claude
    analyze_with_claude(captured_logs.getvalue(), "IaC Data Scraping")

## Train Tokenizer

In [None]:
# Train BPE tokenizer
!python3 -m scripts.tok_train

## Train on GPU (CUDA)

Uses PyTorch DDP for multi-GPU training (T4 x2, P100 x2).

In [None]:
# GPU training command (optimized for T4 x2)
MODEL_DEPTH = 12
BATCH_SIZE = 4  # T4 has 16GB VRAM, conservative batch size
WINDOW_PATTERN = "L"  # Full attention (T4 has good memory bandwidth)

# Single GPU command
single_gpu_cmd = f"""python3 scripts/base_train.py \
    --depth={MODEL_DEPTH} \
    --device-batch-size={BATCH_SIZE} \
    --window-pattern={WINDOW_PATTERN} \
    --target-param-data-ratio=8 \
    --run=iac-gpt-kaggle \
    --model-tag=iac-gpt-gpu-d{MODEL_DEPTH} \
    --eval-every=100 \
    --sample-every=100 \
    --save-every=500"""

# Multi-GPU command (T4 x2, P100 x2)
multi_gpu_cmd = f"""torchrun --nproc_per_node=2 -m scripts.base_train -- \
    --depth={MODEL_DEPTH} \
    --device-batch-size={BATCH_SIZE} \
    --window-pattern={WINDOW_PATTERN} \
    --target-param-data-ratio=8 \
    --run=iac-gpt-kaggle \
    --model-tag=iac-gpt-gpu-d{MODEL_DEPTH} \
    --eval-every=100 \
    --sample-every=100 \
    --save-every=500"""

# Detect GPU count
import torch
gpu_count = torch.cuda.device_count()

print("=" * 80)
print(f"GPU Training Command ({gpu_count} GPU{'s' if gpu_count > 1 else ''}):")
print("=" * 80)

if gpu_count > 1:
    print(f"\nMulti-GPU (torchrun):\n{multi_gpu_cmd}\n")
    print("Or run in a new cell:")
    print(f"!{multi_gpu_cmd}")
else:
    print(f"\nSingle GPU:\n{single_gpu_cmd}\n")
    print("Or run in a new cell:")
    print(f"!{single_gpu_cmd}")

print("\n‚ö†Ô∏è  IMPORTANT: Run cells 7 and 9 first to prepare data and tokenizer!")
print("=" * 80)

## Full Training Pipeline

**IMPORTANT: Run cells in this order:**

1. **Cells 1-3**: Setup (install dependencies, clone repo, verify GPU)
2. **Cell 5**: (Optional) Enter Anthropic API key for auto-log analysis
3. **Cell 7**: üî¥ Prepare IaC training data (uses cache if available, otherwise scrapes)
   - **First run**: ~15-30 min to scrape 110+ repos
   - **Future runs**: ~5 seconds with cached dataset
4. **Cell 9**: üî¥ Train BPE tokenizer on IaC data (~2-3 min)
5. **Cell 11**: Copy the training command and run it

**What each step does:**
- **Cell 7**: 
  - Checks for cached dataset first (fast path)
  - If no cache: Clones 110+ repos (Terraform, K8s, Ansible, Crossplane, Helm, Docker, Pulumi) ‚Üí parquet shards
  - Saves data to `/kaggle/working/iac_training_data/` for dataset creation
- **Cell 9**: Trains 49K vocab BPE tokenizer on IaC corpus ‚Üí saves to `~/.cache/nanochat/tokenizer/`
- **Training**: Pretrains d12 model (124M params) on IaC data with Muon optimizer

**Caching workflow (do this after first run):**
1. **After Cell 7 completes**, click **"Save Version"** ‚Üí **"Save & Run All"**
2. In **Output** sidebar (right), find `/kaggle/working/iac_training_data/`
3. Click **"‚ãÆ"** ‚Üí **"Create Dataset"**
4. Name: `iac-training-corpus`, click **"Create"**
5. **In future notebooks**: 
   - Click **"Add Input"** (right sidebar)
   - Search `iac-training-corpus`
   - Click **"+"** to attach
   - Cell 7 will auto-detect and use it!

**Expected corpus size:**
- 110+ repos ‚Üí ~100-200MB raw IaC code
- ~50-100M tokens (after tokenization with compression ratio 3-4x)
- 8-15 parquet shards for training

**Training time on Kaggle:**
- **T4 x2**: ~3-4 hours for full training (good for d12)
- **P100 x2**: ~4-5 hours (older architecture)
- **A100** (paid): ~1-2 hours (best performance)

**For distributed training on multiple nodes:**
```bash
# On each node, run:
torchrun --nproc_per_node=2 \
    --nnodes=2 --node_rank=<0 or 1> \
    --master_addr=<ip> --master_port=29500 \
    -m scripts.base_train -- [args...]
```