# IaC-GPT Training on Kaggle (v3.1)

Train a domain-specific Infrastructure-as-Code LLM for free on Kaggle GPUs.

**What's new in v3.1:**
- **Fixed T4 GPU hang**: Proper BF16 tensor core detection (2-5min compilation, not 27+min)
- Auto-detects compute capability to use FP16 on T4, BF16 on Ampere+
- No more "Tesla T4 does not support bfloat16 compilation natively" warnings

**v3 features:**
- Checkpoint persistence: auto-save to Kaggle Datasets
- Training data caching: never re-scrape
- Resume training: pick up exactly where you left off
- 4-tier data curriculum + secret sanitization

## Setup:
1. Enable GPU: **Settings → Accelerator → GPU P100**
2. Enable Internet: **Settings → Internet → On**
3. Add your Kaggle username in Cell 1
4. Run all cells

**Estimated time:** d12 ~6-8hrs on P100 (single GPU)

In [None]:
# === CONFIGURATION ===
# Model depth should match corpus size to avoid overfitting:
#   d12 (~115M params) for < 100M tokens  <-- recommended for current IaC corpus
#   d16 (~400M params) for 100-500M tokens
#   d24 (~1.6B params) for 500M+ tokens

MODEL_DEPTH = 12        # 12 recommended for current corpus size
BATCH_SIZE = 2          # 2 for P100 GPU (prevents OOM)
NUM_GPUS = 1            # Kaggle P100 (single GPU)
WINDOW_PATTERN = "L"    # Full attention (don't use SSSL on single GPU)
DATA_RATIO = 8          # target-param-data-ratio (5=aggressive, 8=balanced, 10=conservative)

# === RESUME CONFIGURATION ===
# Set RESUME = True if continuing a previous training run.
# Set RESUME_STEP to the step number to resume from, or -1 to auto-detect the latest.
RESUME = False           # Set True to resume from a previous session
RESUME_STEP = -1         # -1 = auto-detect last checkpoint, or set a specific step number

# === KAGGLE IDENTITY ===
# Your Kaggle username (for saving datasets). Find it at kaggle.com -> Account.
KAGGLE_USERNAME = "YOUR_KAGGLE_USERNAME"  # <-- SET YOUR KAGGLE USERNAME HERE

# Dataset slugs (auto-generated, don't change)
CHECKPOINT_DATASET = f"iac-gpt-checkpoints-d{MODEL_DEPTH}"
TRAINING_DATA_DATASET = "iac-gpt-training-data"

print(f"Training config: d{MODEL_DEPTH} model, batch_size={BATCH_SIZE}, gpus={NUM_GPUS}, ratio={DATA_RATIO}")
print(f"Resume: {RESUME} (step={RESUME_STEP})")
print(f"Kaggle user: {KAGGLE_USERNAME}")

## 1. Setup Environment

In [None]:
# Clone or update the repo
!git clone https://github.com/holynakamoto/iacgpt.git nanochat 2>/dev/null || \
    (cd nanochat && git pull origin master)
%cd nanochat

# Verify we have the T4 fix
!git log --oneline -1
print("\n✓ Expected: e9a4559 Fix T4 BF16 detection: Use compute capability...")

# Check GPU
!nvidia-smi

In [None]:
# Install dependencies
!pip install -q tiktoken pyarrow filelock rustbpe wandb tabulate regex zstandard pyyaml

# Install flash-attn (optional, may fail on older GPUs)
!pip install -q flash-attn --no-build-isolation 2>/dev/null || echo "Flash attention not available (OK for P100)"

In [None]:
# Verify BF16 Detection
import sys
sys.path.insert(0, '/kaggle/working/nanochat')
import torch
from common import has_bf16_support

print("=" * 70)
print("GPU BF16 DETECTION TEST")
print("=" * 70)
print(f"GPU: {torch.cuda.get_device_name(0)}")
major, minor = torch.cuda.get_device_capability(0)
print(f"Compute capability: {major}.{minor}")
print(f"\ntorch.cuda.is_bf16_supported(): {torch.cuda.is_bf16_supported()} ← buggy (detects emulation)")
print(f"has_bf16_support(): {has_bf16_support()} ← correct (checks tensor cores)\n")

if has_bf16_support():
    print("✓ GPU has BF16 tensor cores -> will use BF16")
else:
    print("✓ GPU lacks BF16 tensor cores -> will use FP16 (native tensor cores)")
    print("  (P100 uses FP16, T4 also uses FP16)")
print("=" * 70)

## 2. Collect & Prepare IaC Training Data

The data pipeline:
1. **Scrape** 25 curated IaC repos (Terraform, K8s, Ansible, Crossplane, Docker)
2. **Sanitize** to remove secrets (AWS keys, SSH keys, API tokens, real IPs)
3. **Extract** Requirement→Code pairs for Tier 2 curriculum
4. **Repackage** into parquet shards for the dataloader

**If resuming:** This section checks for a cached training data dataset first.
If found, it skips the 15-minute scrape entirely.

In [None]:
import os, glob, subprocess

CACHE_DIR = os.path.expanduser("~/.cache/nanochat")
DATA_DIR = os.path.join(CACHE_DIR, "iac_data")
BASE_DATA = os.path.join(CACHE_DIR, "base_data")

# Check if cached training data exists (from a previous session's Kaggle Dataset)
cached_data_path = f"/kaggle/input/{TRAINING_DATA_DATASET}"
has_cached_data = os.path.isdir(cached_data_path) and glob.glob(f"{cached_data_path}/*.parquet")

if has_cached_data:
    print(f"Found cached training data at {cached_data_path}")
    os.makedirs(DATA_DIR, exist_ok=True)
    for f in glob.glob(f"{cached_data_path}/*"):
        dest = os.path.join(DATA_DIR, os.path.basename(f))
        if not os.path.exists(dest):
            os.symlink(f, dest)
    # Link base_data
    if os.path.islink(BASE_DATA):
        os.unlink(BASE_DATA)
    os.symlink(DATA_DIR, BASE_DATA)
    parquets = glob.glob(f"{BASE_DATA}/*.parquet")
    print(f"Loaded {len(parquets)} cached shard(s) - skipping scrape!")
else:
    print("No cached data found. Running full data pipeline...")
    # Step 1: Scrape IaC repositories (~10-15 min)
    subprocess.run(["bash", "dev/fast_scrape_iac.sh"], input=b"n", check=True)

    # Step 2: Sanitize - remove secrets before training
    print("\nSanitizing data...")
    subprocess.run(["python3", "dev/sanitize_iac.py", "--raw-dir", "data/iac_raw_cloned", "--dry-run"], check=False)
    subprocess.run(["python3", "dev/sanitize_iac.py", "--raw-dir", "data/iac_raw_cloned"], check=False)

    # Step 3: Convert to training shards
    print("\nCreating training shards...")
    subprocess.run([
        "python3", "dev/repackage_iac_data.py",
        "--input-dir", "data/iac_raw_cloned",
        "--output-dir", DATA_DIR,
        "--include-synthetic", "--include-docs"
    ], check=True)

    # Ensure at least 2 shards for distributed training
    shards = sorted(glob.glob(f"{DATA_DIR}/shard_*.parquet"))
    if len(shards) == 1:
        import shutil
        shutil.copy2(shards[0], os.path.join(DATA_DIR, "shard_00001.parquet"))

    # Link base_data
    if os.path.islink(BASE_DATA):
        os.unlink(BASE_DATA)
    os.symlink(DATA_DIR, BASE_DATA)

parquets = sorted(glob.glob(f"{BASE_DATA}/*.parquet"))
total_mb = sum(os.path.getsize(p) for p in parquets) / 1e6
print(f"\nTraining data ready: {len(parquets)} shards, {total_mb:.1f} MB")

## 3. Train Custom Tokenizer

In [None]:
# Train BPE tokenizer on IaC data (~1 min)
!python3 -m scripts.tok_train

In [None]:
# Evaluate tokenizer compression
!python3 -m scripts.tok_eval

## 4. Train IaC-GPT Base Model

If `RESUME = True`, the training command will load the checkpoint from a previous session
and continue from where it left off. The optimizer state, dataloader position, and loss
tracking are all restored automatically.

In [None]:
import os, glob, shutil

CKPT_DIR = os.path.expanduser(f"~/.cache/nanochat/base_checkpoints/iac-gpt-d{MODEL_DEPTH}")
resume_step = -1

if RESUME:
    # Look for checkpoint dataset from previous session
    ckpt_input = f"/kaggle/input/{CHECKPOINT_DATASET}"
    if os.path.isdir(ckpt_input) and glob.glob(f"{ckpt_input}/model_*.pt"):
        print(f"Restoring checkpoints from {ckpt_input}...")
        os.makedirs(CKPT_DIR, exist_ok=True)
        for f in glob.glob(f"{ckpt_input}/*"):
            dest = os.path.join(CKPT_DIR, os.path.basename(f))
            shutil.copy2(f, dest)

        # Find the latest step
        model_files = sorted(glob.glob(f"{CKPT_DIR}/model_*.pt"))
        if model_files:
            latest = model_files[-1]
            resume_step = int(os.path.basename(latest).replace("model_", "").replace(".pt", ""))
            print(f"Found checkpoint at step {resume_step}")
        else:
            print("WARNING: No model files found in checkpoint dataset!")
            resume_step = -1
    else:
        print(f"WARNING: RESUME=True but no checkpoint dataset found at {ckpt_input}")
        print(f"Add your '{CHECKPOINT_DATASET}' dataset as an Input to this notebook.")
        resume_step = -1

    # Allow manual override
    if RESUME_STEP > 0:
        resume_step = RESUME_STEP
        print(f"Using manually specified resume step: {resume_step}")

    if resume_step > 0:
        print(f"\nWill resume training from step {resume_step}")
    else:
        print("\nNo checkpoint to resume from. Starting fresh.")
else:
    print("Fresh training run (RESUME=False)")

In [None]:
# Train IaC-GPT model
import os
os.environ['OMP_NUM_THREADS'] = '1'
os.environ['PYTHONUNBUFFERED'] = '1'

cmd = f"""torchrun --standalone --nproc_per_node={NUM_GPUS} -m scripts.base_train -- \
    --depth={MODEL_DEPTH} \
    --device-batch-size={BATCH_SIZE} \
    --window-pattern={WINDOW_PATTERN} \
    --target-param-data-ratio={DATA_RATIO} \
    --run=dummy \
    --model-tag=iac-gpt-d{MODEL_DEPTH} \
    --eval-every=100 \
    --sample-every=100 \
    --save-every=100"""

# Add resume flag if we have a checkpoint
if resume_step > 0:
    cmd += f" \\\n    --resume-from-step={resume_step}"
    print(f"RESUMING from step {resume_step}")

print("=" * 80)
print("EXPECTED OUTPUT:")
print("  GPU: Tesla P100-PCIE-16GB | Peak FLOPS (FP16): ~1.9e+13")
print("  GPU does not support bfloat16, using float16 with GradScaler")
print("  Compilation: 2-5 minutes")
print("  Training loop starts showing loss values")
print("=" * 80)
print(f"\nCommand: {cmd}\n")

!{cmd}

## 4b. Save Checkpoints & Data to Kaggle Datasets

Run this cell after training completes (or if you need to stop early).
It uploads your checkpoints and training data as private Kaggle Datasets
so you can resume in a new session.

**To resume later:**
1. Start a new notebook session
2. Add these datasets as Inputs: `iac-gpt-checkpoints-d{MODEL_DEPTH}` and `iac-gpt-training-data`
3. Set `RESUME = True` in the config cell
4. Run all cells

In [None]:
import json, os, glob, shutil

def save_to_kaggle_dataset(source_dir, dataset_slug, username, file_patterns=("*",)):
    """Upload a directory's contents as a Kaggle Dataset."""
    staging = f"/kaggle/working/_staging_{dataset_slug}"
    if os.path.exists(staging):
        shutil.rmtree(staging)
    os.makedirs(staging)

    # Copy matching files to staging
    count = 0
    for pattern in file_patterns:
        for f in glob.glob(os.path.join(source_dir, pattern)):
            if os.path.isfile(f):
                shutil.copy2(f, staging)
                count += 1

    if count == 0:
        print(f"WARNING: No files matched in {source_dir}")
        return False

    # Create dataset-metadata.json
    meta = {
        "title": dataset_slug,
        "id": f"{username}/{dataset_slug}",
        "licenses": [{"name": "CC0-1.0"}]
    }
    with open(os.path.join(staging, "dataset-metadata.json"), "w") as f:
        json.dump(meta, f)

    # Try to create (first time) or update (subsequent times)
    import subprocess
    result = subprocess.run(
        ["kaggle", "datasets", "create", "-p", staging, "--dir-mode", "zip"],
        capture_output=True, text=True
    )
    if "already exists" in result.stderr.lower() or result.returncode != 0:
        result = subprocess.run(
            ["kaggle", "datasets", "version", "-p", staging, "-m", "auto-save", "--dir-mode", "zip"],
            capture_output=True, text=True
        )

    shutil.rmtree(staging)

    if result.returncode == 0:
        print(f"Saved {count} files to kaggle.com/{username}/{dataset_slug}")
        return True
    else:
        print(f"ERROR saving dataset: {result.stderr}")
        return False

# === Save checkpoints ===
ckpt_dir = os.path.expanduser(f"~/.cache/nanochat/base_checkpoints/iac-gpt-d{MODEL_DEPTH}")
if os.path.isdir(ckpt_dir) and glob.glob(f"{ckpt_dir}/model_*.pt"):
    print("Saving checkpoints...")
    save_to_kaggle_dataset(
        ckpt_dir, CHECKPOINT_DATASET, KAGGLE_USERNAME,
        file_patterns=("model_*.pt", "meta_*.json", "optim_*.pt")
    )
else:
    print(f"No checkpoints found at {ckpt_dir}")

# === Save training data (only on first run) ===
data_dir = os.path.expanduser("~/.cache/nanochat/iac_data")
if not has_cached_data and os.path.isdir(data_dir):
    print("\nSaving training data for future sessions...")
    save_to_kaggle_dataset(
        data_dir, TRAINING_DATA_DATASET, KAGGLE_USERNAME,
        file_patterns=("*.parquet",)
    )
else:
    print("\nTraining data already cached as dataset (skipping upload)")

## 5. Evaluate Base Model

In [None]:
# Evaluate base model
cmd = f"torchrun --standalone --nproc_per_node={NUM_GPUS} -m scripts.base_eval -- --device-batch-size={BATCH_SIZE}"
print(f"Running: {cmd}")
!{cmd}

## 6. Test Inference

In [None]:
# Quick inference test
!python3 -m scripts.chat_cli \
    -i base \
    -p "Write a Terraform module for an EKS cluster" \
    -t 0.6

## 7. Download Model

In [None]:
# Compress model for download
!tar -czf iac_gpt_model.tar.gz ~/.cache/nanochat/base_checkpoints/
!tar -czf iac_gpt_tokenizer.tar.gz ~/.cache/nanochat/tokenizer/

print("\n✅ Model files ready for download:")
!ls -lh *.tar.gz

In [None]:
# Download via Kaggle Output panel
from IPython.display import FileLink
print("Download your trained model:")
display(FileLink('iac_gpt_model.tar.gz'))
display(FileLink('iac_gpt_tokenizer.tar.gz'))

## Training Complete!

Your IaC-GPT model is ready.

### What's fixed in v3.1:
✅ **GPU compatibility**: Proper compute capability detection for P100/T4
- P100 (6.0) and T4 (7.5) both use FP16 tensor cores (fast and stable)
- Compilation: 2-5 minutes
- No more "does not support bfloat16" warnings

### Resume a killed session:
1. Start a **new** Kaggle notebook
2. Go to **Add Data** (right sidebar) and add your saved datasets:
   - `iac-gpt-checkpoints-d12` (your model checkpoints)
   - `iac-gpt-training-data` (your parquet shards)
3. Set `RESUME = True` in the config cell
4. Run all cells -- scraping is skipped, training picks up from last checkpoint

### Download & use locally:
```bash
tar -xzf iac_gpt_model.tar.gz
python3 -m scripts.chat_cli -i base -p "Create a Terraform module for an EKS cluster"
```

### Try prompts:
- "Create a Terraform module for an EKS cluster"
- "Write a Kubernetes deployment for nginx"
- "Generate an Ansible playbook to deploy a web app"

---

**Model Stats:**
- Parameters: ~286M (d12)
- Training Data: 11,188 IaC files
- Training Time: ~6-8 hours on P100 (single GPU)
- Cost: **FREE on Kaggle!**