# IaC-GPT Training on Kaggle

Train a domain-specific Infrastructure-as-Code LLM for free on Kaggle GPUs.

## Setup:
1. Enable GPU: **Settings â†’ Accelerator â†’ GPU T4 x2**
2. Enable Internet: **Settings â†’ Internet â†’ On**
3. Run all cells

**Estimated time:** d12 ~3hrs, d16 ~8hrs, d20 ~18hrs

In [None]:
# === CONFIGURATION ===
# Change these settings based on your available Kaggle hours

MODEL_DEPTH = 12        # 12 (~300M, ~3hrs) | 16 (~500M, ~8hrs) | 20 (~800M, ~18hrs)
BATCH_SIZE = 4          # 4 for T4 GPUs (DO NOT use 8 - will OOM!)
NUM_GPUS = 2            # Kaggle T4 x2
WINDOW_PATTERN = "L"    # Full attention (don't use SSSL on T4s)

print(f"Training config: d{MODEL_DEPTH} model, batch_size={BATCH_SIZE}, gpus={NUM_GPUS}")

## 1. Setup Environment

In [None]:
# Clone the repo
!git clone https://github.com/holynakamoto/nanochat.git
%cd nanochat

# Check GPU
!nvidia-smi

In [None]:
# Install dependencies
!pip install -q tiktoken pyarrow filelock rustbpe wandb tabulate regex zstandard

# Install flash-attn (optional, may fail)
!pip install -q flash-attn --no-build-isolation 2>/dev/null || echo "Flash attention not available"

## 2. Collect IaC Training Data

In [None]:
# Scrape IaC repositories (takes ~10-15 min)
!bash dev/fast_scrape_iac.sh <<< 'n'

In [None]:
# Convert to training shards
!python3 dev/repackage_iac_data.py \
    --input-dir data/iac_raw_cloned \
    --output-dir ~/.cache/nanochat/iac_data \
    --include-synthetic \
    --include-docs

In [None]:
# Setup data directories
!cp ~/.cache/nanochat/iac_data/shard_00000.parquet ~/.cache/nanochat/iac_data/shard_00001.parquet
!ln -sf ~/.cache/nanochat/iac_data ~/.cache/nanochat/base_data
!ls -la ~/.cache/nanochat/base_data/

## 3. Train Custom Tokenizer

In [None]:
# Train BPE tokenizer on IaC data (~1 min)
!python3 -m scripts.tok_train

In [None]:
# Evaluate tokenizer compression
!python3 -m scripts.tok_eval

## 4. Train IaC-GPT Base Model

In [None]:
# Train IaC-GPT model
import os
os.environ['OMP_NUM_THREADS'] = '1'

cmd = f"""torchrun --standalone --nproc_per_node={NUM_GPUS} -m scripts.base_train -- \
    --depth={MODEL_DEPTH} \
    --device-batch-size={BATCH_SIZE} \
    --window-pattern={WINDOW_PATTERN} \
    --target-param-data-ratio=5 \
    --run=dummy \
    --model-tag=iac-gpt-d{MODEL_DEPTH} \
    --eval-every=200 \
    --sample-every=500 \
    --save-every=1000"""

print(f"Running: {cmd}")
!{cmd}

## 5. Evaluate Base Model

In [None]:
# Evaluate base model
cmd = f"torchrun --standalone --nproc_per_node={NUM_GPUS} -m scripts.base_eval -- --device-batch-size={BATCH_SIZE}"
print(f"Running: {cmd}")
!{cmd}

## 6. Test Inference

In [None]:
# Quick inference test
!python3 -m scripts.chat_cli \
    --model ~/.cache/nanochat/base_checkpoints/iac-gpt-d{MODEL_DEPTH}/latest_checkpoint \
    --max-new-tokens 100

## 7. Download Model

In [None]:
# Compress model for download
!tar -czf iac_gpt_model.tar.gz ~/.cache/nanochat/base_checkpoints/
!tar -czf iac_gpt_tokenizer.tar.gz ~/.cache/nanochat/tokenizer/

print("\nâœ… Model files ready for download:")
!ls -lh *.tar.gz

In [None]:
# Download via Kaggle Output panel
from IPython.display import FileLink
print("Download your trained model:")
display(FileLink('iac_gpt_model.tar.gz'))
display(FileLink('iac_gpt_tokenizer.tar.gz'))

## ðŸŽ‰ Training Complete!

Your IaC-GPT model is ready!

### Next Steps:
1. Download the model files above
2. Extract locally: `tar -xzf iac_gpt_model.tar.gz`
3. Run inference: `python3 -m scripts.chat_cli --model ~/.cache/nanochat/base_checkpoints/iac-gpt-d12/latest_checkpoint`

### Try it out:
- "Create a Terraform module for an EKS cluster"
- "Write a Kubernetes deployment for nginx"
- "Generate an Ansible playbook to deploy a web app"

---

**Model Stats:**
- Parameters: ~286M (d12)
- Training Data: 11,188 IaC files
- Training Time: ~3 hours
- Cost: **FREE on Kaggle!**