# IaC-GPT Training on Kaggle

Train a ~800M parameter Infrastructure-as-Code specialist LLM for free on Kaggle GPUs.

**Setup:**
1. Enable GPU: Settings → Accelerator → GPU T4 x2
2. Enable Internet: Settings → Internet → On
3. Run all cells

**Estimated time:** 15-20 hours for d20 model

## 1. Setup Environment

In [None]:
# Clone the repo
!git clone https://github.com/holynakamoto/iacgpt.git
%cd iacgpt

# Check GPU
!nvidia-smi

In [None]:
# Install dependencies (Kaggle already has most, just need a few extras)
!pip install -q tiktoken pyarrow filelock rustbpe wandb tabulate regex zstandard

# Install flash-attn for faster training (optional, may fail on some systems)
!pip install -q flash-attn --no-build-isolation 2>/dev/null || echo "Flash attention not available, using SDPA fallback"

## 2. Collect IaC Training Data

In [None]:
# Scrape IaC repositories (takes ~10-15 min)
!bash dev/fast_scrape_iac.sh <<< 'n'

In [None]:
# Convert to training shards
!python dev/repackage_iac_data.py \
    --input-dir data/iac_raw_cloned \
    --output-dir ~/.cache/nanochat/iac_data \
    --include-synthetic --include-docs

In [None]:
# Setup data directories
!cp ~/.cache/nanochat/iac_data/shard_00000.parquet ~/.cache/nanochat/iac_data/shard_00001.parquet
!ln -sf ~/.cache/nanochat/iac_data ~/.cache/nanochat/base_data
!ls -la ~/.cache/nanochat/base_data/

## 3. Train Custom Tokenizer

In [None]:
# Train BPE tokenizer on IaC data (~1 min)
!python -m scripts.tok_train

In [None]:
# Evaluate tokenizer compression
!python -m scripts.tok_eval

## 4. Train IaC-GPT Base Model

**Model options:**
- `--depth=12` (~300M params, ~3 hrs) - Quick test
- `--depth=16` (~500M params, ~8 hrs) - Good balance  
- `--depth=20` (~800M params, ~18 hrs) - Best quality

Adjust based on your available Kaggle hours.

In [None]:
# Train d20 model (~800M params)
# For 2x T4 GPUs, use torchrun
# Estimated time: 15-20 hours

!OMP_NUM_THREADS=1 torchrun --standalone --nproc_per_node=2 -m scripts.base_train -- \
    --depth=20 \
    --device-batch-size=8 \
    --target-param-data-ratio=5 \
    --run=dummy \
    --model-tag=iac-gpt-d20 \
    --eval-every=500 \
    --sample-every=1000 \
    --save-every=2000

In [None]:
# Alternative: Train d16 model (~500M params) - faster, ~8 hours
# Uncomment to use instead of d20

# !OMP_NUM_THREADS=1 torchrun --standalone --nproc_per_node=2 -m scripts.base_train -- \
#     --depth=16 \
#     --device-batch-size=8 \
#     --target-param-data-ratio=5 \
#     --run=dummy \
#     --model-tag=iac-gpt-d16 \
#     --eval-every=500 \
#     --sample-every=1000 \
#     --save-every=2000

## 5. Evaluate Base Model

In [None]:
!torchrun --standalone --nproc_per_node=2 -m scripts.base_eval -- --device-batch-size=8

## 6. SFT: Add DevOps Architect Persona

In [None]:
# Generate IaC-specific identity conversations
!python -c "
import json
import os

conversations = [
    {'messages': [
        {'role': 'user', 'content': 'Who are you?'},
        {'role': 'assistant', 'content': 'I am IaC-GPT, a specialized AI assistant for Infrastructure-as-Code. I can help you write Terraform modules, Kubernetes manifests, Ansible playbooks, and Crossplane compositions.'}
    ]},
    {'messages': [
        {'role': 'user', 'content': 'Write a Terraform module for an AWS VPC'},
        {'role': 'assistant', 'content': '''Here is a Terraform module for an AWS VPC:\n\n```hcl\nvariable \"vpc_cidr\" {\n  default = \"10.0.0.0/16\"\n}\n\nresource \"aws_vpc\" \"main\" {\n  cidr_block = var.vpc_cidr\n  enable_dns_hostnames = true\n  tags = { Name = \"main-vpc\" }\n}\n```'''}
    ]},
    {'messages': [
        {'role': 'user', 'content': 'Create a Kubernetes deployment'},
        {'role': 'assistant', 'content': '''Here is a Kubernetes Deployment:\n\n```yaml\napiVersion: apps/v1\nkind: Deployment\nmetadata:\n  name: app\nspec:\n  replicas: 3\n  selector:\n    matchLabels:\n      app: app\n  template:\n    metadata:\n      labels:\n        app: app\n    spec:\n      containers:\n      - name: app\n        image: nginx:alpine\n        ports:\n        - containerPort: 80\n```'''}
    ]},
]

os.makedirs(os.path.expanduser('~/.cache/nanochat'), exist_ok=True)
with open(os.path.expanduser('~/.cache/nanochat/iac_identity_conversations.jsonl'), 'w') as f:
    for _ in range(50):  # Repeat for more training signal
        for conv in conversations:
            f.write(json.dumps(conv) + '\n')
print('Generated identity conversations')
"

In [None]:
# Run SFT
!torchrun --standalone --nproc_per_node=2 -m scripts.chat_sft -- --device-batch-size=8 --run=dummy

## 7. Test Your Model

In [None]:
# Test with IaC prompts
!python -m scripts.chat_cli -p "Write a Terraform module for an EKS cluster with 3 node groups"

In [None]:
!python -m scripts.chat_cli -p "Create a Kubernetes deployment with resource limits and health checks"

In [None]:
!python -m scripts.chat_cli -p "Review this for security issues: resource aws_s3_bucket data { acl = public-read }"

## 8. Download Model Checkpoint

Save your trained model before the Kaggle session ends!

In [None]:
# Zip the model checkpoint
!zip -r iac-gpt-model.zip ~/.cache/nanochat/checkpoints/ ~/.cache/nanochat/tokenizer/

# Download from Kaggle output (check Output tab after running)

In [None]:
# Or upload to HuggingFace Hub (optional)
# !pip install huggingface_hub
# from huggingface_hub import notebook_login
# notebook_login()
# !huggingface-cli upload holynakamoto/iac-gpt ~/.cache/nanochat/checkpoints/