# Goldilocks Reference Model

**This is the canonical foundation for all Azimuth experiments.**

Copy this notebook as your starting point. Everything here is locked down for reproducibility:

| Property | Value |
|----------|-------|
| Architecture | Rich (4L/128D/2H/256FF) |
| Parameters | ~1.05M |
| Vocab size | 3,988 |
| Dead tokens | 1,914 |
| Sequence length | 128 |
| Batch size | 8 |
| Dtype | bfloat16 |
| Optimizer | AdamW (lr=1e-3) |
| Seed | 42 |

**What you get:**
- ~100 steps/sec on M4 Pro
- Fimbulwinter onset around step 2000-3000
- 95%+ dead tokens frozen by step 5000

**To use:** Copy this notebook, add your experiment code in the marked section.

## Parameters

**Do not modify these** unless you're intentionally deviating from the reference.

In [None]:
# === GOLDILOCKS REFERENCE PARAMETERS ===
# Do not modify unless intentionally deviating

import torch

# Paths — reference Goldilocks data from sibling projects
GOLDILOCKS_DATA = "../Goldilocks/data"
TOKENIZER_PATH = f"{GOLDILOCKS_DATA}/tokenizer.json"
TOKENS_PATH = f"{GOLDILOCKS_DATA}/model_corpus_tokens.safetensors"
CENSUS_PATH = f"{GOLDILOCKS_DATA}/token_census.json"

# Architecture: Rich
N_LAYERS = 4
D_MODEL = 128
N_HEADS = 2
D_FF = 256
SEQ_LEN = 128
DROPOUT = 0.0

# Training
BATCH_SIZE = 8
LEARNING_RATE = 1e-3
MODEL_DTYPE = torch.bfloat16

# Reproducibility
RANDOM_SEED = 42

## Imports & Device

In [None]:
import torch
import torch.nn as nn
import torch.nn.functional as F
from torch.utils.data import Dataset, DataLoader
from safetensors.torch import load_file
from tokenizers import Tokenizer
from pathlib import Path
import json
import time
import numpy as np
from tqdm.auto import tqdm

torch.manual_seed(RANDOM_SEED)

# Device detection
if torch.cuda.is_available():
    device = 'cuda'
elif torch.backends.mps.is_available():
    device = 'mps'
else:
    device = 'cpu'

print(f"Device: {device}")
print(f"Dtype: {MODEL_DTYPE}")

## Load Data

Uses cached tokenized corpus for fast loading (~instant vs ~30s).

In [None]:
# Load tokenizer
tokenizer = Tokenizer.from_file(TOKENIZER_PATH)
vocab_size = tokenizer.get_vocab_size()
print(f"✓ Tokenizer: {vocab_size:,} tokens")

# Load cached tokenized corpus (fast!)
tokens_data = load_file(TOKENS_PATH)
all_tokens = tokens_data["tokens"].to(torch.long)
print(f"✓ Corpus: {len(all_tokens):,} tokens")

# Load dead token census
with open(CENSUS_PATH, 'r') as f:
    census = json.load(f)
dead_token_ids = set(census['dead_token_ids'])
print(f"✓ Dead tokens: {len(dead_token_ids):,}")

## Dataset

In [None]:
class TokenDataset(Dataset):
    """Random chunks from the tokenized corpus."""
    
    def __init__(self, tokens, seq_len, num_samples=100_000):
        self.tokens = tokens
        self.seq_len = seq_len
        self.num_samples = num_samples
        max_start = len(tokens) - seq_len - 1
        self.starts = torch.randint(0, max_start, (num_samples,))
    
    def __len__(self):
        return self.num_samples
    
    def __getitem__(self, idx):
        start = self.starts[idx]
        chunk = self.tokens[start:start + self.seq_len + 1]
        return chunk[:-1], chunk[1:]  # input, target

dataset = TokenDataset(all_tokens, SEQ_LEN)
print(f"✓ Dataset: {len(dataset):,} samples")

## Model

In [None]:
class GPT(nn.Module):
    """Minimal GPT — the Goldilocks reference architecture."""
    
    def __init__(self, vocab_size, d_model, n_heads, n_layers, d_ff, seq_len, dropout=0.0):
        super().__init__()
        self.tok_emb = nn.Embedding(vocab_size, d_model)
        self.pos_emb = nn.Embedding(seq_len, d_model)
        self.layers = nn.ModuleList([
            nn.TransformerEncoderLayer(
                d_model=d_model, nhead=n_heads, dim_feedforward=d_ff,
                dropout=dropout, activation='gelu', batch_first=True, norm_first=True
            ) for _ in range(n_layers)
        ])
        self.ln_f = nn.LayerNorm(d_model)
        self.head = nn.Linear(d_model, vocab_size, bias=False)
        self.head.weight = self.tok_emb.weight  # Weight tying
        self.seq_len = seq_len
        self.register_buffer('causal_mask', None)
        
        # Explicit initialization of token embeddings only
        self._init_weights()
    
    def _init_weights(self):
        """Initialize token embeddings explicitly. No magic."""
        # Token embeddings (W): N(0, 0.02) via torch.randn * 0.02
        # This is the matrix we're studying. Be explicit.
        with torch.no_grad():
            self.tok_emb.weight.copy_(torch.randn(self.tok_emb.weight.shape) * 0.02)
        # pos_emb: leave as PyTorch default (not part of our investigation)
        # head.weight: tied to tok_emb.weight, already initialized
    
    def forward(self, x):
        B, T = x.shape
        if self.causal_mask is None or self.causal_mask.shape[0] != T:
            self.causal_mask = torch.triu(
                torch.ones(T, T, device=x.device, dtype=torch.bool), diagonal=1
            )
        pos = torch.arange(T, device=x.device)
        h = self.tok_emb(x) + self.pos_emb(pos)
        for layer in self.layers:
            h = layer(h, src_mask=self.causal_mask, is_causal=True)
        return self.head(self.ln_f(h))

# Create model
model = GPT(
    vocab_size=vocab_size,
    d_model=D_MODEL,
    n_heads=N_HEADS,
    n_layers=N_LAYERS,
    d_ff=D_FF,
    seq_len=SEQ_LEN,
    dropout=DROPOUT
).to(device).to(MODEL_DTYPE)

n_params = sum(p.numel() for p in model.parameters())
print(f"✓ Model: {n_params:,} parameters ({MODEL_DTYPE})")

## Training Setup

In [None]:
optimizer = torch.optim.AdamW(model.parameters(), lr=LEARNING_RATE)
loader = DataLoader(dataset, batch_size=BATCH_SIZE, shuffle=True, drop_last=True)

print(f"✓ Optimizer: AdamW (lr={LEARNING_RATE})")
print(f"✓ Batch size: {BATCH_SIZE}")
print(f"✓ Tokens per step: {BATCH_SIZE * SEQ_LEN:,}")

---

# ⬇️ YOUR EXPERIMENT HERE ⬇️

Everything above is the locked-down Goldilocks foundation.

Add your experiment code below. You have access to:
- `model` — the GPT model (bf16)
- `optimizer` — AdamW optimizer
- `loader` — DataLoader for training batches
- `dataset` — the TokenDataset
- `dead_token_ids` — set of token IDs that never appear in training
- `tokenizer` — the HuggingFace tokenizer
- `device` — 'mps', 'cuda', or 'cpu'

**Example training loop:**
```python
model.train()
loader_iter = iter(loader)

for step in tqdm(range(NUM_STEPS)):
    try:
        x, y = next(loader_iter)
    except StopIteration:
        loader_iter = iter(loader)
        x, y = next(loader_iter)
    
    x, y = x.to(device), y.to(device)
    
    optimizer.zero_grad()
    logits = model(x)
    loss = F.cross_entropy(logits.view(-1, logits.size(-1)), y.view(-1))
    loss.backward()
    optimizer.step()
    
    # Your instrumentation here
```

In [None]:
# Your experiment code here
pass