# Duckling II Experiment Template

**Copy this notebook as your starting point for dead token dynamics experiments.**

Everything here is the locked-down Duckling II foundation:

| Property | Value | Notes |
|----------|-------|-------|
| Architecture | GPT2LMHeadModel | HuggingFace |
| Layers | 2 | Sweep showed 2-8 equivalent for learning |
| Hidden dim | 128 | |
| Attention heads | 2 | |
| FFN dim | 512 | 4x hidden |
| Context length | 512 | |
| Vocab size | 8,192 | 6,144 live + 2,048 dead |
| Parameters | ~1.5M | |
| Batch size | 2 | Speed-optimized |
| Learning rate | 1e-4 | |
| Precision | bf16 (mixed) | Via TrainingArguments |
| Seed | 42 | |

**What you get:**
- ~75 steps/sec on M4 Pro
- 10K steps in ~2.3 minutes
- Loss ~3.7 at 10K steps, still learning
- Generates coherent TinyStories-style sentences

**To use:** Copy this notebook, add your experiment code in the marked section at the bottom.

## Parameters

**Do not modify these** unless you're intentionally deviating from the reference.

In [None]:
# === DUCKLING II REFERENCE PARAMETERS ===
# Do not modify unless intentionally deviating

# Architecture
VOCAB_SIZE = 8192      # 6,144 live + 2,048 dead
N_EMBD = 128
N_HEAD = 2
N_LAYER = 2            # Winner from layer sweep
N_POSITIONS = 512
N_INNER = 512          # 4x hidden dim

# Training
BATCH_SIZE = 2         # Speed-optimized (75 steps/sec)
LEARNING_RATE = 1e-4
WEIGHT_DECAY = 0.01
MAX_GRAD_NORM = 1.0

# Reproducibility
SEED = 42

# Paths
DATASET_PATH = "tokenized_dataset"
TOKEN_MAPPING_PATH = "token_mapping.json"

## Imports & Setup

In [None]:
import json
import time
import torch
import numpy as np
import matplotlib.pyplot as plt
from datasets import load_from_disk
from transformers import (
    GPT2Config,
    GPT2LMHeadModel,
    GPT2TokenizerFast,
    Trainer,
    TrainingArguments,
    set_seed,
)
from transformers.trainer_callback import TrainerCallback

set_seed(SEED)

# Device detection
if torch.cuda.is_available():
    device = 'cuda'
elif torch.backends.mps.is_available():
    device = 'mps'
else:
    device = 'cpu'

print(f"Device: {device}")
print(f"Seed: {SEED}")

## Load Data & Tokenizer

In [None]:
# Load tokenized dataset
dataset = load_from_disk(DATASET_PATH)
print(f"✓ Dataset: {len(dataset):,} sequences of {N_POSITIONS} tokens")

# Load token mapping
with open(TOKEN_MAPPING_PATH, "r") as f:
    token_mapping = json.load(f)

LIVE_TOKENS = token_mapping["live_tokens"]
DEAD_TOKENS = token_mapping["dead_tokens"]
dead_token_ids = set(range(LIVE_TOKENS, VOCAB_SIZE))

print(f"✓ Vocabulary: {VOCAB_SIZE:,} total ({LIVE_TOKENS:,} live, {DEAD_TOKENS:,} dead)")
print(f"✓ Dead token IDs: {LIVE_TOKENS} to {VOCAB_SIZE - 1}")

# Data collator
class CausalLMCollator:
    def __call__(self, examples):
        input_ids = torch.tensor([ex["input_ids"] for ex in examples], dtype=torch.long)
        return {"input_ids": input_ids, "labels": input_ids.clone()}

collator = CausalLMCollator()

In [None]:
# Tokenizer for encoding/decoding (reconstructed from mapping)
class DucklingTokenizer:
    """Wrapper for encode/decode using our compact vocabulary."""
    def __init__(self, mapping):
        self.base = GPT2TokenizerFast.from_pretrained("gpt2")
        self.gpt2_to_compact = {int(k): v for k, v in mapping["gpt2_to_compact"].items()}
        self.compact_to_gpt2 = {int(k): v for k, v in mapping["compact_to_gpt2"].items()}
        self.live_size = mapping["live_tokens"]
        self.unk_id = mapping["unk_id"]

    def encode(self, text):
        gpt2_ids = self.base.encode(text)
        return [self.gpt2_to_compact.get(id, self.unk_id) for id in gpt2_ids]

    def decode(self, ids):
        gpt2_ids = [self.compact_to_gpt2.get(id, 0) for id in ids if id in self.compact_to_gpt2]
        return self.base.decode(gpt2_ids)

tokenizer = DucklingTokenizer(token_mapping)
print(f"✓ Tokenizer ready")

## Create Model

In [None]:
config = GPT2Config(
    vocab_size=VOCAB_SIZE,
    n_embd=N_EMBD,
    n_head=N_HEAD,
    n_layer=N_LAYER,
    n_positions=N_POSITIONS,
    n_inner=N_INNER,
    activation_function='gelu',
    resid_pdrop=0.0,
    embd_pdrop=0.0,
    attn_pdrop=0.0,
    layer_norm_epsilon=1e-5,
    initializer_range=0.02,
    use_cache=False,
)

model = GPT2LMHeadModel(config)
n_params = sum(p.numel() for p in model.parameters())
print(f"✓ Model: {n_params:,} parameters")

# Access the embedding matrix (this is what we study)
W = model.transformer.wte.weight
print(f"✓ Embedding matrix W: {W.shape} (vocab × hidden)")

## Training Setup

Uses HuggingFace Trainer with proper bf16 mixed precision.

**Critical:** Use `bf16=True` in TrainingArguments, NOT `model.to(torch.bfloat16)`. The latter kills the optimizer.

In [None]:
class LossTracker(TrainerCallback):
    """Track loss at logging steps."""
    def __init__(self):
        self.losses = []
        self.steps = []
    
    def on_log(self, args, state, control, logs=None, **kwargs):
        if logs and "loss" in logs:
            self.losses.append(logs["loss"])
            self.steps.append(state.global_step)

def make_training_args(num_steps, output_dir="experiment", logging_steps=10):
    """Create TrainingArguments with Duckling II defaults."""
    return TrainingArguments(
        output_dir=output_dir,
        overwrite_output_dir=True,
        max_steps=num_steps,
        per_device_train_batch_size=BATCH_SIZE,
        learning_rate=LEARNING_RATE,
        weight_decay=WEIGHT_DECAY,
        max_grad_norm=MAX_GRAD_NORM,
        bf16=True,  # CRITICAL: proper mixed precision
        optim="adamw_torch",
        lr_scheduler_type="linear",
        warmup_steps=0,
        logging_steps=logging_steps,
        logging_first_step=True,
        report_to="none",
        save_strategy="no",
        dataloader_num_workers=0,
        seed=SEED,
    )

print("✓ Training utilities ready")
print(f"  Expected speed: ~75 steps/sec")
print(f"  Tokens per step: {BATCH_SIZE * N_POSITIONS:,}")

---

# ⬇️ YOUR EXPERIMENT HERE ⬇️

Everything above is the locked-down Duckling II foundation.

Add your experiment code below. You have access to:

| Variable | Description |
|----------|-------------|
| `model` | GPT2LMHeadModel (untrained) |
| `W` | Embedding matrix `model.transformer.wte.weight` |
| `dataset` | Tokenized TinyStories sequences |
| `collator` | Data collator for Trainer |
| `tokenizer` | For encode/decode |
| `dead_token_ids` | Set of token IDs 6144-8191 (never appear in training) |
| `device` | 'mps', 'cuda', or 'cpu' |
| `make_training_args()` | Helper to create TrainingArguments |
| `LossTracker` | Callback class for tracking loss |

**Example: Basic training run**
```python
NUM_STEPS = 10_000

tracker = LossTracker()
trainer = Trainer(
    model=model,
    args=make_training_args(NUM_STEPS),
    train_dataset=dataset,
    data_collator=collator,
    callbacks=[tracker],
)

trainer.train()
```

**Example: Record W snapshots during training**
```python
class WRecorder(TrainerCallback):
    def __init__(self, record_every=100):
        self.snapshots = []
        self.record_every = record_every
    
    def on_step_end(self, args, state, control, model=None, **kwargs):
        if state.global_step % self.record_every == 0:
            W = model.transformer.wte.weight.detach().cpu().clone()
            self.snapshots.append((state.global_step, W))
```

**Example: Extract dead token embeddings**
```python
dead_ids = list(dead_token_ids)
W_dead = W[dead_ids].detach()  # Shape: (2048, 128)
```

In [None]:
# Your experiment code here
pass