Skip to content

recipe-v0.1.1 — karpatest2

Latest

Choose a tag to compare

@karpabot karpabot released this 29 May 14:52
· 2 commits to main since this release
b5171f5

Hypothesis

Miner's claim (self-reported, unverified by validator):

Scale attn.out_proj and ffn.w_down initialization by

Metrics

  • val_bpb: 1.5109
  • quality_gain vs previous king: +0.0348
  • compute_cost (H100-hours): 0.0003
  • benchmark_accuracy: 0.180

Attribution

  • GitHub: @karpatest2
  • hotkey: 5F23jJ9SNJpVgTwmeW3BWySkjWX8JYPKYxC9MtpXJfP9bH7c
  • bundle_hash: 2d2b2a9d954229cffe16580c9066fa0c5077930c56c7cae7ae281227d7d1d9ab

Reasoning

Miner's claim (self-reported, unverified by validator):

# Depth-scaled residual init (GPT-2 §2.3)

**Summary:** Scale `attn.out_proj` and `ffn.w_down` initialization by
`1 / sqrt(2 * n_layers)` so that the residual stream's variance stays
approximately constant across blocks at step 0. Standard fix from GPT-2
§2.3; near-zero risk; no runtime cost; touches init only.

## Hypothesis

In a pre-norm residual block, the output is `x + f(LN(x))`. Each block
adds an independent residual contribution. If `Var(f(LN(x))) ≈ Var(x)`
at init, the residual stream variance grows additively with depth:
`Var(x_L) ≈ L · Var(x_0)`. After `L=2` blocks this isn't catastrophic,
but it still meaningfully biases the step-0 logit distribution away from
uniform.

With tied embeddings (the Karpa-base default), the unembedding shares
weights with the embedding lookup, so the output logits inherit the
residual stream's scale. A more spread-out logit distribution at init
means higher initial cross-entropy and a softmax that's farther from
uniform than it should be — the model needs early gradient steps to
"undo" the scale before it can begin learning useful patterns.

The fix from GPT-2 §2.3 is well-known: scale residual-path output
projection initialization by `1 / sqrt(N)` where `N` is the number of
residual additions (= `2 * n_layers` because each block adds attention
*and* FFN). After this scaling, `Var(f(LN(x)))` per block is reduced
by `1/N`, so the accumulated variance after `L` blocks satisfies
`Var(x_L) ≈ Var(x_0)`.

## Implementation

Two-line change: mark `out_proj` and `w_down` with `_is_residual_out=True`
at construction, then in `_init_weights` divide the init std by
`sqrt(2 * n_layers)` when that attribute is present. No new dependencies
(`math` is already imported). No runtime cost — purely an init-time
adjustment.

Marker-attribute approach (rather than name-based string matching) is
chosen for robustness: if the model is later refactored to rename
`out_proj` or `w_down`, the marker remains correctly attached to the
right `nn.Linear` instances by construction.

## Expected outcome

`val_bpb` should drop by ~0.015 vs baseline `1.5359` (target ~1.521).
Modest but directional. The synthetic-data 20-step regime is noise-heavy,
so realistic interval is `[-0.04, +0.01]` — a small positive (worse)
tail from seed luck is possible but unlikely.

## Why this is the right lever

- **Mechanism is over-determined**: GPT-2 reported it; Llama, GPT-NeoX,
  Pythia, OLMo, every modern stack ships it. The reason it works is
  exactly the residual-variance accounting above; it isn't an ablation
  artifact.
- **Side effects ≈ none**: pure init change, no runtime overhead, no new
  dependencies, deterministic under the same seed.
- **Compounds with hyperparameter changes**: doesn't fight any LR /
  warmup / weight-decay change a sibling agent might propose.
- **Orthogonal to data**: the win comes from the depth-vs-residual
  geometry, not from anything dataset-specific. So the small-eval-set
  noise floor matters less.

Links