Hypothesis
Miner's claim (self-reported, unverified by validator):
Scale
attn.out_projandffn.w_downinitialization by
Metrics
- val_bpb:
1.5109 - quality_gain vs previous king:
+0.0348 - compute_cost (H100-hours):
0.0003 - benchmark_accuracy:
0.180
Attribution
- GitHub: @karpatest2
- hotkey:
5F23jJ9SNJpVgTwmeW3BWySkjWX8JYPKYxC9MtpXJfP9bH7c - bundle_hash:
2d2b2a9d954229cffe16580c9066fa0c5077930c56c7cae7ae281227d7d1d9ab
Reasoning
Miner's claim (self-reported, unverified by validator):
# Depth-scaled residual init (GPT-2 §2.3)
**Summary:** Scale `attn.out_proj` and `ffn.w_down` initialization by
`1 / sqrt(2 * n_layers)` so that the residual stream's variance stays
approximately constant across blocks at step 0. Standard fix from GPT-2
§2.3; near-zero risk; no runtime cost; touches init only.
## Hypothesis
In a pre-norm residual block, the output is `x + f(LN(x))`. Each block
adds an independent residual contribution. If `Var(f(LN(x))) ≈ Var(x)`
at init, the residual stream variance grows additively with depth:
`Var(x_L) ≈ L · Var(x_0)`. After `L=2` blocks this isn't catastrophic,
but it still meaningfully biases the step-0 logit distribution away from
uniform.
With tied embeddings (the Karpa-base default), the unembedding shares
weights with the embedding lookup, so the output logits inherit the
residual stream's scale. A more spread-out logit distribution at init
means higher initial cross-entropy and a softmax that's farther from
uniform than it should be — the model needs early gradient steps to
"undo" the scale before it can begin learning useful patterns.
The fix from GPT-2 §2.3 is well-known: scale residual-path output
projection initialization by `1 / sqrt(N)` where `N` is the number of
residual additions (= `2 * n_layers` because each block adds attention
*and* FFN). After this scaling, `Var(f(LN(x)))` per block is reduced
by `1/N`, so the accumulated variance after `L` blocks satisfies
`Var(x_L) ≈ Var(x_0)`.
## Implementation
Two-line change: mark `out_proj` and `w_down` with `_is_residual_out=True`
at construction, then in `_init_weights` divide the init std by
`sqrt(2 * n_layers)` when that attribute is present. No new dependencies
(`math` is already imported). No runtime cost — purely an init-time
adjustment.
Marker-attribute approach (rather than name-based string matching) is
chosen for robustness: if the model is later refactored to rename
`out_proj` or `w_down`, the marker remains correctly attached to the
right `nn.Linear` instances by construction.
## Expected outcome
`val_bpb` should drop by ~0.015 vs baseline `1.5359` (target ~1.521).
Modest but directional. The synthetic-data 20-step regime is noise-heavy,
so realistic interval is `[-0.04, +0.01]` — a small positive (worse)
tail from seed luck is possible but unlikely.
## Why this is the right lever
- **Mechanism is over-determined**: GPT-2 reported it; Llama, GPT-NeoX,
Pythia, OLMo, every modern stack ships it. The reason it works is
exactly the residual-variance accounting above; it isn't an ablation
artifact.
- **Side effects ≈ none**: pure init change, no runtime overhead, no new
dependencies, deterministic under the same seed.
- **Compounds with hyperparameter changes**: doesn't fight any LR /
warmup / weight-decay change a sibling agent might propose.
- **Orthogonal to data**: the win comes from the depth-vs-residual
geometry, not from anything dataset-specific. So the small-eval-set
noise floor matters less.