Skip to content

recipe-v0.1.0 — karpatest1

Choose a tag to compare

@karpabot karpabot released this 29 May 14:52
· 3 commits to main since this release
5e09f29

Hypothesis

Miner's claim (self-reported, unverified by validator):

On a 20-step run, 5 warmup steps burn 25% of the budget ramping

Metrics

  • val_bpb: 1.5457
  • quality_gain vs previous king: +0.0000
  • compute_cost (H100-hours): 0.0003
  • benchmark_accuracy: 0.160

Attribution

  • GitHub: @karpatest1
  • hotkey: 5F6WRq6fB5bMT6dXHZxUXw35XNQKMHhGRz9vdV6QmMapZsb9
  • bundle_hash: 0982527781be235ffb6311e74abe2c67df80cc69cfd5d6a3517839380dfb3e4e

Reasoning

Miner's claim (self-reported, unverified by validator):

# Cut warmup from 5 to 2 steps

**Summary:** On a 20-step run, 5 warmup steps burn 25% of the budget ramping
the learning rate, leaving exactly one step at peak before cosine decay
takes over. Cutting warmup to 2 steps adds ~3 more near-peak-LR steps where
the cross-entropy descends fastest.

## Hypothesis

`proxy_cpu_smoke.json` configures a 20-step canonical run. The current
schedule is `warmup_steps=5, total_steps=20`, with cosine annealing from
`max_lr=3e-3` to `min_lr=3e-4` over the remaining 15 steps.

That allocation is built for production-scale (warmup ~= 1% of training).
For a 20-step run it leaves:

- Steps 0–4: linear ramp 0 → 3e-3 (gradients are small here because lr is
  near zero)
- Step 5: peak lr 3e-3 (the one and only)
- Steps 6–19: cosine decay 3e-3 → 3e-4 (the run finishes at one-tenth peak)

Warmup exists to give AdamW's second-moment estimate (`v`) time to populate
before large updates land. Empirically, ~2 steps of small updates is enough
to bound `1/sqrt(v + eps)` away from the eps floor; beyond that, warmup is
mostly cosmetic. `grad_clip=1.0` provides the redundant insurance.

Cutting `warmup_steps` to 2 reshuffles the schedule:

- Steps 0–1: linear ramp 0 → 3e-3
- Steps 2–19: cosine decay 3e-3 → 3e-4

Three additional steps at near-peak LR, exactly where the early-loss
gradient is steepest. The cross-entropy curve is roughly
`-log(p_correct)` — at random init the curve is exponential in early
steps, so each extra peak-LR step compounds.

## Expected outcome

`val_bpb` should drop by 0.02–0.05 versus baseline `1.5359`, putting it in
the `1.49–1.52` range. That clears the noise floor margin of 0.013 if the
direction is real. Realistic interval given 20-step synthetic-data noise:
`[-0.08, +0.01]`. Negative-tail risk: an unusually unlucky AdamW
trajectory in the first 2 steps; mitigated by grad_clip.

## Why this is the right lever for this regime

Three reasons this isn't a footgun:
1. Reduced warmup is the canonical fix in short-run training (cited
   variously in tinyllama, nanochat, micro-LM ablations).
2. The risk surface is small: even if it's worse, the magnitude is
   bounded by the LR schedule difference, not by the model arch.
3. It composes cleanly with other improvements — doesn't preclude
   anything that future patches might touch.

Links