Record: Parallel Muon + Parameter Banking — 81.87ms/step, val_bpb 1.1247 (3-seed mean) by abaybektursun · Pull Request #399 · openai/parameter-golf

abaybektursun · 2026-03-22T04:52:11Z

Systems Optimization: 81.87 ms/step, all artifacts under 16 MB

Pure training speed optimization. Model architecture and hyperparameters are unchanged — only the optimizer and weight storage layout are modified.

3-Seed Results (8×H100 80GB SXM, PyTorch 2.9.1+cu128, 600s)

Seed	step_avg	steps	int6 sliding val_bpb	artifact
1337	81.86 ms	7,331	1.1241	15,830,960 bytes
42	81.88 ms	7,328	1.1253	15,819,728 bytes
2025	81.86 ms	7,330	1.1247	15,796,052 bytes
Mean	81.87 ms	7,330	1.1247 (std 0.0006)	~15.8 MB

What changed

Three optimizer techniques replacing 66 sequential individual Newton-Schulz calls with 4 batched operations:

1. Parameter Banking

3D nn.Parameter banks replace 66 separate nn.Linear weights:

qo_bank: (22, 512, 512) — Q + Out projections
kv_bank: (22, 256, 512) — K + V projections
mlp_up_bank: (11, 1536, 512) — MLP up
mlp_down_bank: (11, 512, 1536) — MLP down

Forward: F.linear(x, bank[layer_idx]). Compiled forward+backward verified identical: 72.33ms vs 72.59ms.

2. Batched Newton-Schulz

Standard NS coefficients (a=3.4445, b=-4.7750, c=2.0315) applied in a single batched torch.bmm over all bank layers, instead of 66 sequential torch.mm calls.

Note: We initially used Polar Express (arXiv:2505.16932) — per-iteration minimax-optimal polynomial coefficients that achieve 35% tighter orthogonalization (0.21 vs 0.32 relative error). However, PE produces weight value distributions that are ~190KB harder to compress, pushing artifacts over 16MB. Reverting to standard NS fixed the artifact size with no measurable quality difference.

3. Parallel Muon (arXiv:2511.07464)

DDP removed for bank params. Post-backward communication scheduled explicitly:

Launch async reduce_scatter for all banks (biggest first)
all_reduce + Adam step on small params (while bank RS is in-flight)
Wait for RS, local batched NS on each GPU's shard, async all_gather

Why DDP doesn't work with banking

Bank gradients aggregate across all 11 layers → available only at end of backward → DDP can't overlap all-reduce with compute. Result: +4ms regression with banks in DDP (88.8ms vs 84.8ms baseline). Fix: remove DDP for banks, schedule communication explicitly in optimizer step. This follows the approach used in modded-nanogpt — no DDP at all, fully manual communication scheduling.

What we tried and learned

Approach	Result	Lesson
Non-surgery batching (keep 66 params, batch in optimizer)	85.73ms (+1ms vs baseline)	Gather/scatter kernel launch overhead offsets PE speedup
DDP with banks	88.8ms (+4ms regression)	Bank grads only available at end of backward, zero overlap
DDP with `_ddp_params_and_buffers_to_ignore`	88ms	Still no overlap for bank all-reduce
Polar Express	82ms but 16.2MB artifacts	PE weights compress ~190KB worse than NS
Parallel Muon + NS	81.87ms, 15.8MB	Winner

PRs we tested our optimizer against

Base PR	Speed win?	Score win?	Why
#315 (EMA only)	Yes (-3.4%)	Yes (-0.0006)	Extra steps directly improve EMA score
#374 (Tight SWA)	Yes (-3.5%)	No (+0.001)	SWA averages warmdown weights; extra steps don't help
#401 (EMA+SWA stack)	Yes (-2.8%)	No (+0.0005)	Same SWA dilution
#398 (aggressive TTT)	Yes (-2.3%)	No (+0.004)	TTT paradox: more trained model = less TTT gain
#332 (12L grad-quant)	Yes (-12%)	Untested	Gradient-guided quant didn't activate with our optimizer

Key finding: Only PRs using simple EMA (no SWA/TTT) benefit from faster training, because EMA quality improves monotonically with more steps. SWA averages warmdown weights, and TTT paradoxically benefits from less-trained models.

Credits

PR #315 by @jfprincz — architecture config (11L Partial RoPE + LN Scale + EMA + XSA4) used for benchmarking
PR #287 — base 11L stack (SmearGate, BigramHash, OrthoInit, FA3)
Parallel Muon — async reduce-scatter/all-gather scheduling pattern
modded-nanogpt — inspiration for DDP-free manual communication

🤖 Generated with Claude Code

@jfprincz

Systems optimization built on PR openai#315 by @jfprincz (11L XSA4+EMA, 1.1248 bpb). Same architecture, same hyperparameters, only optimizer changed. 82.14ms/step vs 84.76ms baseline = 7,306 steps vs 7,079 in 600s. Pre-quant val_bpb 1.1421 (identical to baseline). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…1.1248) Unbank state dict before quantization so int6 per-row scales match baseline. Rebank after dequantization for roundtrip eval. Results: 82.13ms/step, 7,306 steps, int6 sliding window val_bpb 1.1238. Artifact: 16.06MB (int6+zstd). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Seeds 42, 1337, 2025: mean 82.08ms/step, val_bpb 1.1239 (std 0.0001). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Replaced Polar Express with standard Newton-Schulz + switched to lzma compression. 3-seed results: 81.87ms/step mean, 1.1247 sliding bpb mean, all artifacts ~15.8MB. Seed 1337: 7331 steps, 1.1241 bpb, 15,830,960 bytes Seed 42: 7328 steps, 1.1253 bpb, 15,819,728 bytes Seed 2025: 7330 steps, 1.1247 bpb, 15,796,052 bytes Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Legal score-first TTT (PR openai#461 recipe) applied to openai#414 stack with Parameter Banking + Parallel Muon (first introduced in PR openai#399). Pre-TTT: 1.1234, post-TTT: 1.1213 (-0.0021). TTT eval: 400s. Artifact: 15.84 MB. Seed 1337, 8×H100 SXM, PyTorch 2.9.1+cu128. Every token scored BEFORE model adapts (inference_mode enforced). SGD+momentum(0.9), 3 epochs/32K chunk, freeze first 2 blocks. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…d mean) Legal score-first TTT (PR openai#461 recipe) + BigramHash(3072) + freeze=0 on openai#414 stack with Parameter Banking + Parallel Muon (PR openai#399). 3-seed results (BIGRAM=3072, 3ep, freeze=0, SGD+mom=0.9): Seed 1337: 1.1204 bpb, 413s TTT, 15.98 MB Seed 42: 1.1216 bpb, 406s TTT, 15.99 MB Seed 2025: 1.1221 bpb, 405s TTT, 15.99 MB Mean: 1.1214 (std 0.0009) All artifacts under 16MB. All eval times under 600s. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…ed mean) LeakyReLU(0.5)² activation (-0.003 vs relu²) + legal score-first TTT (PR openai#461 recipe, 3ep SGD, all blocks unfrozen) + BigramHash(1536) on openai#414 stack with Parameter Banking + Parallel Muon (PR openai#399). 3-seed results: Seed 42: 1.1200 bpb, 408s TTT, 15.88 MB Seed 2025: 1.1189 bpb, 408s TTT, 15.99 MB Seed 1337: pending (log will be added) Mean: 1.1195 (std 0.0008) All artifacts under 16MB. All eval under 10 min. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…ed mean) LeakyReLU(0.5)² activation (-0.003 vs relu²) + legal score-first TTT (PR openai#461 recipe, 3ep SGD, all blocks unfrozen) + BigramHash(1536) on openai#414 stack with Parameter Banking + Parallel Muon (PR openai#399). 3-seed results: Seed 1337: 1.1192 bpb, 410s TTT, 15.98 MB Seed 42: 1.1200 bpb, 408s TTT, 15.88 MB Seed 2025: 1.1189 bpb, 408s TTT, 15.99 MB Mean: 1.1194 (std 0.0006) All artifacts under 16MB. All eval under 10 min. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

notapplica mentioned this pull request Mar 22, 2026

Parameter Golf Live AI Commentary + Analysis / Ideas | every 10 minutes #140

Open

abaybektursun and others added 2 commits March 22, 2026 00:13

Add 3-seed results + train logs

4db0057

Seeds 42, 1337, 2025: mean 82.08ms/step, val_bpb 1.1239 (std 0.0001). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

abaybektursun changed the title ~~Record: Parallel Muon + Parameter Banking — 82.14ms/step (3.1% faster than PR #315)~~ Record: Parallel Muon + Parameter Banking — 82.08ms/step (3.2% faster than PR #315) Mar 22, 2026

abaybektursun force-pushed the submission/parallel-muon-82ms branch from 5f4d141 to 4db0057 Compare March 22, 2026 15:24

abaybektursun changed the title ~~Record: Parallel Muon + Parameter Banking — 82.08ms/step (3.2% faster than PR #315)~~ Record: Parallel Muon + Parameter Banking + Polar Express — 82.14ms/step (3.1% faster than PR #315) Mar 22, 2026

abaybektursun changed the title ~~Record: Parallel Muon + Parameter Banking + Polar Express — 82.14ms/step (3.1% faster than PR #315)~~ Record: Parallel Muon + Parameter Banking — 81.87ms/step, val_bpb 1.1247 (3-seed mean) Mar 22, 2026

abaybektursun mentioned this pull request Mar 22, 2026

Record: Legal Score-First TTT + Parallel Muon — val_bpb 1.1214 (3-seed mean) #473

Open

abaybektursun mentioned this pull request Mar 23, 2026

Record: LeakyReLU² + Legal Score-First TTT + Parallel Muon — val_bpb 1.1194 (3-seed mean) #549

Open

abaybektursun mentioned this pull request Mar 24, 2026

Record: Full GPTQ + LeakyReLU² + Parallel Muon — val_bpb 1.1171 (3-seed mean, no TTT) #593

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Record: Parallel Muon + Parameter Banking — 81.87ms/step, val_bpb 1.1247 (3-seed mean)#399

Record: Parallel Muon + Parameter Banking — 81.87ms/step, val_bpb 1.1247 (3-seed mean)#399
abaybektursun wants to merge 4 commits intoopenai:mainfrom
abaybektursun:submission/parallel-muon-82ms

abaybektursun commented Mar 22, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

abaybektursun commented Mar 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Systems Optimization: 81.87 ms/step, all artifacts under 16 MB

3-Seed Results (8×H100 80GB SXM, PyTorch 2.9.1+cu128, 600s)

What changed

1. Parameter Banking

2. Batched Newton-Schulz

3. Parallel Muon (arXiv:2511.07464)

Why DDP doesn't work with banking

What we tried and learned

PRs we tested our optimizer against

Credits

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

abaybektursun commented Mar 22, 2026 •

edited

Loading