Record: 1.0362 BPB — SGD Momentum 0.95 TTT + HedgeMixer + Per-Layer LR by dexhunter · Pull Request #995 · openai/parameter-golf

dexhunter · 2026-03-28T00:31:21Z

Built on PR #720 by @agalimova. Two key improvements to the TTT recipe:

SGD optimizer (lr=0.002, momentum=0.95) replaces AdamW — accounts for -0.036 BPB
Per-layer LR groups (3x output projections, 0.5x input) + cosine schedule + 4 epochs + zero frozen blocks

3-Seed Results

Seed	TTT BPB	Eval Time	Artifact
1337	1.0302	513s	15.57MB
42	1.0365	533s	15.67MB
2025	1.0419	539s	15.15MB
Mean	1.0362

Ablation

Config	Mean BPB	Notes
This (SGD m=0.95 + per-layer LR)	1.0362	Best
SGD m=0.9 + per-layer LR	1.0450	Higher variance
AdamW + per-layer LR	1.0722	First submission
No mixer (SGD TTT only)	~1.156	Mixer is essential
No TTT (sliding window)	~1.121	Neural-only baseline

Validated via 15+ single-knob sweeps: LR (0.001–0.003), momentum (0.9–0.97), freeze depth (0–2), epochs (3–5), per-layer LR mult (2x–4x), chunk size (24K–32K), trigram hash (64K–128K).

Run

export NCCL_NET=Socket SKIP_SLIDING=1
export TTT_OPTIMIZER=sgd TTT_LR=0.002 TTT_MOMENTUM=0.95
SEED=1337 torchrun --nproc_per_node=8 train_gpt.py

Compliance

- 3 seeds on 8xH100 SXM, all train ≤600s, eval ≤600s, artifact ≤16MB
- Score-first legal TTT + backward-looking HedgeMixer
- No external data access during eval

Test Plan

- 3-seed verification (1337, 42, 2025)
- 15+ hyperparameter configs swept to confirm optimality
- Ablation: no-mixer, no-TTT, AdamW baselines measured
- All seeds emit final_int6_ttt_exact metrics

@agalimova

Built on PR openai#720 by @agalimova. Key improvement: momentum 0.95 (vs 0.9) reduces variance and improves mean by 0.009 BPB. 3-seed results: Seed 1337: 1.0302 BPB (513s eval) Seed 42: 1.0365 BPB (533s eval) Seed 2025: 1.0419 BPB (539s eval) Mean: 1.0362 ± 0.006 Validated via comprehensive hyperparameter sweep: LR: 0.001/0.002/0.003 → 0.002 optimal Freeze: 0/1/2 → 0 optimal Epochs: 3/4/5 → 4 optimal Per-layer LR: 2x/3x/4x proj → 3x optimal Momentum: 0.9/0.95 → 0.95 optimal

NoesisGenesis · 2026-03-28T06:05:03Z

FYI, the entropy expert in LogisticContextMixer fails condition 2. It is not -log q_t(x_t) for any normalized distribution q_t over Σ, but a scalar functional of the neural distribution itself. So once it is mixed in as an expert, the resulting object is not, in general, a probability distribution over Σ.

slope 0.75 + LR 0.027 + warmdown 3700 (PR openai#977) No SWA with QAT (PR openai#989) QAT from 50% + range fix [-31,31] mHC 22-param residual mixing (PR openai#928) VE128 + no gated_attn + no value_residual (PR openai#549) LZMA preset 7 compression (PR openai#999) Muon TTT with NS3 (PR openai#999) Entropy-adaptive TTT epochs 2/3/4 (PR openai#999) Per-layer TTT LR (PR openai#995) TTT momentum 0.95 (PR openai#995)

dexhunter · 2026-03-28T08:16:39Z

@NoesisGenesis — you're right. The entropy expert violates Condition 2. I independently identified this through my own audit and ran a thorough ablation to understand the mechanism. Sharing the results here since they may be useful to the community.

Ablation Matrix

Variant	Description	Scored dist sums to 1?	val_bpb	Δ vs A
A	Original (5 experts incl. entropy)	No (~1.057)	1.0699	—
B	4-expert (neural+uni+bi+tri), no entropy	Yes	1.1563	+0.086
C	4-expert (neural+uni+bi+entropy), no trigram	No	1.0801	+0.010
D	Mixer off entirely (raw neural CE)	N/A (standard softmax)	1.1563	+0.086
E	4-expert scored + entropy as Hedge eta modulator only	Yes	1.1562	+0.086

Key Findings

B ≈ D ≈ E — a properly normalized 4-expert mixture performs identically to no mixer at all. The n-gram experts (unigram, bigram, trigram) contribute nothing to scoring on their own.
Entropy as a control signal (E) has zero effect — modulating the Hedge learning rate with entropy is indistinguishable from removing it entirely.
The trigram expert is a small contributor — removing only the trigram (C) costs 0.010 BPB. The entropy expert captures most of the mixer benefit.
The mechanism is direct mixing, not information — entropy's value comes entirely from being in the logsumexp scoring path, acting as a confidence-dependent score modifier. Used only as a control signal, it has zero measurable effect.

Conclusion

The entropy expert works precisely because it violates Condition 2 — it shapes the scored distribution in a way that no properly normalized replacement can replicate. Every normalized variant I tested regresses by ~0.086 BPB, collapsing to the same performance as no mixer at all.

I am closing this PR and my other mixer-based submissions (#953, #967). My next submission will use standard F.cross_entropy scoring with properly normalized probabilities.

Thank you for formalizing the conditions — they provide exactly the clarity the community needed.

dexhunter · 2026-03-28T08:16:52Z

Closing: the LogisticContextMixer's entropy expert violates Condition 2 from the normalization criteria recommended by @valerio-oai in #677. My ablation (posted on #995) confirms the entropy expert works precisely because it produces an unnormalized distribution. I will submit a clean version using standard F.cross_entropy scoring.

This was referenced Mar 28, 2026

Record: 1.0722 BPB — Improved TTT + HedgeMixer with Per-Layer LR Groups #953

Closed

Record: 1.0450 BPB — SGD TTT + HedgeMixer with Per-Layer LR Groups #967

Closed

dexhunter closed this Mar 28, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Record: 1.0362 BPB — SGD Momentum 0.95 TTT + HedgeMixer + Per-Layer LR#995

Record: 1.0362 BPB — SGD Momentum 0.95 TTT + HedgeMixer + Per-Layer LR#995
dexhunter wants to merge 1 commit intoopenai:mainfrom
dexhunter:submission/2026-03-27-sgd-momentum95-1.0362

dexhunter commented Mar 28, 2026

Uh oh!

NoesisGenesis commented Mar 28, 2026

Uh oh!

dexhunter commented Mar 28, 2026 •

edited

Loading

Uh oh!

dexhunter commented Mar 28, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

dexhunter commented Mar 28, 2026

3-Seed Results

Ablation

Run

Uh oh!

NoesisGenesis commented Mar 28, 2026

Uh oh!

dexhunter commented Mar 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Ablation Matrix

Key Findings

Conclusion

Uh oh!

dexhunter commented Mar 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

dexhunter commented Mar 28, 2026 •

edited

Loading

dexhunter commented Mar 28, 2026 •

edited

Loading