Record: 1.0362 BPB — SGD Momentum 0.95 TTT + HedgeMixer + Per-Layer LR#995
Record: 1.0362 BPB — SGD Momentum 0.95 TTT + HedgeMixer + Per-Layer LR#995dexhunter wants to merge 1 commit intoopenai:mainfrom
Conversation
Built on PR openai#720 by @agalimova. Key improvement: momentum 0.95 (vs 0.9) reduces variance and improves mean by 0.009 BPB. 3-seed results: Seed 1337: 1.0302 BPB (513s eval) Seed 42: 1.0365 BPB (533s eval) Seed 2025: 1.0419 BPB (539s eval) Mean: 1.0362 ± 0.006 Validated via comprehensive hyperparameter sweep: LR: 0.001/0.002/0.003 → 0.002 optimal Freeze: 0/1/2 → 0 optimal Epochs: 3/4/5 → 4 optimal Per-layer LR: 2x/3x/4x proj → 3x optimal Momentum: 0.9/0.95 → 0.95 optimal
|
FYI, the entropy expert in LogisticContextMixer fails condition 2. It is not |
slope 0.75 + LR 0.027 + warmdown 3700 (PR openai#977) No SWA with QAT (PR openai#989) QAT from 50% + range fix [-31,31] mHC 22-param residual mixing (PR openai#928) VE128 + no gated_attn + no value_residual (PR openai#549) LZMA preset 7 compression (PR openai#999) Muon TTT with NS3 (PR openai#999) Entropy-adaptive TTT epochs 2/3/4 (PR openai#999) Per-layer TTT LR (PR openai#995) TTT momentum 0.95 (PR openai#995)
|
@NoesisGenesis — you're right. The entropy expert violates Condition 2. I independently identified this through my own audit and ran a thorough ablation to understand the mechanism. Sharing the results here since they may be useful to the community. Ablation Matrix
Key Findings
ConclusionThe entropy expert works precisely because it violates Condition 2 — it shapes the scored distribution in a way that no properly normalized replacement can replicate. Every normalized variant I tested regresses by ~0.086 BPB, collapsing to the same performance as no mixer at all. I am closing this PR and my other mixer-based submissions (#953, #967). My next submission will use standard F.cross_entropy scoring with properly normalized probabilities. Thank you for formalizing the conditions — they provide exactly the clarity the community needed. |
|
Closing: the LogisticContextMixer's entropy expert violates Condition 2 from the normalization criteria recommended by @valerio-oai in #677. My ablation (posted on #995) confirms the entropy expert works precisely because it produces an unnormalized distribution. I will submit a clean version using standard F.cross_entropy scoring. |
Built on PR #720 by @agalimova. Two key improvements to the TTT recipe:
3-Seed Results
Ablation
Validated via 15+ single-knob sweeps: LR (0.001–0.003), momentum (0.9–0.97), freeze depth (0–2), epochs (3–5), per-layer LR mult (2x–4x), chunk size (24K–32K), trigram hash (64K–128K).
Run