Skip to content

Record: 1.0362 BPB — SGD Momentum 0.95 TTT + HedgeMixer + Per-Layer LR#995

Closed
dexhunter wants to merge 1 commit intoopenai:mainfrom
dexhunter:submission/2026-03-27-sgd-momentum95-1.0362
Closed

Record: 1.0362 BPB — SGD Momentum 0.95 TTT + HedgeMixer + Per-Layer LR#995
dexhunter wants to merge 1 commit intoopenai:mainfrom
dexhunter:submission/2026-03-27-sgd-momentum95-1.0362

Conversation

@dexhunter
Copy link
Copy Markdown

Built on PR #720 by @agalimova. Two key improvements to the TTT recipe:

  1. SGD optimizer (lr=0.002, momentum=0.95) replaces AdamW — accounts for -0.036 BPB
  2. Per-layer LR groups (3x output projections, 0.5x input) + cosine schedule + 4 epochs + zero frozen blocks

3-Seed Results

Seed TTT BPB Eval Time Artifact
1337 1.0302 513s 15.57MB
42 1.0365 533s 15.67MB
2025 1.0419 539s 15.15MB
Mean 1.0362

Ablation

Config Mean BPB Notes
This (SGD m=0.95 + per-layer LR) 1.0362 Best
SGD m=0.9 + per-layer LR 1.0450 Higher variance
AdamW + per-layer LR 1.0722 First submission
No mixer (SGD TTT only) ~1.156 Mixer is essential
No TTT (sliding window) ~1.121 Neural-only baseline

Validated via 15+ single-knob sweeps: LR (0.001–0.003), momentum (0.9–0.97), freeze depth (0–2), epochs (3–5), per-layer LR mult (2x–4x), chunk size (24K–32K), trigram hash (64K–128K).

Run

export NCCL_NET=Socket SKIP_SLIDING=1
export TTT_OPTIMIZER=sgd TTT_LR=0.002 TTT_MOMENTUM=0.95
SEED=1337 torchrun --nproc_per_node=8 train_gpt.py

Compliance

- 3 seeds on 8xH100 SXM, all train ≤600s, eval ≤600s, artifact ≤16MB
- Score-first legal TTT + backward-looking HedgeMixer
- No external data access during eval

Test Plan

- 3-seed verification (1337, 42, 2025)
- 15+ hyperparameter configs swept to confirm optimality
- Ablation: no-mixer, no-TTT, AdamW baselines measured
- All seeds emit final_int6_ttt_exact metrics

Built on PR openai#720 by @agalimova. Key improvement: momentum 0.95 (vs 0.9)
reduces variance and improves mean by 0.009 BPB.

3-seed results:
  Seed 1337: 1.0302 BPB (513s eval)
  Seed   42: 1.0365 BPB (533s eval)
  Seed 2025: 1.0419 BPB (539s eval)
  Mean:      1.0362 ± 0.006

Validated via comprehensive hyperparameter sweep:
  LR: 0.001/0.002/0.003 → 0.002 optimal
  Freeze: 0/1/2 → 0 optimal
  Epochs: 3/4/5 → 4 optimal
  Per-layer LR: 2x/3x/4x proj → 3x optimal
  Momentum: 0.9/0.95 → 0.95 optimal
@NoesisGenesis
Copy link
Copy Markdown

FYI, the entropy expert in LogisticContextMixer fails condition 2. It is not -log q_t(x_t) for any normalized distribution q_t over Σ, but a scalar functional of the neural distribution itself. So once it is mixed in as an expert, the resulting object is not, in general, a probability distribution over Σ.

aryanbhosale added a commit to aryanbhosale/parameter-golf that referenced this pull request Mar 28, 2026
slope 0.75 + LR 0.027 + warmdown 3700 (PR openai#977)
No SWA with QAT (PR openai#989)
QAT from 50% + range fix [-31,31]
mHC 22-param residual mixing (PR openai#928)
VE128 + no gated_attn + no value_residual (PR openai#549)
LZMA preset 7 compression (PR openai#999)
Muon TTT with NS3 (PR openai#999)
Entropy-adaptive TTT epochs 2/3/4 (PR openai#999)
Per-layer TTT LR (PR openai#995)
TTT momentum 0.95 (PR openai#995)
@dexhunter
Copy link
Copy Markdown
Author

dexhunter commented Mar 28, 2026

@NoesisGenesis — you're right. The entropy expert violates Condition 2. I independently identified this through my own audit and ran a thorough ablation to understand the mechanism. Sharing the results here since they may be useful to the community.

Ablation Matrix

Variant Description Scored dist sums to 1? val_bpb Δ vs A
A Original (5 experts incl. entropy) No (~1.057) 1.0699
B 4-expert (neural+uni+bi+tri), no entropy Yes 1.1563 +0.086
C 4-expert (neural+uni+bi+entropy), no trigram No 1.0801 +0.010
D Mixer off entirely (raw neural CE) N/A (standard softmax) 1.1563 +0.086
E 4-expert scored + entropy as Hedge eta modulator only Yes 1.1562 +0.086

Key Findings

  1. B ≈ D ≈ E — a properly normalized 4-expert mixture performs identically to no mixer at all. The n-gram experts (unigram, bigram, trigram) contribute nothing to scoring on their own.
  2. Entropy as a control signal (E) has zero effect — modulating the Hedge learning rate with entropy is indistinguishable from removing it entirely.
  3. The trigram expert is a small contributor — removing only the trigram (C) costs 0.010 BPB. The entropy expert captures most of the mixer benefit.
  4. The mechanism is direct mixing, not information — entropy's value comes entirely from being in the logsumexp scoring path, acting as a confidence-dependent score modifier. Used only as a control signal, it has zero measurable effect.

Conclusion

The entropy expert works precisely because it violates Condition 2 — it shapes the scored distribution in a way that no properly normalized replacement can replicate. Every normalized variant I tested regresses by ~0.086 BPB, collapsing to the same performance as no mixer at all.

I am closing this PR and my other mixer-based submissions (#953, #967). My next submission will use standard F.cross_entropy scoring with properly normalized probabilities.

Thank you for formalizing the conditions — they provide exactly the clarity the community needed.

@dexhunter
Copy link
Copy Markdown
Author

dexhunter commented Mar 28, 2026

Closing: the LogisticContextMixer's entropy expert violates Condition 2 from the normalization criteria recommended by @valerio-oai in #677. My ablation (posted on #995) confirms the entropy expert works precisely because it produces an unnormalized distribution. I will submit a clean version using standard F.cross_entropy scoring.

@dexhunter dexhunter closed this Mar 28, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants