Record: SP8192 + Headwise Gated Attention + Legal TTT (1.0805 BPB, 3-seed)#2005
Record: SP8192 + Headwise Gated Attention + Legal TTT (1.0805 BPB, 3-seed)#2005jamesEmerson112 wants to merge 2 commits intoopenai:mainfrom
Conversation
…al_bpb=1.0511) 3-seed mean: 1.0511 BPB (std 0.0008), 8xH100 SXM Seeds: 42 (1.0517), 1337 (1.0513), 2025 (1.0502) All artifacts under 16 MB, train <600s, eval <600s Techniques: Small Batch (ga=1) + EMA=0.990 + Headwise Gated Attention + PreQuantTTT 21ep Base stack: @bigbag PR openai#1493 (FA3, depth recurrence, parallel residuals, XSA, MuonEq-R, GPTQ int6+brotli, score-first TTT)
- Rename folder: remove PreQuantTTT from name - Update to C6 3-seed data (mean 1.0805 BPB) - Remove PreQuantTTT from train_gpt.py, submission.json, README - Mark as non-record, fully legal submission - compliance.no_pre_quant_ttt: true Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
CreditsThis submission builds on the work of many contributors to the Parameter Golf community:
Acknowledgements
|
PR #2005 Supplement — Expanded Research Contributionsval_bpb = 1.0805 (3-seed mean, std 0.0012) | ~15.70 MB | 8xH100 SXM Research Contributions1. Headwise Gated Attention (Novel Architecture)Post-attention sigmoid gate applied per-head, after FlashAttention-3 + XSA compute the attention output. A learned gate modulates each head's contribution before the output projection:
Inspired by NeurIPS 2025 Best Paper (arxiv:2505.06708). Consistent improvement across scales:
The -0.0005 BPB improvement is preserved exactly when scaling from 2×H100 to 8×H100, confirming the technique's robustness. We also tested elementwise gating (1 gate per dim per head, +2.36M params). It achieved slightly better BPB (1.2602 vs 1.2653 on SP1024) but exceeded the 16 MB budget (17.87 MB). Headwise is the Pareto-optimal choice: nearly free parameters, fits under budget, and provides consistent improvement. 2. 29-Paper Systematic SurveySurveyed papers from NeurIPS 2024-2025, ICML 2025, ICLR 2025, and ACL 2025 covering attention modifications, normalization strategies, optimizer scheduling, data selection, structured layers, and compression techniques. Each paper was assessed for Parameter Golf feasibility (16 MB / 10-min / 36M-param constraints) and mapped to PG leaderboard presence. Key finding: most techniques published for 125M+ parameter models do not transfer to the 36M regime. Of 10 papers tested experimentally, 8 produced negative or null results. Only 2 showed gains on the V2 stack, and those gains did not survive the jump to 8×H100. 3. EMA Decay Scaling Law at Short Training DurationsDiscovered that optimal EMA decay shifts dramatically lower when training steps are limited. The rank 1 default (0.9965) is suboptimal for short runs (~1,000-4,500 steps). 2×H100 EMA sweep (C6 base, ~1,030 steps):
Gains are monotonic from 0.995 to 0.990 — more aggressive averaging helps when the training window is short. But EMA sensitivity is extreme: 0.997 is already worse than default, and 0.999 is catastrophic. Critical caveat — does NOT transfer to 8×H100:
With only ~4,486 steps on 8×H100, aggressive EMA averages too few checkpoints. The optimal decay depends on the number of training steps, not just the training duration. 4. 3×3 Factorial Technique Interaction StudySystematic 9-run factorial experiment on the rank 1 stack to isolate technique interactions. Three factors, three levels each:
Factor matrix (TTT BPB, 2×H100):
Key interactions discovered:
This factorial design revealed that technique interactions are non-trivial — you cannot predict combined performance from individual ablations. 5. Small Batch Size for Short Wall-Clock TrainingTested reducing effective batch size to get more optimizer updates within the fixed 10-minute window (inspired by Liao et al., NeurIPS 2024). 2×H100 results (C6 base):
4× smaller batch → 3.3× more steps → -0.020 BPB improvement. The largest single-technique gain we found on the V2 stack. But does NOT transfer to 8×H100:
Despite achieving 13,146 steps on 8×H100 (where EMA=0.990 should help), the smaller batch size degrades quality more than extra steps help. 6. Technique Transfer Failure Across GPU CountsPerhaps the most important meta-finding: hyperparameter improvements on 2×H100 do not reliably transfer to 8×H100.
Architectural changes (headwise gate) transfer perfectly — the delta is preserved exactly across scales. Training hyperparameters (EMA decay, batch size) do not transfer because they depend on the number of training steps, which changes with GPU count. On 8×H100, the model sees ~4× more tokens per step, so the total step count drops from ~3,000-4,000 to ~1,000-4,500. This shifts the optimal EMA and batch size trade-offs. Implication: For Parameter Golf, always validate hyperparameter tuning at competition scale (8×H100). Architecture changes can be safely prototyped on fewer GPUs. 7. GPTQ Compression AnalysisSystematic comparison of our GPTQ implementation vs leaderboard leaders revealed a 5× quality gap (+0.05 BPB vs +0.01 for Kevin Clark rank 5 / dexhunter rank 7). Root cause analysis identified 5 compounding factors:
The leaderboard doesn't just "use GPTQ" — they use GPTQ as the final step of a quantization-aware pipeline (WD tuning → EMA → QAT → GPTQ → brotli). We're only doing the last two steps. Techniques That Failed (Expanded)Tested on the V2 rank 1 stack (36M params, SP8192, 10-min wall clock). All produced negative results.
Meta-finding: 8 of 10 tested papers produced negative results at the 36M-parameter scale. The 36M / 16 MB / 10-min constraint regime fundamentally changes which optimizations matter. Techniques designed for 125M+ parameter models with large compute budgets are not "free gains" at small scale — they often interact negatively with the aggressive training stack (MuonEq-R, depth recurrence, parallel residuals, XSA) that already exists. Experiment Scale
BPB Progression
2×H100 → 8×H100 ScalingConsistent improvement across all V2 configurations:
Technique deltas are preserved across scales for architectural changes. ~0.083 BPB consistent scaling factor from 4× GPU count increase. 3-Seed Reproducibility
All seeds: artifact under 16 MB, training under 600s, eval under 600s. |
Run ID ReferenceInternal run IDs used throughout this document. Each describes a specific configuration.
Base stack ("@bigbag's stack" / "rank 1 stack"): SP8192 vocabulary, 11L×512d×8H/4KV, 4×MLP with LeakyReLU(0.5)², 3-layer depth recurrence (layers 3-4-5 looped 2×), parallel residuals (layers 7+), sigmoid skip gates, partial RoPE (16/64 dims), XSA on all layers, QK-Gain 5.25, MuonEq-R optimizer, EMA (0.9965), GPTQ int6+brotli, score-first TTT. From @bigbag PR #1493. |
Record: SP8192 + Full Stack + Headwise Gated Attention + Legal TTT
val_bpb = 1.0805 (3-seed mean, std 0.0012) | ~15.74 MB | 8xH100 SXM
Non-record submission documenting a novel architecture modification (headwise gated attention) and systematic ablation study across 40+ experiments.
3-Seed Results
Novel Contribution: Headwise Gated Attention
Post-attention sigmoid gate applied per-head, after FA3+XSA. Q projection widened by
gate_dim, gate modulates each head's contribution before output projection. ~50K extra params, zero latency cost, consistent -0.0005 BPB. Inspired by NeurIPS 2025 Best Paper (arxiv:2505.06708).Compliance
Research Contributions
Base Stack
Built on @bigbag's stack (PR #1493) with @clarkkev's SP8192/GPTQ/MuonEq-R (PR #1394), @dexhunter's depth recurrence + legal TTT (PR #1331, #1413), and community contributions.
Reproduction