Record: SP8192 + Headwise Gated Attention + Legal TTT (1.0805 BPB, 3-seed) by jamesEmerson112 · Pull Request #2005 · openai/parameter-golf

jamesEmerson112 · 2026-04-30T18:52:16Z

Record: SP8192 + Full Stack + Headwise Gated Attention + Legal TTT

val_bpb = 1.0805 (3-seed mean, std 0.0012) | ~15.74 MB | 8xH100 SXM

Non-record submission documenting a novel architecture modification (headwise gated attention) and systematic ablation study across 40+ experiments.

3-Seed Results

Seed	Sliding BPB	TTT BPB	Artifact
42	1.0834	1.0818	15,697,552
1337	1.0810	1.0794	15,694,065
2025	1.0820	1.0804	15,693,855
Mean	1.0821	1.0805	15,695,157
Std	0.0012	0.0012

Novel Contribution: Headwise Gated Attention

Post-attention sigmoid gate applied per-head, after FA3+XSA. Q projection widened by gate_dim, gate modulates each head's contribution before output projection. ~50K extra params, zero latency cost, consistent -0.0005 BPB. Inspired by NeurIPS 2025 Best Paper (arxiv:2505.06708).

Compliance

No SLOT, no ETLB, no n-gram cache
No Pre-Quantization TTT — fully legal
Score-first TTT (legal per Issue A Field Guide to Valid Submissions #1017)
All artifacts under 16 MB, training under 600s, eval under 600s

Research Contributions

29-paper systematic survey (NeurIPS/ICML/ICLR/ACL 2024-2025)
40+ experiments across 2xH100 and 8xH100
5 negative results documented (SLM, ResFormer, LR Warmup, Structured FFN, Peri-LN)
EMA decay scaling law discovery at short training durations

Base Stack

Built on @bigbag's stack (PR #1493) with @clarkkev's SP8192/GPTQ/MuonEq-R (PR #1394), @dexhunter's depth recurrence + legal TTT (PR #1331, #1413), and community contributions.

Reproduction

SEED=42 GATED_ATTN=headwise EMBED_BITS=7 EMBED_CLIP_SIGMAS=15.0 \
  TTT_ENABLED=1 \
  torchrun --standalone --nproc_per_node=8 train_gpt.py

@bigbag

…al_bpb=1.0511) 3-seed mean: 1.0511 BPB (std 0.0008), 8xH100 SXM Seeds: 42 (1.0517), 1337 (1.0513), 2025 (1.0502) All artifacts under 16 MB, train <600s, eval <600s Techniques: Small Batch (ga=1) + EMA=0.990 + Headwise Gated Attention + PreQuantTTT 21ep Base stack: @bigbag PR openai#1493 (FA3, depth recurrence, parallel residuals, XSA, MuonEq-R, GPTQ int6+brotli, score-first TTT)

- Rename folder: remove PreQuantTTT from name - Update to C6 3-seed data (mean 1.0805 BPB) - Remove PreQuantTTT from train_gpt.py, submission.json, README - Mark as non-record, fully legal submission - compliance.no_pre_quant_ttt: true Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

jamesEmerson112 · 2026-04-30T18:56:39Z

Credits

This submission builds on the work of many contributors to the Parameter Golf community:

@bigbag — Base stack: 3-layer depth recurrence, parallel residuals, sigmoid skip gates, QK-Gain 5.25, LeakyReLU^2, LN scale, legal TTT (PR #1493)
@clarkkev (Kevin Clark) — SP8192 vocabulary, GPTQ with SDClip, MuonEq-R optimizer, embedding GPTQ (PR #1394)
@dexhunter — Depth recurrence on SP8192 (PR #1331, #1437), legal score-first TTT on SP8192 (PR #1413)
@abaybektursun — Score-first TTT framework and legality analysis (PR #549)
@Robby955 — Parallel residuals on SP8192 (PR #1412)
@msisovic — Parallel residuals concept (PR #1204)
@X-Abhishek-X — Hyperparameter tuning and optimizer experiments (PR #1445, #1471)
@aryanbhosale — Parallel residuals + score-first TTT stack (PR #1517)
An Thien Vo (James Emerson Vo) — Headwise gated attention (novel contribution), 29-paper literature survey, 40+ experiment ablation study

Acknowledgements

OpenAI — for hosting the Parameter Golf challenge and the development grant
RunPod — for compute credits supporting our 2xH100 and 8xH100 experiments
Georgia Tech PACE — for supplementary compute resources
@sranganath02 (Sid Ranganathan) — for collaborating on nanochat research and tokenizer investigation as part of our CS 7643 Deep Learning team project
CS 7643 Deep Learning at Georgia Tech, taught by Dr. Zsolt Kira — course context for this research

jamesEmerson112 · 2026-04-30T23:31:57Z

PR #2005 Supplement — Expanded Research Contributions

val_bpb = 1.0805 (3-seed mean, std 0.0012) | ~15.70 MB | 8xH100 SXM

Research Contributions

1. Headwise Gated Attention (Novel Architecture)

Post-attention sigmoid gate applied per-head, after FlashAttention-3 + XSA compute the attention output. A learned gate modulates each head's contribution before the output projection:

Q projection widened by gate_dim extra dimensions
Gate signal extracted from extra Q dims, passed through sigmoid
Applied elementwise per-head: attn_out *= gate.unsqueeze(-1)
~50K extra parameters (~0.14% overhead), zero inference latency cost

Inspired by NeurIPS 2025 Best Paper (arxiv:2505.06708).

Consistent improvement across scales:

Scale	Headwise (F2/A3)	Control (F1/A1)	Delta
2×H100	1.1636 BPB	1.1641 BPB	-0.0005
8×H100	1.0801 BPB	1.0806 BPB	-0.0005

The -0.0005 BPB improvement is preserved exactly when scaling from 2×H100 to 8×H100, confirming the technique's robustness.

We also tested elementwise gating (1 gate per dim per head, +2.36M params). It achieved slightly better BPB (1.2602 vs 1.2653 on SP1024) but exceeded the 16 MB budget (17.87 MB). Headwise is the Pareto-optimal choice: nearly free parameters, fits under budget, and provides consistent improvement.

2. 29-Paper Systematic Survey

Surveyed papers from NeurIPS 2024-2025, ICML 2025, ICLR 2025, and ACL 2025 covering attention modifications, normalization strategies, optimizer scheduling, data selection, structured layers, and compression techniques. Each paper was assessed for Parameter Golf feasibility (16 MB / 10-min / 36M-param constraints) and mapped to PG leaderboard presence.

Key finding: most techniques published for 125M+ parameter models do not transfer to the 36M regime. Of 10 papers tested experimentally, 8 produced negative or null results. Only 2 showed gains on the V2 stack, and those gains did not survive the jump to 8×H100.

3. EMA Decay Scaling Law at Short Training Durations

Discovered that optimal EMA decay shifts dramatically lower when training steps are limited. The rank 1 default (0.9965) is suboptimal for short runs (~1,000-4,500 steps).

2×H100 EMA sweep (C6 base, ~1,030 steps):

EMA Decay	TTT BPB	vs C6 (1.1622)
0.9965 (default)	1.1622	—
0.995	1.1562	-0.0060
0.993	1.1526	-0.0096
0.990	1.1505	-0.0117
0.997	1.1690	+0.0068
0.999	1.3475	+0.1853 (catastrophic)

Gains are monotonic from 0.995 to 0.990 — more aggressive averaging helps when the training window is short. But EMA sensitivity is extreme: 0.997 is already worse than default, and 0.999 is catastrophic.

Critical caveat — does NOT transfer to 8×H100:

Config	2×H100 BPB	8×H100 BPB	Delta vs C6 (8×H100)
C6 (EMA=0.9965)	1.1622	1.0805	—
L1 (EMA=0.990)	1.1505	1.0830	+0.0025 (worse)

With only ~4,486 steps on 8×H100, aggressive EMA averages too few checkpoints. The optimal decay depends on the number of training steps, not just the training duration.

4. 3×3 Factorial Technique Interaction Study

Systematic 9-run factorial experiment on the rank 1 stack to isolate technique interactions. Three factors, three levels each:

Residual strategy: Parallel Residuals (PR) only vs ResFormer (RF, α=0.5) only vs PR+RF
Gate type: No Gate vs Headwise vs Elementwise

Factor matrix (TTT BPB, 2×H100):

	No Gate	Headwise	Elementwise
PR only	1.1641	1.1636	1.1665 (over budget)
RF only	1.1666	1.1661	1.1700 (over budget)
PR + RF	1.1636	1.1650	1.1686 (over budget)

Key interactions discovered:

Headwise gate helps PR (F2 vs F1: -0.0005) but hurts PR+RF (F8 vs F7: +0.0014). The two residual mechanisms compete for the same "residual quality" niche.
ResFormer helps when stacked with PR (F7 vs F1: -0.0005) but hurts alone (F4 vs F1: +0.0025). ResFormer is only beneficial as a complement to parallel residuals, not a replacement.
Elementwise busts budget in all configurations (+2.9M params → 17.2+ MB). Dead on arrival for 16 MB submissions.

This factorial design revealed that technique interactions are non-trivial — you cannot predict combined performance from individual ablations.

5. Small Batch Size for Short Wall-Clock Training

Tested reducing effective batch size to get more optimizer updates within the fixed 10-minute window (inspired by Liao et al., NeurIPS 2024).

2×H100 results (C6 base):

Config	Batch Tokens	Steps	TTT BPB	vs C6
C6 (default)	786,432	~1,030	1.1622	—
B2 (small batch)	196,608	3,349	1.1419	-0.0203

4× smaller batch → 3.3× more steps → -0.020 BPB improvement. The largest single-technique gain we found on the V2 stack.

But does NOT transfer to 8×H100:

Config	2×H100 BPB	8×H100 BPB	Delta vs C6 (8×H100)
C6 (default batch)	1.1622	1.0805	—
L2 (small batch + EMA=0.990)	1.1368	1.0926	+0.0121 (worse)

Despite achieving 13,146 steps on 8×H100 (where EMA=0.990 should help), the smaller batch size degrades quality more than extra steps help.

6. Technique Transfer Failure Across GPU Counts

Perhaps the most important meta-finding: hyperparameter improvements on 2×H100 do not reliably transfer to 8×H100.

Technique	2×H100 Delta	8×H100 Delta	Transferred?
Headwise Gated Attention	-0.0005	-0.0005	Yes
EMA=0.990	-0.0117	+0.0025	No
Small Batch (ga=1, 196K)	-0.0203	+0.0121	No
EMA + Small Batch	-0.0254	+0.0121	No

Architectural changes (headwise gate) transfer perfectly — the delta is preserved exactly across scales. Training hyperparameters (EMA decay, batch size) do not transfer because they depend on the number of training steps, which changes with GPU count. On 8×H100, the model sees ~4× more tokens per step, so the total step count drops from ~3,000-4,000 to ~1,000-4,500. This shifts the optimal EMA and batch size trade-offs.

Implication: For Parameter Golf, always validate hyperparameter tuning at competition scale (8×H100). Architecture changes can be safely prototyped on fewer GPUs.

7. GPTQ Compression Analysis

Systematic comparison of our GPTQ implementation vs leaderboard leaders revealed a 5× quality gap (+0.05 BPB vs +0.01 for Kevin Clark rank 5 / dexhunter rank 7). Root cause analysis identified 5 compounding factors:

Depth recurrence — looped layers share weights, reducing unique matrices to quantize (fewer surfaces = less total error)
Quantization-Aware Training (QAT) — leaderboard models train expecting quantization; ours gets shocked post-hoc
Higher weight decay (0.085-0.090) — produces smaller weights that compress better under brotli
EMA — averaging out training noise makes weights smoother and more compressible
register_forward_hook Hessian collection — captures true activation statistics through the live network

The leaderboard doesn't just "use GPTQ" — they use GPTQ as the final step of a quantization-aware pipeline (WD tuning → EMA → QAT → GPTQ → brotli). We're only doing the last two steps.

Techniques That Failed (Expanded)

Tested on the V2 rank 1 stack (36M params, SP8192, 10-min wall clock). All produced negative results.

#	Technique	Paper	Result	Why It Failed
1	SLM / Rho-1	NeurIPS 2024	All ratios worse (+0.002 to +0.155 BPB)	At 17M params, the model hasn't mastered basic tokens yet — skipping them removes gradient signal the model genuinely needs. Paper tested at 1B+. No reference model means we can't distinguish learnable (H→L) from unlearnable (H→H) tokens. Fixed wall clock means fewer effective tokens per step = worse model.
2	ResFormer (Value Residual)	ACL 2025	+0.0022 BPB on 8×H100 (context-dependent)	Works on our V1 stack (-0.0048 BPB, α=0.5 optimal) where it provides a gradient highway. Fails on V2 stack because parallel residuals already provide that highway — the two mechanisms are redundant.
3	LR Warmup	NeurIPS 2024	+0.0024 to +0.0066 (monotonically worse with more warmup)	MuonEq-R has its own momentum warmup; extra LR ramp wastes precious steps in a 10-min window.
4	Structured FFN	NeurIPS 2024	+0.04 to +0.05 BPB	Low-rank (r=0.5-0.75) + block-diagonal saves 30-56% MLP params but the approximation is too lossy at 36M. Paper tested at 125M+ where redundancy exists.
5	Peri-LN	ICML 2025	Immediate NaN	Output RMSNorm on attention + MLP conflicts with existing attn_scale/mlp_scale + ln_scale_factor in the rank 1 stack. Independently confirmed by teammate on rank 4 stack (different base, same failure).
6	Differential Attention	ICLR 2025 Oral	+0.0138 BPB	Two-softmax-subtract attention requires 2× FlashAttention-3 calls per layer, reducing throughput by 22% (3,292 steps vs 4,221). The throughput penalty outweighs attention quality at 36M scale.
7	HybridNorm	NeurIPS 2025	+0.011 BPB	V-norm + Post-Norm FFN hurt on rank 4 stack. The stack is already heavily normalized (Q/K-norm, ln_scale_factor, resid_mix, attn_scale/mlp_scale). Adding more normalization conflicts — the normalization axis is closed.
8	GPTQ Sequential / Embed	GPTQ tuning (Frantar et al., ICLR 2023)	+0.19 to +0.66 BPB (dramatically worse)	Sequential block quantization: Hessians collected through dequantized blocks are inferior to full-precision Hessians. Embedding GPTQ: frequency-weighted column correlation is the wrong Hessian for lookup tables (embeddings aren't linear projections). Combined: errors compound catastrophically.

Meta-finding: 8 of 10 tested papers produced negative results at the 36M-parameter scale. The 36M / 16 MB / 10-min constraint regime fundamentally changes which optimizations matter. Techniques designed for 125M+ parameter models with large compute budgets are not "free gains" at small scale — they often interact negatively with the aggressive training stack (MuonEq-R, depth recurrence, parallel residuals, XSA) that already exists.

Experiment Scale

Metric	Value
Total experiments	40+
2×H100 sessions	7 (Sessions 3-8, 11-16)
8×H100 sessions	3 (Sessions 14, 15, 18)
Env configs created	21
Run scripts created	6
Papers surveyed	29
Papers tested experimentally	10
Total compute spend	~$280+ on RunPod

BPB Progression

Milestone	BPB	Config
V1 SP1024 baseline (Run 6)	1.2667	9L×512d, GQA, SP1024, int8+zlib
V1 SP8192 best (Run 11, 3-seed)	1.2073	9L×448d, headwise gate, SP8192, int8+zlib
V2 fork of rank 1 (F2)	1.1636	11L×512d, full stack + headwise gate
V2 best 2×H100 (N1)	1.1368	C6 + EMA=0.990 + Small Batch
C6 submission (3-seed mean)	1.0805	V2 + headwise gate + emb7+eclip15, 8×H100

2×H100 → 8×H100 Scaling

Consistent improvement across all V2 configurations:

Config	2×H100 BPB	8×H100 BPB	Improvement
F1 (control)	1.1641	1.0806	-0.0835
F2 (headwise)	1.1636	1.0801	-0.0835
C6 (headwise + emb7+eclip15)	1.1622	1.0818	-0.0804

Technique deltas are preserved across scales for architectural changes. ~0.083 BPB consistent scaling factor from 4× GPU count increase.

3-Seed Reproducibility

Seed	Sliding BPB	TTT BPB	Artifact
42	1.0834	1.0818	15,697,552 bytes
1337	1.0810	1.0794	15,694,065 bytes
2025	1.0820	1.0804	15,693,855 bytes
Mean	1.0821	1.0805	15,695,157 bytes
Std	0.0012	0.0012

All seeds: artifact under 16 MB, training under 600s, eval under 600s.

jamesEmerson112 · 2026-04-30T23:37:26Z

Run ID Reference

Internal run IDs used throughout this document. Each describes a specific configuration.

Run ID	Full Name	Description
C6	Headwise + emb7+eclip15	Our submission config: @bigbag's full stack + headwise gated attention + int7 embedding quantization (clip σ=15.0). Submitted with 3-seed verification.
F1	Control (no additions)	@bigbag's rank 1 stack unmodified. Baseline for ablation.
F2	PR + Headwise Gate	@bigbag's stack + headwise gated attention (default compression).
F7	PR + ResFormer (α=0.5)	@bigbag's stack + ResFormer value residual learning. No gate.
A1	Control (8×H100)	F1 run at competition scale (8×H100).
A3	Headwise (8×H100)	F2 run at competition scale (8×H100, default compression).
A2	PR + ResFormer (8×H100)	F7 run at competition scale (8×H100).
E1	EMA=0.995	C6 base + more aggressive EMA averaging (decay 0.995 vs default 0.9965).
R3	EMA=0.990	C6 base + most aggressive EMA averaging tested. Best on 2×H100.
L1	EMA=0.990 (8×H100)	R3 config at competition scale. Did NOT transfer.
L2	Small Batch + EMA (8×H100)	Small batch (196K tokens) + EMA=0.990 at competition scale.
B2	Small Batch (2×H100)	C6 base + grad_accum=1, 196K batch tokens (4× smaller, 3.3× more steps).
N1	EMA + Small Batch (2×H100)	C6 + EMA=0.990 + small batch. Best-ever legal 2×H100 result.
N2	N1 + Differential Attention	N1 config + two-softmax-subtract attention (Paper #19).
P1a	SOTA hparams (8×H100)	C6 + 6 hyperparameter overrides from PR #1855 (warmdown, min_lr, clip, beta2).
Q0-Q7	GPTQ tuning runs	Sequential blocks (Q1), embed GPTQ (Q3), all combined (Q7). All worse.
F1-F9	3×3 factorial	9-run sweep: {PR, RF, PR+RF} × {No Gate, Headwise, Elementwise}.

Base stack ("@bigbag's stack" / "rank 1 stack"): SP8192 vocabulary, 11L×512d×8H/4KV, 4×MLP with LeakyReLU(0.5)², 3-layer depth recurrence (layers 3-4-5 looped 2×), parallel residuals (layers 7+), sigmoid skip gates, partial RoPE (16/64 dims), XSA on all layers, QK-Gain 5.25, MuonEq-R optimizer, EMA (0.9965), GPTQ int6+brotli, score-first TTT. From @bigbag PR #1493.

jamesEmerson112 and others added 2 commits April 30, 2026 09:38

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Record: SP8192 + Headwise Gated Attention + Legal TTT (1.0805 BPB, 3-seed)#2005

Record: SP8192 + Headwise Gated Attention + Legal TTT (1.0805 BPB, 3-seed)#2005
jamesEmerson112 wants to merge 2 commits intoopenai:mainfrom
jamesEmerson112:submission/fullstack-headwise-gate

jamesEmerson112 commented Apr 30, 2026

Uh oh!

jamesEmerson112 commented Apr 30, 2026

Uh oh!

jamesEmerson112 commented Apr 30, 2026 •

edited

Loading

Uh oh!

jamesEmerson112 commented Apr 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

jamesEmerson112 commented Apr 30, 2026

Record: SP8192 + Full Stack + Headwise Gated Attention + Legal TTT

3-Seed Results

Novel Contribution: Headwise Gated Attention

Compliance

Research Contributions

Base Stack

Reproduction

Uh oh!

jamesEmerson112 commented Apr 30, 2026

Credits

Acknowledgements

Uh oh!

jamesEmerson112 commented Apr 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR #2005 Supplement — Expanded Research Contributions

Research Contributions

1. Headwise Gated Attention (Novel Architecture)

2. 29-Paper Systematic Survey

3. EMA Decay Scaling Law at Short Training Durations

4. 3×3 Factorial Technique Interaction Study

5. Small Batch Size for Short Wall-Clock Training

6. Technique Transfer Failure Across GPU Counts

7. GPTQ Compression Analysis

Techniques That Failed (Expanded)

Experiment Scale

BPB Progression

2×H100 → 8×H100 Scaling

3-Seed Reproducibility

Uh oh!

jamesEmerson112 commented Apr 30, 2026

Run ID Reference

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

jamesEmerson112 commented Apr 30, 2026 •

edited

Loading