Add 11L Shared Sparse Sidecar + EMA + AdamW TTT (1.0916 mean) by ymrohit · Pull Request #555 · openai/parameter-golf

ymrohit · 2026-03-23T17:06:44Z

Summary

This PR submits a derivative of @sjp611's 1.1027 AdamW-TTT 11-layer stack with a new shared sparse sidecar added on top of the late layers.

Main result:

3-seed mean val_bpb: 1.09161722
best seed: 1.09009466
all three seeds are legal under the decimal 16,000,000-byte cap

What Is New

The main contribution is a late shared sparse sidecar:

injected only in late layers
shared across multiple insertion sites
conditioned per site with learned site embeddings
modulated per site with learned residual scales
implemented as gate -> value -> depthwise conv -> proj

This adds a late local-refinement path without paying for another full transformer block at every site.

Evidence

Official cloud result

Seed	Steps	val_loss	val_bpb	Total bytes
13	5965	1.84057430	1.09009466	15,973,374
1337	5966	1.84364647	1.09191418	15,986,252
1111	5961	1.84521443	1.09284281	15,868,806
Mean		1.84314507	1.09161722
Std		0.00236	0.00140

Matched local sidecar ablation on donor family

This is not the official leaderboard metric, but it is the cleanest directional ablation for the sidecar idea itself:

Variant	final int6 roundtrip val_bpb
Base donor	1.09458804
Sidecar64	1.07249022

Legalization Work

The first sidecar-on-donor version was too close to the 16MB cap. The final legal version trimmed:

SPARSE_HIDDEN_DIM: 64 -> 48
BIGRAM_DIM: 128 -> 96
MAX_WALLCLOCK_SECONDS: 600 -> 596

This preserved the architecture while making all three cloud runs legal.

Attribution

This submission is a derivative work and should be credited that way:

direct donor: @sjp611 AdamW-TTT 11-layer stack
donor lineage: PR #398
new contribution in this PR: shared sparse sidecar architecture, late-site conditioning, and cloud-legal deployment of that architecture

Enhanced test-time training on ymrohit's shared sparse sidecar architecture (PR openai#555). Key changes: cosine LR schedule (0.0005→0.00002), 1-epoch warmup, WD=0.01, 20 epochs. H200 results (USE_COMPILE=0, ~2400 steps): - Post-TTT sliding window BPB: 1.1014 - TTT improvement: 0.0342 BPB over flat-LR 10-epoch baseline Expected to be significantly better on H100 with torch.compile (~5900 steps).

3-seed validation on exact competition hardware (8xH100 80GB SXM): - Seed 13: 1.0703 BPB (5627 steps) - Seed 1111: 1.0687 BPB (5613 steps) - Seed 1337: 1.0704 BPB (5609 steps) - Mean: 1.0698 BPB Beats PR openai#555 (1.0916) by 0.0218 BPB (2.0%). Beats merged openai#1 (1.1233) by 0.0535 BPB (4.8%). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Replaces H200 preliminary results with definitive 8xH100 80GB results using torch.compile. 3-seed validation (13, 1111, 1337): - Mean sliding window BPB (s=64): 1.0698 (vs PR openai#555's 1.0916) - Improvement: 0.0218 BPB (2.0%) - Std dev: 0.00093 (extremely tight) - All seeds under 16MB - 5609-5627 training steps at ~106ms/step Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

valerio-oai · 2026-03-24T14:10:17Z

As far as I can tell here, this proposed TTT scheme trains on the validation set by reporting the score on a doc after its weights have adapted to it, rendering this unsound for the purposes of this competition.

ymrohit force-pushed the ymrohit-sidecar-record branch 2 times, most recently from 6eba720 to 32e65e8 Compare March 23, 2026 17:26

Add 11L shared sparse sidecar record

f231fd5

ymrohit force-pushed the ymrohit-sidecar-record branch from 32e65e8 to f231fd5 Compare March 23, 2026 17:29

notapplica mentioned this pull request Mar 23, 2026

Parameter Golf Live AI Commentary + Analysis / Ideas | every 10 minutes #140

Open

teddyoweh mentioned this pull request Mar 23, 2026

Record: 11L Sidecar48 + Enhanced TTT (cosine LR, 20 epochs) — 1.0698 BPB (3-seed mean) #581

Closed

valerio-oai closed this Mar 24, 2026

0hq mentioned this pull request Mar 25, 2026

Illegal submissions megathread #677

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add 11L Shared Sparse Sidecar + EMA + AdamW TTT (1.0916 mean)#555

Add 11L Shared Sparse Sidecar + EMA + AdamW TTT (1.0916 mean)#555
ymrohit wants to merge 1 commit intoopenai:mainfrom
ymrohit:ymrohit-sidecar-record

ymrohit commented Mar 23, 2026

Uh oh!

valerio-oai commented Mar 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

ymrohit commented Mar 23, 2026

Summary

What Is New

Evidence

Official cloud result

Matched local sidecar ablation on donor family

Legalization Work

Attribution

Uh oh!

valerio-oai commented Mar 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants