Add 11L Shared Sparse Sidecar + EMA + AdamW TTT (1.0916 mean)#555
Closed
ymrohit wants to merge 1 commit intoopenai:mainfrom
Closed
Add 11L Shared Sparse Sidecar + EMA + AdamW TTT (1.0916 mean)#555ymrohit wants to merge 1 commit intoopenai:mainfrom
ymrohit wants to merge 1 commit intoopenai:mainfrom
Conversation
6eba720 to
32e65e8
Compare
32e65e8 to
f231fd5
Compare
teddyoweh
pushed a commit
to teddyoweh/parameter-golf
that referenced
this pull request
Mar 23, 2026
Enhanced test-time training on ymrohit's shared sparse sidecar architecture (PR openai#555). Key changes: cosine LR schedule (0.0005→0.00002), 1-epoch warmup, WD=0.01, 20 epochs. H200 results (USE_COMPILE=0, ~2400 steps): - Post-TTT sliding window BPB: 1.1014 - TTT improvement: 0.0342 BPB over flat-LR 10-epoch baseline Expected to be significantly better on H100 with torch.compile (~5900 steps).
teddyoweh
pushed a commit
to teddyoweh/parameter-golf
that referenced
this pull request
Mar 24, 2026
3-seed validation on exact competition hardware (8xH100 80GB SXM): - Seed 13: 1.0703 BPB (5627 steps) - Seed 1111: 1.0687 BPB (5613 steps) - Seed 1337: 1.0704 BPB (5609 steps) - Mean: 1.0698 BPB Beats PR openai#555 (1.0916) by 0.0218 BPB (2.0%). Beats merged openai#1 (1.1233) by 0.0535 BPB (4.8%). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
teddyoweh
pushed a commit
to teddyoweh/parameter-golf
that referenced
this pull request
Mar 24, 2026
Replaces H200 preliminary results with definitive 8xH100 80GB results using torch.compile. 3-seed validation (13, 1111, 1337): - Mean sliding window BPB (s=64): 1.0698 (vs PR openai#555's 1.0916) - Improvement: 0.0218 BPB (2.0%) - Std dev: 0.00093 (extremely tight) - All seeds under 16MB - 5609-5627 training steps at ~106ms/step Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Contributor
|
As far as I can tell here, this proposed TTT scheme trains on the validation set by reporting the score on a doc after its weights have adapted to it, rendering this unsound for the purposes of this competition. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
This PR submits a derivative of
@sjp611's1.1027AdamW-TTT 11-layer stack with a new shared sparse sidecar added on top of the late layers.Main result:
16,000,000-byte capWhat Is New
The main contribution is a late shared sparse sidecar:
gate -> value -> depthwise conv -> projThis adds a late local-refinement path without paying for another full transformer block at every site.
Evidence
Official cloud result
Matched local sidecar ablation on donor family
This is not the official leaderboard metric, but it is the cleanest directional ablation for the sidecar idea itself:
Legalization Work
The first sidecar-on-donor version was too close to the 16MB cap. The final legal version trimmed:
SPARSE_HIDDEN_DIM: 64 -> 48BIGRAM_DIM: 128 -> 96MAX_WALLCLOCK_SECONDS: 600 -> 596This preserved the architecture while making all three cloud runs legal.
Attribution
This submission is a derivative work and should be credited that way:
@sjp611AdamW-TTT 11-layer stackPR #398