Skip to content

Add 11L Shared Sparse Sidecar + EMA + AdamW TTT (1.0916 mean)#555

Closed
ymrohit wants to merge 1 commit intoopenai:mainfrom
ymrohit:ymrohit-sidecar-record
Closed

Add 11L Shared Sparse Sidecar + EMA + AdamW TTT (1.0916 mean)#555
ymrohit wants to merge 1 commit intoopenai:mainfrom
ymrohit:ymrohit-sidecar-record

Conversation

@ymrohit
Copy link

@ymrohit ymrohit commented Mar 23, 2026

Summary

This PR submits a derivative of @sjp611's 1.1027 AdamW-TTT 11-layer stack with a new shared sparse sidecar added on top of the late layers.

Main result:

  • 3-seed mean val_bpb: 1.09161722
  • best seed: 1.09009466
  • all three seeds are legal under the decimal 16,000,000-byte cap

What Is New

The main contribution is a late shared sparse sidecar:

  • injected only in late layers
  • shared across multiple insertion sites
  • conditioned per site with learned site embeddings
  • modulated per site with learned residual scales
  • implemented as gate -> value -> depthwise conv -> proj

This adds a late local-refinement path without paying for another full transformer block at every site.

Evidence

Official cloud result

Seed Steps val_loss val_bpb Total bytes
13 5965 1.84057430 1.09009466 15,973,374
1337 5966 1.84364647 1.09191418 15,986,252
1111 5961 1.84521443 1.09284281 15,868,806
Mean 1.84314507 1.09161722
Std 0.00236 0.00140

Matched local sidecar ablation on donor family

This is not the official leaderboard metric, but it is the cleanest directional ablation for the sidecar idea itself:

Variant final int6 roundtrip val_bpb
Base donor 1.09458804
Sidecar64 1.07249022

Legalization Work

The first sidecar-on-donor version was too close to the 16MB cap. The final legal version trimmed:

  • SPARSE_HIDDEN_DIM: 64 -> 48
  • BIGRAM_DIM: 128 -> 96
  • MAX_WALLCLOCK_SECONDS: 600 -> 596

This preserved the architecture while making all three cloud runs legal.

Attribution

This submission is a derivative work and should be credited that way:

  • direct donor: @sjp611 AdamW-TTT 11-layer stack
  • donor lineage: PR #398
  • new contribution in this PR: shared sparse sidecar architecture, late-site conditioning, and cloud-legal deployment of that architecture

@ymrohit ymrohit force-pushed the ymrohit-sidecar-record branch 2 times, most recently from 6eba720 to 32e65e8 Compare March 23, 2026 17:26
@ymrohit ymrohit force-pushed the ymrohit-sidecar-record branch from 32e65e8 to f231fd5 Compare March 23, 2026 17:29
teddyoweh pushed a commit to teddyoweh/parameter-golf that referenced this pull request Mar 23, 2026
Enhanced test-time training on ymrohit's shared sparse sidecar architecture (PR openai#555).
Key changes: cosine LR schedule (0.0005→0.00002), 1-epoch warmup, WD=0.01, 20 epochs.

H200 results (USE_COMPILE=0, ~2400 steps):
- Post-TTT sliding window BPB: 1.1014
- TTT improvement: 0.0342 BPB over flat-LR 10-epoch baseline

Expected to be significantly better on H100 with torch.compile (~5900 steps).
teddyoweh pushed a commit to teddyoweh/parameter-golf that referenced this pull request Mar 24, 2026
3-seed validation on exact competition hardware (8xH100 80GB SXM):
- Seed 13:   1.0703 BPB (5627 steps)
- Seed 1111: 1.0687 BPB (5613 steps)
- Seed 1337: 1.0704 BPB (5609 steps)
- Mean:      1.0698 BPB

Beats PR openai#555 (1.0916) by 0.0218 BPB (2.0%).
Beats merged openai#1 (1.1233) by 0.0535 BPB (4.8%).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
teddyoweh pushed a commit to teddyoweh/parameter-golf that referenced this pull request Mar 24, 2026
Replaces H200 preliminary results with definitive 8xH100 80GB results
using torch.compile. 3-seed validation (13, 1111, 1337):

- Mean sliding window BPB (s=64): 1.0698 (vs PR openai#555's 1.0916)
- Improvement: 0.0218 BPB (2.0%)
- Std dev: 0.00093 (extremely tight)
- All seeds under 16MB
- 5609-5627 training steps at ~106ms/step

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@valerio-oai
Copy link
Contributor

As far as I can tell here, this proposed TTT scheme trains on the validation set by reporting the score on a doc after its weights have adapted to it, rendering this unsound for the purposes of this competition.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants