Skip to content

[Non-record] XSA + EMA + TTT: Negative interaction study (val_bpb=1.1436)#303

Open
sseanliu wants to merge 1 commit intoopenai:mainfrom
sseanliu:submission/xsa-ema-ttt
Open

[Non-record] XSA + EMA + TTT: Negative interaction study (val_bpb=1.1436)#303
sseanliu wants to merge 1 commit intoopenai:mainfrom
sseanliu:submission/xsa-ema-ttt

Conversation

@sseanliu
Copy link

Summary

Non-record submission testing TTT on the XSA+EMA base (PR #287). Key finding: TTT hurts by 0.016 BPB.

Results

Configuration val_bpb
XSA + EMA (PR #287, no TTT) 1.1280
XSA + EMA + TTT (this) 1.1436 (+0.016 worse)
SmearGate + TTT (PR #254) 1.1313

TTT makes the XSA+EMA model worse, confirming the mechanism redundancy pattern from #290 and #296.

Why TTT hurts XSA models

XSA and TTT both target local context modeling. XSA removes self-information from attention outputs; TTT adapts weights to local validation patterns. Stacking them double-counts the same signal, while TTT's SGD updates disrupt the smooth EMA weight landscape.

Reproducibility (2 seeds)

Seed val_bpb
1337 1.1436
42 1.1441
Mean 1.1439

Artifact: 15.3MB. Training: 6,001 steps @ 100ms/step. TTT: 67s. Used FA2 (not FA3).

See README for full analysis.

mohosy pushed a commit to mohosy/parameter-golf that referenced this pull request Mar 21, 2026
…, clean up script

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@mohosy
Copy link

mohosy commented Mar 21, 2026

this is really good data, was literally about to stack ttt on ema+xsa and you saved me the compute lol. do you think theres any eval time trick that does work with ema or is it just fundamentally incompatible with adaptation

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants