Record: Gated XSA + LQER top-1 + strict token-only n-gram TTT (val_bpb: 1.047) by simon-marcus · Pull Request #2018 · openai/parameter-golf

simon-marcus · 2026-04-30T21:24:13Z

Record: Gated XSA + LQER top-1 + strict token-only n-gram TTT (val_bpb: 1.047)

val_bpb: 1.04722074 (3-seed mean, std 0.00104816) | max artifact: 15,996,490 bytes | 8xH100 SXM | strict in-timer TTT eval

Improvement vs merged PR #1855 SOTA (1.06107587 BPB): -0.01385513 BPB / -0.00960 nats per byte, clearing the README's 0.005-nats record threshold by about 1.92x.

Metric	Seed 42	Seed 1337	Seed 2026	3-seed
Stop step	4,914	4,926	4,916	4,918.7 mean
Train time	596.127 s	596.167 s	596.080 s	596.125 s mean
Pre-quant BPB	1.04930686	1.05124428	1.05029930	1.05028348 mean
Quantized BPB	1.05773513	1.05990331	1.05886641	1.05883495 mean
Post-TTT BPB	1.04616727	1.04826351	1.04723144	1.04722074 mean
Eval time	471.457 s	465.480 s	463.281 s	466.739 s mean
Artifact bytes	15,995,574	15,992,746	15,996,490	15,996,490 max

All reported eval time above includes the n-gram hint precompute inside the measured TTT eval timer (NGRAM_HINT_PRECOMPUTE_OUTSIDE=0).

Summary

This submission picks up on the PR #1967 / CaseOps lineage and then applies a training-time attention change plus a conservative eval-time n-gram path:

Gated XSA. Each attention layer gets a learned per-head scalar xsa_alpha; the existing XSA subtraction coefficient is multiplied by tanh(xsa_alpha). The gate is zero-initialized, so the model starts as a strict superset of the base stack.
LQER top-1. LQER_TOP_K=1 keeps the best LQER correction tensor. This saves artifact bytes versus the top-3 setting and was a favorable knob in the PR Record: Leaky ReLU Slope + GPTQ Reverse-Cholesky Speedup + PR #1938 (val_bpb = 1.06242) #1948 lineage.
Strict token-only n-gram tilt. In response to the current-token class-routing concern, this update adopts the conservative PR Record: SP8192 + Muon 0.97 + Legal Score-First TTT — val_bpb 1.07983 (3-seed mean) #1514 workaround: disable the within-word and word-level experts and retain only the token-16 expert. The token hint is emitted from token_context_hash(st) over prefix state before the current token is pushed into the online state.
In-timer hint precompute. The n-gram hint pass is included in the final eval timer (NGRAM_HINT_PRECOMPUTE_OUTSIDE=0). A token-only native fast path keeps the full eval under the 10-minute cap.
Cheaper phased TTT. The final eval uses one score-first global TTT phase over a 1,000-document prefix, then scores the remaining stream with the adapted global model plus per-document LoRA TTT.

What did not work: Skylight/NorMuon was tested but is disabled in this submission (SKYLIGHT_MUON=0) because it destabilized this stack.

Compliance notes

Artifact size: max artifact is 15,996,490 bytes, under the decimal 16,000,000-byte cap.
Training budget: all three seeds stop on the 600-second wallclock cap at about 596.1 s.
Eval budget: all three token-only final TTT evals are under 600 s. The n-gram hint precompute is included in that timer.
Score-first TTT: the phased TTT path scores validation tokens before using them for global or LoRA updates. The global phase only trains on already-scored prefix documents.
Token-only n-gram tilt: the tilt applies a closed-form renormalized one-token boost, p'(a) = exp(beta * 1[a=h]) p(a) / Z, where Z = 1 + p(h)(exp(beta)-1). Hints are generated left-to-right from prefix token state.
No within-word or word-level experts: the final logs show token_gate=628130 within_gate=0 word_gate=0 agree2plus=0 for every seed.
Gate population diagnostic: token_only_fast_evals/token_only_gate_population.json reproduces the production hint pass and reports the same token_gate=628130, with within_gate=0 and word_gate=0.
Dataset/tokenizer: uses the CaseOps SP8192 lossless-caps tokenizer and byte-sidecar BPB accounting from the CaseOps lineage. The 80 training shards match the merged CaseOps leader's prepare_caseops_data.py default val_docs=10000 output byte-for-byte. Evaluation uses the full CaseOps validation shard/sidecar reported by the leaderboard lineage (val_tokens: 47851520). See DATASET_AUDIT.md.

Key settings

Setting	Value
Base stack	PR #1967 V21 + LeakyReLU 0.3 + n-gram tilt lineage
Model	11 layers, 512 dim, 8 heads / 4 KV heads
Tokenizer	SP8192 lossless-caps CaseOps v1 reserved
Eval sequence length	2560
TTT mask	`no_qv`
TTT LoRA rank	80
TTT local LR mult	0.75
QK gain init	5.25
Matrix LR	0.026
Min LR	0.1
LQER	rank 4, asymmetric, top-1
N-gram precompute	inside timer (`NGRAM_HINT_PRECOMPUTE_OUTSIDE=0`)
N-gram expert	token-16 only
Within/word experts	disabled (`WITHIN_BOOST=0`, `WORD_BOOST=0`)
Phased TTT	1 phase, 1,000 prefix docs
Gated XSA	enabled
Skylight Muon	disabled

Reproducing

Install Python dependencies from requirements.txt, install FlashAttention 3 as described there, and install the lrzip system package before launching the run. The script itself does not install packages or make network calls during training/evaluation.

SEED=42 \
NGRAM_HINT_PRECOMPUTE_OUTSIDE=0 \
DATA_PATH=./data/datasets/fineweb10B_sp8192_lossless_caps_caseops_v1_reserved \
TOKENIZER_PATH=./data/tokenizers/fineweb_8192_bpe_lossless_caps_caseops_v1_reserved.model \
CASEOPS_ENABLED=1 VOCAB_SIZE=8192 ITERATIONS=20000 MAX_WALLCLOCK_SECONDS=600 \
TTT_ENABLED=1 PHASED_TTT_ENABLED=1 PHASED_TTT_NUM_PHASES=1 PHASED_TTT_PREFIX_DOCS=1000 \
TTT_LORA_RANK=80 TTT_MASK=no_qv TTT_Q_LORA=0 TTT_V_LORA=0 \
TTT_LOCAL_LR_MULT=0.75 EVAL_SEQ_LEN=2560 TTT_EVAL_SEQ_LEN=2560 \
QK_GAIN_INIT=5.25 \
MATRIX_LR=0.026 MIN_LR=0.1 EMBED_BITS=7 GRAD_CLIP_NORM=0.3 \
MATRIX_CLIP_SIGMAS=12.85 ATTN_CLIP_SIGMAS=13.0 MLP_CLIP_SIGMAS=11.5 EMBED_CLIP_SIGMAS=14.0 \
FUSED_CE_ENABLED=1 SMEAR_GATE_ENABLED=1 GATE_WINDOW=12 \
SPARSE_ATTN_GATE_ENABLED=1 LQER_ENABLED=1 LQER_RANK=4 LQER_TOP_K=1 \
LQER_GROUP_SIZE=64 LQER_ASYM_ENABLED=1 LQER_ASYM_GROUP=64 \
AWQ_LITE_ENABLED=1 ASYM_LOGIT_RESCALE=1 NGRAM_TILT_ENABLED=1 \
TOKEN_ORDER=16 TOKEN_THRESHOLD=0.800 TOKEN_BOOST=2.625 \
WITHIN_TAU=999 WITHIN_BOOST=0 WORD_TAU=999 WORD_BOOST=0 AGREE_ADD_BOOST=0 \
GATED_XSA=1 SKYLIGHT_MUON=0 \
GPTQ_RESERVE_SECONDS=4.0 GPTQ_CALIBRATION_BATCHES=16 \
COMPRESSOR=pergroup \
torchrun --standalone --nproc_per_node=8 train_gpt.py

Files

train_gpt.py - complete training/eval script.
online_ngram_tilt.py, online_ngram_state.c - token-only n-gram hint/tilt helper from the PR Record: V21 + N-gram Tilt + LeakyReLU 0.3 — val_bpb 1.05851 (3-seed mean) #1967 lineage with the conservative fast path.
prepare_caseops_data.py, lossless_caps.py - CaseOps dataset preparation helpers.
DATASET_AUDIT.md, dataset_verification/ - dataset construction audit and verification logs.
tokenizers/fineweb_8192_bpe_lossless_caps_caseops_v1_reserved.model - tokenizer model.
train_seed42.log, train_seed1337.log, train_seed2026.log - original full per-seed training logs for the saved artifacts.
token_only_fast_evals/ - eval-only replay logs from the saved artifacts using the conservative token-only n-gram path.
submission.json - structured metadata for the token-only 3-seed result.

Credits

This work is a small stack on top of a long public lineage:

PR Record: V21 + N-gram Tilt + LeakyReLU 0.3 — val_bpb 1.05851 (3-seed mean) #1967 by ndokutovich for the V21 + LeakyReLU 0.3 + closed-form n-gram tilt stack.
PR Record: PR #1945 base + 2560 long-context + no_qv TTT mask + TTT LR 0.75 + QK_GAIN 5.25 — val_bpb 1.05855 (3-seed mean) #1953 by andrewbaggio1 for the long-context/no-QV TTT and QK-gain settings.
PR Record: V22 = V21 + PR #1953 levers + EVAL_SEQ_LEN=2816 -- val_bpb 1.05877 (3-seed mean, all strict <600s) #1945 by alertcat for the V21/AWQ-lite/asymmetric-logit-rescale base.
PR Record: Leaky ReLU Slope + GPTQ Reverse-Cholesky Speedup + PR #1938 (val_bpb = 1.06242) #1948 by TimS-ml and lijuncheng16 for the LQER-top-k sweep and LeakyReLU work.
PR Record: SP8192 + Muon 0.97 + Legal Score-First TTT — val_bpb 1.07983 (3-seed mean) #1514 by codemath3000 for the conservative token-only n-gram workaround precedent.
PR Record: 1.1109 BPB FullGPTQ XSA11 + online (legal) ngram augment #1145 by AnirudhRahul for the online n-gram augmentation lineage.
The CaseOps lineage from romeerp, dexhunter, aquariouseworkman, codemath3000, and others for the SP8192 lossless-caps tokenizer, byte-sidecar BPB accounting, and score-first phased TTT.

andrewbaggio1 · 2026-04-30T21:38:21Z

so cool

romeerp · 2026-04-30T21:44:06Z

wow, did you get the ~.01 prequant bpb improvement from just adding gated xsa?

andrewbaggio1 · 2026-04-30T21:46:41Z

wow, did you get the ~.01 prequant bpb improvement from just adding gated xsa?

no it's from precomputing n gram tilt inside the timer

aquariouseworkman · 2026-04-30T22:32:49Z

wow, did you get the ~.01 prequant bpb improvement from just adding gated xsa?

no it's from precomputing n gram tilt inside the timer

leon2k2k2k · 2026-04-30T23:32:00Z

See #1967 (comment) — same C1 (causality) issue
applies here; this PR ships byte-identical online_ngram_state.c + online_ngram_tilt.py.

simon-marcus · 2026-05-01T01:29:53Z

See #1967 (comment) — same C1 (causality) issue applies here; this PR ships byte-identical online_ngram_state.c + online_ngram_tilt.py.

Thanks @leon2k2k2k, this was a useful catch. I agree with the core concern: the within-word / word-level expert gates in the PR #1967 lineage inspect properties of the realized tokens[i], so I do not want to rely on those paths for #2018.

I updated #2018 to adopt the conservative workaround you pointed to from the legal, merged precedent in #1514: disable the within-word and word-level experts and retain only the token-16 expert. In the updated code, the token-only fast path emits the hint from token_context_hash(st) over prefix token state, then updates the online table with the current token afterward via token_push.

The updated 3-seed result is a slight regression from my original submission (1.046 mean is now 1.047):

Seed	val_bpb	eval_time
42	1.04616727	471.457 s
1337	1.04826351	465.480 s
2026	1.04723144	463.281 s
Mean	1.04722074	466.739 s

The logs now show, for every seed:

ngram_tilt:hints total=47851520 gated=628130 token_gate=628130 within_gate=0 word_gate=0 agree2plus=0

I updated the PR title/body, README, submission.json, defaults, and logs accordingly.

leon2k2k2k · 2026-05-01T01:44:25Z

@simon-marcus thanks! Just curious, your amazing pre-quant, is that just training NN with no n-gram?

…ixer Self-contained reference for byte-level NN scoring without the C1/C2 leak in PR openai#2039 / openai#1967 / openai#2018 / openai#2041. Shows ~-0.097 BPB legitimate gain on spec 250 seed_0 (1M val tokens), independent of include_space leak. Files: README, proper_ppm_mixer_rigorous.py (canonical), byte_bpb_proper.py (NN-only baseline), show_big_gains.py (inspection), test_byte0_3way.py (5-config leak validation).

sharpobject · 2026-05-01T16:44:37Z

the first 40k documents of your training data are the same as the last 40k documents of your validation data

@dexhunter

Audits every CaseOps-lineage record-track PR (merged + unmerged) since 2026-04-18 for whether val docs are also in the training set. Working set: 34 PRs (31 from chronological seed list + 3 discovered ancestors: openai#1908, openai#1923, openai#2007). Boundary nodes openai#1493 / openai#1626 (pre-CaseOps). Verdicts: - CLEAN (8): openai#1729, openai#1851, openai#1868, openai#1908, openai#2019, openai#2027, openai#2031, openai#2068 - LEAK (25): openai#1736 (our research baseline) → openai#1769 → openai#1787 → openai#1797 → openai#1855 → V21 family (openai#1945, openai#1923, openai#1953, openai#1967) → openai#2018 → openai#2118 (current claimed frontier 1.04350), plus siblings. - INHERIT (1): openai#2050 (eval-only on frozen openai#1915) Code-level evidence (not README claims): - Every shipped prepare_caseops_data.py is byte-identical: SHARD_TOKENS=10_000_000, default=10_000 for --val-docs - NO PR overrides --val-docs (searched all .sh files in all 34 PRs) - cached_challenge_fineweb.py downloads from romeerp/parameter-golf-caseops-v1 HF dataset whose manifest pins docs_val=50000, docs_train=8181945, sums match → CLEAN by construction - PR openai#2018's DATASET_AUDIT.md is gold-standard explicit leak description - PR openai#2118's submission.json admits "--val-docs=10000 train shards + 50k val eval" Three signposts: - Leak introduced: PR openai#1736 by @dexhunter (Apr 19) — first prepare_caseops_data.py default invocation - Leak fixed: PR openai#1851 by @aquariouseworkman (Apr 27) — switched to HF dataset - Leak re-introduced: PR openai#1855 by @codemath3000 (same day) — rebuilt locally The merged-leaderboard SOTA (openai#1851/openai#1868 at 1.06128/1.06141) is CLEAN. The unmerged frontier (openai#2118 at 1.04350) is LEAK. The 0.018 bpb gap is inflated by val memorization; spec 301 was designed to measure how much remains under clean data. Files: caseops-memory-leakage/README.md — overview, methodology, takeaways caseops-memory-leakage/verdicts.md — 34-row master table with evidence caseops-memory-leakage/family-tree.md — ASCII trees with [C]/[L] annotations

simon-marcus added 2 commits April 30, 2026 17:24

Record: Gated XSA + LQER top-1 + strict in-timer n-gram TTT

2155320

Document margin over merged SOTA

bd081da

jorge-asenjo mentioned this pull request Apr 30, 2026

Record: SP8192 V21 + Inside-timer N-gram TTT (no Gated XSA) — val_bpb 1.05692 (3-seed mean) #2041

Open

8 tasks

leon2k2k2k mentioned this pull request Apr 30, 2026

Record: V21 + N-gram Tilt + LeakyReLU 0.3 — val_bpb 1.05851 (3-seed mean) #1967

Open

Switch submission to token-only n-gram TTT

92d4fab

simon-marcus changed the title ~~Record: Gated XSA + LQER top-1 + strict in-timer n-gram TTT (val_bpb: 1.046)~~ Record: Gated XSA + LQER top-1 + strict token-only n-gram TTT (val_bpb: 1.047) May 1, 2026

This was referenced May 1, 2026

Closed: superseded non-record draft #2108

Closed

Non-record: final frontier autopsy #2110

Open

Non-record: competition research notes #2111

Open

andrewbaggio1 mentioned this pull request May 1, 2026

Record: 1.0435 Gated XSA + token-only n-gram tilt + LQER top-1 + AWQ-lite + AsymLogit) with GPTQ_RESERVE_SECONDS=2.0 and corrected CaseOps data preparation #2118

Open

vaibhavmishra1 mentioned this pull request May 1, 2026

Record: CaseOps Gated XSA NgramTilt LQER | val_bpb=1.05933439 #2123

Closed

leon2k2k2k mentioned this pull request May 1, 2026

Train/val data leakage in CaseOps records — prepare_caseops_data.py default overlaps 80% of val docs with training data #2127

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Record: Gated XSA + LQER top-1 + strict token-only n-gram TTT (val_bpb: 1.047)#2018

Record: Gated XSA + LQER top-1 + strict token-only n-gram TTT (val_bpb: 1.047)#2018
simon-marcus wants to merge 3 commits intoopenai:mainfrom
simon-marcus:submission/gatedxsa-lqertop1-intimer

simon-marcus commented Apr 30, 2026 •

edited

Loading

Uh oh!

andrewbaggio1 commented Apr 30, 2026

Uh oh!

romeerp commented Apr 30, 2026

Uh oh!

andrewbaggio1 commented Apr 30, 2026

Uh oh!

aquariouseworkman commented Apr 30, 2026

Uh oh!

leon2k2k2k commented Apr 30, 2026

Uh oh!

simon-marcus commented May 1, 2026

Uh oh!

leon2k2k2k commented May 1, 2026

Uh oh!

sharpobject commented May 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

Conversation

simon-marcus commented Apr 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Record: Gated XSA + LQER top-1 + strict token-only n-gram TTT (val_bpb: 1.047)

Summary

Compliance notes

Key settings

Reproducing

Files

Credits

Uh oh!

andrewbaggio1 commented Apr 30, 2026

Uh oh!

romeerp commented Apr 30, 2026

Uh oh!

andrewbaggio1 commented Apr 30, 2026

Uh oh!

aquariouseworkman commented Apr 30, 2026

Uh oh!

leon2k2k2k commented Apr 30, 2026

Uh oh!

simon-marcus commented May 1, 2026

Uh oh!

leon2k2k2k commented May 1, 2026

Uh oh!

sharpobject commented May 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

simon-marcus commented Apr 30, 2026 •

edited

Loading