Record: Gated XSA + LQER top-1 + strict token-only n-gram TTT (val_bpb: 1.047)#2018
Record: Gated XSA + LQER top-1 + strict token-only n-gram TTT (val_bpb: 1.047)#2018simon-marcus wants to merge 3 commits intoopenai:mainfrom
Conversation
|
so cool |
|
wow, did you get the ~.01 prequant bpb improvement from just adding gated xsa? |
no it's from precomputing n gram tilt inside the timer |
|
See #1967 (comment) — same C1 (causality) issue |
Thanks @leon2k2k2k, this was a useful catch. I agree with the core concern: the within-word / word-level expert gates in the PR #1967 lineage inspect properties of the realized I updated #2018 to adopt the conservative workaround you pointed to from the legal, merged precedent in #1514: disable the within-word and word-level experts and retain only the token-16 expert. In the updated code, the token-only fast path emits the hint from The updated 3-seed result is a slight regression from my original submission (1.046 mean is now 1.047):
The logs now show, for every seed: I updated the PR title/body, README, |
|
@simon-marcus thanks! Just curious, your amazing pre-quant, is that just training NN with no n-gram? |
…ixer Self-contained reference for byte-level NN scoring without the C1/C2 leak in PR openai#2039 / openai#1967 / openai#2018 / openai#2041. Shows ~-0.097 BPB legitimate gain on spec 250 seed_0 (1M val tokens), independent of include_space leak. Files: README, proper_ppm_mixer_rigorous.py (canonical), byte_bpb_proper.py (NN-only baseline), show_big_gains.py (inspection), test_byte0_3way.py (5-config leak validation).
|
the first 40k documents of your training data are the same as the last 40k documents of your validation data |
Audits every CaseOps-lineage record-track PR (merged + unmerged) since 2026-04-18 for whether val docs are also in the training set. Working set: 34 PRs (31 from chronological seed list + 3 discovered ancestors: openai#1908, openai#1923, openai#2007). Boundary nodes openai#1493 / openai#1626 (pre-CaseOps). Verdicts: - CLEAN (8): openai#1729, openai#1851, openai#1868, openai#1908, openai#2019, openai#2027, openai#2031, openai#2068 - LEAK (25): openai#1736 (our research baseline) → openai#1769 → openai#1787 → openai#1797 → openai#1855 → V21 family (openai#1945, openai#1923, openai#1953, openai#1967) → openai#2018 → openai#2118 (current claimed frontier 1.04350), plus siblings. - INHERIT (1): openai#2050 (eval-only on frozen openai#1915) Code-level evidence (not README claims): - Every shipped prepare_caseops_data.py is byte-identical: SHARD_TOKENS=10_000_000, default=10_000 for --val-docs - NO PR overrides --val-docs (searched all .sh files in all 34 PRs) - cached_challenge_fineweb.py downloads from romeerp/parameter-golf-caseops-v1 HF dataset whose manifest pins docs_val=50000, docs_train=8181945, sums match → CLEAN by construction - PR openai#2018's DATASET_AUDIT.md is gold-standard explicit leak description - PR openai#2118's submission.json admits "--val-docs=10000 train shards + 50k val eval" Three signposts: - Leak introduced: PR openai#1736 by @dexhunter (Apr 19) — first prepare_caseops_data.py default invocation - Leak fixed: PR openai#1851 by @aquariouseworkman (Apr 27) — switched to HF dataset - Leak re-introduced: PR openai#1855 by @codemath3000 (same day) — rebuilt locally The merged-leaderboard SOTA (openai#1851/openai#1868 at 1.06128/1.06141) is CLEAN. The unmerged frontier (openai#2118 at 1.04350) is LEAK. The 0.018 bpb gap is inflated by val memorization; spec 301 was designed to measure how much remains under clean data. Files: caseops-memory-leakage/README.md — overview, methodology, takeaways caseops-memory-leakage/verdicts.md — 34-row master table with evidence caseops-memory-leakage/family-tree.md — ASCII trees with [C]/[L] annotations

Record: Gated XSA + LQER top-1 + strict token-only n-gram TTT (val_bpb: 1.047)
val_bpb: 1.04722074 (3-seed mean, std 0.00104816) | max artifact: 15,996,490 bytes | 8xH100 SXM | strict in-timer TTT eval
Improvement vs merged PR #1855 SOTA (1.06107587 BPB): -0.01385513 BPB / -0.00960 nats per byte, clearing the README's 0.005-nats record threshold by about 1.92x.
All reported eval time above includes the n-gram hint precompute inside the measured TTT eval timer (
NGRAM_HINT_PRECOMPUTE_OUTSIDE=0).Summary
This submission picks up on the PR #1967 / CaseOps lineage and then applies a training-time attention change plus a conservative eval-time n-gram path:
xsa_alpha; the existing XSA subtraction coefficient is multiplied bytanh(xsa_alpha). The gate is zero-initialized, so the model starts as a strict superset of the base stack.LQER_TOP_K=1keeps the best LQER correction tensor. This saves artifact bytes versus the top-3 setting and was a favorable knob in the PR Record: Leaky ReLU Slope + GPTQ Reverse-Cholesky Speedup + PR #1938 (val_bpb = 1.06242) #1948 lineage.token_context_hash(st)over prefix state before the current token is pushed into the online state.NGRAM_HINT_PRECOMPUTE_OUTSIDE=0). A token-only native fast path keeps the full eval under the 10-minute cap.What did not work: Skylight/NorMuon was tested but is disabled in this submission (
SKYLIGHT_MUON=0) because it destabilized this stack.Compliance notes
p'(a) = exp(beta * 1[a=h]) p(a) / Z, whereZ = 1 + p(h)(exp(beta)-1). Hints are generated left-to-right from prefix token state.token_gate=628130 within_gate=0 word_gate=0 agree2plus=0for every seed.token_only_fast_evals/token_only_gate_population.jsonreproduces the production hint pass and reports the sametoken_gate=628130, withwithin_gate=0andword_gate=0.prepare_caseops_data.pydefaultval_docs=10000output byte-for-byte. Evaluation uses the full CaseOps validation shard/sidecar reported by the leaderboard lineage (val_tokens: 47851520). SeeDATASET_AUDIT.md.Key settings
no_qvNGRAM_HINT_PRECOMPUTE_OUTSIDE=0)WITHIN_BOOST=0,WORD_BOOST=0)Reproducing
Install Python dependencies from
requirements.txt, install FlashAttention 3 as described there, and install thelrzipsystem package before launching the run. The script itself does not install packages or make network calls during training/evaluation.Files
train_gpt.py- complete training/eval script.online_ngram_tilt.py,online_ngram_state.c- token-only n-gram hint/tilt helper from the PR Record: V21 + N-gram Tilt + LeakyReLU 0.3 — val_bpb 1.05851 (3-seed mean) #1967 lineage with the conservative fast path.prepare_caseops_data.py,lossless_caps.py- CaseOps dataset preparation helpers.DATASET_AUDIT.md,dataset_verification/- dataset construction audit and verification logs.tokenizers/fineweb_8192_bpe_lossless_caps_caseops_v1_reserved.model- tokenizer model.train_seed42.log,train_seed1337.log,train_seed2026.log- original full per-seed training logs for the saved artifacts.token_only_fast_evals/- eval-only replay logs from the saved artifacts using the conservative token-only n-gram path.submission.json- structured metadata for the token-only 3-seed result.Credits
This work is a small stack on top of a long public lineage:
ndokutovichfor the V21 + LeakyReLU 0.3 + closed-form n-gram tilt stack.andrewbaggio1for the long-context/no-QV TTT and QK-gain settings.alertcatfor the V21/AWQ-lite/asymmetric-logit-rescale base.TimS-mlandlijuncheng16for the LQER-top-k sweep and LeakyReLU work.codemath3000for the conservative token-only n-gram workaround precedent.AnirudhRahulfor the online n-gram augmentation lineage.romeerp,dexhunter,aquariouseworkman,codemath3000, and others for the SP8192 lossless-caps tokenizer, byte-sidecar BPB accounting, and score-first phased TTT.