Skip to content

Record: Gated XSA + LQER top-1 + strict token-only n-gram TTT (val_bpb: 1.047)#2018

Open
simon-marcus wants to merge 3 commits intoopenai:mainfrom
simon-marcus:submission/gatedxsa-lqertop1-intimer
Open

Record: Gated XSA + LQER top-1 + strict token-only n-gram TTT (val_bpb: 1.047)#2018
simon-marcus wants to merge 3 commits intoopenai:mainfrom
simon-marcus:submission/gatedxsa-lqertop1-intimer

Conversation

@simon-marcus
Copy link
Copy Markdown

@simon-marcus simon-marcus commented Apr 30, 2026

Record: Gated XSA + LQER top-1 + strict token-only n-gram TTT (val_bpb: 1.047)

val_bpb: 1.04722074 (3-seed mean, std 0.00104816) | max artifact: 15,996,490 bytes | 8xH100 SXM | strict in-timer TTT eval

Improvement vs merged PR #1855 SOTA (1.06107587 BPB): -0.01385513 BPB / -0.00960 nats per byte, clearing the README's 0.005-nats record threshold by about 1.92x.

Metric Seed 42 Seed 1337 Seed 2026 3-seed
Stop step 4,914 4,926 4,916 4,918.7 mean
Train time 596.127 s 596.167 s 596.080 s 596.125 s mean
Pre-quant BPB 1.04930686 1.05124428 1.05029930 1.05028348 mean
Quantized BPB 1.05773513 1.05990331 1.05886641 1.05883495 mean
Post-TTT BPB 1.04616727 1.04826351 1.04723144 1.04722074 mean
Eval time 471.457 s 465.480 s 463.281 s 466.739 s mean
Artifact bytes 15,995,574 15,992,746 15,996,490 15,996,490 max

All reported eval time above includes the n-gram hint precompute inside the measured TTT eval timer (NGRAM_HINT_PRECOMPUTE_OUTSIDE=0).

Summary

This submission picks up on the PR #1967 / CaseOps lineage and then applies a training-time attention change plus a conservative eval-time n-gram path:

  1. Gated XSA. Each attention layer gets a learned per-head scalar xsa_alpha; the existing XSA subtraction coefficient is multiplied by tanh(xsa_alpha). The gate is zero-initialized, so the model starts as a strict superset of the base stack.
  2. LQER top-1. LQER_TOP_K=1 keeps the best LQER correction tensor. This saves artifact bytes versus the top-3 setting and was a favorable knob in the PR Record: Leaky ReLU Slope + GPTQ Reverse-Cholesky Speedup + PR #1938 (val_bpb = 1.06242) #1948 lineage.
  3. Strict token-only n-gram tilt. In response to the current-token class-routing concern, this update adopts the conservative PR Record: SP8192 + Muon 0.97 + Legal Score-First TTT — val_bpb 1.07983 (3-seed mean) #1514 workaround: disable the within-word and word-level experts and retain only the token-16 expert. The token hint is emitted from token_context_hash(st) over prefix state before the current token is pushed into the online state.
  4. In-timer hint precompute. The n-gram hint pass is included in the final eval timer (NGRAM_HINT_PRECOMPUTE_OUTSIDE=0). A token-only native fast path keeps the full eval under the 10-minute cap.
  5. Cheaper phased TTT. The final eval uses one score-first global TTT phase over a 1,000-document prefix, then scores the remaining stream with the adapted global model plus per-document LoRA TTT.

What did not work: Skylight/NorMuon was tested but is disabled in this submission (SKYLIGHT_MUON=0) because it destabilized this stack.

Compliance notes

  • Artifact size: max artifact is 15,996,490 bytes, under the decimal 16,000,000-byte cap.
  • Training budget: all three seeds stop on the 600-second wallclock cap at about 596.1 s.
  • Eval budget: all three token-only final TTT evals are under 600 s. The n-gram hint precompute is included in that timer.
  • Score-first TTT: the phased TTT path scores validation tokens before using them for global or LoRA updates. The global phase only trains on already-scored prefix documents.
  • Token-only n-gram tilt: the tilt applies a closed-form renormalized one-token boost, p'(a) = exp(beta * 1[a=h]) p(a) / Z, where Z = 1 + p(h)(exp(beta)-1). Hints are generated left-to-right from prefix token state.
  • No within-word or word-level experts: the final logs show token_gate=628130 within_gate=0 word_gate=0 agree2plus=0 for every seed.
  • Gate population diagnostic: token_only_fast_evals/token_only_gate_population.json reproduces the production hint pass and reports the same token_gate=628130, with within_gate=0 and word_gate=0.
  • Dataset/tokenizer: uses the CaseOps SP8192 lossless-caps tokenizer and byte-sidecar BPB accounting from the CaseOps lineage. The 80 training shards match the merged CaseOps leader's prepare_caseops_data.py default val_docs=10000 output byte-for-byte. Evaluation uses the full CaseOps validation shard/sidecar reported by the leaderboard lineage (val_tokens: 47851520). See DATASET_AUDIT.md.

Key settings

Setting Value
Base stack PR #1967 V21 + LeakyReLU 0.3 + n-gram tilt lineage
Model 11 layers, 512 dim, 8 heads / 4 KV heads
Tokenizer SP8192 lossless-caps CaseOps v1 reserved
Eval sequence length 2560
TTT mask no_qv
TTT LoRA rank 80
TTT local LR mult 0.75
QK gain init 5.25
Matrix LR 0.026
Min LR 0.1
LQER rank 4, asymmetric, top-1
N-gram precompute inside timer (NGRAM_HINT_PRECOMPUTE_OUTSIDE=0)
N-gram expert token-16 only
Within/word experts disabled (WITHIN_BOOST=0, WORD_BOOST=0)
Phased TTT 1 phase, 1,000 prefix docs
Gated XSA enabled
Skylight Muon disabled

Reproducing

Install Python dependencies from requirements.txt, install FlashAttention 3 as described there, and install the lrzip system package before launching the run. The script itself does not install packages or make network calls during training/evaluation.

SEED=42 \
NGRAM_HINT_PRECOMPUTE_OUTSIDE=0 \
DATA_PATH=./data/datasets/fineweb10B_sp8192_lossless_caps_caseops_v1_reserved \
TOKENIZER_PATH=./data/tokenizers/fineweb_8192_bpe_lossless_caps_caseops_v1_reserved.model \
CASEOPS_ENABLED=1 VOCAB_SIZE=8192 ITERATIONS=20000 MAX_WALLCLOCK_SECONDS=600 \
TTT_ENABLED=1 PHASED_TTT_ENABLED=1 PHASED_TTT_NUM_PHASES=1 PHASED_TTT_PREFIX_DOCS=1000 \
TTT_LORA_RANK=80 TTT_MASK=no_qv TTT_Q_LORA=0 TTT_V_LORA=0 \
TTT_LOCAL_LR_MULT=0.75 EVAL_SEQ_LEN=2560 TTT_EVAL_SEQ_LEN=2560 \
QK_GAIN_INIT=5.25 \
MATRIX_LR=0.026 MIN_LR=0.1 EMBED_BITS=7 GRAD_CLIP_NORM=0.3 \
MATRIX_CLIP_SIGMAS=12.85 ATTN_CLIP_SIGMAS=13.0 MLP_CLIP_SIGMAS=11.5 EMBED_CLIP_SIGMAS=14.0 \
FUSED_CE_ENABLED=1 SMEAR_GATE_ENABLED=1 GATE_WINDOW=12 \
SPARSE_ATTN_GATE_ENABLED=1 LQER_ENABLED=1 LQER_RANK=4 LQER_TOP_K=1 \
LQER_GROUP_SIZE=64 LQER_ASYM_ENABLED=1 LQER_ASYM_GROUP=64 \
AWQ_LITE_ENABLED=1 ASYM_LOGIT_RESCALE=1 NGRAM_TILT_ENABLED=1 \
TOKEN_ORDER=16 TOKEN_THRESHOLD=0.800 TOKEN_BOOST=2.625 \
WITHIN_TAU=999 WITHIN_BOOST=0 WORD_TAU=999 WORD_BOOST=0 AGREE_ADD_BOOST=0 \
GATED_XSA=1 SKYLIGHT_MUON=0 \
GPTQ_RESERVE_SECONDS=4.0 GPTQ_CALIBRATION_BATCHES=16 \
COMPRESSOR=pergroup \
torchrun --standalone --nproc_per_node=8 train_gpt.py

Files

  • train_gpt.py - complete training/eval script.
  • online_ngram_tilt.py, online_ngram_state.c - token-only n-gram hint/tilt helper from the PR Record: V21 + N-gram Tilt + LeakyReLU 0.3 — val_bpb 1.05851 (3-seed mean) #1967 lineage with the conservative fast path.
  • prepare_caseops_data.py, lossless_caps.py - CaseOps dataset preparation helpers.
  • DATASET_AUDIT.md, dataset_verification/ - dataset construction audit and verification logs.
  • tokenizers/fineweb_8192_bpe_lossless_caps_caseops_v1_reserved.model - tokenizer model.
  • train_seed42.log, train_seed1337.log, train_seed2026.log - original full per-seed training logs for the saved artifacts.
  • token_only_fast_evals/ - eval-only replay logs from the saved artifacts using the conservative token-only n-gram path.
  • submission.json - structured metadata for the token-only 3-seed result.

Credits

This work is a small stack on top of a long public lineage:

@andrewbaggio1
Copy link
Copy Markdown

so cool

@romeerp
Copy link
Copy Markdown
Contributor

romeerp commented Apr 30, 2026

wow, did you get the ~.01 prequant bpb improvement from just adding gated xsa?

@andrewbaggio1
Copy link
Copy Markdown

wow, did you get the ~.01 prequant bpb improvement from just adding gated xsa?

no it's from precomputing n gram tilt inside the timer

@aquariouseworkman
Copy link
Copy Markdown
Contributor

wow, did you get the ~.01 prequant bpb improvement from just adding gated xsa?

no it's from precomputing n gram tilt inside the timer

image

@leon2k2k2k
Copy link
Copy Markdown

See #1967 (comment) — same C1 (causality) issue
applies here; this PR ships byte-identical online_ngram_state.c + online_ngram_tilt.py.

@simon-marcus simon-marcus changed the title Record: Gated XSA + LQER top-1 + strict in-timer n-gram TTT (val_bpb: 1.046) Record: Gated XSA + LQER top-1 + strict token-only n-gram TTT (val_bpb: 1.047) May 1, 2026
@simon-marcus
Copy link
Copy Markdown
Author

See #1967 (comment) — same C1 (causality) issue applies here; this PR ships byte-identical online_ngram_state.c + online_ngram_tilt.py.

Thanks @leon2k2k2k, this was a useful catch. I agree with the core concern: the within-word / word-level expert gates in the PR #1967 lineage inspect properties of the realized tokens[i], so I do not want to rely on those paths for #2018.

I updated #2018 to adopt the conservative workaround you pointed to from the legal, merged precedent in #1514: disable the within-word and word-level experts and retain only the token-16 expert. In the updated code, the token-only fast path emits the hint from token_context_hash(st) over prefix token state, then updates the online table with the current token afterward via token_push.

The updated 3-seed result is a slight regression from my original submission (1.046 mean is now 1.047):

Seed val_bpb eval_time
42 1.04616727 471.457 s
1337 1.04826351 465.480 s
2026 1.04723144 463.281 s
Mean 1.04722074 466.739 s

The logs now show, for every seed:

ngram_tilt:hints total=47851520 gated=628130 token_gate=628130 within_gate=0 word_gate=0 agree2plus=0

I updated the PR title/body, README, submission.json, defaults, and logs accordingly.

@leon2k2k2k
Copy link
Copy Markdown

@simon-marcus thanks! Just curious, your amazing pre-quant, is that just training NN with no n-gram?

leon2k2k2k added a commit to leon2k2k2k/parameter-golf that referenced this pull request May 1, 2026
…ixer

Self-contained reference for byte-level NN scoring without the C1/C2 leak
in PR openai#2039 / openai#1967 / openai#2018 / openai#2041. Shows ~-0.097 BPB legitimate gain on
spec 250 seed_0 (1M val tokens), independent of include_space leak.

Files: README, proper_ppm_mixer_rigorous.py (canonical), byte_bpb_proper.py
(NN-only baseline), show_big_gains.py (inspection), test_byte0_3way.py
(5-config leak validation).
@sharpobject
Copy link
Copy Markdown

the first 40k documents of your training data are the same as the last 40k documents of your validation data

leon2k2k2k added a commit to leon2k2k2k/parameter-golf that referenced this pull request May 1, 2026
Audits every CaseOps-lineage record-track PR (merged + unmerged) since
2026-04-18 for whether val docs are also in the training set.

Working set: 34 PRs (31 from chronological seed list + 3 discovered ancestors:
openai#1908, openai#1923, openai#2007). Boundary nodes openai#1493 / openai#1626 (pre-CaseOps).

Verdicts:
  - CLEAN (8): openai#1729, openai#1851, openai#1868, openai#1908, openai#2019, openai#2027, openai#2031, openai#2068
  - LEAK (25): openai#1736 (our research baseline) → openai#1769openai#1787openai#1797openai#1855 → V21 family (openai#1945, openai#1923, openai#1953, openai#1967) → openai#2018openai#2118
    (current claimed frontier 1.04350), plus siblings.
  - INHERIT (1): openai#2050 (eval-only on frozen openai#1915)

Code-level evidence (not README claims):
  - Every shipped prepare_caseops_data.py is byte-identical:
    SHARD_TOKENS=10_000_000, default=10_000 for --val-docs
  - NO PR overrides --val-docs (searched all .sh files in all 34 PRs)
  - cached_challenge_fineweb.py downloads from romeerp/parameter-golf-caseops-v1
    HF dataset whose manifest pins docs_val=50000, docs_train=8181945,
    sums match → CLEAN by construction
  - PR openai#2018's DATASET_AUDIT.md is gold-standard explicit leak description
  - PR openai#2118's submission.json admits "--val-docs=10000 train shards + 50k val eval"

Three signposts:
  - Leak introduced: PR openai#1736 by @dexhunter (Apr 19) — first prepare_caseops_data.py
    default invocation
  - Leak fixed: PR openai#1851 by @aquariouseworkman (Apr 27) — switched to HF dataset
  - Leak re-introduced: PR openai#1855 by @codemath3000 (same day) — rebuilt locally

The merged-leaderboard SOTA (openai#1851/openai#1868 at 1.06128/1.06141) is CLEAN.
The unmerged frontier (openai#2118 at 1.04350) is LEAK. The 0.018 bpb gap is
inflated by val memorization; spec 301 was designed to measure how much
remains under clean data.

Files:
  caseops-memory-leakage/README.md       — overview, methodology, takeaways
  caseops-memory-leakage/verdicts.md     — 34-row master table with evidence
  caseops-memory-leakage/family-tree.md  — ASCII trees with [C]/[L] annotations
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants