Train gpt 0427 - 1.078bpb by lijuncheng16 · Pull Request #1867 · openai/parameter-golf

lijuncheng16 · 2026-04-27T19:06:10Z

Feature	0409	0427	Notes
SMT (Sparse Matrix Tuning)	—	enabled by default (block_size=64, keep_frac=0.25, skip_embed=1)	Your own technique (per memory: SMT author / arxiv 2405.15525). Significant new lever for matrix-param updates.
XSA-all	—	XSA_LAST_N=11 (= all layers)	Earlier records used XSA on a subset; 0427 makes it whole-stack.
GPTQ grouped quantization	row-wise	GPTQ_GROUP_SIZE=64 (grouped)	Finer-grained quant ⇒ smaller error per group.
ETLB infrastructure	—	added (off by default: ETLB_ENABLED=0, lr=0.05, steps=5, clip=3)	Eval-time logit bias hook, ready to flip on.
QK_GAIN_INIT	5.0	5.25	Slight retune; 0409's 5.25 was at the high end of the monotonic-improvement sweep.
TTT default	enabled in record	disabled (TTT_ENABLED=0)	TTT moved to opt-in; matches the simplification pattern (off-by-default, override per run).
Architecture / depth recurrence / parallel residuals / SP8192 / SDClip / MuonEq-R / WD / MLR / EMA / warmdown	same	same	The architectural & HP backbone is preserved.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Lines 306 and 603 used double-quoted strings inside an f-string, which the parser rejects before PEP 701 (Python 3.12). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Wrap the FA3 import in try/except. The fallback transposes between FA's (B,T,H,D) layout and SDPA's (B,H,T,D) and expands K/V for GQA so older torch versions without native GQA still work. Slower than FA3 — only for unblocking dev when FA3 isn't built. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Spins up 3 tmux sessions, each running train_gpt_0427.py on its own GPU with a different seed and a unique RUN_ID. Defaults: GPUs 0,1,2, seeds 1337-1339, MAX_WALLCLOCK_SECONDS=4800 (8x the 600s 8xH100 budget, to roughly step-match on 1xH100). Includes pre-flight checks for venv, dataset shards, tokenizer; uses python -u + PYTHONUNBUFFERED=1 so log output flushes through tee in real time. Configurable via env: VENV, REPO, SCRIPT, SEEDS_OVERRIDE, GPUS_OVERRIDE, MAX_WALLCLOCK_SECONDS, VOCAB_SIZE, EXTRA_ENV. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Stack S9 variant alongside 0427: bank-mode weight storage (qo_bank, kv_bank, mlp_up_bank, mlp_down_bank), Polar-Express Newton-Schulz coefficients for Muon, fused Triton softcapped CE, Phased LoRA TTT, global SGD post-quant repair. Has its own 3-tier flash-attn fallback (FA3 -> FA2 -> SDPA) so no hand-patch is needed. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Sibling of run_3seeds.sh, defaults to train_gpt_s9.py and uses session prefix "s9_" + run-id prefix "s9" so it can run alongside the 0427 launcher without colliding (different tmux session names, different log filenames). Same configurable env vars (GPUS_OVERRIDE, SEEDS_OVERRIDE, MAX_WALLCLOCK_SECONDS, VOCAB_SIZE, EXTRA_ENV, etc.). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

lijuncheng16 · 2026-04-27T19:06:55Z

initial submission. details to come

Reproduces the 2026-04-09 record (SP8192 + 3-Layer Recurrence + Parallel Residuals + QK-Gain 5.25 + Legal TTT, val_bpb=1.0810 3-seed mean). Points at the LZMA-compressed code wrapper inside the record folder, defaults to seeds 42/314/999 (matching the record), and sets the record's documented env overrides (QK_GAIN_INIT=5.25, TTT_ENABLED=1, TTT_LR=0.005, TTT_EPOCHS=3). Session prefix r0409_ so it can run alongside the 0427 and S9 launchers. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Three scripts for preparing the lossless-caps caseops dataset: - lossless_caps.py — case encoding/decoding logic - prepare_caseops_data.py — dataset preparation pipeline - retokenize_corpus.py — re-tokenization helper Used by the train_gpt_s9_caseops_lqer.py training variant. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

S9 stack extended with caseops dataset support and LQER (Low-rank Quantization Error Rescue). 4487 lines vs train_gpt_s9.py's 4363. This is the script used in PR openai#1851 stage 1/2 ablations (cells A0–F4 in stage 1, Z0/P*/Q*/R* in stage 2). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

5252-line training script reproducing PR openai#1851's stack with extensive inline annotations (CN comments). Mandatory FA3 import (no SDPA fallback) and direct Triton kernel use. Sibling to train_gpt_s9*.py variants. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…t w/ GPTQ v2 3143-line condensed version of train_gpt_s0_pr1851_mod.py (no inline annotations, GPTQ v2 path). Same mandatory FA3 + Triton dependency as the annotated sibling. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Same 3143-line code as v2; only Hyperparameters defaults changed to match the PR openai#1851 stack tuning observed in stage-1/2 ablations: SEED=42 MIN_LR=0.1 TTT_BATCH_SIZE=16 PHASED_TTT_NUM_PHASES=3 GPTQ_RESERVE_SECONDS=16 EMBED_BITS=7 EMBED_CLIP_SIGMAS=15 MLP_CLIP_SIGMAS=12 SMEAR_GATE_ENABLED=1 GATED_ATTN_QUANT_GATE=1 SPARSE_ATTN_GATE_ENABLED=1 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Source file behind PR openai#1867 (lijuncheng16). The Sparse Matrix Tuning implementation referenced in the parameter-golf notes blog (TTT section, T4): _smt_select_masks() runs once on chunk 0 to pick top-K (default keep_frac=0.25) 64x64 gradient blocks per matrix; the resulting binary masks are then frozen for the rest of TTT and used to zero gradients outside the kept blocks during each TTT step. The chunk-0 gradient signal turned out to be too unstable to base a frozen mask on, so SMT underperformed full LoRA TTT end-to-end. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

lijuncheng16 and others added 6 commits April 26, 2026 23:05

Add train_gpt_0427.py variant

797e28f

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Fix Python <3.12 f-string nested-quote SyntaxErrors

7906306

Lines 306 and 603 used double-quoted strings inside an f-string, which the parser rejects before PEP 701 (Python 3.12). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

lijuncheng16 changed the title ~~Train gpt 0427 - 1.708bpb~~ Train gpt 0427 - 1.078bpb Apr 27, 2026

lijuncheng16 and others added 3 commits April 28, 2026 11:30

lijuncheng16 force-pushed the train-gpt-0427 branch 2 times, most recently from 8574bf5 to a2cb6a3 Compare April 28, 2026 22:26

lijuncheng16 and others added 2 commits April 29, 2026 10:37

This was referenced Apr 29, 2026

Record: Leaky ReLU Slope + GPTQ Reverse-Cholesky Speedup + PR #1938 (val_bpb = 1.06242) #1948

Open

Record: MHA Path + 1855 9-hparam Stack + PR #1948 + PR #1855 (val_bpb = 1.06184, 3-seed) #1987

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Train gpt 0427 - 1.078bpb#1867

Train gpt 0427 - 1.078bpb#1867
lijuncheng16 wants to merge 12 commits intoopenai:mainfrom
lijuncheng16:train-gpt-0427

lijuncheng16 commented Apr 27, 2026

Uh oh!

lijuncheng16 commented Apr 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

lijuncheng16 commented Apr 27, 2026

Uh oh!

lijuncheng16 commented Apr 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant