Train gpt 0427 - 1.078bpb#1867
Open
lijuncheng16 wants to merge 12 commits intoopenai:mainfrom
Open
Conversation
lijuncheng16
commented
Apr 27, 2026
| Feature | 0409 | 0427 | Notes |
|---|---|---|---|
| SMT (Sparse Matrix Tuning) | — | enabled by default (block_size=64, keep_frac=0.25, skip_embed=1) | Your own technique (per memory: SMT author / arxiv 2405.15525). Significant new lever for matrix-param updates. |
| XSA-all | — | XSA_LAST_N=11 (= all layers) | Earlier records used XSA on a subset; 0427 makes it whole-stack. |
| GPTQ grouped quantization | row-wise | GPTQ_GROUP_SIZE=64 (grouped) | Finer-grained quant ⇒ smaller error per group. |
| ETLB infrastructure | — | added (off by default: ETLB_ENABLED=0, lr=0.05, steps=5, clip=3) | Eval-time logit bias hook, ready to flip on. |
| QK_GAIN_INIT | 5.0 | 5.25 | Slight retune; 0409's 5.25 was at the high end of the monotonic-improvement sweep. |
| TTT default | enabled in record | disabled (TTT_ENABLED=0) | TTT moved to opt-in; matches the simplification pattern (off-by-default, override per run). |
| Architecture / depth recurrence / parallel residuals / SP8192 / SDClip / MuonEq-R / WD / MLR / EMA / warmdown | same | same | The architectural & HP backbone is preserved. |
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Lines 306 and 603 used double-quoted strings inside an f-string, which the parser rejects before PEP 701 (Python 3.12). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Wrap the FA3 import in try/except. The fallback transposes between FA's (B,T,H,D) layout and SDPA's (B,H,T,D) and expands K/V for GQA so older torch versions without native GQA still work. Slower than FA3 — only for unblocking dev when FA3 isn't built. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Spins up 3 tmux sessions, each running train_gpt_0427.py on its own GPU with a different seed and a unique RUN_ID. Defaults: GPUs 0,1,2, seeds 1337-1339, MAX_WALLCLOCK_SECONDS=4800 (8x the 600s 8xH100 budget, to roughly step-match on 1xH100). Includes pre-flight checks for venv, dataset shards, tokenizer; uses python -u + PYTHONUNBUFFERED=1 so log output flushes through tee in real time. Configurable via env: VENV, REPO, SCRIPT, SEEDS_OVERRIDE, GPUS_OVERRIDE, MAX_WALLCLOCK_SECONDS, VOCAB_SIZE, EXTRA_ENV. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Stack S9 variant alongside 0427: bank-mode weight storage (qo_bank, kv_bank, mlp_up_bank, mlp_down_bank), Polar-Express Newton-Schulz coefficients for Muon, fused Triton softcapped CE, Phased LoRA TTT, global SGD post-quant repair. Has its own 3-tier flash-attn fallback (FA3 -> FA2 -> SDPA) so no hand-patch is needed. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sibling of run_3seeds.sh, defaults to train_gpt_s9.py and uses session prefix "s9_" + run-id prefix "s9" so it can run alongside the 0427 launcher without colliding (different tmux session names, different log filenames). Same configurable env vars (GPUS_OVERRIDE, SEEDS_OVERRIDE, MAX_WALLCLOCK_SECONDS, VOCAB_SIZE, EXTRA_ENV, etc.). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Author
|
initial submission. details to come |
Reproduces the 2026-04-09 record (SP8192 + 3-Layer Recurrence + Parallel Residuals + QK-Gain 5.25 + Legal TTT, val_bpb=1.0810 3-seed mean). Points at the LZMA-compressed code wrapper inside the record folder, defaults to seeds 42/314/999 (matching the record), and sets the record's documented env overrides (QK_GAIN_INIT=5.25, TTT_ENABLED=1, TTT_LR=0.005, TTT_EPOCHS=3). Session prefix r0409_ so it can run alongside the 0427 and S9 launchers. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Three scripts for preparing the lossless-caps caseops dataset: - lossless_caps.py — case encoding/decoding logic - prepare_caseops_data.py — dataset preparation pipeline - retokenize_corpus.py — re-tokenization helper Used by the train_gpt_s9_caseops_lqer.py training variant. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
S9 stack extended with caseops dataset support and LQER (Low-rank Quantization Error Rescue). 4487 lines vs train_gpt_s9.py's 4363. This is the script used in PR openai#1851 stage 1/2 ablations (cells A0–F4 in stage 1, Z0/P*/Q*/R* in stage 2). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
5252-line training script reproducing PR openai#1851's stack with extensive inline annotations (CN comments). Mandatory FA3 import (no SDPA fallback) and direct Triton kernel use. Sibling to train_gpt_s9*.py variants. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
8574bf5 to
a2cb6a3
Compare
…t w/ GPTQ v2 3143-line condensed version of train_gpt_s0_pr1851_mod.py (no inline annotations, GPTQ v2 path). Same mandatory FA3 + Triton dependency as the annotated sibling. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Same 3143-line code as v2; only Hyperparameters defaults changed to match the PR openai#1851 stack tuning observed in stage-1/2 ablations: SEED=42 MIN_LR=0.1 TTT_BATCH_SIZE=16 PHASED_TTT_NUM_PHASES=3 GPTQ_RESERVE_SECONDS=16 EMBED_BITS=7 EMBED_CLIP_SIGMAS=15 MLP_CLIP_SIGMAS=12 SMEAR_GATE_ENABLED=1 GATED_ATTN_QUANT_GATE=1 SPARSE_ATTN_GATE_ENABLED=1 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
lijuncheng16
added a commit
to lijuncheng16/parameter-golf
that referenced
this pull request
Apr 30, 2026
Source file behind PR openai#1867 (lijuncheng16). The Sparse Matrix Tuning implementation referenced in the parameter-golf notes blog (TTT section, T4): _smt_select_masks() runs once on chunk 0 to pick top-K (default keep_frac=0.25) 64x64 gradient blocks per matrix; the resulting binary masks are then frozen for the rest of TTT and used to zero gradients outside the kept blocks during each TTT step. The chunk-0 gradient signal turned out to be too unstable to base a frozen mask on, so SMT underperformed full LoRA TTT end-to-end. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.