Skip to content

Train gpt 0427 - 1.078bpb#1867

Open
lijuncheng16 wants to merge 12 commits intoopenai:mainfrom
lijuncheng16:train-gpt-0427
Open

Train gpt 0427 - 1.078bpb#1867
lijuncheng16 wants to merge 12 commits intoopenai:mainfrom
lijuncheng16:train-gpt-0427

Conversation

@lijuncheng16
Copy link
Copy Markdown

Feature 0409 0427 Notes
SMT (Sparse Matrix Tuning) enabled by default (block_size=64, keep_frac=0.25, skip_embed=1) Your own technique (per memory: SMT author / arxiv 2405.15525). Significant new lever for matrix-param updates.
XSA-all XSA_LAST_N=11 (= all layers) Earlier records used XSA on a subset; 0427 makes it whole-stack.
GPTQ grouped quantization row-wise GPTQ_GROUP_SIZE=64 (grouped) Finer-grained quant ⇒ smaller error per group.
ETLB infrastructure added (off by default: ETLB_ENABLED=0, lr=0.05, steps=5, clip=3) Eval-time logit bias hook, ready to flip on.
QK_GAIN_INIT 5.0 5.25 Slight retune; 0409's 5.25 was at the high end of the monotonic-improvement sweep.
TTT default enabled in record disabled (TTT_ENABLED=0) TTT moved to opt-in; matches the simplification pattern (off-by-default, override per run).
Architecture / depth recurrence / parallel residuals / SP8192 / SDClip / MuonEq-R / WD / MLR / EMA / warmdown same same The architectural & HP backbone is preserved.

lijuncheng16 and others added 6 commits April 26, 2026 23:05
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Lines 306 and 603 used double-quoted strings inside an f-string,
which the parser rejects before PEP 701 (Python 3.12).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Wrap the FA3 import in try/except. The fallback transposes between FA's
(B,T,H,D) layout and SDPA's (B,H,T,D) and expands K/V for GQA so older
torch versions without native GQA still work. Slower than FA3 — only for
unblocking dev when FA3 isn't built.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Spins up 3 tmux sessions, each running train_gpt_0427.py on its own
GPU with a different seed and a unique RUN_ID. Defaults: GPUs 0,1,2,
seeds 1337-1339, MAX_WALLCLOCK_SECONDS=4800 (8x the 600s 8xH100 budget,
to roughly step-match on 1xH100). Includes pre-flight checks for venv,
dataset shards, tokenizer; uses python -u + PYTHONUNBUFFERED=1 so log
output flushes through tee in real time.

Configurable via env: VENV, REPO, SCRIPT, SEEDS_OVERRIDE, GPUS_OVERRIDE,
MAX_WALLCLOCK_SECONDS, VOCAB_SIZE, EXTRA_ENV.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Stack S9 variant alongside 0427: bank-mode weight storage (qo_bank,
kv_bank, mlp_up_bank, mlp_down_bank), Polar-Express Newton-Schulz
coefficients for Muon, fused Triton softcapped CE, Phased LoRA TTT,
global SGD post-quant repair. Has its own 3-tier flash-attn fallback
(FA3 -> FA2 -> SDPA) so no hand-patch is needed.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sibling of run_3seeds.sh, defaults to train_gpt_s9.py and uses session
prefix "s9_" + run-id prefix "s9" so it can run alongside the 0427
launcher without colliding (different tmux session names, different log
filenames). Same configurable env vars (GPUS_OVERRIDE, SEEDS_OVERRIDE,
MAX_WALLCLOCK_SECONDS, VOCAB_SIZE, EXTRA_ENV, etc.).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@lijuncheng16
Copy link
Copy Markdown
Author

initial submission. details to come

Reproduces the 2026-04-09 record (SP8192 + 3-Layer Recurrence + Parallel
Residuals + QK-Gain 5.25 + Legal TTT, val_bpb=1.0810 3-seed mean). Points
at the LZMA-compressed code wrapper inside the record folder, defaults to
seeds 42/314/999 (matching the record), and sets the record's documented
env overrides (QK_GAIN_INIT=5.25, TTT_ENABLED=1, TTT_LR=0.005, TTT_EPOCHS=3).
Session prefix r0409_ so it can run alongside the 0427 and S9 launchers.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@lijuncheng16 lijuncheng16 changed the title Train gpt 0427 - 1.708bpb Train gpt 0427 - 1.078bpb Apr 27, 2026
lijuncheng16 and others added 3 commits April 28, 2026 11:30
Three scripts for preparing the lossless-caps caseops dataset:
- lossless_caps.py — case encoding/decoding logic
- prepare_caseops_data.py — dataset preparation pipeline
- retokenize_corpus.py — re-tokenization helper

Used by the train_gpt_s9_caseops_lqer.py training variant.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
S9 stack extended with caseops dataset support and LQER (Low-rank
Quantization Error Rescue). 4487 lines vs train_gpt_s9.py's 4363.
This is the script used in PR openai#1851 stage 1/2 ablations (cells A0–F4
in stage 1, Z0/P*/Q*/R* in stage 2).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
5252-line training script reproducing PR openai#1851's stack with extensive
inline annotations (CN comments). Mandatory FA3 import (no SDPA fallback)
and direct Triton kernel use. Sibling to train_gpt_s9*.py variants.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@lijuncheng16 lijuncheng16 force-pushed the train-gpt-0427 branch 2 times, most recently from 8574bf5 to a2cb6a3 Compare April 28, 2026 22:26
lijuncheng16 and others added 2 commits April 29, 2026 10:37
…t w/ GPTQ v2

3143-line condensed version of train_gpt_s0_pr1851_mod.py (no inline
annotations, GPTQ v2 path). Same mandatory FA3 + Triton dependency as
the annotated sibling.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Same 3143-line code as v2; only Hyperparameters defaults changed to
match the PR openai#1851 stack tuning observed in stage-1/2 ablations:
  SEED=42  MIN_LR=0.1  TTT_BATCH_SIZE=16  PHASED_TTT_NUM_PHASES=3
  GPTQ_RESERVE_SECONDS=16  EMBED_BITS=7  EMBED_CLIP_SIGMAS=15
  MLP_CLIP_SIGMAS=12  SMEAR_GATE_ENABLED=1  GATED_ATTN_QUANT_GATE=1
  SPARSE_ATTN_GATE_ENABLED=1

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
lijuncheng16 added a commit to lijuncheng16/parameter-golf that referenced this pull request Apr 30, 2026
Source file behind PR openai#1867 (lijuncheng16). The Sparse Matrix Tuning
implementation referenced in the parameter-golf notes blog (TTT section,
T4): _smt_select_masks() runs once on chunk 0 to pick top-K (default
keep_frac=0.25) 64x64 gradient blocks per matrix; the resulting binary
masks are then frozen for the rest of TTT and used to zero gradients
outside the kept blocks during each TTT step. The chunk-0 gradient
signal turned out to be too unstable to base a frozen mask on, so SMT
underperformed full LoRA TTT end-to-end.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant