11L Int6 QAT + Per-Dim SmearGate + SWA: 1.1480 BPB (3-seed mean) by baudrillardsgh0st · Pull Request #194 · openai/parameter-golf

baudrillardsgh0st · 2026-03-20T08:49:26Z

Summary

Mean val_bpb: 1.1480 (3 seeds, self-verified)
Best single seed: 1.1453 (seed 1337)
Artifact: 15.33 MiB (int6-in-int8 + zstd-22)
11 layers, 512 dim, 8 heads, 4 KV heads, MLP 3x, 26.5M params

Statistical Significance

Improvement over SOTA (1.1748): 0.0268 BPB (required: ≥0.005)

Seed	val_loss	val_bpb	Steps	ms/step	SWA ckpts
1337	1.9339	1.1453	8052	74.49	30
7	1.9400	1.1490	8040	74.34	29
42	1.9413	1.1498	7772	77.17	27

Mean: 1.1480 | Std: 0.0024 | t = 16.04 (df=2) | p < 0.001

One-sided t-test: H₀ that improvement < 0.005 nats rejected at p < 0.001.

Key Techniques

Per-Dimension SmearGate: Learned sigmoid(Parameter(dim)) gate blending current and previous token embeddings. More expressive than scalar gating — each embedding dimension gets its own blend ratio. Zero-initialized, 512 params.
Stochastic Weight Averaging (SWA): Every 50 steps over the last 50% of training (~29-30 checkpoints averaged). Produces smoother weight distributions that are more robust to int6 quantization. Pre-quant BPB 1.1666 → post-quant 1.1453 (only 0.021 degradation).
Int6 QAT with STE: Fake int6 quantization during forward pass. Per-row symmetric, 6-bit clipping. Int6 values stored in int8 containers — zstd-22 compresses ~35%.
High Muon weight decay (0.038): Keeps weights small for better int6 quantization fidelity and generalization.
FP16 tied embedding passthrough: Embedding/unembedding kept in fp16 to avoid compounding quantization error.

Training Config

Parameter	Value
Layers	11
Matrix LR	0.02
Scalar LR	0.02
Tied Embed LR	0.03
Muon Momentum	0.99 (warmup 0.92→0.99 over 1500 steps)
Muon Weight Decay	0.038
Warmdown Steps	3000
QAT Bits	6
SWA Every	50 steps
Batch tokens	524,288
Seq len	2048

Run command

NCCL_NVLS_ENABLE=0 \
VOCAB_SIZE=1024 NUM_HEADS=8 NUM_KV_HEADS=4 MODEL_DIM=512 \
NUM_LAYERS=11 MLP_MULT=3 \
QAT=1 QUANT_BITS=6 FP16_EMBED=1 LATE_K_LAYERS=0 \
EVAL_STRIDE=64 EVAL_BATCH_SEQS=32 \
MATRIX_LR=0.02 SCALAR_LR=0.02 TIED_EMBED_LR=0.03 \
MUON_MOMENTUM=0.99 MUON_WEIGHT_DECAY=0.038 \
SMEAR_GATE=1 BIGRAM_HASH=0 \
TRAIN_SEQ_LEN=2048 TRAIN_BATCH_TOKENS=524288 \
WARMDOWN_STEPS=3000 MAX_WALLCLOCK_SECONDS=600 \
SWA_ENABLED=1 SWA_START_FRAC=0.5 SWA_EVERY=50 \
torchrun --standalone --nproc_per_node=8 train_gpt.py

Key improvements over prior submission (openai#192, 1.1502): - Per-dimension SmearGate (sigmoid(Parameter(dim))) vs scalar gate - Stochastic Weight Averaging every 50 steps over last 50% of training - Result: 1.1453 BPB, beating current SOTA (1.1458) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Seeds 1337 (1.1453), 7 (1.1490), 42 (1.1498) — mean 1.1480. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

train_seed1337.log (BPB 1.1453), train_seed7.log (BPB 1.1490), train_seed42.log (BPB 1.1498). Mean: 1.1480. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Downloaded train_gpt.py and README from the top open PRs on openai/parameter-golf: - PR openai#198 (1.1318): 11L Int6 + WD + SWA + FA3 + SmearGate + BigramHash - PR openai#194 (1.1480): 11L Int6 QAT + SmearGate + SWA - PR openai#206 (1.1507): 9L Int6 STE + SmearGate + OrthoInit + U-Net skips Updated program.md to point agent at PR openai#198 as the new starting base, with detailed technique breakdown and strategy to beat 1.1318. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

- Remove Python source contamination from train_seed1337.log (was 2264 lines, now 256) - Fix torchrun warning/docstring interleaving in seed42 and seed7 logs - Remove duplicate train.log (was identical to train_seed1337.log) - Upgrade README: add ablation table, pre/post quant breakdown, p-value, expanded technique descriptions with sweep ranges, attribution, file manifest Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

notapplica mentioned this pull request Mar 20, 2026

Parameter Golf Live AI Commentary #140

Open

baudrillardsgh0st changed the title ~~11L Int6 QAT + Per-Dim SmearGate + SWA: 1.1453 BPB~~ 11L Int6 QAT + Per-Dim SmearGate + SWA: 1.1480 BPB (3-seed mean) Mar 20, 2026

Jackson and others added 2 commits March 20, 2026 05:47

Add 3-seed results (mean BPB 1.1480) for statistical verification

c4d2e3a

Seeds 1337 (1.1453), 7 (1.1490), 42 (1.1498) — mean 1.1480. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Add per-seed train logs for 3-seed self-verification

9b9dbb0

train_seed1337.log (BPB 1.1453), train_seed7.log (BPB 1.1490), train_seed42.log (BPB 1.1498). Mean: 1.1480. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

dentity007 mentioned this pull request Mar 20, 2026

Non-record: 10L Int6 QAT + SmearGate + SWA (val_bpb=1.1575) #273

Open

6 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

11L Int6 QAT + Per-Dim SmearGate + SWA: 1.1480 BPB (3-seed mean)#194

11L Int6 QAT + Per-Dim SmearGate + SWA: 1.1480 BPB (3-seed mean)#194
baudrillardsgh0st wants to merge 4 commits intoopenai:mainfrom
baudrillardsgh0st:submit/11L-smearv2-swa

baudrillardsgh0st commented Mar 20, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

baudrillardsgh0st commented Mar 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Statistical Significance

Key Techniques

Training Config

Run command

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

baudrillardsgh0st commented Mar 20, 2026 •

edited

Loading