Skip to content

Record: Int6 STE + SmearGate + Seq2048 + OrthoInit + RoPE50K + SWA/100 (mean val_bpb=1.1507)#206

Open
dexhunter wants to merge 2 commits intoopenai:mainfrom
dexhunter:weco-step31-swa100-1.1494
Open

Record: Int6 STE + SmearGate + Seq2048 + OrthoInit + RoPE50K + SWA/100 (mean val_bpb=1.1507)#206
dexhunter wants to merge 2 commits intoopenai:mainfrom
dexhunter:weco-step31-swa100-1.1494

Conversation

@dexhunter
Copy link

Summary

Mean val_bpb = 1.1507 (3-seed verified, p<0.001), beating merged SOTA (1.1748) by 0.024.

Evolved over 31 AIDE2 optimization steps from baseline 1.1607 on 8xH100.

Seed val_bpb Steps ms/step Artifact
1337 1.15022 10613 56.53 14,555,057
42 1.15095 10610 56.53 14,791,593
7 1.15099 10610 56.53 14,562,412
Mean 1.15072

Technique Stack

  1. Int6 STE — Fake int6 quantization every forward pass with STE gradient bypass
  2. NorMuon + WD=0.02 — Row-normalized Newton-Schulz with decoupled weight decay
  3. 3x MLP (1536 hidden) — Wider MLP enabled by int6 compression
  4. SmearGate — Learned gate blending token embeddings with predecessors (~512 params)
  5. Orthogonal Init — OrthoInit on all non-zero-init linear layers
  6. Seq2048 + RoPE Base 50K — 2x training context with adjusted RoPE
  7. SWA every 100 steps — More frequent checkpoint averaging during warmdown
  8. FP16 tied embedding — Embedding never quantized
  9. Sliding window eval (stride=64) — Every token scored with ~1984 tokens context
  10. Zstd-22 — Better compression than zlib
  11. U-Net skip connections — Encoder-decoder with learnable skip weights

Architecture

  • 9 layers, 512 dim, 8 heads, 4 KV heads (GQA)
  • Vocab 1024 (SentencePiece BPE), seq len 2048, tied embeddings
  • relu² activation, RoPE, logit softcapping (30.0)

Submission checklist

  • 3-seed verification (mean val_bpb=1.1507)
  • All artifacts < 16MB (max 14.79MB, 1.2MB headroom)
  • Wallclock < 600s on 8xH100
  • Train logs included (3 seeds)
  • Reproducible train_gpt.py included
  • submission.json with metadata

…SWA/100 (val_bpb=1.1507)

3-seed verified mean val_bpb=1.1507 (sliding window, stride=64).
Seeds: 1337=1.1502, 42=1.1509, 7=1.1510. All artifacts under 16MB.

Technique stack evolved over 31 AIDE2 optimization steps:
- Int6 STE quantization-aware training (near-zero quant penalty)
- NorMuon optimizer with decoupled weight decay (0.02)
- 3x MLP width (1536 hidden)
- SmearGate: learned embedding-level context blending
- Orthogonal initialization for all linear layers
- Sequence length 2048 with RoPE base 50K
- SWA every 100 steps during warmdown
- FP16 tied embedding passthrough
- Sliding window eval (stride=64)
- Zstd-22 compression
- U-Net skip connections
Leaderboard expects val_bpb, val_loss, bytes_total, bytes_code
at top level. Our submission used mean_val_bpb, artifact_bytes, etc.
abaybektursun added a commit to abaybektursun/parameter-golf that referenced this pull request Mar 20, 2026
Downloaded train_gpt.py and README from the top open PRs on openai/parameter-golf:
- PR openai#198 (1.1318): 11L Int6 + WD + SWA + FA3 + SmearGate + BigramHash
- PR openai#194 (1.1480): 11L Int6 QAT + SmearGate + SWA
- PR openai#206 (1.1507): 9L Int6 STE + SmearGate + OrthoInit + U-Net skips

Updated program.md to point agent at PR openai#198 as the new starting base,
with detailed technique breakdown and strategy to beat 1.1318.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant