Record: 10L Int5-MLP + SmearGate + BigramHash + Late QAT (val_bpb=1.1628) by chris-buckley · Pull Request #286 · openai/parameter-golf

chris-buckley · 2026-03-20T23:18:44Z

Summary

Mixed-precision int5/int6 export trades per-weight precision for an extra transformer layer: MLP weights go int5 while attention stays int6, buying enough artifact budget for a 10-layer ReLU² model under the 16 MB cap. SmearGate and BigramHash inject cheap token-pair context without learned parameters, and late QAT (kicking in at 85% wallclock) avoids the training instability of always-on STE while still closing most of the quantization gap.

Technique Stack

Mixed int5 MLP / int6 attention export — buys artifact budget for a 10th layer
SmearGate — gated residual smearing for cheap inter-token mixing
BigramHash — 4096-bucket bigram embedding (dim 128) for token-pair context without a full bigram table
Orthogonal init + muP-style output projection scaling — stable training at depth 10
Decoupled Muon weight decay (0.04) — Muon optimizer with weight decay decoupled from the update
SWA during warmdown — averaging 15 checkpoints every 50 steps starting at 50% of training
Late QAT at 85% wallclock — quantization-aware fine-tuning only in the final phase, not always-on STE
Sliding-window eval (stride 64, full-tail handling) — proper long-context evaluation

Metrics

Metric	Value
val_bpb	1.1628
val_loss	1.9634
pre-quant val_bpb	1.1907
pre-quant val_loss	2.0105
artifact size	15,481,841 bytes (model: 15,425,120 + code: 56,721)
steps	4,354
wallclock	602 s
eval time	171 s
hardware	8×H100 SXM
seed	1337

Reproduction

RUN_ID=10l_int5mlp_smearbigram_lateqat_seed1337 \
DATA_PATH=./data/datasets/fineweb10B_sp1024 \
TOKENIZER_PATH=./data/tokenizers/fineweb_1024_bpe.model \
VOCAB_SIZE=1024 \
SEED=1337 \
pip install zstandard && \
torchrun --standalone --nproc_per_node=8 \
./records/track_10min_16mb/2026-03-20_10L_Int5MLP_SmearBigram_LateQAT/train_gpt.py

Three-seed sweep:

for SEED in 1337 42 7; do
  RUN_ID=10l_int5mlp_smearbigram_lateqat_seed${SEED} \
  DATA_PATH=./data/datasets/fineweb10B_sp1024 \
  TOKENIZER_PATH=./data/tokenizers/fineweb_1024_bpe.model \
  VOCAB_SIZE=1024 \
  SEED=${SEED} \
  torchrun --standalone --nproc_per_node=8 \
  ./records/track_10min_16mb/2026-03-20_10L_Int5MLP_SmearBigram_LateQAT/train_gpt.py
done

Status

This is a single-seed result (seed 1337). It does not beat the current best MLP3x submission (val_bpb=1.1598). The technique stack is complete and the run is reproducible, but seeds 42 and 7 still need to be run for statistical significance before this qualifies as a proper record claim.

Posting this as a record contribution to document the mixed int5/int6 + late QAT approach. If multi-seed results hold up or improve, will update.

…628)

Record: 10L Int5-MLP + SmearGate + BigramHash + Late QAT (val_bpb=1.1…

3a456b9

…628)

abaybektursun mentioned this pull request Mar 25, 2026

WIP: GPTQ + XSA-all + BigramHash 3072×112 #728

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Record: 10L Int5-MLP + SmearGate + BigramHash + Late QAT (val_bpb=1.1628)#286

Record: 10L Int5-MLP + SmearGate + BigramHash + Late QAT (val_bpb=1.1628)#286
chris-buckley wants to merge 1 commit intoopenai:mainfrom
chris-buckley:10l-int5mlp-smearbigram-lateqat

chris-buckley commented Mar 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

chris-buckley commented Mar 20, 2026

Summary

Technique Stack

Metrics

Reproduction

Status

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant