Skip to content

Record: 10L Int5-MLP + SmearGate + BigramHash + Late QAT (val_bpb=1.1628)#286

Open
chris-buckley wants to merge 1 commit intoopenai:mainfrom
chris-buckley:10l-int5mlp-smearbigram-lateqat
Open

Record: 10L Int5-MLP + SmearGate + BigramHash + Late QAT (val_bpb=1.1628)#286
chris-buckley wants to merge 1 commit intoopenai:mainfrom
chris-buckley:10l-int5mlp-smearbigram-lateqat

Conversation

@chris-buckley
Copy link

Summary

Mixed-precision int5/int6 export trades per-weight precision for an extra transformer layer: MLP weights go int5 while attention stays int6, buying enough artifact budget for a 10-layer ReLU² model under the 16 MB cap. SmearGate and BigramHash inject cheap token-pair context without learned parameters, and late QAT (kicking in at 85% wallclock) avoids the training instability of always-on STE while still closing most of the quantization gap.

Technique Stack

  1. Mixed int5 MLP / int6 attention export — buys artifact budget for a 10th layer
  2. SmearGate — gated residual smearing for cheap inter-token mixing
  3. BigramHash — 4096-bucket bigram embedding (dim 128) for token-pair context without a full bigram table
  4. Orthogonal init + muP-style output projection scaling — stable training at depth 10
  5. Decoupled Muon weight decay (0.04) — Muon optimizer with weight decay decoupled from the update
  6. SWA during warmdown — averaging 15 checkpoints every 50 steps starting at 50% of training
  7. Late QAT at 85% wallclock — quantization-aware fine-tuning only in the final phase, not always-on STE
  8. Sliding-window eval (stride 64, full-tail handling) — proper long-context evaluation

Metrics

Metric Value
val_bpb 1.1628
val_loss 1.9634
pre-quant val_bpb 1.1907
pre-quant val_loss 2.0105
artifact size 15,481,841 bytes (model: 15,425,120 + code: 56,721)
steps 4,354
wallclock 602 s
eval time 171 s
hardware 8×H100 SXM
seed 1337

Reproduction

RUN_ID=10l_int5mlp_smearbigram_lateqat_seed1337 \
DATA_PATH=./data/datasets/fineweb10B_sp1024 \
TOKENIZER_PATH=./data/tokenizers/fineweb_1024_bpe.model \
VOCAB_SIZE=1024 \
SEED=1337 \
pip install zstandard && \
torchrun --standalone --nproc_per_node=8 \
./records/track_10min_16mb/2026-03-20_10L_Int5MLP_SmearBigram_LateQAT/train_gpt.py

Three-seed sweep:

for SEED in 1337 42 7; do
  RUN_ID=10l_int5mlp_smearbigram_lateqat_seed${SEED} \
  DATA_PATH=./data/datasets/fineweb10B_sp1024 \
  TOKENIZER_PATH=./data/tokenizers/fineweb_1024_bpe.model \
  VOCAB_SIZE=1024 \
  SEED=${SEED} \
  torchrun --standalone --nproc_per_node=8 \
  ./records/track_10min_16mb/2026-03-20_10L_Int5MLP_SmearBigram_LateQAT/train_gpt.py
done

Status

This is a single-seed result (seed 1337). It does not beat the current best MLP3x submission (val_bpb=1.1598). The technique stack is complete and the run is reproducible, but seeds 42 and 7 still need to be run for statistical significance before this qualifies as a proper record claim.

Posting this as a record contribution to document the mixed int5/int6 + late QAT approach. If multi-seed results hold up or improve, will update.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant