Record: 8192 Vocab Size, NorMuon, Selective Quantization; 1.186 val_bpb by mtybadger · Pull Request #78 · openai/parameter-golf

mtybadger · 2026-03-19T12:14:57Z

2023-03-19: Vocab Size, NorMuon, Selective Quantization; 1.186 val_bpb

Day 1! This record contains three main new ideas, as well as some tweaks to the baseline, particularly vocab size. I had several ideas I wanted to try today, and these are the ones that worked - I want to chase further on quantization in the coming days.

Changes in this model:

Vocab size 1024 -> 8192
New "sp8192" tokenizer trained using

./data/download_hf_docs_and_tokenize.py   --output-root ./data   --tokenizer-config ./data/tokenizer_specs.json --max-train-tokens 8000000000 --tokenizer-train-docs 100000

with this tokenizer_spec:

{
  "tokenizers": [
    {
      "name": "sp_bpe_1024",
      "dataset_suffix": "sp1024",
      "vocab_size": 1024
    },
    {
      "name": "sp_bpe_8192",
      "dataset_suffix": "sp8192",
      "vocab_size": 8192
    }
  ]
}

with a 50/50 val/train split as a result. Tokenizers for sp1024, 2048, 4096 and 8192 with data are publicly available on my huggingface.

NorMuon implementation from the original paper, popularized by modded-nanogpt, replacing Muon
Selective Quantization: the weights are quantized to int6, while the embeddings are kept at int8. Not sure if this is optimal and have seen plenty of weird behaviour from this, but I think it's in the right direction; I think being precise about precision will be really key to this challenge and I want to dig into it more. From now on there will be a lot of trading off precision between areas of the model!

Configuration:

All hyperparams as in default NaiveBaseline except VOCAB_SIZE, TRAIN_SEQ_LEN, WARMDOWN_ITERS and NUM_LAYERS; unfortunately to get the increased vocab size we have to sacrifice a layer. I'm sure there's a better architectural setup here, but I don't know if it's recurrence.
Tested on Hyperbolic Labs 8xH100 node with SXM5; reproduced baseline with step_avg:43.67ms and final_int8_zlib_roundtrip_exact val_bpb:1.22731147 immediately before.

Command:

NCCL_IB_DISABLE=1 \
RUN_ID=verify_sp8192_w6e8_8gpu \
DATA_PATH=./data/datasets/fineweb10B_sp8192 \
TOKENIZER_PATH=./data/tokenizers/fineweb_8192_bpe.model \
VOCAB_SIZE=8192 \
MAX_WALLCLOCK_SECONDS=600 \
WEIGHT_QUANTIZATION_BITS=6 \
EMBED_QUANTIZATION_BITS=8 \
WARMDOWN_ITERS=3000 \
TRAIN_SEQ_LEN=4096 \
NUM_LAYERS=8 \
torchrun --standalone --nproc_per_node=8 ./records/track_10min_16mb/2026-03-19_VocabSize_NorMuon_SelectiveQuant/train_gpt.py

Key metrics (from train.log):

Timed training stopped at 9359/20000 steps due to the wallclock cap.
Pre-quant eval at stop: val_loss:3.0261, val_bpb:1.1717
Post-quant roundtrip eval: val_loss:3.06233041, `val_bpb:1.18576208
Train time: 600075ms (step_avg:64.12ms)
Serialized model w6e8+zlib: 14743224 bytes
Code size: 53612 bytes
Total submission size w6e8+zlib: 14796836 bytes

Training volume:

Global batch: 524288 tokens/step
Total train tokens seen: 7224688640

Included files:

train_gpt.py (code snapshot used for the run)
train.log (exact remote training log)
submission.json (leaderboard metadata)

openai#77, openai#78) Analyzed techniques, ablations, and individual BPB contributions. Key finding: sliding window eval (~0.034) and int6+wider MLP (~0.029) are the dominant validated techniques. Several promising combinations remain untested across submissions. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Major improvements based on competition intelligence (day 2 PRs): 1. Sliding window eval (stride=256): overlapping windows give each token more context. Free ~0.03 bpb improvement, zero artifact cost. Based on PRs openai#70, openai#77, openai#65. 2. Int6 quantization: configurable WEIGHT_QUANT_BITS (default 6) and EMBED_QUANT_BITS (default 8). Saves ~25% artifact space vs int8, allowing bigger models. Based on PRs openai#78, openai#70. 3. MLP 3x expansion: MLP_MULT_NUM=3 (up from 8/3). Wider MLP gives ~0.019 bpb improvement. Based on PRs openai#70, openai#66. 4. Default dim=512 with LR=0.03 (best config from experiments). 5. forward_logits() helper for sliding window (avoids model.forward which returns loss, not logits). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

record

42de836

jordankzf mentioned this pull request Mar 19, 2026

Unofficial Leaderboard #83

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Record: 8192 Vocab Size, NorMuon, Selective Quantization; 1.186 val_bpb#78

Record: 8192 Vocab Size, NorMuon, Selective Quantization; 1.186 val_bpb#78
mtybadger wants to merge 1 commit intoopenai:mainfrom
mtybadger:spruce/3-19

mtybadger commented Mar 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

mtybadger commented Mar 19, 2026

2023-03-19: Vocab Size, NorMuon, Selective Quantization; 1.186 val_bpb

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant