Skip to content

Record: 8192 Vocab Size, NorMuon, Selective Quantization; 1.186 val_bpb#78

Open
mtybadger wants to merge 1 commit intoopenai:mainfrom
mtybadger:spruce/3-19
Open

Record: 8192 Vocab Size, NorMuon, Selective Quantization; 1.186 val_bpb#78
mtybadger wants to merge 1 commit intoopenai:mainfrom
mtybadger:spruce/3-19

Conversation

@mtybadger
Copy link

2023-03-19: Vocab Size, NorMuon, Selective Quantization; 1.186 val_bpb

Day 1! This record contains three main new ideas, as well as some tweaks to the baseline, particularly vocab size. I had several ideas I wanted to try today, and these are the ones that worked - I want to chase further on quantization in the coming days.

Changes in this model:

  • Vocab size 1024 -> 8192
  • New "sp8192" tokenizer trained using
./data/download_hf_docs_and_tokenize.py   --output-root ./data   --tokenizer-config ./data/tokenizer_specs.json --max-train-tokens 8000000000 --tokenizer-train-docs 100000

with this tokenizer_spec:

{
  "tokenizers": [
    {
      "name": "sp_bpe_1024",
      "dataset_suffix": "sp1024",
      "vocab_size": 1024
    },
    {
      "name": "sp_bpe_8192",
      "dataset_suffix": "sp8192",
      "vocab_size": 8192
    }
  ]
}

with a 50/50 val/train split as a result. Tokenizers for sp1024, 2048, 4096 and 8192 with data are publicly available on my huggingface.

  • NorMuon implementation from the original paper, popularized by modded-nanogpt, replacing Muon
  • Selective Quantization: the weights are quantized to int6, while the embeddings are kept at int8. Not sure if this is optimal and have seen plenty of weird behaviour from this, but I think it's in the right direction; I think being precise about precision will be really key to this challenge and I want to dig into it more. From now on there will be a lot of trading off precision between areas of the model!

Configuration:

  • All hyperparams as in default NaiveBaseline except VOCAB_SIZE, TRAIN_SEQ_LEN, WARMDOWN_ITERS and NUM_LAYERS; unfortunately to get the increased vocab size we have to sacrifice a layer. I'm sure there's a better architectural setup here, but I don't know if it's recurrence.
  • Tested on Hyperbolic Labs 8xH100 node with SXM5; reproduced baseline with step_avg:43.67ms and final_int8_zlib_roundtrip_exact val_bpb:1.22731147 immediately before.

Command:

NCCL_IB_DISABLE=1 \
RUN_ID=verify_sp8192_w6e8_8gpu \
DATA_PATH=./data/datasets/fineweb10B_sp8192 \
TOKENIZER_PATH=./data/tokenizers/fineweb_8192_bpe.model \
VOCAB_SIZE=8192 \
MAX_WALLCLOCK_SECONDS=600 \
WEIGHT_QUANTIZATION_BITS=6 \
EMBED_QUANTIZATION_BITS=8 \
WARMDOWN_ITERS=3000 \
TRAIN_SEQ_LEN=4096 \
NUM_LAYERS=8 \
torchrun --standalone --nproc_per_node=8 ./records/track_10min_16mb/2026-03-19_VocabSize_NorMuon_SelectiveQuant/train_gpt.py

Key metrics (from train.log):

  • Timed training stopped at 9359/20000 steps due to the wallclock cap.
  • Pre-quant eval at stop: val_loss:3.0261, val_bpb:1.1717
  • Post-quant roundtrip eval: val_loss:3.06233041, `val_bpb:1.18576208
  • Train time: 600075ms (step_avg:64.12ms)
  • Serialized model w6e8+zlib: 14743224 bytes
  • Code size: 53612 bytes
  • Total submission size w6e8+zlib: 14796836 bytes

Training volume:

  • Global batch: 524288 tokens/step
  • Total train tokens seen: 7224688640

Included files:

  • train_gpt.py (code snapshot used for the run)
  • train.log (exact remote training log)
  • submission.json (leaderboard metadata)

phaesoo added a commit to phaesoo/parameter-golf that referenced this pull request Mar 19, 2026
openai#77, openai#78)

Analyzed techniques, ablations, and individual BPB contributions.
Key finding: sliding window eval (~0.034) and int6+wider MLP (~0.029)
are the dominant validated techniques. Several promising combinations
remain untested across submissions.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
xskuy pushed a commit to xskuy/parameter-golf that referenced this pull request Mar 19, 2026
Major improvements based on competition intelligence (day 2 PRs):

1. Sliding window eval (stride=256): overlapping windows give each token
   more context. Free ~0.03 bpb improvement, zero artifact cost.
   Based on PRs openai#70, openai#77, openai#65.

2. Int6 quantization: configurable WEIGHT_QUANT_BITS (default 6) and
   EMBED_QUANT_BITS (default 8). Saves ~25% artifact space vs int8,
   allowing bigger models. Based on PRs openai#78, openai#70.

3. MLP 3x expansion: MLP_MULT_NUM=3 (up from 8/3). Wider MLP gives
   ~0.019 bpb improvement. Based on PRs openai#70, openai#66.

4. Default dim=512 with LR=0.03 (best config from experiments).

5. forward_logits() helper for sliding window (avoids model.forward
   which returns loss, not logits).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant