Record: MLP3x + Int8 Tok Emb + Grouped LZMA + Sliding Window (val_bpb=1.1623) by ChaseWNorton · Pull Request #160 · openai/parameter-golf

ChaseWNorton · 2026-03-20T03:11:57Z

MLP3x + Int8 Tok Emb + Grouped LZMA + Sliding Window

This submission starts from the public 10-minute baseline family and makes three main changes:

Increase feedforward capacity from 2x to 3x.
Train and evaluate at seq_len=2048.
Use a custom grouped low-bit export plus sliding-window evaluation to improve the final under-cap score.

The model was trained for the official 600s wallclock limit on 8x H100 SXM, then repacked into a submission-valid artifact under the 16,000,000 byte limit.

Model

VOCAB_SIZE=1024
NUM_LAYERS=9
MODEL_DIM=512
NUM_HEADS=8
NUM_KV_HEADS=4
MLP_MULT=3
TIE_EMBEDDINGS=1
RoPE + RMSNorm + tied input/output embeddings
U-Net-style skip structure inherited from the baseline

This keeps the backbone close to the baseline while spending more of the parameter budget on the MLP.

Training Setup

The timed run used:

TRAIN_BATCH_TOKENS=786432
TRAIN_SEQ_LEN=2048
ITERATIONS=20000
MAX_WALLCLOCK_SECONDS=600
WARMUP_STEPS=20
WARMDOWN_ITERS=3000
TIED_EMBED_LR=0.03
MATRIX_LR=0.02
SCALAR_LR=0.02
MUON_MOMENTUM=0.99

Logged optimizer summary:

tie_embeddings:True embed_lr:0.03 head_lr:0.0 matrix_lr:0.02 scalar_lr:0.02

Logged attention summary:

attention_mode:gqa num_heads:8 num_kv_heads:4

The script includes QAT support, but this specific timed run stopped before QAT activation:

qat_enabled:True
qat_start_frac:0.500
qat_start_step:10000
actual stop step: 7534

So the final reported result is from post-training repacking of the timed checkpoint rather than from a checkpoint that had entered the QAT phase.

Timed Training Result

The official training run stopped at the wallclock cap:

step:7534/20000
train_time:600120ms
step_avg:79.65ms

Validation at stop:

val_loss=1.9844
val_bpb=1.1753

Other logged details:

peak memory allocated: 16738 MiB
peak memory reserved: 16944 MiB
raw checkpoint size: 86099351 bytes

Export / Compression

The first export evaluated from the timed run used:

int6 for most tensors
fp16 token embedding passthrough
fp16 passthrough for the last two c_k weights
zlib

That version evaluated very well but was not submission-valid because it was over the size cap:

total size: 16639274 bytes
standard post-export score: 1.18101095
sliding-window score: 1.16018011

The final submission-valid repack uses:

grouped QGv3 serialization
lzma compression
int6 for most tensors
int8 for tok_emb.weight
no fp16 passthrough tensors

This change was enough to get back under the limit while preserving almost all of the sliding-window gain.

Final Submission Artifact

artifact path: final_model.mixed_tok8_lzma.ptz
compressor: lzma
model bytes: 15845980
code bytes: 64924
total bytes: 15910904

This is submission-valid under the 16,000,000 byte cap.

Final Scores

Exact post-pack scores for the final under-cap artifact:

standard eval:
- val_loss=1.99867543
- val_bpb=1.18372817
sliding-window eval with seq_len=2048, stride=256:
- val_loss=1.96250243
- val_bpb=1.16230441

The submission score is therefore:

`1.16230441 val_bpb`

Notes

The largest quality-preserving export change was keeping the token embedding at 8-bit while quantizing the rest of the model to 6-bit.
Grouped serialization reduced overhead compared with a naive torch.save artifact.
Fresh-load evaluation required keeping low-dimensional floating buffers, including RoPE frequency buffers, in fp32 to match the trained checkpoint behavior after reload.

Included Files

README.md - this writeup
submission.json - submission metadata
train.log - exact log from the timed 8x H100 SXM run
train_gpt.py - code snapshot for the submission artifact and evaluation path

…_bpb=1.1623

Submission: MLP3x + Int8 Tok Emb + Grouped LZMA + Sliding Window, val…

9c4146a

…_bpb=1.1623

notapplica mentioned this pull request Mar 20, 2026

⛳ Parameter Golf Live AI Commentary ⛳ + Analysis / Ideas | every 10 minutes #140

Open

abaybektursun mentioned this pull request Mar 25, 2026

WIP: GPTQ + XSA-all + BigramHash 3072×112 #728

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Record: MLP3x + Int8 Tok Emb + Grouped LZMA + Sliding Window (val_bpb=1.1623)#160

Record: MLP3x + Int8 Tok Emb + Grouped LZMA + Sliding Window (val_bpb=1.1623)#160
ChaseWNorton wants to merge 1 commit intoopenai:mainfrom
ChaseWNorton:submission/chase-norton-mlp3x-int6-sliding

ChaseWNorton commented Mar 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

ChaseWNorton commented Mar 20, 2026

MLP3x + Int8 Tok Emb + Grouped LZMA + Sliding Window

Model

Training Setup

Timed Training Result

Export / Compression

Final Submission Artifact

Final Scores

1.16230441 val_bpb

Notes

Included Files

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

`1.16230441 val_bpb`