Skip to content

New SOTA: 1.12676 BPB - 11L XSA-all(11) + GPTQ-lite + EMA + Late QAT#478

Open
gowtham0992 wants to merge 1 commit intoopenai:mainfrom
gowtham0992:submission/11L-XSA-all-GPTQ-lite-EMA-LateQAT
Open

New SOTA: 1.12676 BPB - 11L XSA-all(11) + GPTQ-lite + EMA + Late QAT#478
gowtham0992 wants to merge 1 commit intoopenai:mainfrom
gowtham0992:submission/11L-XSA-all-GPTQ-lite-EMA-LateQAT

Conversation

@gowtham0992
Copy link

New SOTA Record: val_bpb 1.12676 (3-seed mean)

Beats current SOTA (1.14276) by 0.016 nats.

3-Seed Results (8xH100 SXM, 600s)

Seed BPB Size
42 1.12713 15.64 MB
1337 1.12648 15.62 MB
2024 1.12667 15.88 MB
Mean 1.12676 ~15.7 MB

Key Techniques

  • XSA on ALL 11 layers
  • GPTQ-lite optimal clip percentile search
  • EMA(0.997) + Tight SWA
  • Late QAT int6-all at LR scale < 0.15
  • Raw binary + zstd22 serialization

Dependencies

  • zstandard, flash_attn_3 (see requirements.txt)

Verified on RunPod 8xH100 SXM (official template): 1.12753 BPB

See README.md for full details.

@mohosy
Copy link

mohosy commented Mar 23, 2026

xsa on all 11 layers is bold, most ppl only do last 3 or 4. does it actualy help on the early layers too or is it just not hurting? also gptq lite clip search is a nice touch i havent seen anyone else do that yet

@gowtham0992
Copy link
Author

xsa on all 11 layers is bold, most ppl only do last 3 or 4. does it actualy help on the early layers too or is it just not hurting? also gptq lite clip search is a nice touch i havent seen anyone else do that yet

thanks! yeah xsa on all layers actually helps, not just not hurting. its ablated to:

XSA-all(11): 1.12676 BPB, 6764 steps, 88.7ms/step XSA(4) last only: 1.13266 BPB, 6998 steps, 85.7ms/step

so ~0.006 BPB win even though its 3ms/step slower and we lose ~230 steps. early layers tend to repeat self-value patterns, xsa forces them to actually encode new info. at 11L 512d every layer counts

gptq-lite is free basically. 5 clip percentiles per row, pick min MSE. adds like 2 seconds to save and gets ~0.0006 BPB back from quant gap.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants