New SOTA: 1.12676 BPB - 11L XSA-all(11) + GPTQ-lite + EMA + Late QAT#478
New SOTA: 1.12676 BPB - 11L XSA-all(11) + GPTQ-lite + EMA + Late QAT#478gowtham0992 wants to merge 1 commit intoopenai:mainfrom
Conversation
|
xsa on all 11 layers is bold, most ppl only do last 3 or 4. does it actualy help on the early layers too or is it just not hurting? also gptq lite clip search is a nice touch i havent seen anyone else do that yet |
thanks! yeah xsa on all layers actually helps, not just not hurting. its ablated to: XSA-all(11): 1.12676 BPB, 6764 steps, 88.7ms/step XSA(4) last only: 1.13266 BPB, 6998 steps, 85.7ms/step so ~0.006 BPB win even though its 3ms/step slower and we lose ~230 steps. early layers tend to repeat self-value patterns, xsa forces them to actually encode new info. at 11L 512d every layer counts gptq-lite is free basically. 5 clip percentiles per row, pick min MSE. adds like 2 seconds to save and gets ~0.0006 BPB back from quant gap. |
New SOTA Record: val_bpb 1.12676 (3-seed mean)
Beats current SOTA (1.14276) by 0.016 nats.
3-Seed Results (8xH100 SXM, 600s)
Key Techniques
Dependencies
zstandard,flash_attn_3(see requirements.txt)Verified on RunPod 8xH100 SXM (official template): 1.12753 BPB
See README.md for full details.