Record: 0.6864 BPB — K-LoRA + Min-NLL + FlashAttention-3 by bigbag · Pull Request #614 · openai/parameter-golf

bigbag · 2026-03-24T13:18:37Z

Summary

val_bpb: 0.6864 (seed 42, 8xH100 SXM)
Artifact: 15.53 MB (97.1% of 16MB limit)
Training: 600s (7313 steps at 82ms/step)
Eval: 588s (within 600s budget)

Three Innovations

1. K-Projection LoRA (from PR #611)

LoRA on K projections (not just Q/V) with 0.3x LR multiplier. Adapting K alongside Q/V gives more expressive per-document specialization at marginal compute cost.

2. Min-NLL Epoch Selection (from PR #611)

Track minimum average NLL per document across all TTT epochs, use best epoch's scores. Prevents late-epoch overfitting — safely run 6 epochs without any document degrading.

3. FlashAttention-3 (our addition)

flash_attn_func for causal attention + Rotary cache .clone() fix for CUDA graph compatibility. ~3% speed boost.

LoRA TTT Details

Rank-8 Q/K/V LoRA + rank-16 LM-head LoRA
Per-block bias tuning, per-document reset at BOS boundaries
Batched 64 docs/GPU, Adam lr=0.01, 6 epochs, per-step cosine LR
Per-layer LR: LM-head 2x, V 1.5x, Q 0.5x, K 0.3x, bias 3x
Temperature rescaling T=0.98, wall-clock deadline 550s

Results

Pre-quant BPB: 1.1624
Post-quant BPB: 1.1755 (quant gap: 0.013)
Post-TTT BPB: 0.6864 (TTT gain: 0.489)
Eval breakdown: short docs 27s + long docs 340s + scoring 221s = 588s

Based On

PR Record: Chimera TTT — K-Projection LoRA + Min-NLL (0.5601 BPB, 3-seed mean) #611 (Chimera TTT, K-LoRA + Min-NLL)
PR Record: DeepQuant V10b — 11L INT6 + 8ep LoRA TTT (val_bpb=0.6430) #596 (DeepQuant V10b, LoRA TTT architecture)

Test plan

Artifact under 16MB (15.53MB)
Training under 600s (600s)
Eval under 600s (588s)

Built on PR openai#611 (Chimera TTT) with our FlashAttention-3 addition. K-Projection LoRA: LoRA on Q/K/V (not just Q/V), K at 0.3x LR. Min-NLL epoch selection: track best epoch per doc, prevents overfitting. 6 TTT epochs within 600s eval budget (588s actual). Architecture: 10L 512d GQA 8/4, EMA 0.999, SWA, compiled Muon, train_seq_len=1024, int6+zstd-22. 7313 steps at 82ms/step. Result: pre-quant 1.1624, post-quant 1.1755, post-TTT 0.6864. Artifact 15.53MB, eval 588s. Seed 42. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

valerio-oai · 2026-03-24T15:25:51Z

The min-NLL scheme leaks information. This is the same as training on the val set, and is therefore disallowed.

valerio-oai closed this Mar 24, 2026

notapplica mentioned this pull request Mar 24, 2026

Parameter Golf Live AI Commentary + Analysis / Ideas | every 10 minutes #140

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Record: 0.6864 BPB — K-LoRA + Min-NLL + FlashAttention-3#614

Record: 0.6864 BPB — K-LoRA + Min-NLL + FlashAttention-3#614
bigbag wants to merge 1 commit intoopenai:mainfrom
bigbag:submission/klora-minnll-0.6864

bigbag commented Mar 24, 2026 •

edited

Loading

Uh oh!

valerio-oai commented Mar 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

bigbag commented Mar 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Three Innovations

1. K-Projection LoRA (from PR #611)

2. Min-NLL Epoch Selection (from PR #611)

3. FlashAttention-3 (our addition)

LoRA TTT Details

Results

Based On

Test plan

Uh oh!

valerio-oai commented Mar 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

bigbag commented Mar 24, 2026 •

edited

Loading