Skip to content

Record: 0.6864 BPB — K-LoRA + Min-NLL + FlashAttention-3#614

Closed
bigbag wants to merge 1 commit intoopenai:mainfrom
bigbag:submission/klora-minnll-0.6864
Closed

Record: 0.6864 BPB — K-LoRA + Min-NLL + FlashAttention-3#614
bigbag wants to merge 1 commit intoopenai:mainfrom
bigbag:submission/klora-minnll-0.6864

Conversation

@bigbag
Copy link

@bigbag bigbag commented Mar 24, 2026

Summary

  • val_bpb: 0.6864 (seed 42, 8xH100 SXM)
  • Artifact: 15.53 MB (97.1% of 16MB limit)
  • Training: 600s (7313 steps at 82ms/step)
  • Eval: 588s (within 600s budget)

Three Innovations

1. K-Projection LoRA (from PR #611)

LoRA on K projections (not just Q/V) with 0.3x LR multiplier. Adapting K alongside Q/V gives more expressive per-document specialization at marginal compute cost.

2. Min-NLL Epoch Selection (from PR #611)

Track minimum average NLL per document across all TTT epochs, use best epoch's scores. Prevents late-epoch overfitting — safely run 6 epochs without any document degrading.

3. FlashAttention-3 (our addition)

flash_attn_func for causal attention + Rotary cache .clone() fix for CUDA graph compatibility. ~3% speed boost.

LoRA TTT Details

  • Rank-8 Q/K/V LoRA + rank-16 LM-head LoRA
  • Per-block bias tuning, per-document reset at BOS boundaries
  • Batched 64 docs/GPU, Adam lr=0.01, 6 epochs, per-step cosine LR
  • Per-layer LR: LM-head 2x, V 1.5x, Q 0.5x, K 0.3x, bias 3x
  • Temperature rescaling T=0.98, wall-clock deadline 550s

Results

  • Pre-quant BPB: 1.1624
  • Post-quant BPB: 1.1755 (quant gap: 0.013)
  • Post-TTT BPB: 0.6864 (TTT gain: 0.489)
  • Eval breakdown: short docs 27s + long docs 340s + scoring 221s = 588s

Based On

Test plan

  • Artifact under 16MB (15.53MB)
  • Training under 600s (600s)
  • Eval under 600s (588s)

Built on PR openai#611 (Chimera TTT) with our FlashAttention-3 addition.

K-Projection LoRA: LoRA on Q/K/V (not just Q/V), K at 0.3x LR.
Min-NLL epoch selection: track best epoch per doc, prevents overfitting.
6 TTT epochs within 600s eval budget (588s actual).

Architecture: 10L 512d GQA 8/4, EMA 0.999, SWA, compiled Muon,
train_seq_len=1024, int6+zstd-22. 7313 steps at 82ms/step.

Result: pre-quant 1.1624, post-quant 1.1755, post-TTT 0.6864.
Artifact 15.53MB, eval 588s. Seed 42.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@valerio-oai
Copy link
Contributor

The min-NLL scheme leaks information. This is the same as training on the val set, and is therefore disallowed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants