Skip to content

fix(attention): disable Track E v2 — 50%+ perf regression vs cuBLAS#357

Merged
github-actions[bot] merged 1 commit into
mainfrom
fix/track-e-v2-perf-regression
May 21, 2026
Merged

fix(attention): disable Track E v2 — 50%+ perf regression vs cuBLAS#357
github-actions[bot] merged 1 commit into
mainfrom
fix/track-e-v2-perf-regression

Conversation

@kekzl
Copy link
Copy Markdown
Owner

@kekzl kekzl commented May 21, 2026

Urgent perf hotfix

PR #356 shipped Track E v2 as the default prefill path. It's correct on all 6 production models (the original goal) but catastrophically slower than cuBLAS:

seq Track E v2 cuBLAS Δ
512 10295 tok/s 12225 -15.8%
4096 4906 tok/s 9905 -50.5%
8192 3095 tok/s 8199 -62.2%

A/B measured on identical imp:test image, Qwen3-8B Q8_0, 3 reps each. Production has been running with this regression since #356 merged ~1h ago.

Root cause

v2's PV phase uses a scalar FP32 loop (V loaded FP16, upcast to FP32 inline, O accumulates FP32). This sidesteps the FP16 P-fragment numerical issue that bit v1 — but throws away tensor cores for PV.

cuBLAS uses tensor cores for both QKᵀ and PV. v2 uses tensor cores only for QKᵀ. The asymmetric loss of TC throughput on PV is the bottleneck.

What this PR does

Adds an unconditional return false; at the launcher entry. Dispatcher falls back to cuBLAS / FMHA as before #350. Comment in the code documents the perf numbers + the path to a future re-enable.

What stays

  • Track E v2 kernel code stays in tree (700 LOC at src/compute/attention_tiled_streaming.cu)
  • Layout probe from PR track-e/v2-groundwork: layout probe + bug analysis #355 stays as the regression gate
  • Public header unchanged
  • Dispatcher unchanged (Track E preferred → falls to cuBLAS via false return)

Next optimisation pass

Replace v2's scalar FP32 PV with WMMA m16n8k16 PV using FP32-accumulated P repack. The key insight: P doesn't need to be stored to smem at all — keep it in registers in FP32, pack to FP16 just before each mma call. v1 stored P to smem then re-loaded as FP16 fragments (corrupting precision). v2 went scalar to avoid that. A third design: in-register FP32 P, pack-and-mma per WMMA tile.

Worth a focused 1-day experiment with the layout probe as a regression gate.

Verification

  • Build clean
  • cuBLAS prefill perf restored (pp4096 = 9558 tok/s with disable, matches cuBLAS-only A/B baseline 9905)
  • verify-fast pre-push hook green

🤖 Generated with Claude Code

PR #356 shipped Track E v2 (WMMA-based, correct on all 6 models) but
A/B perf measurement on Qwen3-8B Q8_0 shows scalar FP32 PV in v2 is
much slower than cuBLAS's tensor-core PV path:

  pp512:  v2 = 10295 tok/s  vs cuBLAS = 12225 (-15.8%)
  pp4096: v2 =  4906 tok/s  vs cuBLAS =  9905 (-50.5%)
  pp8192: v2 =  3095 tok/s  vs cuBLAS =  8199 (-62.2%)

Production was running with this regression since #356 merged.

This commit re-disables Track E at the launcher entry (early return false).
Dispatcher falls back to cuBLAS / FMHA as before #350. No correctness
impact — Track E v2 is correct, just too slow to ship.

Kernel stays in tree. Next optimisation pass: replace the scalar FP32
PV with WMMA m16n8k16 PV using careful FP16 P-frag handling (the issue
that bit v1, which v2 sidestepped by going scalar). Worth retrying with
the layout probe (#355) as a regression gate.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@github-actions github-actions Bot enabled auto-merge (squash) May 21, 2026 21:49
@github-actions github-actions Bot merged commit ef2a8d5 into main May 21, 2026
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant