fix(attention): disable Track E v2 — 50%+ perf regression vs cuBLAS by kekzl · Pull Request #357 · kekzl/imp

kekzl · 2026-05-21T21:49:15Z

Urgent perf hotfix

PR #356 shipped Track E v2 as the default prefill path. It's correct on all 6 production models (the original goal) but catastrophically slower than cuBLAS:

seq	Track E v2	cuBLAS	Δ
512	10295 tok/s	12225	-15.8%
4096	4906 tok/s	9905	-50.5%
8192	3095 tok/s	8199	-62.2%

A/B measured on identical imp:test image, Qwen3-8B Q8_0, 3 reps each. Production has been running with this regression since #356 merged ~1h ago.

Root cause

v2's PV phase uses a scalar FP32 loop (V loaded FP16, upcast to FP32 inline, O accumulates FP32). This sidesteps the FP16 P-fragment numerical issue that bit v1 — but throws away tensor cores for PV.

cuBLAS uses tensor cores for both QKᵀ and PV. v2 uses tensor cores only for QKᵀ. The asymmetric loss of TC throughput on PV is the bottleneck.

What this PR does

Adds an unconditional return false; at the launcher entry. Dispatcher falls back to cuBLAS / FMHA as before #350. Comment in the code documents the perf numbers + the path to a future re-enable.

What stays

Track E v2 kernel code stays in tree (700 LOC at src/compute/attention_tiled_streaming.cu)
Layout probe from PR track-e/v2-groundwork: layout probe + bug analysis #355 stays as the regression gate
Public header unchanged
Dispatcher unchanged (Track E preferred → falls to cuBLAS via false return)

Next optimisation pass

Replace v2's scalar FP32 PV with WMMA m16n8k16 PV using FP32-accumulated P repack. The key insight: P doesn't need to be stored to smem at all — keep it in registers in FP32, pack to FP16 just before each mma call. v1 stored P to smem then re-loaded as FP16 fragments (corrupting precision). v2 went scalar to avoid that. A third design: in-register FP32 P, pack-and-mma per WMMA tile.

Worth a focused 1-day experiment with the layout probe as a regression gate.

Verification

Build clean
cuBLAS prefill perf restored (pp4096 = 9558 tok/s with disable, matches cuBLAS-only A/B baseline 9905)
verify-fast pre-push hook green

🤖 Generated with Claude Code

PR #356 shipped Track E v2 (WMMA-based, correct on all 6 models) but A/B perf measurement on Qwen3-8B Q8_0 shows scalar FP32 PV in v2 is much slower than cuBLAS's tensor-core PV path: pp512: v2 = 10295 tok/s vs cuBLAS = 12225 (-15.8%) pp4096: v2 = 4906 tok/s vs cuBLAS = 9905 (-50.5%) pp8192: v2 = 3095 tok/s vs cuBLAS = 8199 (-62.2%) Production was running with this regression since #356 merged. This commit re-disables Track E at the launcher entry (early return false). Dispatcher falls back to cuBLAS / FMHA as before #350. No correctness impact — Track E v2 is correct, just too slow to ship. Kernel stays in tree. Next optimisation pass: replace the scalar FP32 PV with WMMA m16n8k16 PV using careful FP16 P-frag handling (the issue that bit v1, which v2 sidestepped by going scalar). Worth retrying with the layout probe (#355) as a regression gate. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

github-actions Bot enabled auto-merge (squash) May 21, 2026 21:49

github-actions Bot merged commit ef2a8d5 into main May 21, 2026
3 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(attention): disable Track E v2 — 50%+ perf regression vs cuBLAS#357

fix(attention): disable Track E v2 — 50%+ perf regression vs cuBLAS#357
github-actions[bot] merged 1 commit into
mainfrom
fix/track-e-v2-perf-regression

kekzl commented May 21, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

kekzl commented May 21, 2026

Urgent perf hotfix

Root cause

What this PR does

What stays

Next optimisation pass

Verification

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant