fix(attention): disable Track E v2 — 50%+ perf regression vs cuBLAS#357
Merged
Conversation
PR #356 shipped Track E v2 (WMMA-based, correct on all 6 models) but A/B perf measurement on Qwen3-8B Q8_0 shows scalar FP32 PV in v2 is much slower than cuBLAS's tensor-core PV path: pp512: v2 = 10295 tok/s vs cuBLAS = 12225 (-15.8%) pp4096: v2 = 4906 tok/s vs cuBLAS = 9905 (-50.5%) pp8192: v2 = 3095 tok/s vs cuBLAS = 8199 (-62.2%) Production was running with this regression since #356 merged. This commit re-disables Track E at the launcher entry (early return false). Dispatcher falls back to cuBLAS / FMHA as before #350. No correctness impact — Track E v2 is correct, just too slow to ship. Kernel stays in tree. Next optimisation pass: replace the scalar FP32 PV with WMMA m16n8k16 PV using careful FP16 P-frag handling (the issue that bit v1, which v2 sidestepped by going scalar). Worth retrying with the layout probe (#355) as a regression gate. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Urgent perf hotfix
PR #356 shipped Track E v2 as the default prefill path. It's correct on all 6 production models (the original goal) but catastrophically slower than cuBLAS:
A/B measured on identical
imp:testimage, Qwen3-8B Q8_0, 3 reps each. Production has been running with this regression since #356 merged ~1h ago.Root cause
v2's PV phase uses a scalar FP32 loop (V loaded FP16, upcast to FP32 inline, O accumulates FP32). This sidesteps the FP16 P-fragment numerical issue that bit v1 — but throws away tensor cores for PV.
cuBLAS uses tensor cores for both QKᵀ and PV. v2 uses tensor cores only for QKᵀ. The asymmetric loss of TC throughput on PV is the bottleneck.
What this PR does
Adds an unconditional
return false;at the launcher entry. Dispatcher falls back to cuBLAS / FMHA as before #350. Comment in the code documents the perf numbers + the path to a future re-enable.What stays
src/compute/attention_tiled_streaming.cu)Next optimisation pass
Replace v2's scalar FP32 PV with WMMA m16n8k16 PV using FP32-accumulated P repack. The key insight: P doesn't need to be stored to smem at all — keep it in registers in FP32, pack to FP16 just before each mma call. v1 stored P to smem then re-loaded as FP16 fragments (corrupting precision). v2 went scalar to avoid that. A third design: in-register FP32 P, pack-and-mma per WMMA tile.
Worth a focused 1-day experiment with the layout probe as a regression gate.
Verification
🤖 Generated with Claude Code