perf(attention): Track E warp-spec 4+4 + perf validation report#351
Merged
Conversation
A/B on identical imp:test image (Qwen3-8B-Q8_0): - cuBLAS-only: pp512 = 12100 tok/s, tg128 = 154.65 tok/s - Track E: pp512 = 12724 tok/s, tg128 = 154.74 tok/s - Δ: +5.2% pp512, decode unchanged Track E's projected 3-5× attention-kernel speedup (from Säule 3 gating bench) translates to only +5.2% total prefill because attention is a small fraction of total prefill time (QKV proj + FFN dominate). The old 2026-05-14 baseline (pp512=13446) was from a different build/ environment and is superseded by this fresh measurement. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Real perf-bench A/B across 2 models × 3 seq lengths on RTX 5090. Track E gives +5-15% end-to-end prefill, growing with seq length (O(n²) attention) and with weight-format weight (NVFP4 weights amplify attention's share). | Model | seq | Track E | cuBLAS | Δ | |---|---:|---:|---:|---:| | Q8_0 | 512 | 12724 | 12100 | +5.2% | | Q8_0 | 4096 | 10830 | 9995 | +8.4% | | Q8_0 | 8192 | 9413 | 8216 | +14.6% | | NVFP4 | 4096 | 31925 | 28458 | +12.2% | | NVFP4 | 8192 | 31778 | 28384 | +12.0% | Gating bench projected 3-5× attention-kernel speedup; nsys profile confirms attention is 2.3% of pp512 prefill time on Q8_0, so Amdahl's law caps the end-to-end gain. Report quantifies all this. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Single producer warp was bottlenecked on cp.async issue rate at long seq lengths. 4 producer warps × 4 consumer warps quadruples load throughput while still providing exactly one mma warp per Br/16 row-tile at Br=64. mbar counts adjusted: QKt_done and V_consumed init count 7→4. pp8192 Qwen3-8B Q8_0: 9413 → 9622 tok/s (+2.2%, avg of 2 runs) Correctness: all 9 TrackE_* tests pass unchanged. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Follow-up to #350 (Track E base kernel). Three commits stranded after the squash-merge:
96221a0perf(baseline) — refreshtests/perf_baseline.jsonwith Track E numbers, supersedes the 2026-05-14 baseline that was from a different build/env9044752docs(track-e) — comprehensive A/B perf validation report covering 2 models × 3 seq lengths9d8e74fperf(attention) — warp-spec change from 1 producer + 7 consumers to 4 producer + 4 consumer, +2.2% additional pp8192 gainValidated perf (A/B on identical image)
Speedup grows with seq length (O(n²) attention) and with weight-format leanness (NVFP4 amplifies attention's share). Decode (
tg128) unchanged across all configs.Why 4+4 over the originally-shipped 1+7
The single producer warp doing all
cp.asyncwas load-bandwidth bottlenecked at long sequences. Splitting load work across 4 producer warps (128 cp.async-issuing threads vs 32) reclaims throughput. At Br=64 only 4 consumer warps are mma-active anyway (4 row-tiles of 16 rows = 64), so dropping from 7 to 4 consumers loses nothing on the compute side.Sweet spot confirmed empirically: tested 2+6 (slower by -0.4%) and Br=128 (slower by -5.3% from reduced occupancy). 4+4 is the local optimum.
Other optimizations tried + reverted (not committed)
paged_kv_gather_nvfp4_to_fp16is only 0.5% of pp4096 NVFP4 time, so the gather isn't the lever; new kernel would need fused gather+NVFP4-attention (1-2 wk project) for ~+6-8% projectednsys profile reference
Q8_0 pp512 (post Track E + 4+4): attention is 2.3% of total prefill, dequant_q8_0 is 21.5%, CUTLASS FP16 GEMM (FFN) is 14.9%.
NVFP4 pp4096 (with
--kv-nvfp4): attention is 18.2% of total, CUTLASS NVFP4 GEMM is 49.8%,paged_kv_gather_nvfp4_to_fp16is 0.5%.Track E owns the attention slice. Further e2e wins are in dequant/GEMM territory (out of Track E scope).
Test plan
verify-fast(pre-push hook) greenTrackE_*correctness tests PASS (5 hd=128 + 1 hd=256 + 3 features; 1 hd=512 SKIP)test-attentionsuite: 129 PASS, 0 FAIL, 3 pre-existing skipsmake verify(long suite) — to be run after merge🤖 Generated with Claude Code