perf(attention): Track E warp-spec 4+4 + perf validation report by kekzl · Pull Request #351 · kekzl/imp

kekzl · 2026-05-21T12:17:02Z

Summary

Follow-up to #350 (Track E base kernel). Three commits stranded after the squash-merge:

96221a0 perf(baseline) — refresh tests/perf_baseline.json with Track E numbers, supersedes the 2026-05-14 baseline that was from a different build/env
9044752 docs(track-e) — comprehensive A/B perf validation report covering 2 models × 3 seq lengths
9d8e74f perf(attention) — warp-spec change from 1 producer + 7 consumers to 4 producer + 4 consumer, +2.2% additional pp8192 gain

Validated perf (A/B on identical image)

Model	seq	Track E (4+4)	cuBLAS	Δ
Qwen3-8B Q8_0	512	12724	12100	+5.2%
Qwen3-8B Q8_0	4096	10830	9995	+8.4%
Qwen3-8B Q8_0	8192	9622	8216	+17.1%
Qwen3-8B NVFP4	4096	31925	28458	+12.2%
Qwen3-8B NVFP4	8192	31778	28384	+12.0%

Speedup grows with seq length (O(n²) attention) and with weight-format leanness (NVFP4 amplifies attention's share). Decode (tg128) unchanged across all configs.

Why 4+4 over the originally-shipped 1+7

The single producer warp doing all cp.async was load-bandwidth bottlenecked at long sequences. Splitting load work across 4 producer warps (128 cp.async-issuing threads vs 32) reclaims throughput. At Br=64 only 4 consumer warps are mma-active anyway (4 row-tiles of 16 rows = 64), so dropping from 7 to 4 consumers loses nothing on the compute side.

Sweet spot confirmed empirically: tested 2+6 (slower by -0.4%) and Br=128 (slower by -5.3% from reduced occupancy). 4+4 is the local optimum.

Other optimizations tried + reverted (not committed)

ldmatrix x4 → x2 — no signal above 1.5% bench noise
L2-persist hint for Q tile — no signal
2+6 warp-spec — −0.4% vs 4+4
Br=128 with 4 consumers × 2 row-tiles each — −5.3% (occupancy drop)
NVFP4-KV inner-loop — HOLD; nsys shows paged_kv_gather_nvfp4_to_fp16 is only 0.5% of pp4096 NVFP4 time, so the gather isn't the lever; new kernel would need fused gather+NVFP4-attention (1-2 wk project) for ~+6-8% projected

nsys profile reference

Q8_0 pp512 (post Track E + 4+4): attention is 2.3% of total prefill, dequant_q8_0 is 21.5%, CUTLASS FP16 GEMM (FFN) is 14.9%.

NVFP4 pp4096 (with --kv-nvfp4): attention is 18.2% of total, CUTLASS NVFP4 GEMM is 49.8%, paged_kv_gather_nvfp4_to_fp16 is 0.5%.

Track E owns the attention slice. Further e2e wins are in dequant/GEMM territory (out of Track E scope).

Test plan

verify-fast (pre-push hook) green
All 9 TrackE_* correctness tests PASS (5 hd=128 + 1 hd=256 + 3 features; 1 hd=512 SKIP)
Full test-attention suite: 129 PASS, 0 FAIL, 3 pre-existing skips
A/B perf on 2 models × 3 seq lengths documented
make verify (long suite) — to be run after merge

🤖 Generated with Claude Code

A/B on identical imp:test image (Qwen3-8B-Q8_0): - cuBLAS-only: pp512 = 12100 tok/s, tg128 = 154.65 tok/s - Track E: pp512 = 12724 tok/s, tg128 = 154.74 tok/s - Δ: +5.2% pp512, decode unchanged Track E's projected 3-5× attention-kernel speedup (from Säule 3 gating bench) translates to only +5.2% total prefill because attention is a small fraction of total prefill time (QKV proj + FFN dominate). The old 2026-05-14 baseline (pp512=13446) was from a different build/ environment and is superseded by this fresh measurement. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Real perf-bench A/B across 2 models × 3 seq lengths on RTX 5090. Track E gives +5-15% end-to-end prefill, growing with seq length (O(n²) attention) and with weight-format weight (NVFP4 weights amplify attention's share). | Model | seq | Track E | cuBLAS | Δ | |---|---:|---:|---:|---:| | Q8_0 | 512 | 12724 | 12100 | +5.2% | | Q8_0 | 4096 | 10830 | 9995 | +8.4% | | Q8_0 | 8192 | 9413 | 8216 | +14.6% | | NVFP4 | 4096 | 31925 | 28458 | +12.2% | | NVFP4 | 8192 | 31778 | 28384 | +12.0% | Gating bench projected 3-5× attention-kernel speedup; nsys profile confirms attention is 2.3% of pp512 prefill time on Q8_0, so Amdahl's law caps the end-to-end gain. Report quantifies all this. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Single producer warp was bottlenecked on cp.async issue rate at long seq lengths. 4 producer warps × 4 consumer warps quadruples load throughput while still providing exactly one mma warp per Br/16 row-tile at Br=64. mbar counts adjusted: QKt_done and V_consumed init count 7→4. pp8192 Qwen3-8B Q8_0: 9413 → 9622 tok/s (+2.2%, avg of 2 runs) Correctness: all 9 TrackE_* tests pass unchanged. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

kekzl and others added 3 commits May 21, 2026 14:16

github-actions Bot enabled auto-merge (squash) May 21, 2026 12:17

github-actions Bot merged commit b263e8d into main May 21, 2026
3 checks passed

This was referenced May 21, 2026

fix(attention): Track E PV repack — m16n8k16 A-operand layout #352

Merged

fix(attention): Track E PV repack — m16n8k16 A-operand layout (re-enables Track E) #353

Merged

fix(attention): disable Track E — multi-model degeneration (URGENT) #354

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf(attention): Track E warp-spec 4+4 + perf validation report#351

perf(attention): Track E warp-spec 4+4 + perf validation report#351
github-actions[bot] merged 3 commits into
mainfrom
feat/track-e-warp-spec-4-4

kekzl commented May 21, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

kekzl commented May 21, 2026

Summary

Validated perf (A/B on identical image)

Why 4+4 over the originally-shipped 1+7

Other optimizations tried + reverted (not committed)

nsys profile reference

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant