perf(gdn): fuse alpha+beta into single GEMV for decode#153
Merged
Conversation
Two tiny [d_model, n_heads] matvecs sharing the same input `no` are
replaced with one matvec against an interleaved [d_model, 2*n_heads]
packed weight built at load time. Output for n=1 becomes
[α₀..α_{H-1}, β₀..β_{H-1}] contiguous in ssm_dt_buf_, so alpha_proj_out
and beta_proj_out fall out as trivial offset views — no deinterleave
kernel needed. Prefill (n>1) keeps the original 2-call dispatcher.
Why: nsys attribution on Qwen3.6-NVFP4 (PR #151's findings) showed 5
FP16 internal projections per GDN layer × 60 layers dominating ~21.7 %
of decode kernel time. With CUDA Graphs ON (default), launch overhead
is partly amortised, so the actual win is bounded — a single fusion
recovers ~1/5 of that path's launches.
Filter: only fires when both gdn_alpha and gdn_beta are FP16 / BF16
on-device with matching shapes — i.e. NVFP4-prequant Qwen3.6 / Qwen3.5
GDN models that keep SSM internals at FP16. GGUF Q*_K / MXFP4 paths
keep the original raw-quant 2-call dispatcher (which they have to use
anyway for correct dispatch).
Bench (this RTX 5090, identical seed, prompt, 256-token decode):
| Model | clean main | + fusion | Δ |
|------------------------|-----------:|---------:|-------:|
| Qwen3.6-35B-A3B-NVFP4 | 171.20 | 173.28 | +1.2 % |
| Qwen3.5-4B-GDN Q8_0 | 193.45 | 193.83 | n/a |
| Qwen3-4B Q8_0 (dense) | (verify-fast +/- noise; fusion inactive) |
Q8_0 GDN unchanged because gdn_alpha/beta upload as raw Q8_0 → filter
skips the fusion → original two-call path. No regression.
Out-of-scope (separate larger PR):
- 4-way fusion (ssm_in + gdn_gate + alpha + beta) — needs storage
planner + Phase 3 NVFP4 cache co-ordination because ssm_in/gdn_gate
are larger, have different output dims, and ssm_in's output is
already sliced into z/xBC/dt downstream.
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Two tiny
[d_model, n_heads]matvecs (gdn_alphaandgdn_beta) that share the same inputnoare replaced with one matvec against an interleaved[d_model, 2*n_heads]packed weight built at load time. Forn=1decode the output is[α₀..α_{H-1}, β₀..β_{H-1}]contiguous inssm_dt_buf_, soalpha_proj_outandbeta_proj_outare trivial offset views — no deinterleave kernel needed.Prefill (
n>1) and any non-FP16/BF16 weight (Q*_K, MXFP4, NVFP4 raw-quant) keep the original 2-call dispatcher untouched.Why this PR
PR #151's nsys findings flagged 5 FP16 internal projections per GDN layer × 60 layers as dominating ~21.7 % of Qwen3.6-NVFP4 decode kernel time. With CUDA Graphs ON (default), launch overhead is largely amortised so the practical headroom from fusing tiny launches is bounded. This is the smallest-possible first cut: alpha + beta share input, are tiny enough that interleave-into-one-buffer-and-take-views is the entire change, and don't touch the storage planner.
Bench
RTX 5090, identical seed/prompt, 256-token decode, graphs ON:
make verify-fast: PASS (decode/prefill within thresholds, graphs gate 1.62×, smoke coherent).Filter
Fusion only fires when:
gdn_alphaandgdn_betaare on-deviceqtype == F16 || qtype == BF16[d_model, n_heads]Q8_0 / Q4_K / MXFP4 GDN models stay on the raw-quant 2-call path (which they have to use anyway for correct dequant dispatch). NVFP4-prequant Qwen3.6 / Qwen3.5 hybrid paths see the fusion fire because SSM-internal projections are kept FP16 (they're excluded from the NVFP4 cache for accuracy — see
weight_map.cpp:842).VRAM cost
Per layer:
2 × d_model × n_heads × 2bytes for the packed buffer (originals stay live for the prefill path). Qwen3.6-35B: 2048 × 16 × 4 = 128 KB × 60 GDN layers = ~7.7 MB total. Negligible.Out of scope (separate PR)
4-way fusion (
ssm_in + gdn_gate + alpha + beta) — closes more of the 21.7 % path but needs storage planner + NVFP4 cache co-ordination becausessm_inandgdn_gateare bigger, have different output dims, andssm_in's output is already sliced into z/xBC/dt downstream. Treating that as a follow-up keeps this PR small and reviewable.Test plan
make verify-fast(build + tests + perf + smoke + graphs gate)🤖 Generated with Claude Code