perf(gdn): fuse alpha+beta into single GEMV for decode by kekzl · Pull Request #153 · kekzl/imp

kekzl · 2026-05-09T18:30:23Z

Summary

Two tiny [d_model, n_heads] matvecs (gdn_alpha and gdn_beta) that share the same input no are replaced with one matvec against an interleaved [d_model, 2*n_heads] packed weight built at load time. For n=1 decode the output is [α₀..α_{H-1}, β₀..β_{H-1}] contiguous in ssm_dt_buf_, so alpha_proj_out and beta_proj_out are trivial offset views — no deinterleave kernel needed.

Prefill (n>1) and any non-FP16/BF16 weight (Q*_K, MXFP4, NVFP4 raw-quant) keep the original 2-call dispatcher untouched.

Why this PR

PR #151's nsys findings flagged 5 FP16 internal projections per GDN layer × 60 layers as dominating ~21.7 % of Qwen3.6-NVFP4 decode kernel time. With CUDA Graphs ON (default), launch overhead is largely amortised so the practical headroom from fusing tiny launches is bounded. This is the smallest-possible first cut: alpha + beta share input, are tiny enough that interleave-into-one-buffer-and-take-views is the entire change, and don't touch the storage planner.

Bench

RTX 5090, identical seed/prompt, 256-token decode, graphs ON:

Model	clean main	+ fusion	Δ
Qwen3.6-35B-A3B-NVFP4	171.20	173.28	+1.2 %
Qwen3.5-4B-GDN Q8_0	193.45	193.83	unchanged (fusion inactive — Q8_0 weights)

make verify-fast: PASS (decode/prefill within thresholds, graphs gate 1.62×, smoke coherent).

Filter

Fusion only fires when:

both gdn_alpha and gdn_beta are on-device
both have qtype == F16 || qtype == BF16
shapes match [d_model, n_heads]

Q8_0 / Q4_K / MXFP4 GDN models stay on the raw-quant 2-call path (which they have to use anyway for correct dequant dispatch). NVFP4-prequant Qwen3.6 / Qwen3.5 hybrid paths see the fusion fire because SSM-internal projections are kept FP16 (they're excluded from the NVFP4 cache for accuracy — see weight_map.cpp:842).

VRAM cost

Per layer: 2 × d_model × n_heads × 2 bytes for the packed buffer (originals stay live for the prefill path). Qwen3.6-35B: 2048 × 16 × 4 = 128 KB × 60 GDN layers = ~7.7 MB total. Negligible.

Out of scope (separate PR)

4-way fusion (ssm_in + gdn_gate + alpha + beta) — closes more of the 21.7 % path but needs storage planner + NVFP4 cache co-ordination because ssm_in and gdn_gate are bigger, have different output dims, and ssm_in's output is already sliced into z/xBC/dt downstream. Treating that as a follow-up keeps this PR small and reviewable.

Test plan

make verify-fast (build + tests + perf + smoke + graphs gate)
Qwen3.6-NVFP4 coherent decode ("capital of France" smoke)
Qwen3.5-GDN Q8_0 coherent decode (Babbage essay)
Bench A/B clean-main vs patched on Qwen3.6-NVFP4 + Qwen3.5-GDN

🤖 Generated with Claude Code

Two tiny [d_model, n_heads] matvecs sharing the same input `no` are replaced with one matvec against an interleaved [d_model, 2*n_heads] packed weight built at load time. Output for n=1 becomes [α₀..α_{H-1}, β₀..β_{H-1}] contiguous in ssm_dt_buf_, so alpha_proj_out and beta_proj_out fall out as trivial offset views — no deinterleave kernel needed. Prefill (n>1) keeps the original 2-call dispatcher. Why: nsys attribution on Qwen3.6-NVFP4 (PR #151's findings) showed 5 FP16 internal projections per GDN layer × 60 layers dominating ~21.7 % of decode kernel time. With CUDA Graphs ON (default), launch overhead is partly amortised, so the actual win is bounded — a single fusion recovers ~1/5 of that path's launches. Filter: only fires when both gdn_alpha and gdn_beta are FP16 / BF16 on-device with matching shapes — i.e. NVFP4-prequant Qwen3.6 / Qwen3.5 GDN models that keep SSM internals at FP16. GGUF Q*_K / MXFP4 paths keep the original raw-quant 2-call dispatcher (which they have to use anyway for correct dispatch). Bench (this RTX 5090, identical seed, prompt, 256-token decode): | Model | clean main | + fusion | Δ | |------------------------|-----------:|---------:|-------:| | Qwen3.6-35B-A3B-NVFP4 | 171.20 | 173.28 | +1.2 % | | Qwen3.5-4B-GDN Q8_0 | 193.45 | 193.83 | n/a | | Qwen3-4B Q8_0 (dense) | (verify-fast +/- noise; fusion inactive) | Q8_0 GDN unchanged because gdn_alpha/beta upload as raw Q8_0 → filter skips the fusion → original two-call path. No regression. Out-of-scope (separate larger PR): - 4-way fusion (ssm_in + gdn_gate + alpha + beta) — needs storage planner + Phase 3 NVFP4 cache co-ordination because ssm_in/gdn_gate are larger, have different output dims, and ssm_in's output is already sliced into z/xBC/dt downstream. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

github-actions Bot enabled auto-merge (squash) May 9, 2026 18:30

github-actions Bot merged commit 313ef89 into main May 9, 2026
3 checks passed

kekzl mentioned this pull request May 9, 2026

perf(gdn): 4-way input fusion (ssm_in + gdn_gate + alpha + beta) for decode #154

Merged

5 tasks

kekzl deleted the perf/gdn-alpha-beta-fuse branch May 9, 2026 22:36

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf(gdn): fuse alpha+beta into single GEMV for decode#153

perf(gdn): fuse alpha+beta into single GEMV for decode#153
github-actions[bot] merged 1 commit into
mainfrom
perf/gdn-alpha-beta-fuse

kekzl commented May 9, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

kekzl commented May 9, 2026

Summary

Why this PR

Bench

Filter

VRAM cost

Out of scope (separate PR)

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant