Skip to content

perf(gdn): fuse alpha+beta into single GEMV for decode#153

Merged
github-actions[bot] merged 1 commit into
mainfrom
perf/gdn-alpha-beta-fuse
May 9, 2026
Merged

perf(gdn): fuse alpha+beta into single GEMV for decode#153
github-actions[bot] merged 1 commit into
mainfrom
perf/gdn-alpha-beta-fuse

Conversation

@kekzl
Copy link
Copy Markdown
Owner

@kekzl kekzl commented May 9, 2026

Summary

Two tiny [d_model, n_heads] matvecs (gdn_alpha and gdn_beta) that share the same input no are replaced with one matvec against an interleaved [d_model, 2*n_heads] packed weight built at load time. For n=1 decode the output is [α₀..α_{H-1}, β₀..β_{H-1}] contiguous in ssm_dt_buf_, so alpha_proj_out and beta_proj_out are trivial offset views — no deinterleave kernel needed.

Prefill (n>1) and any non-FP16/BF16 weight (Q*_K, MXFP4, NVFP4 raw-quant) keep the original 2-call dispatcher untouched.

Why this PR

PR #151's nsys findings flagged 5 FP16 internal projections per GDN layer × 60 layers as dominating ~21.7 % of Qwen3.6-NVFP4 decode kernel time. With CUDA Graphs ON (default), launch overhead is largely amortised so the practical headroom from fusing tiny launches is bounded. This is the smallest-possible first cut: alpha + beta share input, are tiny enough that interleave-into-one-buffer-and-take-views is the entire change, and don't touch the storage planner.

Bench

RTX 5090, identical seed/prompt, 256-token decode, graphs ON:

Model clean main + fusion Δ
Qwen3.6-35B-A3B-NVFP4 171.20 173.28 +1.2 %
Qwen3.5-4B-GDN Q8_0 193.45 193.83 unchanged (fusion inactive — Q8_0 weights)

make verify-fast: PASS (decode/prefill within thresholds, graphs gate 1.62×, smoke coherent).

Filter

Fusion only fires when:

  • both gdn_alpha and gdn_beta are on-device
  • both have qtype == F16 || qtype == BF16
  • shapes match [d_model, n_heads]

Q8_0 / Q4_K / MXFP4 GDN models stay on the raw-quant 2-call path (which they have to use anyway for correct dequant dispatch). NVFP4-prequant Qwen3.6 / Qwen3.5 hybrid paths see the fusion fire because SSM-internal projections are kept FP16 (they're excluded from the NVFP4 cache for accuracy — see weight_map.cpp:842).

VRAM cost

Per layer: 2 × d_model × n_heads × 2 bytes for the packed buffer (originals stay live for the prefill path). Qwen3.6-35B: 2048 × 16 × 4 = 128 KB × 60 GDN layers = ~7.7 MB total. Negligible.

Out of scope (separate PR)

4-way fusion (ssm_in + gdn_gate + alpha + beta) — closes more of the 21.7 % path but needs storage planner + NVFP4 cache co-ordination because ssm_in and gdn_gate are bigger, have different output dims, and ssm_in's output is already sliced into z/xBC/dt downstream. Treating that as a follow-up keeps this PR small and reviewable.

Test plan

  • make verify-fast (build + tests + perf + smoke + graphs gate)
  • Qwen3.6-NVFP4 coherent decode ("capital of France" smoke)
  • Qwen3.5-GDN Q8_0 coherent decode (Babbage essay)
  • Bench A/B clean-main vs patched on Qwen3.6-NVFP4 + Qwen3.5-GDN

🤖 Generated with Claude Code

Two tiny [d_model, n_heads] matvecs sharing the same input `no` are
replaced with one matvec against an interleaved [d_model, 2*n_heads]
packed weight built at load time. Output for n=1 becomes
[α₀..α_{H-1}, β₀..β_{H-1}] contiguous in ssm_dt_buf_, so alpha_proj_out
and beta_proj_out fall out as trivial offset views — no deinterleave
kernel needed. Prefill (n>1) keeps the original 2-call dispatcher.

Why: nsys attribution on Qwen3.6-NVFP4 (PR #151's findings) showed 5
FP16 internal projections per GDN layer × 60 layers dominating ~21.7 %
of decode kernel time. With CUDA Graphs ON (default), launch overhead
is partly amortised, so the actual win is bounded — a single fusion
recovers ~1/5 of that path's launches.

Filter: only fires when both gdn_alpha and gdn_beta are FP16 / BF16
on-device with matching shapes — i.e. NVFP4-prequant Qwen3.6 / Qwen3.5
GDN models that keep SSM internals at FP16. GGUF Q*_K / MXFP4 paths
keep the original raw-quant 2-call dispatcher (which they have to use
anyway for correct dispatch).

Bench (this RTX 5090, identical seed, prompt, 256-token decode):

| Model                  | clean main | + fusion | Δ      |
|------------------------|-----------:|---------:|-------:|
| Qwen3.6-35B-A3B-NVFP4  | 171.20     | 173.28   | +1.2 % |
| Qwen3.5-4B-GDN Q8_0    | 193.45     | 193.83   | n/a   |
| Qwen3-4B Q8_0 (dense)  | (verify-fast +/- noise; fusion inactive) |

Q8_0 GDN unchanged because gdn_alpha/beta upload as raw Q8_0 → filter
skips the fusion → original two-call path. No regression.

Out-of-scope (separate larger PR):
- 4-way fusion (ssm_in + gdn_gate + alpha + beta) — needs storage
  planner + Phase 3 NVFP4 cache co-ordination because ssm_in/gdn_gate
  are larger, have different output dims, and ssm_in's output is
  already sliced into z/xBC/dt downstream.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@github-actions github-actions Bot enabled auto-merge (squash) May 9, 2026 18:30
@github-actions github-actions Bot merged commit 313ef89 into main May 9, 2026
3 checks passed
@kekzl kekzl deleted the perf/gdn-alpha-beta-fuse branch May 9, 2026 22:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant