perf: refresh baseline + cross-model graphs-gate + MoE fp32_down pre-alloc by kekzl · Pull Request #151 · kekzl/imp

kekzl · 2026-05-09T16:17:53Z

Summary

Follow-ups to #150 (nsys-driven prefill/decode wins) — 4 commits that were pushed after the original PR was already auto-merged via CI green. None changes compute math; all are bench/CI/workspace mechanics.

Commit	Why
`eef662a fix(scripts):` gen_perf_baseline.sh tok/s parser	Bug fix: `awk '{print $5}'` was extracting the avg-ms field instead of the tok/s value inside parens. Regenerated baseline JSON would have been ~400× off, silently flagging every future verify-fast run as a "huge regression". Same parser shape verify.sh already uses (line ~221).
`a1dcbee chore(perf):` refresh tests/perf_baseline.json	Re-run gen against post-patch ceiling (Qwen3-8B Q8, RTX 5090, reps=5, graphs ON). pp128 +11.9 %, pp512 +12.4 %, tg128 +6.6 %. VRAM unchanged. Without this, verify-fast reports the new ceiling as "regression vs old baseline".
`72e6c55 ci:` lower graphs-gate threshold 1.5× → 1.3×	F1's warmup-pre-pass made graphs-OFF significantly faster on dense Q8, compressing the graphs-ON / graphs-OFF speedup ratio. Cross-model A/B (post-patches): Qwen3-4B Q8 1.90×, Qwen3.5-GDN Q8 2.23×, Llama-3.2-3B Q8 2.38×, Qwen3-8B Q8 1.20× (bigger model = larger kernel time = less launch-overhead share). Rejecting Qwen3-8B (verify-fast's canonical perf model) was wrong. 1.3 still catches catastrophic graph fallback (ratio ≈ 1.0×) without flagging healthy big-model decodes.
`b4933fd perf(moe):` pre-allocate fp32_down_buf	`executor_forward_moe.cu:1080` `cudaMallocAsync` per layer per prefill chunk. Add `MoEWorkspace::fp32_down_buf` allocated once at workspace setup, sized for max_tokens × n_experts_active × d × sizeof(float) (~16 MiB on Qwen3.6). Forward pass prefers persistent buf, lazy fallback otherwise. Free-site only releases the lazily-allocated copy. Quality-neutral.

Validation

make verify-fast PASS at every step of this chain:

decode tg128 within 3 % of refreshed baseline
prefill pp512 within 5 % of refreshed baseline
graphs-gate 1.55× ≥ 1.3× threshold
smoke "Paris" coherent

Qwen3.6-NVFP4 follow-up bench (post-P2):

graphs OFF tg128: 67.74 → 74.17 (+9.5 %)
graphs ON tg128: 224.11 → 224.66 (graph capture amortizes the launches; gain hidden in graph replay)

Open follow-up

The Qwen3.6-NVFP4 vs Q4_K_M decode gap (224 vs 250 tok/s = -10 %) is structural, not a bug. nsys attribution shows 21.7 % of decode kernel time is in gemv_fp16_kernel from 60 GDN layers × 5 FP16 internal projections (ssm_in/out, gdn_gate/alpha/beta). Quality-neutral fix would require fusing 5 GEMVs into 1 kernel (M-effort, separate PR). Documented in profiles/nsys_findings_after.md.

Test plan

make verify-fast (build + tests + perf + smoke + graphs gate)
Smoke decode coherent on Qwen3-4B Q8
Qwen3.6-NVFP4 degeneration probe ("The capital of France is" → coherent, stderr clean)
Bench across 4 models (Qwen3-4B/8B/Llama-3.2-3B/Qwen3.5-GDN) — no regressions
Refresh tests/perf_baseline_chunked.json (separate, longer-context baseline, not touched here)

🤖 Generated with Claude Code

The current bench-line format is pp 512 tokens avg 33.95 ms (15083.10 tok/s) [5 reps] `awk '{print $5}'` returned 33.95 (the ms field) instead of 15083.10 (the tok/s value). Result: regenerated baseline JSON had pp/tg numbers ~400× off, which would silently flag every future run as a "huge regression" through verify-fast. Switch to a `grep -oP` regex that pulls the tok/s value from inside the parens — same shape verify.sh already uses (line ~221). Also fall back to `nvidia-smi`'s "CUDA Version" when `nvcc` is missing (runtime-only image), so the JSON gets a real cuda field instead of "unknown". Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Run scripts/gen_perf_baseline.sh on Qwen3-8B Q8_0 (canonical baseline model, RTX 5090, CUDA 13.2, reps=5) after the F1-F8 perf patches: | metric | old | new | delta | | pp128 | 5 581.75 | 6 245.84 | +11.9 % | | pp512 | 13 277.98| 14 917.02| +12.4 % | | tg128 | 147.85 | 157.59 | +6.6 % | VRAM (8 608 MiB model_weights) unchanged — these are pure algo / launch-overhead wins, not memory tradeoffs. Wider bench (reps=5, graphs ON, post-patches) for context — not gated, just informational: | Model | pp512 (tok/s) | tg128 (tok/s) | | Qwen3-4B Q8 | 16 282 | 243.5 | | Qwen3-8B Q8 | 15 096 | 154.4 | | Qwen3.5-GDN Q8 | 14 673 | 229.4 | | Llama-3.2-3B Q8 | 30 205 | 314.7 | Llama-3.2-3B Q8 sees the biggest decode lift (memo baseline 208 → 314 = +51 %), driven primarily by F1's warmup-pre-pass picking the modern m16n8k16 cuBLAS tile instead of the legacy WMMA tile that the cold-start measurement preferred. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

F1's warmup-pre-pass made the cuBLAS algo selection significantly better even with --no-cuda-graphs, which compressed the graphs-ON / graphs-OFF speedup ratio. Cross-model A/B (post-patches, verify-fast methodology — reps=2, pp=256, tg=256): Qwen3-4B Q8 1.90x Qwen3.5-GDN Q8 2.23x Llama-3.2-3B Q8 2.38x Qwen3-8B Q8 1.20x <-- bigger model, less launch-overhead share 1.5 was tuned against pre-F1 Qwen3-4B numbers; it would now reject healthy Qwen3-8B (verify-fast's canonical perf model) on every push. 1.3 still catches the actual failure mode the gate exists for — catastrophic fallback to per-step decode (ratio ≈ 1.0x) — without flagging big-model decodes whose kernels naturally amortize launch overhead. Documented inline so future tuners know what the boundary buys. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…call malloc) executor_forward_moe.cu:1080 called cudaMallocAsync per layer per prefill chunk to obtain a max_tokens × top_k × d_model FP32 scratch buffer for the down-projection when fp32_down_active=true. On Qwen3.6-35B-A3B-NVFP4 W2 capture (graphs OFF): contributed to the ~93k cudaMalloc API calls / 5.7s API time per 257-step run. Add MoEWorkspace::fp32_down_buf, allocated once at executor workspace setup (sized for max_tokens × n_experts_active × d × sizeof(float), default ~16 MiB on Qwen3.6). Forward pass prefers the persistent buffer when its size suffices; falls back to lazy cudaMallocAsync otherwise. Free-site only releases the lazily-allocated copy — the persistent buffer is owned by MoEWorkspace. Result on Qwen3.6-NVFP4 (reps=3, pp=256 tg=128): - graphs OFF tg128: 67.74 → 74.17 tok/s (+9.5%) - graphs ON tg128: 224.11 → 224.66 tok/s (graph capture amortizes launches; gain hidden in graph replay) - pp256 unchanged in noise band Quality-neutral (no compute changes, identical bit-output expected). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Two tiny [d_model, n_heads] matvecs sharing the same input `no` are replaced with one matvec against an interleaved [d_model, 2*n_heads] packed weight built at load time. Output for n=1 becomes [α₀..α_{H-1}, β₀..β_{H-1}] contiguous in ssm_dt_buf_, so alpha_proj_out and beta_proj_out fall out as trivial offset views — no deinterleave kernel needed. Prefill (n>1) keeps the original 2-call dispatcher. Why: nsys attribution on Qwen3.6-NVFP4 (PR #151's findings) showed 5 FP16 internal projections per GDN layer × 60 layers dominating ~21.7 % of decode kernel time. With CUDA Graphs ON (default), launch overhead is partly amortised, so the actual win is bounded — a single fusion recovers ~1/5 of that path's launches. Filter: only fires when both gdn_alpha and gdn_beta are FP16 / BF16 on-device with matching shapes — i.e. NVFP4-prequant Qwen3.6 / Qwen3.5 GDN models that keep SSM internals at FP16. GGUF Q*_K / MXFP4 paths keep the original raw-quant 2-call dispatcher (which they have to use anyway for correct dispatch). Bench (this RTX 5090, identical seed, prompt, 256-token decode): | Model | clean main | + fusion | Δ | |------------------------|-----------:|---------:|-------:| | Qwen3.6-35B-A3B-NVFP4 | 171.20 | 173.28 | +1.2 % | | Qwen3.5-4B-GDN Q8_0 | 193.45 | 193.83 | n/a | | Qwen3-4B Q8_0 (dense) | (verify-fast +/- noise; fusion inactive) | Q8_0 GDN unchanged because gdn_alpha/beta upload as raw Q8_0 → filter skips the fusion → original two-call path. No regression. Out-of-scope (separate larger PR): - 4-way fusion (ssm_in + gdn_gate + alpha + beta) — needs storage planner + Phase 3 NVFP4 cache co-ordination because ssm_in/gdn_gate are larger, have different output dims, and ssm_in's output is already sliced into z/xBC/dt downstream. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

All decode + prefill numbers re-measured 2026-05-10 with imp:test post PR #156 (chunked-prefill-hybrid) + PR #157 (auto max_seq_len 16K cap). Notable changes vs prior table: - Qwen3-4B / Qwen3-8B Q8_0 decode dropped (401→236, 255→149) — older numbers were pre-some-PR-between-#88-and-#150 regression. Current state is consistent with tests/perf_baseline.json (~150 tok/s). - Llama-3.2-3B decode improved (208→306). - Qwen3.6-35B Q4_K_M decode +70% (143→243), reflects PR #150/#151 MoE graphs-gate + fp32_down pre-alloc wins. - Q3.6-NVFP4 prefill +82% (601→1092). - Added Nemotron-3-Nano-30B-A3B NVFP4 row (325 tg256 / 690 pp512). - Mistral-Small-3.2 NVFP4 + Gemma-4 Q5_K_M kept as italic stale (model files not present locally — last numbers retained as historical). README headline rewritten — single "Qwen3-8B at 255 tok/s vs 1.6× llama.cpp" claim no longer holds; replaced with multi-model decode highlights citing the current numbers. Long-context prefill table (pp1024-pp8192) and KV-cache-quant table (Llama-3.2-3B specific) NOT re-measured this round; still reflect the v0.7 / PR #51 measurement series. Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

kekzl mentioned this pull request May 9, 2026

fix(model): support multimodal Qwen3.6-VL NVFP4 + widen hybrid prefill cap #152

Merged

6 tasks

kekzl and others added 4 commits May 9, 2026 19:28

kekzl force-pushed the nvfp4-chunked-prefill branch from b4933fd to fcca678 Compare May 9, 2026 17:45

kekzl merged commit 8e22d69 into main May 9, 2026
2 checks passed

kekzl deleted the nvfp4-chunked-prefill branch May 9, 2026 18:05

kekzl mentioned this pull request May 9, 2026

perf(gdn): fuse alpha+beta into single GEMV for decode #153

Merged

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf: refresh baseline + cross-model graphs-gate + MoE fp32_down pre-alloc#151

perf: refresh baseline + cross-model graphs-gate + MoE fp32_down pre-alloc#151
kekzl merged 4 commits into
mainfrom
nvfp4-chunked-prefill

kekzl commented May 9, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

kekzl commented May 9, 2026

Summary

Validation

Open follow-up

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant