perf: refresh baseline + cross-model graphs-gate + MoE fp32_down pre-alloc#151
Merged
Conversation
6 tasks
The current bench-line format is
pp 512 tokens avg 33.95 ms (15083.10 tok/s) [5 reps]
`awk '{print $5}'` returned 33.95 (the ms field) instead of 15083.10
(the tok/s value). Result: regenerated baseline JSON had pp/tg numbers
~400× off, which would silently flag every future run as a "huge
regression" through verify-fast.
Switch to a `grep -oP` regex that pulls the tok/s value from inside
the parens — same shape verify.sh already uses (line ~221). Also
fall back to `nvidia-smi`'s "CUDA Version" when `nvcc` is missing
(runtime-only image), so the JSON gets a real cuda field instead of
"unknown".
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Run scripts/gen_perf_baseline.sh on Qwen3-8B Q8_0 (canonical baseline model, RTX 5090, CUDA 13.2, reps=5) after the F1-F8 perf patches: | metric | old | new | delta | | pp128 | 5 581.75 | 6 245.84 | +11.9 % | | pp512 | 13 277.98| 14 917.02| +12.4 % | | tg128 | 147.85 | 157.59 | +6.6 % | VRAM (8 608 MiB model_weights) unchanged — these are pure algo / launch-overhead wins, not memory tradeoffs. Wider bench (reps=5, graphs ON, post-patches) for context — not gated, just informational: | Model | pp512 (tok/s) | tg128 (tok/s) | | Qwen3-4B Q8 | 16 282 | 243.5 | | Qwen3-8B Q8 | 15 096 | 154.4 | | Qwen3.5-GDN Q8 | 14 673 | 229.4 | | Llama-3.2-3B Q8 | 30 205 | 314.7 | Llama-3.2-3B Q8 sees the biggest decode lift (memo baseline 208 → 314 = +51 %), driven primarily by F1's warmup-pre-pass picking the modern m16n8k16 cuBLAS tile instead of the legacy WMMA tile that the cold-start measurement preferred. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
F1's warmup-pre-pass made the cuBLAS algo selection significantly better even with --no-cuda-graphs, which compressed the graphs-ON / graphs-OFF speedup ratio. Cross-model A/B (post-patches, verify-fast methodology — reps=2, pp=256, tg=256): Qwen3-4B Q8 1.90x Qwen3.5-GDN Q8 2.23x Llama-3.2-3B Q8 2.38x Qwen3-8B Q8 1.20x <-- bigger model, less launch-overhead share 1.5 was tuned against pre-F1 Qwen3-4B numbers; it would now reject healthy Qwen3-8B (verify-fast's canonical perf model) on every push. 1.3 still catches the actual failure mode the gate exists for — catastrophic fallback to per-step decode (ratio ≈ 1.0x) — without flagging big-model decodes whose kernels naturally amortize launch overhead. Documented inline so future tuners know what the boundary buys. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…call malloc)
executor_forward_moe.cu:1080 called cudaMallocAsync per layer per
prefill chunk to obtain a max_tokens × top_k × d_model FP32 scratch
buffer for the down-projection when fp32_down_active=true. On
Qwen3.6-35B-A3B-NVFP4 W2 capture (graphs OFF): contributed to the
~93k cudaMalloc API calls / 5.7s API time per 257-step run.
Add MoEWorkspace::fp32_down_buf, allocated once at executor workspace
setup (sized for max_tokens × n_experts_active × d × sizeof(float),
default ~16 MiB on Qwen3.6). Forward pass prefers the persistent
buffer when its size suffices; falls back to lazy cudaMallocAsync
otherwise. Free-site only releases the lazily-allocated copy — the
persistent buffer is owned by MoEWorkspace.
Result on Qwen3.6-NVFP4 (reps=3, pp=256 tg=128):
- graphs OFF tg128: 67.74 → 74.17 tok/s (+9.5%)
- graphs ON tg128: 224.11 → 224.66 tok/s (graph capture amortizes
launches; gain hidden in graph replay)
- pp256 unchanged in noise band
Quality-neutral (no compute changes, identical bit-output expected).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
b4933fd to
fcca678
Compare
4 tasks
github-actions Bot
pushed a commit
that referenced
this pull request
May 9, 2026
Two tiny [d_model, n_heads] matvecs sharing the same input `no` are
replaced with one matvec against an interleaved [d_model, 2*n_heads]
packed weight built at load time. Output for n=1 becomes
[α₀..α_{H-1}, β₀..β_{H-1}] contiguous in ssm_dt_buf_, so alpha_proj_out
and beta_proj_out fall out as trivial offset views — no deinterleave
kernel needed. Prefill (n>1) keeps the original 2-call dispatcher.
Why: nsys attribution on Qwen3.6-NVFP4 (PR #151's findings) showed 5
FP16 internal projections per GDN layer × 60 layers dominating ~21.7 %
of decode kernel time. With CUDA Graphs ON (default), launch overhead
is partly amortised, so the actual win is bounded — a single fusion
recovers ~1/5 of that path's launches.
Filter: only fires when both gdn_alpha and gdn_beta are FP16 / BF16
on-device with matching shapes — i.e. NVFP4-prequant Qwen3.6 / Qwen3.5
GDN models that keep SSM internals at FP16. GGUF Q*_K / MXFP4 paths
keep the original raw-quant 2-call dispatcher (which they have to use
anyway for correct dispatch).
Bench (this RTX 5090, identical seed, prompt, 256-token decode):
| Model | clean main | + fusion | Δ |
|------------------------|-----------:|---------:|-------:|
| Qwen3.6-35B-A3B-NVFP4 | 171.20 | 173.28 | +1.2 % |
| Qwen3.5-4B-GDN Q8_0 | 193.45 | 193.83 | n/a |
| Qwen3-4B Q8_0 (dense) | (verify-fast +/- noise; fusion inactive) |
Q8_0 GDN unchanged because gdn_alpha/beta upload as raw Q8_0 → filter
skips the fusion → original two-call path. No regression.
Out-of-scope (separate larger PR):
- 4-way fusion (ssm_in + gdn_gate + alpha + beta) — needs storage
planner + Phase 3 NVFP4 cache co-ordination because ssm_in/gdn_gate
are larger, have different output dims, and ssm_in's output is
already sliced into z/xBC/dt downstream.
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
github-actions Bot
pushed a commit
that referenced
this pull request
May 9, 2026
All decode + prefill numbers re-measured 2026-05-10 with imp:test post PR #156 (chunked-prefill-hybrid) + PR #157 (auto max_seq_len 16K cap). Notable changes vs prior table: - Qwen3-4B / Qwen3-8B Q8_0 decode dropped (401→236, 255→149) — older numbers were pre-some-PR-between-#88-and-#150 regression. Current state is consistent with tests/perf_baseline.json (~150 tok/s). - Llama-3.2-3B decode improved (208→306). - Qwen3.6-35B Q4_K_M decode +70% (143→243), reflects PR #150/#151 MoE graphs-gate + fp32_down pre-alloc wins. - Q3.6-NVFP4 prefill +82% (601→1092). - Added Nemotron-3-Nano-30B-A3B NVFP4 row (325 tg256 / 690 pp512). - Mistral-Small-3.2 NVFP4 + Gemma-4 Q5_K_M kept as italic stale (model files not present locally — last numbers retained as historical). README headline rewritten — single "Qwen3-8B at 255 tok/s vs 1.6× llama.cpp" claim no longer holds; replaced with multi-model decode highlights citing the current numbers. Long-context prefill table (pp1024-pp8192) and KV-cache-quant table (Llama-3.2-3B specific) NOT re-measured this round; still reflect the v0.7 / PR #51 measurement series. Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Follow-ups to #150 (nsys-driven prefill/decode wins) — 4 commits that were pushed after the original PR was already auto-merged via CI green. None changes compute math; all are bench/CI/workspace mechanics.
eef662a fix(scripts):gen_perf_baseline.sh tok/s parserawk '{print $5}'was extracting the avg-ms field instead of the tok/s value inside parens. Regenerated baseline JSON would have been ~400× off, silently flagging every future verify-fast run as a "huge regression". Same parser shape verify.sh already uses (line ~221).a1dcbee chore(perf):refresh tests/perf_baseline.json72e6c55 ci:lower graphs-gate threshold 1.5× → 1.3×b4933fd perf(moe):pre-allocate fp32_down_bufexecutor_forward_moe.cu:1080cudaMallocAsyncper layer per prefill chunk. AddMoEWorkspace::fp32_down_bufallocated once at workspace setup, sized for max_tokens × n_experts_active × d × sizeof(float) (~16 MiB on Qwen3.6). Forward pass prefers persistent buf, lazy fallback otherwise. Free-site only releases the lazily-allocated copy. Quality-neutral.Validation
make verify-fastPASS at every step of this chain:Qwen3.6-NVFP4 follow-up bench (post-P2):
Open follow-up
The Qwen3.6-NVFP4 vs Q4_K_M decode gap (224 vs 250 tok/s = -10 %) is structural, not a bug. nsys attribution shows 21.7 % of decode kernel time is in
gemv_fp16_kernelfrom 60 GDN layers × 5 FP16 internal projections (ssm_in/out,gdn_gate/alpha/beta). Quality-neutral fix would require fusing 5 GEMVs into 1 kernel (M-effort, separate PR). Documented in profiles/nsys_findings_after.md.Test plan
make verify-fast(build + tests + perf + smoke + graphs gate)tests/perf_baseline_chunked.json(separate, longer-context baseline, not touched here)🤖 Generated with Claude Code