Skip to content

perf: refresh baseline + cross-model graphs-gate + MoE fp32_down pre-alloc#151

Merged
kekzl merged 4 commits into
mainfrom
nvfp4-chunked-prefill
May 9, 2026
Merged

perf: refresh baseline + cross-model graphs-gate + MoE fp32_down pre-alloc#151
kekzl merged 4 commits into
mainfrom
nvfp4-chunked-prefill

Conversation

@kekzl
Copy link
Copy Markdown
Owner

@kekzl kekzl commented May 9, 2026

Summary

Follow-ups to #150 (nsys-driven prefill/decode wins) — 4 commits that were pushed after the original PR was already auto-merged via CI green. None changes compute math; all are bench/CI/workspace mechanics.

Commit Why
eef662a fix(scripts): gen_perf_baseline.sh tok/s parser Bug fix: awk '{print $5}' was extracting the avg-ms field instead of the tok/s value inside parens. Regenerated baseline JSON would have been ~400× off, silently flagging every future verify-fast run as a "huge regression". Same parser shape verify.sh already uses (line ~221).
a1dcbee chore(perf): refresh tests/perf_baseline.json Re-run gen against post-patch ceiling (Qwen3-8B Q8, RTX 5090, reps=5, graphs ON). pp128 +11.9 %, pp512 +12.4 %, tg128 +6.6 %. VRAM unchanged. Without this, verify-fast reports the new ceiling as "regression vs old baseline".
72e6c55 ci: lower graphs-gate threshold 1.5× → 1.3× F1's warmup-pre-pass made graphs-OFF significantly faster on dense Q8, compressing the graphs-ON / graphs-OFF speedup ratio. Cross-model A/B (post-patches): Qwen3-4B Q8 1.90×, Qwen3.5-GDN Q8 2.23×, Llama-3.2-3B Q8 2.38×, Qwen3-8B Q8 1.20× (bigger model = larger kernel time = less launch-overhead share). Rejecting Qwen3-8B (verify-fast's canonical perf model) was wrong. 1.3 still catches catastrophic graph fallback (ratio ≈ 1.0×) without flagging healthy big-model decodes.
b4933fd perf(moe): pre-allocate fp32_down_buf executor_forward_moe.cu:1080 cudaMallocAsync per layer per prefill chunk. Add MoEWorkspace::fp32_down_buf allocated once at workspace setup, sized for max_tokens × n_experts_active × d × sizeof(float) (~16 MiB on Qwen3.6). Forward pass prefers persistent buf, lazy fallback otherwise. Free-site only releases the lazily-allocated copy. Quality-neutral.

Validation

make verify-fast PASS at every step of this chain:

  • decode tg128 within 3 % of refreshed baseline
  • prefill pp512 within 5 % of refreshed baseline
  • graphs-gate 1.55× ≥ 1.3× threshold
  • smoke "Paris" coherent

Qwen3.6-NVFP4 follow-up bench (post-P2):

  • graphs OFF tg128: 67.74 → 74.17 (+9.5 %)
  • graphs ON tg128: 224.11 → 224.66 (graph capture amortizes the launches; gain hidden in graph replay)

Open follow-up

The Qwen3.6-NVFP4 vs Q4_K_M decode gap (224 vs 250 tok/s = -10 %) is structural, not a bug. nsys attribution shows 21.7 % of decode kernel time is in gemv_fp16_kernel from 60 GDN layers × 5 FP16 internal projections (ssm_in/out, gdn_gate/alpha/beta). Quality-neutral fix would require fusing 5 GEMVs into 1 kernel (M-effort, separate PR). Documented in profiles/nsys_findings_after.md.

Test plan

  • make verify-fast (build + tests + perf + smoke + graphs gate)
  • Smoke decode coherent on Qwen3-4B Q8
  • Qwen3.6-NVFP4 degeneration probe ("The capital of France is" → coherent, stderr clean)
  • Bench across 4 models (Qwen3-4B/8B/Llama-3.2-3B/Qwen3.5-GDN) — no regressions
  • Refresh tests/perf_baseline_chunked.json (separate, longer-context baseline, not touched here)

🤖 Generated with Claude Code

kekzl and others added 4 commits May 9, 2026 19:28
The current bench-line format is

  pp   512 tokens  avg    33.95 ms  (15083.10 tok/s)  [5 reps]

`awk '{print $5}'` returned 33.95 (the ms field) instead of 15083.10
(the tok/s value). Result: regenerated baseline JSON had pp/tg numbers
~400× off, which would silently flag every future run as a "huge
regression" through verify-fast.

Switch to a `grep -oP` regex that pulls the tok/s value from inside
the parens — same shape verify.sh already uses (line ~221). Also
fall back to `nvidia-smi`'s "CUDA Version" when `nvcc` is missing
(runtime-only image), so the JSON gets a real cuda field instead of
"unknown".

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Run scripts/gen_perf_baseline.sh on Qwen3-8B Q8_0 (canonical baseline
model, RTX 5090, CUDA 13.2, reps=5) after the F1-F8 perf patches:

| metric | old      | new      | delta    |
| pp128  | 5 581.75 | 6 245.84 | +11.9 %  |
| pp512  | 13 277.98| 14 917.02| +12.4 %  |
| tg128  | 147.85   | 157.59   | +6.6 %   |

VRAM (8 608 MiB model_weights) unchanged — these are pure
algo / launch-overhead wins, not memory tradeoffs.

Wider bench (reps=5, graphs ON, post-patches) for context — not
gated, just informational:

| Model            | pp512 (tok/s) | tg128 (tok/s) |
| Qwen3-4B Q8      | 16 282        | 243.5         |
| Qwen3-8B Q8      | 15 096        | 154.4         |
| Qwen3.5-GDN Q8   | 14 673        | 229.4         |
| Llama-3.2-3B Q8  | 30 205        | 314.7         |

Llama-3.2-3B Q8 sees the biggest decode lift (memo baseline 208 →
314 = +51 %), driven primarily by F1's warmup-pre-pass picking the
modern m16n8k16 cuBLAS tile instead of the legacy WMMA tile that
the cold-start measurement preferred.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
F1's warmup-pre-pass made the cuBLAS algo selection significantly
better even with --no-cuda-graphs, which compressed the
graphs-ON / graphs-OFF speedup ratio. Cross-model A/B (post-patches,
verify-fast methodology — reps=2, pp=256, tg=256):

  Qwen3-4B Q8       1.90x
  Qwen3.5-GDN Q8    2.23x
  Llama-3.2-3B Q8   2.38x
  Qwen3-8B Q8       1.20x   <-- bigger model, less launch-overhead share

1.5 was tuned against pre-F1 Qwen3-4B numbers; it would now reject
healthy Qwen3-8B (verify-fast's canonical perf model) on every push.
1.3 still catches the actual failure mode the gate exists for —
catastrophic fallback to per-step decode (ratio ≈ 1.0x) — without
flagging big-model decodes whose kernels naturally amortize launch
overhead.

Documented inline so future tuners know what the boundary buys.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…call malloc)

executor_forward_moe.cu:1080 called cudaMallocAsync per layer per
prefill chunk to obtain a max_tokens × top_k × d_model FP32 scratch
buffer for the down-projection when fp32_down_active=true. On
Qwen3.6-35B-A3B-NVFP4 W2 capture (graphs OFF): contributed to the
~93k cudaMalloc API calls / 5.7s API time per 257-step run.

Add MoEWorkspace::fp32_down_buf, allocated once at executor workspace
setup (sized for max_tokens × n_experts_active × d × sizeof(float),
default ~16 MiB on Qwen3.6). Forward pass prefers the persistent
buffer when its size suffices; falls back to lazy cudaMallocAsync
otherwise. Free-site only releases the lazily-allocated copy — the
persistent buffer is owned by MoEWorkspace.

Result on Qwen3.6-NVFP4 (reps=3, pp=256 tg=128):
- graphs OFF tg128: 67.74 → 74.17 tok/s (+9.5%)
- graphs ON  tg128: 224.11 → 224.66 tok/s (graph capture amortizes
                                            launches; gain hidden in graph replay)
- pp256 unchanged in noise band

Quality-neutral (no compute changes, identical bit-output expected).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@kekzl kekzl force-pushed the nvfp4-chunked-prefill branch from b4933fd to fcca678 Compare May 9, 2026 17:45
@kekzl kekzl merged commit 8e22d69 into main May 9, 2026
2 checks passed
@kekzl kekzl deleted the nvfp4-chunked-prefill branch May 9, 2026 18:05
github-actions Bot pushed a commit that referenced this pull request May 9, 2026
Two tiny [d_model, n_heads] matvecs sharing the same input `no` are
replaced with one matvec against an interleaved [d_model, 2*n_heads]
packed weight built at load time. Output for n=1 becomes
[α₀..α_{H-1}, β₀..β_{H-1}] contiguous in ssm_dt_buf_, so alpha_proj_out
and beta_proj_out fall out as trivial offset views — no deinterleave
kernel needed. Prefill (n>1) keeps the original 2-call dispatcher.

Why: nsys attribution on Qwen3.6-NVFP4 (PR #151's findings) showed 5
FP16 internal projections per GDN layer × 60 layers dominating ~21.7 %
of decode kernel time. With CUDA Graphs ON (default), launch overhead
is partly amortised, so the actual win is bounded — a single fusion
recovers ~1/5 of that path's launches.

Filter: only fires when both gdn_alpha and gdn_beta are FP16 / BF16
on-device with matching shapes — i.e. NVFP4-prequant Qwen3.6 / Qwen3.5
GDN models that keep SSM internals at FP16. GGUF Q*_K / MXFP4 paths
keep the original raw-quant 2-call dispatcher (which they have to use
anyway for correct dispatch).

Bench (this RTX 5090, identical seed, prompt, 256-token decode):

| Model                  | clean main | + fusion | Δ      |
|------------------------|-----------:|---------:|-------:|
| Qwen3.6-35B-A3B-NVFP4  | 171.20     | 173.28   | +1.2 % |
| Qwen3.5-4B-GDN Q8_0    | 193.45     | 193.83   | n/a   |
| Qwen3-4B Q8_0 (dense)  | (verify-fast +/- noise; fusion inactive) |

Q8_0 GDN unchanged because gdn_alpha/beta upload as raw Q8_0 → filter
skips the fusion → original two-call path. No regression.

Out-of-scope (separate larger PR):
- 4-way fusion (ssm_in + gdn_gate + alpha + beta) — needs storage
  planner + Phase 3 NVFP4 cache co-ordination because ssm_in/gdn_gate
  are larger, have different output dims, and ssm_in's output is
  already sliced into z/xBC/dt downstream.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
github-actions Bot pushed a commit that referenced this pull request May 9, 2026
All decode + prefill numbers re-measured 2026-05-10 with imp:test post
PR #156 (chunked-prefill-hybrid) + PR #157 (auto max_seq_len 16K cap).

Notable changes vs prior table:
- Qwen3-4B / Qwen3-8B Q8_0 decode dropped (401→236, 255→149) — older
  numbers were pre-some-PR-between-#88-and-#150 regression. Current
  state is consistent with tests/perf_baseline.json (~150 tok/s).
- Llama-3.2-3B decode improved (208→306).
- Qwen3.6-35B Q4_K_M decode +70% (143→243), reflects PR #150/#151 MoE
  graphs-gate + fp32_down pre-alloc wins.
- Q3.6-NVFP4 prefill +82% (601→1092).
- Added Nemotron-3-Nano-30B-A3B NVFP4 row (325 tg256 / 690 pp512).
- Mistral-Small-3.2 NVFP4 + Gemma-4 Q5_K_M kept as italic stale (model
  files not present locally — last numbers retained as historical).

README headline rewritten — single "Qwen3-8B at 255 tok/s vs 1.6×
llama.cpp" claim no longer holds; replaced with multi-model decode
highlights citing the current numbers.

Long-context prefill table (pp1024-pp8192) and KV-cache-quant table
(Llama-3.2-3B specific) NOT re-measured this round; still reflect the
v0.7 / PR #51 measurement series.

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant