Replies: 10 comments 4 replies
-
|
As a comparison here is the bench output from the exact same system with a RTX 3090ti: bench.sh outputbench output (3 warmups + 5 measured per prompt) |
Beta Was this translation helpful? Give feedback.
-
|
I'd love some suggestions on tweaks to the compose file that I can test to make better use of the 5090 and remove some of the needed workarounds that are in place from the 3090. |
Beta Was this translation helpful? Give feedback.
-
|
Massive thanks @apnar — first 5090 (Blackwell sm_120) data point on club-3090. Quick read on what your numbers show, then concrete tweaks to chase. What the data says (5090 vs 3090 Ti, same system)
~1.6× speedup is roughly the memory-bandwidth ratio (5090's 1.79 TB/s vs 3090 Ti's 1.01 TB/s = 1.77×) which is the right shape for an INT4-quantized 27B that's bandwidth-bound on weight reads. You're getting most of what the hardware promises out of an unmodified-for-Blackwell config — meaning the workarounds we ship for Ampere aren't catastrophically counterproductive on Blackwell, but several of them can be removed or replaced for further gain. Concrete suggestions in priority order: Tweak A (likely +5-15%): drop the Ampere-specific Marlin RO-mountThe compose mounts Try: # In your override compose, comment out or remove the two RO mounts:
# - ../patches/vllm-marlin-pad/marlin.py:/usr/local/lib/python3.12/dist-packages/vllm/model_executor/kernels/linear/mixed_precision/marlin.py:ro
# - ../patches/vllm-marlin-pad/MPLinearKernel.py:/usr/local/lib/python3.12/dist-packages/vllm/model_executor/kernels/linear/mixed_precision/MPLinearKernel.py:roBoot. If it boots clean and Tweak B (likely +5-10% accuracy, neutral TPS): swap KV formatYour current config inherits Try: - --kv-cache-dtype
- fp8_e4m3e4m3 has 1 more mantissa bit than e5m2 → less rounding noise on KV writes. On Blackwell you get the accuracy benefit AND the native compute path. Run a verify-stress probe at 60K + 90K needle (you'd need to raise Note: also worth testing Tweak C (lots more headroom): raise max-model-lenYou have 48K cap but the 5090's 32 GB VRAM has ~3 GB of unused headroom (29.8 used / 32 avail) at your current load. The KV pool is 5K-ish tokens worth bigger than what 48K needs:
Try Tweak D (probably no-op but worth flagging): Genesis Blackwell auto-detectionYour boot log will show
After your bench, would be useful to grep the boot log for docker logs vllm-qwen36-27b 2>&1 | grep -E "is_blackwell|\[OFF\]|\[ON \]" | head -30Some of those On the soak-test docker errorYour In the meantime, the soak-helper update I shipped today (commit f32d8a6) detects "silent-empty" turns (HTTP 200 + 0 completion tokens) automatically — useful gate even when you can't run it via the docker entrypoint. If you can run the helper directly: python3 scripts/soak-helper.py summary <turn-log.csv> /tmp/summary.md <boot_vram_mib> 200 0 5(But you'd need to feed it a turn-log.csv from a manually-orchestrated soak run, which on microk8s means writing a small wrapper. Probably not worth it until I soften soak-test.sh's docker check the same way setup.sh got softened.) What we'd love to see nextIn rough priority for adding to BENCHMARKS as a new "5090 single-card" row:
Genuinely thanks for being the first Blackwell data point on this stack. The 5090 sub-class is where most of the hardware roadmap goes; having someone running it through our verify-stress harness is a real contribution. |
Beta Was this translation helpful? Give feedback.
-
|
Thanks for the suggestions. For A, I'm only running a single card so I've been using the default docker-compose.yml, the suggestion in A appears to only apply to the dual configurations as I didn't see those lines in the default one. I made the changes suggested in B and C and here are the bench results: bench.sh outputbench output (3 warmups + 5 measured per prompt)Additionally, here is the grep for is_blackwell: I'll look to give the suggested model a try later but am happy to try any other config tweaks you'd like to suggest. |
Beta Was this translation helpful? Give feedback.
-
|
Tried running the kaitchup/Qwen3.6-27B-autoround-nvfp4-linearattn-BF16 as suggested. First run: Then I try changing the data-type value from float16 to bfloat16 and it died with the following log output: |
Beta Was this translation helpful? Give feedback.
-
|
Tried changing Quantization to 'compressed-tensors' which got further but died with the following: |
Beta Was this translation helpful? Give feedback.
-
|
I got much further this time the language-model-only. Here was the output: I tried lowering the 96000 max-model-len back down to 48000, but it still failed with similar error. Any thoughts on where I can find a bit more memory? |
Beta Was this translation helpful? Give feedback.
-
|
Managed to get it running with the changes you suggested. Actually didn't need to drop MTP. Numbers don't look good though. Here is the bench output: These are the command line options I used: Here are the logs from that run: As an aside, might be worth adding the '--root-user-action' flag on the pip command in the entrypoint for compose files just to get rid the the initial warning. |
Beta Was this translation helpful? Give feedback.
-
|
@apnar — thanks for sticking with the NVFP4 attempt and getting clean bench numbers out the other side. Honest read: 32 TPS is the expected envelope for that quant variant on this hardware right now, not a config issue you can tune around. Worth explaining why before you decide whether to keep it or fall back. Why the numbers are this slowCompare against your earlier INT4 AutoRound run on the same 5090:
The kaitchup variant's name is precise: The CV 0.1% you observed is the giveaway — that's a textbook bandwidth-bound path: 32 TPS × ~26 GB residency reads per token ≈ 0.83 TB/s sustained, which is 46% of the 5090's 1.79 TB/s peak. Reasonable for a path that's neither using Marlin's packed kernel (which AutoRound INT4 has) nor Blackwell's native FP4 tensor cores (which a hypothetical full-NVFP4 quant would). For context on what AutoRound INT4 is doing differently: it goes through vLLM's Marlin kernel which fuses dequant + GEMM in a packed format optimized for consumer GPU bandwidth. NVFP4 + compressed-tensors loader doesn't have that fusion path yet on consumer Blackwell — partly because of the genesis-vllm-patches#20 SM 12.0 detection bug we filed (Genesis can't fully recognize sm_120 as Blackwell, so platform-specific paths don't engage), and partly because vLLM's NVFP4 support is newer / less optimized than Marlin INT4. Recommendation: AutoRound INT4 is the daily-driver on Blackwell consumerFor now:
NVFP4 on Hopper (datacenter) is more mature today — the Hopper FP4 path has been in use longer and the loader is better tested on H100/H200. On consumer Blackwell (5090), it's preview-quality. What would be useful from your sideIf you have cycles for it:
Either of those gives you a more useful "5090 owner" data point than the NVFP4 path can today. We'd love to add a 5090 single-card row to BENCHMARKS at the AutoRound INT4 numbers — that's a config we can confidently recommend to other 5090 owners. Cross-link to genesis bugFiling reference for #20: the SM 12.0 detection issue is at Sandermage/genesis-vllm-patches#20. Once that lands, several Blackwell-specific paths in Genesis (Marlin tuning, P40 grouping kernel) will start engaging cleanly on your rig, which should help even the AutoRound INT4 path. Worth watching that issue. Genuinely thanks for being our 5090 cross-rig — the NVFP4 attempt was a worthwhile preview-class data point even though the numbers didn't justify daily-driver use. We're collecting these pin to the upstream tree as the Blackwell consumer envelope sharpens. |
Beta Was this translation helpful? Give feedback.
-
|
A small meta-note as we close out today's loop on this thread, @apnar: The back-and-forth here on the 5090 NVFP4 attempts is genuinely high-signal — first Blackwell consumer cross-rig data on club-3090, surfaced the Sandermage/genesis-vllm-patches#20 sm_120 detection bug, gave us real numbers to anchor "what does the AutoRound INT4 path look like on a 5090". All useful. But we noticed the thread has accumulated several rounds of full-container-log dumps + Genesis dispatcher banners + bench output — content that's structurally a bug-tracking conversation, not the kind of open-ended hardware-questions discussion this channel is shaped for. We just sharpened the routing in disc #17 and CONTRIBUTING.md (PR #61) to make the convention explicit going forward:
Future iterations on this NVFP4 / 5090 work — would you mind taking them to a
This thread is fine as the design / hardware-class discussion for what 5090 ownership of club-3090 looks like — high-level "should we ship a 5090 default config?" sort of questions. The structured cross-rig bench data row (AutoRound INT4 daily-driver numbers) belongs in a Numbers from your rig issue when you have it. The bug reports (NVFP4 boot OOMs, slow throughput diagnoses, KV pool admission failures) belong in regular bug-report issues. No judgment on what's already posted — this is a "going forward" convention, not a retroactive cleanup. We're not migrating this thread or asking you to re-file anything you've already shared. Genuinely thanks for the depth of cross-rig data here; it's been load-bearing. |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
As requested in one of the other discussions here is some initial output from a rtx 5090. I'm running on microk8s but using a container with the same parameters as:
models/qwen3.6-27b/vllm/compose/docker-compose.yml
Output of scripts/report.sh --full:
club-3090 rig report
Generated: 2026-05-04 17:02:23 UTC
Redacted output (paths, host, user, tokens). Re-run with
--no-redactfor full data.System
CPU + RAM
Disk
GPU hardware
NVLink
No NVLink detected (PCIe-only)
Topology
PCIe / GPU topology matrix
Full nvidia-smi
Full nvidia-smi output
Display / desktop state
Container runtime
Stack version
011d4cc(branch:master)git statusto inspect)2db18df(per scripts/setup.sh)Active container
No vLLM container running. Start one with
bash scripts/launch.shand re-run for the full report.verify-full.sh output
verify-full output
verify-stress.sh output
verify-stress output (7 boundary checks incl. Cliff 2 needle recall)
soak-test.sh (SOAK_MODE=continuous) output
soak-test stdout (5-session × 5-turn ramping conversation, ~25 min)
bench.sh output
bench output (3 warmups + 5 measured per prompt)
Generated by
bash scripts/report.sh. Flags:--verify(verify-full),--stress(verify-stress 7/7 incl. Cliff 2 needles),--soak(SOAK_MODE=continuous, catches Cliff 2b),--bench(canonical TPS),--full(all four, ~35 min). Use--no-redactto disable redaction (internal sharing only).Beta Was this translation helpful? Give feedback.
All reactions