2x3090 here with NVLink, can't thank you enough #29
Replies: 11 comments 4 replies
-
|
@JusefPol — welcome, and thanks for the kind words. Your config tweaks are exactly right for NVLink, and the bench numbers are the cleanest cross-rig validation we've gotten so far. Why your changes were the correct callsOur
Your numbersThat's +56% narrative / +57% code vs the same On
|
Beta Was this translation helpful? Give feedback.
-
|
Just tested turbo, I don't see any gains, except that I can run max context size with 4 concurrent (which in my mind means 6 concurrent at 200k, I rather let agents manage compactions rather than running full context size, on my experiments on past models, if the context reaches those levels, quality drops like a stone and so does performance, so there is no point, I rather have more agents doing work). Here is the output on turbo with nvlink active: Checks: [launch] running verify-full.sh against the new server (URL=http://localhost:8011, CONTAINER=vllm-qwen36-27b-dual-turbo)... Running FULL functional test against http://localhost:8011 (model=qwen3.6-27b-autoround, container=vllm-qwen36-27b-dual-turbo) [1/8] Server reachable on /v1/models ... All checks passed. Stack is ready for full-functionality use. [launch] done. Endpoint: http://localhost:8011 Performance: ========== NARRATIVE (prompt=65 chars, max_tokens=1000) ========== === measured (5) === === summary [narrative] (n=5) === ========== CODE (prompt=78 chars, max_tokens=800) ========== === measured (5) === === summary [code] (n=5) === === GPU state === With regards to the diff. I use internally tdsproxy and tailscale, so my composes have network flags, labels, etc, but I can create a clean version without it and push it back if you want. Do any of your stress test really push the model? Just tested the verify-stress.sh script. But I saw there are other stress on the folder. Is there there a test that pushes for a long period of time? Last note. to your comment: " if you ever run with --enable-log-stats and capture the spec-decode AL number, I'd love to add an NVLink row to BENCHMARKS.md." Sorry I am not super-experienced on this (proved of my failure for months to make it work reliably hehehe). the --enable-log-stats is an extra parameter on the command correct?, but what do you mean by capture the spec-decode AL number? is it a specific lines that will be on the logs after the option is activated? |
Beta Was this translation helpful? Give feedback.
-
|
I can see your answers are AI generated :-) as is not the first time I faced an AI telling me about the --enable-log-stats command on vLLM. As it happened sometime ago when I tried, there is no parameter called --enable-log-stats. In fact, on vLLM the parameter is the opposite, is --disable-log-stats which has default false. so by default is already giving stats. I will put them here. With dual running, I ran the performance again, to track the log vs the output of the performance, ========== NARRATIVE (prompt=65 chars, max_tokens=1000) ========== === measured (5) === === summary [narrative] (n=5) === ========== CODE (prompt=78 chars, max_tokens=800) ========== === measured (5) === === summary [code] (n=5) === === GPU state === Here is the corresponding log for the commands that the script launched: vllm-qwen36-27b-dual | (APIServer pid=1) INFO: 172.22.0.1:33202 - "POST /v1/chat/completions HTTP/1.1" 200 OK For good measure, I also did the verify stress: Running STRESS / boundary test against http://localhost:8020 (model=qwen3.6-27b-autoround, container=vllm-qwen36-27b) [1/7] Long-context needle small rungs (10K / 30K) ... And here are the corresponding logs, it is difficult to track which one correspond to which, as some of the test generate multiple lines. Using the POST to chat completions should be a reference, but I saw a couple of times a test launched 2 post so you know exactly better than me what each test should generate, here is the logs for the stress test: vllm-qwen36-27b-dual | (APIServer pid=1) INFO: 172.22.0.1:60382 - "POST /v1/chat/completions HTTP/1.1" 200 OK |
Beta Was this translation helpful? Give feedback.
-
|
It was not a critique at all, I always find interesting every use people have for AI, as it helps me learn as well. Here is the same output filtered as per your command, it corresponds with the performance tests and the stress tests: (APIServer pid=1) INFO 05-02 12:00:59 [metrics.py:101] SpecDecoding metrics: Mean acceptance length: 2.59, Accepted throughput: 3.38 tokens/s, Drafted throughput: 6.39 tokens/s, Accepted: 614 tokens, Drafted: 1161 tokens, Per-position acceptance rate: 0.749, 0.499, 0.339, Avg Draft acceptance rate: 52.9% |
Beta Was this translation helpful? Give feedback.
-
|
@JusefPol — appreciate the gracious framing. The Your numbers are substantial — NVLink is doing real workPulling the data into context vs our PCIe baseline:
The interesting finding: your MTP acceptance length and per-position accept rates are lower than ours (2.59 AL vs 3.4, 75/50/34 vs 95/86/71) — so the +60% TPS isn't coming from better spec-decode. It's coming from lower all-reduce overhead via NVLink. Each layer's tensor-parallel sync between cards lands faster, so even with fewer accepted draft tokens per step, you're cycling through more steps per second. That's the NVLink advantage we'd expected (~1.6-1.8× per-stream from H100 SXM data + 3090-NVLink-bridge measurements in the wild). Yours lands at ~1.6× — solidly in the documented range. TTFT advantage too65 ms vs our 145 ms — NVLink reduces the cross-card latency for the first cudagraph capture / spec-verify roundtrip. That ~80 ms saved is most of the difference; for IDE-agent workloads where TTFT is the user-felt latency, that's a real win. What this means for your PR #31Concrete evidence the NVLink variant is worth landing once we can validate the config. Your numbers are the closest-to-canonical we'll get on NVLink hardware. Reviewing PR #31 against the must-fix items I flagged (container name + port collision, header text "DEFAULT for 2× cards", stale "PCIe-only" comment, variant table) — if you can address those plus paste a One follow-up question on the AL gapYour AL=2.59 vs our 3.4 is interesting separately from TPS. Two things could cause it:
If you can share the And the report.sh you might find usefulPull the latest: git pull origin master
bash scripts/report.sh --bench > my-rig-nvlink.mdSingle command captures hardware (incl. PCIe lane width per card, NVLink topology, power caps), OS, container state, AND runs canonical bench. Standardizes cross-rig data so we can add an authoritative NVLink row to HARDWARE.md. Thanks again for the bench + the catch on the hallucination — both improved the project. |
Beta Was this translation helpful? Give feedback.
-
|
@JusefPol — your NVLink config landed via PR #31, now on master as What changed on top of your contributionMostly housekeeping so it could coexist with
Your Ask: re-bench when you get a chanceThe shipped variant uses
I'd also love to know whether Thanks again for the work and for being patient with the back-and-forth in this thread. |
Beta Was this translation helpful? Give feedback.
-
|
Thanks!, I was planning to do the changes myself tomorrow, as I am out with the family for the weekend. Thanks for the support then! I am learning to use github thanks to this project! I am a Solutions Architect rather than a Developer, so I am getting to know these things slowly, but I will get to it as soon as I get some time. For the confirmation for stability, that will take time as I start using more heavily hermes and opencode, I am still configuring them properly (as I struggle due to local LLMs never lasting for too long without making mistakes. With regards to my operating point which is more concurrency rather than focusing on high context. I think is where we are headed honestly, it does not matter if they are releasing 1 million token models if the accuracy and performance drop like a stone when you try to much into one, but having multiple highly specialized agents doing stuff, that makes sense to me, That points to the topic I raised yesterday on Discussion #33, is about looking to that scenario. Maybe hitting 6-8 concurrency with 128k tokens (131072) could be a sweet spot if tool calling and overall quality is maintained, and if I can reach still above 100 tok/s per request, that makes a great point for NVLink don't you think? As I said, I can't follow your speed of answering and doing stuff, due to lack of experience on specific and weird parts of the job (I can mount a computer a thousand times with blind eyes, but coding part haven't done since university... 20 years ago), so Ill catch up, I have family and work as well :D |
Beta Was this translation helpful? Give feedback.
-
|
@JusefPol — no rush at all on testing or learning GitHub. You're contributing real signal already; the dual-nvlink variant we shipped exists because of your config + bench data — that's substantial work, regardless of where you land on the developer ↔ architect spectrum. Your operating-point intuition is correctThe "multi-agent at moderate ctx" pattern is increasingly the winning one for agentic workloads, and not just on hardware-constrained rigs. A few real reasons it beats "one big context":
This is why we ship On 6-8 concurrency × 131K — quick KV math
Even with TQ3, you'd be ~2× over-budget for 6 streams × 131K — works in practice via prefix caching + chunked-prefill paging KV in/out, but you'd hit cliffs more often, and TTFT on cold prompts would spike. The realistic stepping stones are:
NVLink genuinely shines at high concurrencyYou're right that NVLink + many streams compounds. The reason isn't bandwidth (a 27B model's per-layer allreduce is small) — it's latency. Every decoded token triggers an NCCL allreduce; PCIe adds ~30-50µs over NVLink per call. At 4 streams × 100 tok/sec = 400 allreduces/sec → ~12-20ms of overhead/sec NVLink saves you that PCIe doesn't. At 8 streams that doubles. Compounding effect on aggregate throughput. That's why your numbers are higher than dual.yml's published baseline: not just "more bandwidth" — "less latency floor on every step." No timeline pressureTest when you test. The 142K + 4 streams config you already have is documented evidence that NVLink is doing real work; if/when you push to 6 streams or k8v4 we'll have more, but the project doesn't need it to ship the variants. Family + work first. (Cross-linking Discussion #33 since the k8v4 question lives there — your two threads converge on the same operating point.) |
Beta Was this translation helpful? Give feedback.
-
|
Cross-thread heads-up @JusefPol — relevant to your operating-point concerns above. We just reproduced @GuiPerPT's club-3090#41 Cliff 2 OOM on this rig under multi-turn traffic. Both shipping single-card variants (long-text 0.93, long-vision 0.95) fail at ~21-26K accumulated context on Qwen3.6-27B + vLLM 0.20 + Genesis v7.69 + 1× 3090. Stack trace byte-identical to GuiPerPT's. Detailed writeup + recommendations: #41 comment. Why this is relevant to you specifically: your 4-streams × 142K NVLink config on dual card splits the activation budget, so this almost certainly does NOT affect dual-card paths. The single-card recommendation pivot (route hermes/openhands users to dual.yml or llamacpp) reinforces the value of what you're running. The k8v4 stretch we discussed in #33 remains worth pursuing on dual where the activation pressure isn't the binding constraint. We're working on a fix — Codex investigation brief just queued for: precise kernel-allocation math, why P103 doesn't prevent it on real serving, and candidate fix designs (Genesis patch, env var, upstream PR options). Will update once we have a path forward. |
Beta Was this translation helpful? Give feedback.
-
|
OK, here is the new report, I have updated the repo from origin, so the tests are on the latest version as of 1 hour ago. This has been executed with the shipped dual-nvlink.yml on this repo. (with the 2 concurrency and max context). Hope it helps. I will run separately mine for my normal use. Eventually I will create it on my repo as a high-concurrent-nvlink, but after much testing on usage with real world work. IF at any point we get a turboquant 8-4 I will also create that one. By the way, to be clear, is it possible for me to test directly the turboquant k8v4 by just changing the --kv-cache-dtype parameter? should it work like this? I have no hopes of navigating through all the patches and tinkering you can achieve. but if the test is as simple as changing that parameter, then I don't need to wait for an "official" yml to be released. PS: IF is only changing the --kv-cache-dtype parameter I am guessing that the starting point is not the dual.yaml but the dual.turbo.yaml that has all the patches for turboquant isnt it? |
Beta Was this translation helpful? Give feedback.
-
|
@JusefPol — these numbers are excellent, and they're now landed in the canonical BENCHMARKS row (commit 017d0d2). What you measured
That ~57% TPS uplift on TP=2 + NVLink at 262K + 2 streams is the cleanest single demonstration we have of NVLink doing real work on this stack. The mechanism is exactly what we discussed earlier in the thread: per-token NCCL allreduce latency floor drops ~30-50µs going from PCIe to NVLink, and that compounds because every decoded token triggers an allreduce. Your soak passing with 0 MiB growth + 100% TPS retention is the strongest validation we could ask for — first cross-rig confirmation that the dual-nvlink path is multi-turn-cliff-clean. Your two questionsQ1: "Can I test turboquant k8v4 by just changing
|
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
Hi @noonghunna . Another one who has discover your gem here. I have struggled for months trying to push the boundaries of my cards for coding assistant and agentic usage (hermes, openclaw) always struggling to make it work. For the first time I get good feelings of the testins I am doing.
I am using the normal dual.yaml as I need "stability" for tool calling, and I don't trust turboquant yet. With a few modifications.
Dropped the context size to 142k and increased parallelism to 4 (I need at least 4 parallel executions so hermes can run itself and its subagents while I can still code with opencode. I use images from time to time, and this is working nicely.
Since I have NVLink, ill put here the changes I've made to make it work, just in case you have comments. Feel free to create an optional (untested) version if you want with my changes. If see any problems in the future I will let you know. I did clone this morning, so I need to use it heavely, but at least is not failing the first tool calling like my attempts with qwen3.6-35B on llama-cpp before.
-- Added - NCCL_P2P_LEVEL=NVL
-- commented out: #- NCCL_P2P_DISABLE=1
-- Remove expandable_segments:True from PYTORCH_CUDA_ALLOC_CONF as it was causing a crash on startup
--removed --disable-custom-all-reduce from the command.
With this the model is running succesfully, here are the results of the benchmark:
========== NARRATIVE (prompt=65 chars, max_tokens=1000) ==========
=== warmups (3) ===
warm-1 wall= 55.34s ttft= 44995ms toks=1000 wall_TPS= 18.07 decode_TPS= 96.67
warm-2 wall= 8.92s ttft= 70ms toks=1000 wall_TPS=112.06 decode_TPS=112.94
warm-3 wall= 9.29s ttft= 70ms toks=1000 wall_TPS=107.64 decode_TPS=108.46
=== measured (5) ===
run-1 wall= 9.49s ttft= 69ms toks=1000 wall_TPS=105.42 decode_TPS=106.19
run-2 wall= 9.36s ttft= 50ms toks=1000 wall_TPS=106.86 decode_TPS=107.43
run-3 wall= 9.48s ttft= 50ms toks=1000 wall_TPS=105.47 decode_TPS=106.03
run-4 wall= 8.85s ttft= 69ms toks= 972 wall_TPS=109.84 decode_TPS=110.70
run-5 wall= 9.02s ttft= 70ms toks=1000 wall_TPS=110.85 decode_TPS=111.72
=== summary [narrative] (n=5) ===
wall_TPS mean= 107.69 std= 2.52 CV= 2.3% min=105.42 max=110.85
decode_TPS mean= 108.42 std= 2.63 CV= 2.4% min=106.03 max=111.72
TTFT mean= 62ms std= 11ms min=50ms max=70ms
========== CODE (prompt=78 chars, max_tokens=800) ==========
=== warmups (3) ===
warm-1 wall= 3.57s ttft= 68ms toks= 492 wall_TPS=137.75 decode_TPS=140.43
warm-2 wall= 2.99s ttft= 70ms toks= 419 wall_TPS=139.94 decode_TPS=143.27
warm-3 wall= 5.59s ttft= 71ms toks= 800 wall_TPS=143.00 decode_TPS=144.84
=== measured (5) ===
run-1 wall= 4.87s ttft= 70ms toks= 681 wall_TPS=139.70 decode_TPS=141.72
run-2 wall= 5.75s ttft= 69ms toks= 800 wall_TPS=139.20 decode_TPS=140.89
run-3 wall= 5.89s ttft= 70ms toks= 800 wall_TPS=135.81 decode_TPS=137.44
run-4 wall= 5.77s ttft= 69ms toks= 800 wall_TPS=138.76 decode_TPS=140.44
run-5 wall= 5.55s ttft= 69ms toks= 769 wall_TPS=138.64 decode_TPS=140.38
=== summary [code] (n=5) ===
wall_TPS mean= 138.42 std= 1.52 CV= 1.1% min=135.81 max=139.70
decode_TPS mean= 140.18 std= 1.62 CV= 1.2% min=137.44 max=141.72
TTFT mean= 69ms std= 1ms min=69ms max=70ms
=== GPU state ===
0, 88 %, 22518 MiB, 24576 MiB, 353.37 W, 55
1, 85 %, 22518 MiB, 24576 MiB, 355.88 W, 51
Again. Thanks for your work. Incredible.
Beta Was this translation helpful? Give feedback.
All reactions