Replies: 7 comments 3 replies
-
|
Good question and the intuition is sound — k8v4 (8-bit K, 4-bit V, ~6 bits avg/token) genuinely sits in the middle ground you're describing. We have data on it from earlier rounds, and an explicit decision history worth surfacing. What we measuredFrom
So vs fp8: k8v4 buys you 4.5× the KV pool and +10% concurrency at the cost of −15% per-stream TPS. vs TQ3: k8v4 keeps +3% narr / +12% code per-stream but loses 2× the KV pool and 45% of the concurrency headroom. Quality-wise, 8-bit K is close to fp16; 3-bit nc has measurable cascade-risk on long-context decoding (which is why we bias toward fp8 for IDE-agent workloads even though TQ3 has more pool capacity). Why we swapped k8v4 → TQ3 on 2026-04-28
That conclusion was framed around raw TPS vs concurrency. Your question reframes it around quality preservation — which is a legitimately different axis we didn't optimize for explicitly. What's changed since that decisionTwo things, neither of them re-tested on dual-card yet:
So the 60/77 numbers above are from the v7.51-stable era. A re-bench against current substrate would likely move them — possibly meaningfully. What I'd proposeShip a third dual variant: Concrete plan:
Cross-rig opportunitySaw your PR #31 for the NVLink variant — once we have a I'll get the compose drafted + benched against current substrate over the next day or two and update this thread with numbers. If the middle-ground holds up under measurement, it ships. |
Beta Was this translation helpful? Give feedback.
-
|
Yes, I'd love to, as I mentioned on the other discussion. I am really interested to push parallelism, so if you get a working variant where you see almost no degradation in quality, as soon as you get it out, I will test it, both default, and high-concurrency mode |
Beta Was this translation helpful? Give feedback.
-
|
Logical next variant. Sequence makes sense:
Math + operating-point discussion is in the Discussion #29 reply just now (link) so I won't duplicate it here. Short version: the k8v4 sibling is the cleanest next thing to ship; the 6-stream stretch needs ctx trimming to fit the pool. I won't build ahead of test signal — your 4×142K + NVLink config is one cross-rig data point, and shipping a k8v4 sibling without anyone running it would be premature commitment. When you've had a chance to run dual-nvlink in real opencode/hermes traffic for a week or two and have stability data, that's the right moment to spec the k8v4 variant — your real workload tells us where the useful operating point actually is, vs where the math says it could theoretically be. |
Beta Was this translation helpful? Give feedback.
-
|
I am publishing here my first results trying the turboquant_k8v4 variant with context window of 131072 (128k) and concurrency of 5 (I had to increase gpu utilization from 0.85 to 0.9 from the dual turbo variant) This results come with the NVlink modifications active. [17:20:34] /project/cpp/grammar_matcher.cc:497: Warning: The matcher has terminated after accepting the stop token, but is trying to accept new token with id 198. I see one of the tests failed, but there is no specific log in the docker compose that might indicate what happened. It's too soon to ship this on a PR yet, let me know if you see something weird on the results, and I will try for a while with real use case data. |
Beta Was this translation helpful? Give feedback.
-
|
@JusefPol — interesting middle-ground attempt. Quick read on the two anomalies you flagged, plus what I'd want to see before this becomes a shippable variant. On the
|
| Failure class | Symptom | What to check |
|---|---|---|
| Silent-empty (#43, #47, #50 class) | HTTP 200 + 0 completion tokens, model "thought" then output nothing | docker logs ... 2>&1 | grep -iE "grammar.*reject" should be empty if not this |
| OOM during prefill at 131K ctx | HTTP 500 with allocator trace | grep -iE "out of memory|cudaMalloc" |
| Cliff 2-class GDN forward OOM | HTTP 500, traces through chunk_gated_delta_rule_fwd |
Probe 7 of verify-stress at 60K-90K |
| Concurrency 5 contention | One stream's HTTP error during multi-stream test | grep "Engine.*0 reqs" near the failure window |
If you can post the specific verify-stress probe number (1/7 through 7/7) or the failing test's name + the docker logs vllm-qwen36-27b-dual-turbo 2>&1 | tail -100 around the moment of failure, I can disambiguate. The "no specific log" report you got might just be that the failure is a soft-empty (HTTP 200 + 0 tokens) which doesn't surface as an error in the engine logs — exactly the gap commit f32d8a6 closes for soak-test specifically. Could you re-run with git pull to get the silent-empty detection?
On the K8V4 + 131K + concurrency=5 + NVLink config itself
Conceptually this is a sensible middle ground:
| Variant | KV format | Per-token KV bytes (after TP=2) | Concurrency at 262K | Activation budget |
|---|---|---|---|---|
dual.yml |
fp8_e5m2 | 1.0 | 2.47× | comfortable |
| Yours: K8V4 + 131K | k8v4 (8-bit K, 4-bit V) | 0.625 | ~5× at 131K | tighter |
dual-turbo.yml |
turboquant_3bit_nc | 0.375 | 4.67× at 262K | tightest |
K8V4 is the asymmetric pick from the lucebox-hub family (PR #56 lineage) — 8 bits on K because K is more sensitive to quantization noise than V, 4 bits on V because V tolerates aggressive quant well. Should give you better accuracy on long-ctx tasks than TQ3 with similar KV-pool capacity. So if it survives stress tests cleanly, it's worth shipping as a third variant.
What I'd want to see before merging it into the variant matrix:
verify-stress.sh7/7 PASS at your 131K config — including 60K + 90K Cliff 2 needles. That's the gate other dual-* variants pass.SOAK_MODE=continuousPASS at concurrency=5 — multi-turn at high concurrency is where Cliff 2b shows up (still pending the upstream issue we're tracking; even passing soak today is meaningful).bench.shnumbers showing TPS isn't a regression vsdual-turbo.yml(NVLink active should help here, since PCIe-only dual-turbo bench is your reference).- The unspecified test failure resolved — whatever it was, we need to know what you're trading.
If you're willing to put together a full bash scripts/report.sh --full and post it as a numbers-from-your-rig issue, that'd give us the canonical data shape we use for cross-rig validation. Then we can discuss merging this into a dual-balanced.yml or similar third-rung variant.
The NVLink win you reported on the original dual.yml (the +57% TPS finding from your earlier numbers) means K8V4 + NVLink + concurrency=5 is a configuration we don't have a published number for anywhere on this stack — meaningful contribution if it pans out.
TL;DR
- The grammar_matcher warning: benign, ignore.
- The unspecified test failure: need more data — specifically which probe failed and
docker logsaround the failure window. Trygit pullfirst; soak-helper now flags silent-empty automatically (f32d8a6). - The K8V4 + 131K + concurrency=5 + NVLink concept is sensible. Worth bringing to a
numbers-from-your-rigissue with a fullverify-stress+ soak + bench. If it clears those gates, it's a real third variant the matrix doesn't cover today.
Beta Was this translation helpful? Give feedback.
-
|
Cool, I will try tomorrow the tests. Do you want me to continue here, or launch everything on an actual github issue? |
Beta Was this translation helpful? Give feedback.
-
|
@JusefPol — continue here for now, please. The signal-to-noise on a research-mode exploration like "TQ k8v4 middle-ground for 2-card NVLink rigs" is best in a discussion thread where the back-and-forth on tuning constants doesn't pollute the issue tracker's "real bugs to fix" channel. Issues work better when you have either (a) a reproducible failure mode with a fix proposal, or (b) a shippable artifact that needs review. If your continued testing lands on:
Useful context for your next round: Sandermage just shipped v7.72.2 on 2026-05-05 (we've absorbed it into club-3090 PR #59, branch
If you re-run k8v4 testing on the v7.72.2-uplift branch you'll be on a richer set of patches than your earlier run. Worth a re-bench against your own prior numbers — especially the grammar-reject side, which v7.72.1 should have improved on multi-tool catalogs. When you do file the eventual PR (assuming the variant survives soak-continuous which is the load-bearing gate per docs/CLIFFS.md on Cliff 2b), the Numbers from your rig issue template is a good first step before opening the PR — it captures rig context cleanly and lands the BENCHMARKS row in a structured way. |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
Just noticed that the dual is using simple fp8 and dual turbo is turboquant 3bit. Have you tested already or do you have any plans to try a version with turboquant_k8v4? it might give just enough room without any real penalty on tool calling and quality.
Beta Was this translation helpful? Give feedback.
All reactions