π§ͺ Dual Carnice-V2-27B (Q8) on 2Γ 3090 β beellama MTP, q8_0 KV over kvarn6, base-parity quality #417
Replies: 2 comments 8 replies
-
|
This is a very useful benchmark for dual-3090 local inference. The q8_0 KV vs KVarN result is especially interesting. It is a good reminder that the best memory strategy depends heavily on the actual hardware envelope. On a tight single-card setup, compression may be worth the compute overhead, but on dual 3090s with enough VRAM headroom, higher-fidelity q8_0 KV can be both faster and cleaner. For agentic workflows, I think the most useful next layer would be measuring not only raw TPS, but workflow-level behavior:
The OpenAI-compatible serving path is also important. It makes this kind of local setup much easier to test with existing agent tools, but the real question is whether the model stays stable across planning, tool use, code edits, and long-context replay. Great work documenting the tradeoff between KV strategy, MTP, quality, and hardware fit. |
Beta Was this translation helpful? Give feedback.
-
|
This is great, Richard β and you ran it exactly right: deterministic-first, paced, retries on, cost-aware. Two caveats to set up an honest comparison, then the numbers. Caveat 1 β model and endpoint differ. You ran Doubao-Seed-2.0-Code (a managed frontier code model); our local baseline is qwen3.6-27b. So gaps fold "different model" + "different deployment" together β useful for "where is each stronger," but not the same-model deployment isolation (for that, a qwen3.6-27b your hub serves would be apples-to-apples). Caveat 2 β reasoning state, which we should have flagged in the recipe (our miss). benchlocal-cli sets thinking per-pack via each pack's Quality β the deterministic five (no-Docker packs):
(Our base qwen3.6-27b is benched as det/sandbox totals β det 64/75 on Three things stand out:
For the next round β run both reasoning arms. Since thinking's effect is pack- and model-specific (in our own tests it's often a wash, sometimes a slight regression on strict-format packs), the clean approach is two arms and report the delta: # all-OFF arm
benchlocal-cli run --endpoint β¦ --model β¦ --api-key β¦ --pack reasonmath-15 --no-thinking --save-json off-reasonmath.json
# all-ON arm
benchlocal-cli run --endpoint β¦ --model β¦ --api-key β¦ --pack reasonmath-15 --enable-thinking --thinking-max-tokens 16384 --save-json on-reasonmath.jsonCaveat 2 still applies β those flags steer a vLLM-style server cleanly, but against Doubao you'd want to set its native reasoning toggle to match each arm (otherwise the flag is a no-op and both arms are identical). Where each is stronger, on this run: Doubao β excellent tool-calling, zero hardware/maintenance, scales on demand. Local qwen3.6-27b β slight dataextract edge, reasoning state fully under your control, full privacy/control, cost is just the power cap. Different strengths, no dominator β the framing you called from the start. Then the heavy packs: |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
We just shipped
beellama/carnice-v2-dual-q8-mtpπ§ͺ β Carnice-V2-27B (@kai-os's Hermes-style agentic SFT of Qwen3.6-27B), Q8_0 weights + embedded MTP head (MTP GGUF by @stuchapin), on beellama.cpp v0.3.2 (@Anbeeld), layer-split across two 3090s. The dual / quality-max follow-through to the single-card Carnice β requested in #403. All credit for the model goes to @kai-os; @stuchapin's MTP head is what unlocks the self-speculative path β see Credits.The headline: q8_0 KV beats the kvarn6 we were asked for β +17% prefill, higher fidelity, and it still fits the full 262K on dual.
π΄ Results Card β 2Γ RTX 3090 PCIe (no NVLink), beellama v0.3.2, think-off
β Serving
vllm/dual-maxβTTFT ~79 ms Β· prefill ~1197 t/s Β· n=5 measured, CV <1.3% Β·
DRAFT_N_MAX=2= +13% decode (lossless, opt-in). Sampling temp 0.6 / top-k 20 / top-p 0.95.β Reference row β stock Qwen3.6-27B at its max-fidelity dual tier (
vllm/qwen-27b-dual-max: official FP8 weights + int8-PTH KV, vLLM TP=2). Reported as a single ~56 TPS decode figure (narr/code not separately split). This is a different engine + quant (vLLM FP8 vs beellama Q8) β not apples-to-apples on raw TPS; it's here so you can place Carnice against base. The comparable axis is capability β see the 8-pack below.β‘ Quality β core 8-pack (/150)
vs base: stock Qwen3.6-27B at the same max-fidelity dual tier (
vllm/qwen-27b-dual-max) scores 110/150 on the same--fullharness (det 65 Β· sandbox 45). So the Carnice fine-tune lands within ~5β7 pts (sampling noise band) of base capability β i.e. no measurable capability cost for the agentic/Hermes character. Think-off β think-on (wash). In-band with our other 27B-class composes (qwopus-coder 103). Plus: verify-full 8/8 Β· verify-stress 8/8 (NIAH ladder filled to 240K = 91% of n_ctx, 5.2 GB ceiling margin) Β· soak-continuous PASS (20Γ5, 0 MiB growth, 0/100 silent-empty, p50 42.2 TPS, 100% retention).β’ Takeaways
carnice-v2-single). On dual's 48 GB you don't need the compression, so q8_0 is faster, higher-fidelity (q8 > kvarn6 q6-class), and reference-aligned. (-b/-ub/--no-mmapwere A/B'd flat β KV type was the only prefill lever.)vllm/dual-max's 110/150 on the 8-pack β you're trading none of the underlying Qwen capability for the Hermes/agentic tuning.-ts 0.55,0.45to offset the MTP context pinning to the main card). On a 2-GPU box it's mutually exclusive with single-card work.Why dual Q8 Carnice?
Carnice-V2 is a Hermes-style agentic/reasoning fine-tune β the use case is tool-use + reasoning with thinking on. The single-card path (
carnice-v2-single, Q5 + kvarn) trades quality for fitting 24 GB; on two cards you can run Q8 weights + high-fidelity q8_0 KV over the full 262K with no compression compromise. This is the quality-max Carnice.Getting it
The MTP GGUF is public: stuchapin/Carnice-V2-27B-MTP-GGUF (
Carnice-V2-27B-Q8_0-mtp.gguf).setup.sh/launch.shauto-fetch it, orhf download stuchapin/Carnice-V2-27B-MTP-GGUF. Runs on beellama.cpp v0.3.2 (pinned digest in the compose).Run it
bash scripts/switch.sh --owui beellama/carnice-v2-dual-q8-mtp # launches on :8070 and wires Open WebUIServes an OpenAI-compatible API on
:8070. Reasoning is on by default (it's a reasoning SFT); setREASONING=offfor fast chat,DRAFT_N_MAX=2for the +13% decode.What'd help
Cross-rig numbers (other 3090 pairs / 4090s β fit, TPS, MTP accept), and anyone who finds or trains a Carnice-matched DFlash drafter (the one thing that'd beat MTP here). The portable finding: on a dual card with headroom, standard q8_0 KV beats KVarN software-compression on prefill β KVarN's win is single-card-only.
Credits
--spec-type draft-mtpdecode.--spec-type draft-mtp+ KVarN.Shipped via PR #416. π§ͺ Experimental only because beellama v0.3.2 is a rolling pre-release β fully validated otherwise. Originally requested in #403.
Beta Was this translation helpful? Give feedback.
All reactions