🧪 Dual Carnice-V2-27B (Q8) on 2× 3090 — beellama MTP, q8_0 KV over kvarn6, base-parity quality #417

noonghunna · 2026-06-16T12:15:05Z

noonghunna
Jun 16, 2026
Maintainer

We just shipped beellama/carnice-v2-dual-q8-mtp 🧪 — Carnice-V2-27B (@kai-os's Hermes-style agentic SFT of Qwen3.6-27B), Q8_0 weights + embedded MTP head (MTP GGUF by @stuchapin), on beellama.cpp v0.3.2 (@Anbeeld), layer-split across two 3090s. The dual / quality-max follow-through to the single-card Carnice — requested in #403. All credit for the model goes to @kai-os; @stuchapin's MTP head is what unlocks the self-speculative path — see Credits.

The headline: q8_0 KV beats the kvarn6 we were asked for — +17% prefill, higher fidelity, and it still fits the full 262K on dual.

🎴 Results Card — 2× RTX 3090 PCIe (no NVLink), beellama v0.3.2, think-off

① Serving

Config	Spec-dec	KV	ctx	Narr	Code	Accept	VRAM
Carnice Q8 — MTP n=1 ⭐	draft-mtp	q8_0	262K	40.7	44.0	~81%	21.8 / 21.2 GB
Carnice Q8 — MTP n=2 (opt-in)	draft-mtp	q8_0	262K	~45	—	62%	same
base Qwen3.6-27B `vllm/dual-max` †	MTP n=3	int8-PTH	262K	~56 †	~56 †	—	—

TTFT ~79 ms · prefill ~1197 t/s · n=5 measured, CV <1.3% · DRAFT_N_MAX=2 = +13% decode (lossless, opt-in). Sampling temp 0.6 / top-k 20 / top-p 0.95.

† Reference row — stock Qwen3.6-27B at its max-fidelity dual tier (vllm/qwen-27b-dual-max: official FP8 weights + int8-PTH KV, vLLM TP=2). Reported as a single ~56 TPS decode figure (narr/code not separately split). This is a different engine + quant (vLLM FP8 vs beellama Q8) — not apples-to-apples on raw TPS; it's here so you can place Carnice against base. The comparable axis is capability — see the 8-pack below.

② Quality — core 8-pack (/150)

Pack	Carnice think-off	Carnice think-on
toolcall-15	13	14
instructfollow-15	13	14
structoutput-15	14	14
dataextract-15	9	10
reasonmath-15	13	12
bugfind-15	12	13
hermesagent-20	12	12
cli-40	17	16
Total /150	103 (69%)	105 (70%)

vs base: stock Qwen3.6-27B at the same max-fidelity dual tier (vllm/qwen-27b-dual-max) scores 110/150 on the same --full harness (det 65 · sandbox 45). So the Carnice fine-tune lands within ~5–7 pts (sampling noise band) of base capability — i.e. no measurable capability cost for the agentic/Hermes character. Think-off ≈ think-on (wash). In-band with our other 27B-class composes (qwopus-coder 103). Plus: verify-full 8/8 · verify-stress 8/8 (NIAH ladder filled to 240K = 91% of n_ctx, 5.2 GB ceiling margin) · soak-continuous PASS (20×5, 0 MiB growth, 0/100 silent-empty, p50 42.2 TPS, 100% retention).

③ Takeaways

q8_0 KV, not kvarn6 (which the request named): a measured A/B showed q8_0 prefills +17% (1003 vs 860 t/s). KVarN is software KV-compression — its compute cost only pays off on a tight single card (where it wins on carnice-v2-single). On dual's 48 GB you don't need the compression, so q8_0 is faster, higher-fidelity (q8 > kvarn6 q6-class), and reference-aligned. (-b/-ub/--no-mmap were A/B'd flat — KV type was the only prefill lever.)
MTP, not DFlash: the only DFlash drafters that exist are base-Qwen3.6-27B, which mismatch the fine-tune → ~10% accept, slower than MTP. The embedded MTP head is distribution-matched by construction (~81% @ n=1). No Carnice-matched DFlash drafter exists.
Same capability as base, distinct character: within noise of vllm/dual-max's 110/150 on the 8-pack — you're trading none of the underlying Qwen capability for the Hermes/agentic tuning.
Dual-only: Q8_0 is ~29 GB > 24 GB → both cards, layer-split (-ts 0.55,0.45 to offset the MTP context pinning to the main card). On a 2-GPU box it's mutually exclusive with single-card work.

Why dual Q8 Carnice?

Carnice-V2 is a Hermes-style agentic/reasoning fine-tune — the use case is tool-use + reasoning with thinking on. The single-card path (carnice-v2-single, Q5 + kvarn) trades quality for fitting 24 GB; on two cards you can run Q8 weights + high-fidelity q8_0 KV over the full 262K with no compression compromise. This is the quality-max Carnice.

Getting it

The MTP GGUF is public: stuchapin/Carnice-V2-27B-MTP-GGUF (Carnice-V2-27B-Q8_0-mtp.gguf). setup.sh/launch.sh auto-fetch it, or hf download stuchapin/Carnice-V2-27B-MTP-GGUF. Runs on beellama.cpp v0.3.2 (pinned digest in the compose).

Run it

bash scripts/switch.sh --owui beellama/carnice-v2-dual-q8-mtp   # launches on :8070 and wires Open WebUI

Serves an OpenAI-compatible API on :8070. Reasoning is on by default (it's a reasoning SFT); set REASONING=off for fast chat, DRAFT_N_MAX=2 for the +13% decode.

What'd help

Cross-rig numbers (other 3090 pairs / 4090s — fit, TPS, MTP accept), and anyone who finds or trains a Carnice-matched DFlash drafter (the one thing that'd beat MTP here). The portable finding: on a dual card with headroom, standard q8_0 KV beats KVarN software-compression on prefill — KVarN's win is single-card-only.

Credits

The model: @kai-os — Carnice-V2-27B, the Hermes-style agentic SFT of Qwen3.6-27B. All capability and character is theirs.
The MTP head: @stuchapin — the Q8_0 + embedded MTP-head GGUF, which unlocks the lossless --spec-type draft-mtp decode.
The engine: beellama.cpp (@Anbeeld) — --spec-type draft-mtp + KVarN.

Shipped via PR #416. 🧪 Experimental only because beellama v0.3.2 is a rolling pre-release — fully validated otherwise. Originally requested in #403.

richardchen874-sys · 2026-06-26T05:32:44Z

richardchen874-sys
Jun 26, 2026

This is a very useful benchmark for dual-3090 local inference.

The q8_0 KV vs KVarN result is especially interesting. It is a good reminder that the best memory strategy depends heavily on the actual hardware envelope. On a tight single-card setup, compression may be worth the compute overhead, but on dual 3090s with enough VRAM headroom, higher-fidelity q8_0 KV can be both faster and cleaner.

For agentic workflows, I think the most useful next layer would be measuring not only raw TPS, but workflow-level behavior:

tool-call reliability
structured-output failure rate
retry rate
latency over long multi-step tasks
cost per completed local workflow
context retention over long sessions
whether MTP acceptance changes during tool-heavy runs
comparison between local OpenAI-compatible serving and managed API endpoints

The OpenAI-compatible serving path is also important. It makes this kind of local setup much easier to test with existing agent tools, but the real question is whether the model stays stable across planning, tool use, code edits, and long-context replay.

Great work documenting the tradeoff between KV strategy, MTP, quality, and hardware fit.

7 replies

noonghunna Jun 27, 2026
Maintainer Author

Appreciate it, Richard — and your "where each is stronger" framing is the right one. Cleanest path is you running the same packs against your endpoint and sharing the JSON; we line it up against our local numbers for the identical model + packs and publish the methodology alongside, so it's reproducible and even-handed either way.

Install:

pip install git+https://github.com/noonghunna/benchlocal-cli.git
benchlocal-cli list        # all packs

Full quality suite — the 8-pack (/150) + aider-polyglot-30 — against your endpoint:

for p in toolcall-15 instructfollow-15 structoutput-15 dataextract-15 \
         reasonmath-15 bugfind-15 hermesagent-20 cli-40 aider-polyglot-30; do
  benchlocal-cli run \
    --endpoint https://<your-hub>/v1 \
    --model <model-id-on-your-hub> \
    --api-key <your-key> \
    --pack "$p" \
    --request-delay 0.5 \
    --max-transient-retries 5 \
    --save-json "cloud-$p.json" --progress
done

On rate limits — yes, two controls, built for exactly this cloud case:

--request-delay <sec> — proactive pacing: minimum seconds between requests, so you stay under your hub's RPM ceiling. Start at 0.5, raise if you still throttle (or BENCHLOCAL_REQUEST_DELAY).
--max-transient-retries <N> (default 3) — reactive: 429 / transient failures auto-retry before a scenario fails. Bump to 5–6 for a strict limit.
They compose: pace to avoid the throttle, retry to recover when it still hits. Leave --retry-on-timeout off — a timeout means the token budget was genuinely exhausted, so retrying just burns more.
Spend guard: add --max-total-tokens <N> per pack as a hard cost ceiling. The three agentic packs (hermesagent-20, cli-40, aider-polyglot-30) use by far the most tokens — set it generously so it caps a runaway without truncating a legit run.

Two notes:

Same model > same hardware. Pick a model your hub serves that we also bench — qwen3.6-27b is ideal (our local baseline is richest there); otherwise tell us which and we'll pull the matching local numbers.
The last three are sandboxed agent loops. Against a public endpoint they should run fine (no localhost rewrite needed, unlike a local server), but we haven't validated agent-over-cloud — so if they misbehave, the deterministic six are the rock-solid apples-to-apples and you can flag the rest.

Each JSON logs per-pack token counts + per-case latencies, so cost-per-completed-run falls straight out (your price × tokens) — exactly the metric you listed. Drop the cloud-*.json here and we'll publish the head-to-head vs local qwen3.6-27b, strengths on both sides. 🙏

noonghunna Jun 27, 2026
Maintainer Author

Quick follow-up — I folded that whole cloud-run recipe into the benchlocal-cli README so it's the canonical reference rather than just this thread: Running against a cloud / managed endpoint (the cloud knobs, the rate-limit pacing + retry pair, the spend guard, provider pinning, and which packs are the clean apples-to-apples). Whenever you've got cloud-*.json, send it over and we'll line it up against local. 🙏

richardchen874-sys Jun 29, 2026

Appreciate the detailed recipe — this is exactly the kind of reproducible comparison I had in mind.

I agree that the cleanest path is to run the same packs against a managed OpenAI-compatible endpoint, keep the methodology identical, and compare the JSON outputs rather than making a vague local-vs-cloud claim.

I will start with the more deterministic packs first, then decide whether to add the heavier agentic packs after checking token usage and latency behavior:

toolcall-15
instructfollow-15
structoutput-15
dataextract-15
reasonmath-15
bugfind-15

Then, if the endpoint behaves well, I can test the heavier workflow packs separately:

hermesagent-20
cli-40
aider-polyglot-30

The rate-limit and spend-guard notes are very useful. I will use request pacing, transient retries, and a token ceiling so the comparison stays controlled instead of turning into an uncontrolled cloud spend test.

For the model, I agree that matching qwen3.6-27b or the closest equivalent is the most useful path, since that makes the local vs managed comparison much more even-handed.

Once I have the cloud-*.json files, I’ll share them here so the comparison can focus on the real metrics:

completed runs
token counts
per-case latency
transient failures / retries
cost per completed benchmark run
where local and managed endpoints are each stronger

Thanks for also folding the cloud-run recipe into the README. That makes this much easier to reproduce cleanly.

richardchen874-sys Jun 29, 2026

Appreciate it, Richard — and your "where each is stronger" framing is the right one. Cleanest path is you running the same packs against your endpoint and sharing the JSON; we line it up against our local numbers for the identical model + packs and publish the methodology alongside, so it's reproducible and even-handed either way.

Install:
pip install git+https://github.com/noonghunna/benchlocal-cli.git
benchlocal-cli list        # all packs
Full quality suite — the 8-pack (/150) + aider-polyglot-30 — against your endpoint:
for p in toolcall-15 instructfollow-15 structoutput-15 dataextract-15 \
         reasonmath-15 bugfind-15 hermesagent-20 cli-40 aider-polyglot-30; do
  benchlocal-cli run \
    --endpoint https://<your-hub>/v1 \
    --model <model-id-on-your-hub> \
    --api-key <your-key> \
    --pack "$p" \
    --request-delay 0.5 \
    --max-transient-retries 5 \
    --save-json "cloud-$p.json" --progress
done
On rate limits — yes, two controls, built for exactly this cloud case:

--request-delay <sec> — proactive pacing: minimum seconds between requests, so you stay under your hub's RPM ceiling. Start at 0.5, raise if you still throttle (or BENCHLOCAL_REQUEST_DELAY).

--max-transient-retries <N> (default 3) — reactive: 429 / transient failures auto-retry before a scenario fails. Bump to 5–6 for a strict limit.

They compose: pace to avoid the throttle, retry to recover when it still hits. Leave --retry-on-timeout off — a timeout means the token budget was genuinely exhausted, so retrying just burns more.

Spend guard: add --max-total-tokens <N> per pack as a hard cost ceiling. The three agentic packs (hermesagent-20, cli-40, aider-polyglot-30) use by far the most tokens — set it generously so it caps a runaway without truncating a legit run.

Two notes:

Same model > same hardware. Pick a model your hub serves that we also bench — qwen3.6-27b is ideal (our local baseline is richest there); otherwise tell us which and we'll pull the matching local numbers.

The last three are sandboxed agent loops. Against a public endpoint they should run fine (no localhost rewrite needed, unlike a local server), but we haven't validated agent-over-cloud — so if they misbehave, the deterministic six are the rock-solid apples-to-apples and you can flag the rest.

Each JSON logs per-pack token counts + per-case latencies, so cost-per-completed-run falls straight out (your price × tokens) — exactly the metric you listed. Drop the cloud-*.json here and we'll publish the head-to-head vs local qwen3.6-27b, strengths on both sides. 🙏

I finished the first managed OpenAI-compatible endpoint run and attached the generated cloud-*.json files as a zip.

Model used: Doubao-Seed-2.0-Code
Endpoint type: managed OpenAI-compatible endpoint
Run mode: deterministic packs first, with request pacing and transient retries enabled.

I kept this first run focused on the lighter deterministic packs before moving to the heavier workflow packs, so the comparison stays controlled and cost-aware.

Summary:

toolcall-15: 15 / 15, 100%, p50 8.82s, p95 13.02s
structoutput-15: 14 / 15, 93%, p50 8.97s, p95 13.11s
instructfollow-15: 13 / 15, 87%, p50 8.39s, p95 14.98s
dataextract-15: 7 / 15, 47%, p50 11.17s, p95 24.30s
reasonmath-15: 12 / 15, 80%, p50 15.99s, p95 110.82s
bugfind-15: 12 / 15, 80%, p50 21.63s, p95 114.59s

A few early observations:

Tool calling looks strong on this endpoint.
Structured output is mostly good, with one schema violation.
Instruction following is decent, with a couple of strict-format misses.
Data extraction is clearly the weakest area in this run, especially numeric fields, invoice-style fields, and structured item extraction.
Reasoning/math and bug finding both passed 12 / 15, but with high p95 latency.
bugfind-15 required Docker sandbox setup locally before it could run properly.

I’ll keep the heavier workflow packs separate for now, since hermesagent-20, cli-40, and aider-polyglot-30 may have very different token usage, latency, and runtime behavior.

The goal here is to make the local vs managed endpoint comparison based on completed runs, latency, failure modes, token usage, retries, and cost per completed benchmark run rather than a vague local-vs-cloud claim.
cloud-doubao-seed-2-code-results.zip

noonghunna Jun 30, 2026
Maintainer Author

Aligned on the framing — "where each stack is stronger," confounds named — and good confirmation that the latency tail tracks the thinking-on packs.

The two-arm round is the one to chase next: if Doubao exposes a native reasoning toggle, run reasonmath / bugfind both ways and we'll line each arm up against the matching local arm. If it doesn't, "reasoning isn't independently controllable on this endpoint" is itself worth recording.

On the test key — I'll actually take you up on it, but for a slightly different purpose than the comparison, which is where I think it earns its keep: hardening benchlocal-cli's verifiers against a different model family. Our pack verifiers were developed mostly against local qwen/gemma-style output; a frontier model from another family is exactly the adversarial input that surfaces over-fit grading. Your run already has a candidate — dataextract at 47%: is that a real model weakness, or is the verifier too strict on a model that formats numeric / invoice fields differently? Running the packs against your endpoint and inspecting the graded-as-failed cases is the only way to tell those apart and make the verifiers model-agnostic — which improves the packs for everyone, independent of any local-vs-cloud number.

So yes please to a small benchmark-only test key — we'll treat it strictly as a dev credential (env-only, never committed, packs only; no private/customer data, as you scoped). Easiest if you send it privately rather than in-thread. The public local-vs-cloud comparison still stays on the you-run → share-JSON → we-publish-the-method model, so that part remains vendor-neutral and reproducible by anyone.

Whenever you've got the two-arm reasonmath / bugfind JSON, drop it here too. And thanks — using a managed frontier model to stress-test the verifiers is a genuinely useful angle we hadn't set up. 🙏

noonghunna · 2026-06-30T01:41:46Z

noonghunna
Jun 30, 2026
Maintainer Author

This is great, Richard — and you ran it exactly right: deterministic-first, paced, retries on, cost-aware. Two caveats to set up an honest comparison, then the numbers.

Caveat 1 — model and endpoint differ. You ran Doubao-Seed-2.0-Code (a managed frontier code model); our local baseline is qwen3.6-27b. So gaps fold "different model" + "different deployment" together — useful for "where is each stronger," but not the same-model deployment isolation (for that, a qwen3.6-27b your hub serves would be apples-to-apples).

Caveat 2 — reasoning state, which we should have flagged in the recipe (our miss). benchlocal-cli sets thinking per-pack via each pack's default_thinking metadata. Of the six you ran, three default thinking-ON — instructfollow-15, reasonmath-15, and bugfind-15 — while toolcall-15, structoutput-15, and dataextract-15 default off. But it signals thinking via chat_template_kwargs.enable_thinking — a vLLM-side convention a managed endpoint likely ignores, using its own reasoning control instead. So your Doubao run's actual reasoning state is unknown, and it's probably not matched to how a local model would run those packs. Worth pinning down for the next round (more below).

Quality — the deterministic five (no-Docker packs):

Pack	Doubao-Seed-2.0-Code (managed)	qwen3.6-27b (local dual-3090)
toolcall-15	15	13–14
structoutput-15	14	14
instructfollow-15	13	13–15
dataextract-15	7	8–10
reasonmath-15	12	12–13
det-5 total	61/75	64–65/75
bugfind-15 (sandboxed)	12/15	12–14/15

(Our base qwen3.6-27b is benched as det/sandbox totals — det 64/75 on vllm/dual, 65/75 on vllm/dual-max; per-pack ranges are from the qwen3.6-27b compose family, which shares the base.)

Three things stand out:

They land in the same band. A local quantized 27B on two 3090s (~64–65) sits within n=15 sampling noise of a managed frontier code model (61) on the deterministic packs. The headline for local inference: you're not trading away capability to self-host.
dataextract is a shared floor. Doubao 7, our qwen 8–10 — a different model family hits the same wall (numeric typing / string-vs-number / invoice fields). That's the pack probing a real model-class weakness, not a quirk of either model — nice cross-model validation.
The latency tail lines up exactly with the thinking-on packs — likely CoT, not endpoint variance. Your p95 blows out only on reasonmath (~111s) and bugfind (~115s) while everything else is 13–24s — and those are precisely the two long-case packs that default thinking-on. So the tail reads as reasoning-token generation on hard cases, not shared-tenant queueing — which is exactly why nailing the reasoning state (caveat 2) matters before reading too much into latency.

For the next round — run both reasoning arms. Since thinking's effect is pack- and model-specific (in our own tests it's often a wash, sometimes a slight regression on strict-format packs), the clean approach is two arms and report the delta:

# all-OFF arm
benchlocal-cli run --endpoint … --model … --api-key … --pack reasonmath-15 --no-thinking    --save-json off-reasonmath.json
# all-ON arm
benchlocal-cli run --endpoint … --model … --api-key … --pack reasonmath-15 --enable-thinking --thinking-max-tokens 16384 --save-json on-reasonmath.json

Caveat 2 still applies — those flags steer a vLLM-style server cleanly, but against Doubao you'd want to set its native reasoning toggle to match each arm (otherwise the flag is a no-op and both arms are identical). reasonmath + bugfind are where you'd expect the biggest spread.

Where each is stronger, on this run: Doubao — excellent tool-calling, zero hardware/maintenance, scales on demand. Local qwen3.6-27b — slight dataextract edge, reasoning state fully under your control, full privacy/control, cost is just the power cap. Different strengths, no dominator — the framing you called from the start.

Then the heavy packs: hermesagent-20 / cli-40 / aider-polyglot-30 are where the latency tail and cost-per-completed-workflow really bite (your JSON already logs per-pack token counts, so cost-per-run falls straight out once you add price). Run those with the token ceiling when ready. Appreciate you closing the loop with real numbers — and the reasoning-state gap is on us; we'll add it to the README cloud recipe. 🙏

1 reply

richardchen874-sys Jun 30, 2026

Thanks for the detailed read — this is very helpful.

I agree with both caveats.

The Doubao run is useful as a managed-endpoint comparison, but it is not a clean same-model deployment isolation against local qwen3.6-27b. So I would treat this first pass as a “where each endpoint/model stack is stronger” comparison, not as a strict local-vs-cloud model-equivalence claim.

The reasoning-state point also makes sense. I saw the same latency pattern: reasonmath-15 and bugfind-15 were the main p95 outliers, while the non-reasoning/default-off packs stayed much tighter. That does suggest the tail is likely reasoning-token generation on harder cases rather than generic endpoint variance.

For the next round, I’ll first check whether this managed endpoint exposes a native reasoning / thinking control for Doubao-Seed-2.0-Code. If it does, I can run the cleaner two-arm comparison you suggested.

Also, if you want to keep a managed OpenAI-compatible endpoint as a standing cloud comparison target for benchlocal, I can provide a small free test credit / test key for this purpose.

The safest scope would be:

benchmark packs only
no private code
no sensitive data
no customer data
no payment needed for the test
token usage, latency, failures, and cost-per-completed-run fully visible

That way you can run the same packs from your side too, instead of only relying on the JSON files I upload.

I think that would make the local vs managed comparison more useful: not “local or cloud is globally better,” but where each stack is stronger, weaker, and more predictable under the same benchmark workflow.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

🧪 Dual Carnice-V2-27B (Q8) on 2× 3090 — beellama MTP, q8_0 KV over kvarn6, base-parity quality #417

Uh oh!

{{title}}

Uh oh!

Replies: 2 comments 8 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

🧪 Dual Carnice-V2-27B (Q8) on 2× 3090 — beellama MTP, q8_0 KV over kvarn6, base-parity quality #417

Uh oh!

noonghunna Jun 16, 2026 Maintainer

🎴 Results Card — 2× RTX 3090 PCIe (no NVLink), beellama v0.3.2, think-off

① Serving

② Quality — core 8-pack (/150)

③ Takeaways

Why dual Q8 Carnice?

Getting it

Run it

What'd help

Credits

Replies: 2 comments · 8 replies

Uh oh!

richardchen874-sys Jun 26, 2026

Uh oh!

noonghunna Jun 27, 2026 Maintainer Author

Uh oh!

noonghunna Jun 27, 2026 Maintainer Author

Uh oh!

richardchen874-sys Jun 29, 2026

Uh oh!

richardchen874-sys Jun 29, 2026

Uh oh!

noonghunna Jun 30, 2026 Maintainer Author

Uh oh!

Uh oh!

noonghunna Jun 30, 2026 Maintainer Author

Uh oh!

richardchen874-sys Jun 30, 2026

noonghunna
Jun 16, 2026
Maintainer

Replies: 2 comments 8 replies

richardchen874-sys
Jun 26, 2026

noonghunna Jun 27, 2026
Maintainer Author

noonghunna Jun 27, 2026
Maintainer Author

noonghunna Jun 30, 2026
Maintainer Author

noonghunna
Jun 30, 2026
Maintainer Author