🧠 Carnice-V2-27B on a single 3090 — a Hermes-agentic Qwen3.6 that edges the Qwopus-Coder sibling (🧪) #407

noonghunna · 2026-06-14T12:05:45Z

noonghunna
Jun 14, 2026
Maintainer

Adding Carnice-V2-27B (stuchapin's GGUF of kai-os/Carnice-V2-27b — a Hermes-style agentic SFT of Qwen3.6-27B) as a single-3090 model on beellama.cpp with KVarN. A Q5_K_M GGUF with an embedded MTP head, reasoning-on by default, behind an OpenAI-compatible endpoint.

This is the revival of the Carnice config asked about in #403 — note the dead link there pointed at the old, archived dual-vLLM AutoRound path. The supported path now is this single-card beellama one (Q5_K_M fits one 24 GB card at 160K, so it doesn't need dual).

Sibling of beellama/qwopus-coder — same engine + KVarN-4 KV — but where Qwopus is the coder, Carnice is the agentic-reasoning slot.

Status: 🧪 experimental (--force to launch). It rides the pre-release beellama KVarN build, so no production guarantee yet — stays experimental until Anbeeld tags a stable release.

Results Card — 1× RTX 3090

① Serving

Config	Spec-dec	KV / ctx	decode TPS (narr / code)	VRAM / card
kvarn4 / kvarn4 (default)	MTP n=1 (embedded)	kvarn4 / 160K	46.8 / 50.5	23.5 GB
kvarn4 / kvarn4 (n=2 opt-in)	MTP n=2 (embedded)	kvarn4 / 160K	~52 / ~56	23.5 GB

_{beellama v0.3.2-preview (KVarN, digest-pinned), -fa on, single 3090 sm_86, reasoning-on. 3 warm + 5 measured, temp 0.6 / top-k 20 / top-p 0.95. MTP acceptance ~94% @ n=1. Measured ceilings: 160K @ 23.5 GB (default, ~1 GB free), 176K @ 23.9 GB (runtime ceiling), 192K OOMs. NIAH recalled clean to 150K (verify-stress 8/8). Soak PASS: 0 growth, 0/100 silent-empty, 100.5% retention.}

② Quality — core 8-pack → /150 (reasoning-on)

pack (max)	Carnice-V2	Qwopus-Coder (ref)
toolcall-15	14	14
instructfollow-15	15	14
structoutput-15	14	14
dataextract-15	10	9
reasonmath-15	13	11
bugfind-15	12	14
hermesagent-20	13	10
cli-40	19	17
TOTAL / 150	110	103

_{Both kvarn4/kvarn4, reasoning-on, single run/arm — treat ≤±5/150 as noise. Carnice's +7 is concentrated where a Hermes-agentic SFT should win: HermesAgent 13 vs 10, InstructFollow 15, ReasonMath 13 (Qwopus edges bugfind). dataextract sits at 10 across both — a Qwen-family number-formatting trait, not the KV.}

③ Takeaways

It earns its slot — +7 over the Qwopus-Coder sibling at matched engine/KV/thinking, driven by agentic + instruction-following. Carnice is the single-card agentic-reasoning pick; Qwopus stays the coder.
The --spec-draft-n-max story — Carnice's MTP head is mtp_num_hidden_layers=1, so the author's rule is n=1 (n=3 cascades errors). The card also reports n=2 crashed on a 2×3090 — but on our single-card kvarn4 it doesn't crash, and runs ~12% faster (spec-dec is lossless, so the lower acceptance costs speed only, not quality). We ship n=1 as the safe default; n=2 is a one-env-var opt-in (DRAFT_N_MAX=2).
Engine note — the model card says these GGUFs "only load on llama.cpp PR-22673; mainline fails." That doesn't apply to beellama — it loads the fused MTP head cleanly (speculative decoding context initialized).

Requirements

1× RTX 3090 (24 GB, sm_86). 4090/5090 (sm_89/120) via the multi-arch build, compiled-not-validated here.
The KVarN engine build — beellama v0.3.2 preview, digest-pinned in the engine profile (v0.3.0/earlier reject kvarn* cache types). Launchers inject it automatically.

Getting it / Run it

WEIGHTS=carnice-v2 bash scripts/setup.sh qwen3.6-27b               # fetch the Q5_K_M MTP GGUF (~19 GB) + SHA-verify
bash scripts/switch.sh beellama/carnice-v2-single-q5km-mtp --force # serves :8068, reasoning-on
curl -s http://localhost:8068/v1/models | jq .

Reasoning-on by default (deepseek-format thoughts in reasoning_content, 4096-token budget). Flip REASONING=off for a fast no-think pass; DRAFT_N_MAX=2 for the +12% speed opt-in.

Credits

stuchapin — the Carnice-V2-27B MTP GGUFs + the --spec-draft-n-max 1 tuning finding.
kai-os — the Carnice-V2-27b fine-tune.
Anbeeld — beellama.cpp + KVarN (writeup & benchmarks). 🙌

noonghunna · 2026-06-14T19:51:10Z

noonghunna
Jun 14, 2026
Maintainer Author

Base Qwen3.6-27B reference — what does the fine-tune actually add?

Carnice-V2 and Qwopus-Coder are both SFTs of the same base, so here's the vanilla Qwen3.6-27B (unsloth Q5_K_S) on the same core 8-pack, reasoning-on, to make the fine-tune deltas visible:

pack (max)	Base Qwen3.6-27B	Carnice-V2	Qwopus-Coder
toolcall-15	14	14	14
instructfollow-15	14	15	14
structoutput-15	14	14	14
dataextract-15	10	10	9
reasonmath-15	13	13	11
bugfind-15	12	12	14
hermesagent-20	14	13	10
cli-40	22	19	17
TOTAL / 150	113	110	103

_{Base = unsloth Qwen3.6-27B Q5_K_S on beellama DFlash (q5_0/q4_1 KV), reasoning-on, single-run temp 0, from #239 (2026-05-30 grader). Carnice/Qwopus are kvarn4 KV; spec-dec is lossless on all three so the 8-pack tracks the underlying model — but KV + grader-date differ, so treat cross-column deltas ≤±5–7/150 as noise.}

Read: at reasoning-on the three sit inside the 8-pack noise band (base 113 / Carnice 110 / Qwopus 103) — the SFTs don't lift the total so much as redistribute it. Carnice's signal is instructfollow 15 + the agentic shape (hermesagent 13 vs Qwopus 10); Qwopus trades agentic for bugfind 14; base itself stays strong on the open-ended packs (cli-40 22, hermesagent 14). So the slot framing holds — Carnice = the agentic-reasoning pick, Qwopus = the coder — but measured against base it's a redistribution within the model's own ~100–113 cross-config 8-pack spread rather than a net headroom gain. A KV/grader-matched base run would tighten this; the column above is directional.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

🧠 Carnice-V2-27B on a single 3090 — a Hermes-agentic Qwen3.6 that edges the Qwopus-Coder sibling (🧪) #407

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

🧠 Carnice-V2-27B on a single 3090 — a Hermes-agentic Qwen3.6 that edges the Qwopus-Coder sibling (🧪) #407

Uh oh!

noonghunna Jun 14, 2026 Maintainer

Results Card — 1× RTX 3090

① Serving

② Quality — core 8-pack → /150 (reasoning-on)

③ Takeaways

Requirements

Getting it / Run it

Credits

Replies: 1 comment

Uh oh!

noonghunna Jun 14, 2026 Maintainer Author

Base Qwen3.6-27B reference — what does the fine-tune actually add?

noonghunna
Jun 14, 2026
Maintainer

noonghunna
Jun 14, 2026
Maintainer Author