π§ Carnice-V2-27B on a single 3090 β a Hermes-agentic Qwen3.6 that edges the Qwopus-Coder sibling (π§ͺ) #407
Replies: 1 comment
-
Base Qwen3.6-27B reference β what does the fine-tune actually add?Carnice-V2 and Qwopus-Coder are both SFTs of the same base, so here's the vanilla Qwen3.6-27B (unsloth Q5_K_S) on the same core 8-pack, reasoning-on, to make the fine-tune deltas visible:
Base = unsloth Qwen3.6-27B Q5_K_S on beellama DFlash (q5_0/q4_1 KV), reasoning-on, single-run temp 0, from #239 (2026-05-30 grader). Carnice/Qwopus are kvarn4 KV; spec-dec is lossless on all three so the 8-pack tracks the underlying model β but KV + grader-date differ, so treat cross-column deltas β€Β±5β7/150 as noise. Read: at reasoning-on the three sit inside the 8-pack noise band (base 113 / Carnice 110 / Qwopus 103) β the SFTs don't lift the total so much as redistribute it. Carnice's signal is instructfollow 15 + the agentic shape (hermesagent 13 vs Qwopus 10); Qwopus trades agentic for bugfind 14; base itself stays strong on the open-ended packs (cli-40 22, hermesagent 14). So the slot framing holds β Carnice = the agentic-reasoning pick, Qwopus = the coder β but measured against base it's a redistribution within the model's own ~100β113 cross-config 8-pack spread rather than a net headroom gain. A KV/grader-matched base run would tighten this; the column above is directional. |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
Adding Carnice-V2-27B (stuchapin's GGUF of kai-os/Carnice-V2-27b β a Hermes-style agentic SFT of Qwen3.6-27B) as a single-3090 model on beellama.cpp with KVarN. A Q5_K_M GGUF with an embedded MTP head, reasoning-on by default, behind an OpenAI-compatible endpoint.
This is the revival of the Carnice config asked about in #403 β note the dead link there pointed at the old, archived dual-vLLM AutoRound path. The supported path now is this single-card beellama one (Q5_K_M fits one 24 GB card at 160K, so it doesn't need dual).
Sibling of
beellama/qwopus-coderβ same engine + KVarN-4 KV β but where Qwopus is the coder, Carnice is the agentic-reasoning slot.Status: π§ͺ experimental (
--forceto launch). It rides the pre-release beellama KVarN build, so no production guarantee yet β stays experimental until Anbeeld tags a stable release.Results Card β 1Γ RTX 3090
β Serving
beellama v0.3.2-preview (KVarN, digest-pinned),
-fa on, single 3090 sm_86, reasoning-on. 3 warm + 5 measured, temp 0.6 / top-k 20 / top-p 0.95. MTP acceptance ~94% @ n=1. Measured ceilings: 160K @ 23.5 GB (default, ~1 GB free), 176K @ 23.9 GB (runtime ceiling), 192K OOMs. NIAH recalled clean to 150K (verify-stress 8/8). Soak PASS: 0 growth, 0/100 silent-empty, 100.5% retention.β‘ Quality β core 8-pack β /150 (reasoning-on)
Both kvarn4/kvarn4, reasoning-on, single run/arm β treat β€Β±5/150 as noise. Carnice's +7 is concentrated where a Hermes-agentic SFT should win: HermesAgent 13 vs 10, InstructFollow 15, ReasonMath 13 (Qwopus edges bugfind). dataextract sits at 10 across both β a Qwen-family number-formatting trait, not the KV.
β’ Takeaways
--spec-draft-n-maxstory β Carnice's MTP head ismtp_num_hidden_layers=1, so the author's rule is n=1 (n=3 cascades errors). The card also reports n=2 crashed on a 2Γ3090 β but on our single-card kvarn4 it doesn't crash, and runs ~12% faster (spec-dec is lossless, so the lower acceptance costs speed only, not quality). We ship n=1 as the safe default; n=2 is a one-env-var opt-in (DRAFT_N_MAX=2).speculative decoding context initialized).Requirements
kvarn*cache types). Launchers inject it automatically.Getting it / Run it
Reasoning-on by default (deepseek-format thoughts in
reasoning_content, 4096-token budget). FlipREASONING=offfor a fast no-think pass;DRAFT_N_MAX=2for the +12% speed opt-in.Credits
--spec-draft-n-max 1tuning finding.Beta Was this translation helpful? Give feedback.
All reactions