Qwopus3.6-27B-Coder on a single 3090 β first KVarN compose π§ͺ #392
Replies: 11 comments
-
|
No dflash for this model? |
Beta Was this translation helpful? Give feedback.
-
|
Right β no DFlash for this one, and it's a model thing, not a preference. DFlash needs a separate, distribution-matched draft GGUF (the way It's also the better fit on a single 24 GB card: the MTP head shares the model backbone (~1.5 GB β that's why MTP-on tops out ~160K vs ~230K with it off), whereas DFlash would add a separate draft model with its own KV cache β more KV pressure, less context headroom. On a context-hungry coder where KVarN is already doing the heavy lifting, the lean embedded head wins. Not anti-DFlash at all β it's our single-card default for the base 27B (output-lossless + tool-grammar-neutral). This model just ships MTP, and MTP suits the single-card context budget. If a DFlash drafter for the coder ever appears, happy to A/B it. |
Beta Was this translation helpful? Give feedback.
-
|
I'm running Jackrong/Qwopus3.6-27B-v2-MTP-GGUF (non-Coder) with beelama 0.3.2 and dflash doesn't seem to hurt TG at all. Ctx is 100k because running on windows with a display |
Beta Was this translation helpful? Give feedback.
-
|
Useful data point β thanks. And fair, I'd soften "tanks acceptance." The thing I should've been precise about: spec-dec only proposes β the target model verifies every token β so a mismatched drafter costs acceptance (speedup), not output quality; it can't make TG worse, just less faster. So "doesn't hurt TG" fits: the base-27B DFlash drafter is evidently close enough to the v2 fine-tune's distribution to keep acceptance usable. Good to know it holds on the v2. Two reasons it's still MTP for this model, though:
Genuinely curious what you're seeing, though β DFlash acceptance / actual TG speedup on the v2, and is it Anbeeld's base-27B drafter you paired? If the base drafter accelerates a fine-tune cleanly, that's a finding worth writing down. |
Beta Was this translation helpful? Give feedback.
-
|
yes I just use the base-27b drafter from Anbeeld. Don't bench it or anything but usually I get 70-100 tps (mixed chat and code). With MTP, I get 50-60. Agree that 100k ctx is a bit tight so I'm just temporarily using it for light tasks. Still waiting for components for a proper dual card build. |
Beta Was this translation helpful? Give feedback.
-
|
Yep with dflash draft acceptance is really bad: only 21%. 58 tps mixed code + chat. Q5_K_S, 131k ctx (without MTP), 22/24 GB VRAM. |
Beta Was this translation helpful? Give feedback.
-
|
I testing this using this with Opencode and it gets stuck in a loop: |
Beta Was this translation helpful? Give feedback.
-
|
Thanks for the log β that's enough to see it's not a generation loop; it's the harness reprocessing your whole context every turn. The tell repeats three times:
At ~150K, KVarN's windowed KV can't reuse the prompt cache across OpenCode's resent context, so each turn erases the checkpoints and reprocesses the full ~150K from scratch (~2β4 min at your prefill rate). OpenCode then hits its own timeout and cancels mid-reprocess ( Two things would confirm and likely fix it:
If you can also share whether the card is headless or driving a display, that helps β a desktop-shared 3090 loses a few GB that tightens this further. |
Beta Was this translation helpful? Give feedback.
-
|
Rig report to follow below. Btw, I am getting no safetensors when I ran the setup script. Not sure if this matters? See below. WEIGHTS=qwopus-coder bash scripts/setup.sh qwen3.6-27b Setup root: /opt/ai/compose/club-3090 |
Beta Was this translation helpful? Give feedback.
-
|
scripts/report.sh --full club-3090 rig reportGenerated: 2026-06-16 01:27:13 UTC Redacted output (paths, host, user, tokens). Re-run with System
CPU + RAM
Disk
GPU hardware
NVLinkNo NVLink detected (PCIe-only) TopologyPCIe / GPU topology matrixPCIe / P2P detail (lspci)lspci PCIe/P2P detail (LnkSta / ACS / topology)Full nvidia-smiFull nvidia-smi outputDisplay / desktop state
Container runtime
Stack version
Profile state
KV math calibration
Full kv-calc --calibration outputActive container
Container Python / CUDA versions
Boot log highlightsFull boot log (first 200 lines)First 200 lines of docker logsRecent failed boot attemptsExited vLLM/llama.cpp containers exist but all >24h old β likely not relevant to current investigation. verify-full.sh outputverify-full outputverify-stress.sh outputverify-stress output (7 boundary checks incl. Cliff 2 needle recall)soak-test.sh (SOAK_MODE=continuous) outputsoak-test stdout (5-session Γ 5-turn ramping conversation, ~25 min)bench.sh outputbench output (3 warmups + 5 measured per prompt)bench-agentic.sh outputbench-agentic output (1 session x 12 default turns, curve-shape estimate; ~8 min estimate)Generated by |
Beta Was this translation helpful? Give feedback.
-
|
Both things are good news β nothing's actually wrong with your setup. *The "No .safetensors found" message is a false alarm, not a failed download. Your GGUF downloaded fine β the Thanks for the rig report β it rules hardware out of the loop. Three reads:
So the stuck behaviour is exactly the cache-invalidation reprocess from before, confirmed not hardware: at ~150K, KVarN's windowed KV can't reuse the prompt cache, OpenCode resends its growing context each turn β beellama reprocesses the full ~150K from scratch (~4 min at your ~640 t/s prefill) β OpenCode hits its own timeout, cancels, resubmits β reprocessβcancelβreprocess. Looks stuck; it's grinding. The fix is to keep the working context well under the ~160K MTP-on ceiling so each reprocess finishes before OpenCode's cancel timeout:
If 96K still loops, share the OpenCode timeout setting and I'll help size it β but I'd bet capping the resent context clears it. |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
Adding Qwopus3.6-27B-Coder (Jackrong's coder fine-tune of Qwen3.6-27B) as a single-3090 coding model on beellama.cpp β and it's our first compose to use KVarN, Anbeeld's new KV-cache compression (beellama v0.3.2 preview). A Q5_K_M GGUF with an embedded MTP head, served behind an OpenAI-compatible endpoint.
The headline: kvarn4 KV is quality- and decode-neutral vs q5_0/q4_1, and buys ~1.2Γ context β so on one 24 GB card you get a large coding window at full q5-class fidelity, with spec-dec.
Status: π§ͺ experimental (
--forceto launch). It rides a pre-release beellama build (KVarN is preview), so no production guarantee yet β it stays experimental until Anbeeld tags a stable release.Results Card β 1Γ RTX 3090
β Serving
beellama v0.3.2-preview (KVarN, digest-pinned),
-fa on, FA_ALL_QUANTS, single 3090 sm_86. 3 warm + 5 measured, temp 0.6 / top-k 20 / top-p 1.0. Needle recalled clean at ~72K depth (= the q5_0/q4_1 control). MTP costs ~1.5 GB, so MTP-on tops out ~160K; drop MTP (env opt-in in the compose) for the ~230K window at ~35 TPS. Full 262K fits only on kvarn2 (2-bit β too low for code).β‘ Quality β core 8-pack β /150
Single run/arm β treat β€Β±5/150 as noise. off = thinking-off, on = thinking-on. kvarn4 vs q5_0/q4_1 lands within Β±2β4 either way β quality-neutral. (dataextract is low across all four β a model number-formatting trait, not the KV.)
β’ Takeaways
Requirements
kvarn*cache types). The launchers inject it automatically.Getting it / Run it
Big-context mode (~230K, MTP off) is a documented env-toggle in the compose header. Refine-by-reply, OpenAI-compatible, thinking-off by default.
Credits
Beta Was this translation helpful? Give feedback.
All reactions