Qwopus3.6-27B-Coder on a single 3090 — first KVarN compose 🧪 #392

noonghunna · 2026-06-12T22:30:37Z

noonghunna
Jun 12, 2026
Maintainer

Adding Qwopus3.6-27B-Coder (Jackrong's coder fine-tune of Qwen3.6-27B) as a single-3090 coding model on beellama.cpp — and it's our first compose to use KVarN, Anbeeld's new KV-cache compression (beellama v0.3.2 preview). A Q5_K_M GGUF with an embedded MTP head, served behind an OpenAI-compatible endpoint.

New to KVarN? Anbeeld's writeup — KVarN KV Cache: implementation and benchmarks — is the place to start (the PPL/KLD analysis + per-level quality tiers).

The headline: kvarn4 KV is quality- and decode-neutral vs q5_0/q4_1, and buys ~1.2× context — so on one 24 GB card you get a large coding window at full q5-class fidelity, with spec-dec.

Status: 🧪 experimental (--force to launch). It rides a pre-release beellama build (KVarN is preview), so no production guarantee yet — it stays experimental until Anbeeld tags a stable release.

Results Card — 1× RTX 3090

① Serving

Config	Spec-dec	KV / ctx	decode TPS (narr / code)	VRAM / card
kvarn4 / kvarn4 (default)	MTP (embedded)	kvarn4 / 160K	45.6 / 57.6	23.3 GB
q5_0 / q4_1 (reference)	MTP (embedded)	q5_0+q4_1 / 160K	46.9 / 58.7	23.8 GB
kvarn4 / kvarn4 (big-ctx)	none	kvarn4 / 230K	~35 (code)	23.5 GB

_{beellama v0.3.2-preview (KVarN, digest-pinned), -fa on, FA_ALL_QUANTS, single 3090 sm_86. 3 warm + 5 measured, temp 0.6 / top-k 20 / top-p 1.0. Needle recalled clean at ~72K depth (= the q5_0/q4_1 control). MTP costs ~1.5 GB, so MTP-on tops out ~160K; drop MTP (env opt-in in the compose) for the ~230K window at ~35 TPS. Full 262K fits only on kvarn2 (2-bit — too low for code).}

② Quality — core 8-pack → /150

pack (max)	kvarn4 off	kvarn4 on	q5_0/q4_1 off	q5_0/q4_1 on
toolcall-15	14	14	14	15
instructfollow-15	13	14	12	15
structoutput-15	14	14	14	14
dataextract-15	10	9	9	8
reasonmath-15	12	11	12	12
bugfind-15	12	14	12	14
hermesagent-20	10	10	10	9
cli-40	19	17	19	20
TOTAL / 150	104	103	102	107

_{Single run/arm — treat ≤±5/150 as noise. off = thinking-off, on = thinking-on. kvarn4 vs q5_0/q4_1 lands within ±2–4 either way → quality-neutral. (dataextract is low across all four — a model number-formatting trait, not the KV.)}

③ Takeaways

KVarN-4 is free here — quality-neutral (104/103 ≈ 102/107) and decode-neutral (57.6 ≈ 58.7 code TPS) vs q5_0/q4_1, at less memory. Confirms Anbeeld's KLD "kvarn4 = q5-class" on real tasks (disc #329).
More context for free — kvarn4 lifts the single-card ceiling ~196K → ~230K at q5-class fidelity.
MTP vs context — keep MTP for speed (~160K, ~57 code TPS) or drop it for the big window (~230K, ~35 TPS). Both in the compose.

Requirements

1× RTX 3090 (24 GB, sm_86). 4090/5090 (sm_89/120) should work via the multi-arch build but are compiled-not-validated here.
The KVarN engine build — beellama v0.3.2 preview, digest-pinned in the engine profile (the v0.3.0/earlier images reject kvarn* cache types). The launchers inject it automatically.

Getting it / Run it

WEIGHTS=qwopus-coder bash scripts/setup.sh qwen3.6-27b   # fetch the Q5_K_M MTP GGUF (~18 GB) + SHA-verify
bash scripts/switch.sh beellama/qwopus-coder --force      # serves :8067, model: Qwopus3.6-27B-Coder-MTP-Q5_K_M.gguf
curl -s http://localhost:8067/v1/models | jq .

Big-context mode (~230K, MTP off) is a documented env-toggle in the compose header. Refine-by-reply, OpenAI-compatible, thinking-off by default.

Credits

Jackrong — the Qwopus3.6-27B-Coder fine-tune + the MTP GGUFs.
Anbeeld — beellama.cpp + KVarN (writeup & benchmarks). 🙌

henrykrinkle01 · 2026-06-13T00:41:24Z

henrykrinkle01
Jun 13, 2026

No dflash for this model?

0 replies

noonghunna · 2026-06-13T00:49:15Z

noonghunna
Jun 13, 2026
Maintainer Author

Right — no DFlash for this one, and it's a model thing, not a preference.

DFlash needs a separate, distribution-matched draft GGUF (the way beellama/dflash pairs the base Qwen3.6-27B with Anbeeld's DFlash drafter). The Qwopus coder is a fine-tune and there's no DFlash drafter trained for it — pointing the base-27B drafter at a fine-tune mismatches the target distribution and tanks acceptance. What Jackrong did ship is an embedded MTP head baked into the GGUF, so MTP is the in-the-box, matched spec path here, no extra download.

It's also the better fit on a single 24 GB card: the MTP head shares the model backbone (~1.5 GB — that's why MTP-on tops out ~160K vs ~230K with it off), whereas DFlash would add a separate draft model with its own KV cache → more KV pressure, less context headroom. On a context-hungry coder where KVarN is already doing the heavy lifting, the lean embedded head wins.

Not anti-DFlash at all — it's our single-card default for the base 27B (output-lossless + tool-grammar-neutral). This model just ships MTP, and MTP suits the single-card context budget. If a DFlash drafter for the coder ever appears, happy to A/B it.

0 replies

henrykrinkle01 · 2026-06-13T01:06:30Z

henrykrinkle01
Jun 13, 2026

I'm running Jackrong/Qwopus3.6-27B-v2-MTP-GGUF (non-Coder) with beelama 0.3.2 and dflash doesn't seem to hurt TG at all. Ctx is 100k because running on windows with a display

0 replies

noonghunna · 2026-06-13T10:44:47Z

noonghunna
Jun 13, 2026
Maintainer Author

Useful data point — thanks. And fair, I'd soften "tanks acceptance." The thing I should've been precise about: spec-dec only proposes — the target model verifies every token — so a mismatched drafter costs acceptance (speedup), not output quality; it can't make TG worse, just less faster. So "doesn't hurt TG" fits: the base-27B DFlash drafter is evidently close enough to the v2 fine-tune's distribution to keep acceptance usable. Good to know it holds on the v2.

Two reasons it's still MTP for this model, though:

Different fine-tune — you're on Qwopus3.6-27B-v2 (non-Coder); Qwopus3.6-27B-Coder on a single 3090 — first KVarN compose 🧪 #392 is the Coder. No Coder-matched DFlash drafter exists either, and the Coder GGUF ships an embedded MTP head — so MTP is the in-the-box path there too.
The context cost is real even when TG isn't — your 100K is mostly Windows + display, agreed, but DFlash's separate drafter + its own KV stacks on top of that. Headless on a 24 GB card the Coder gets ~160K with MTP / ~230K spec-off; the lean embedded head is what buys that headroom. On a single tight card, context is the thing we're protecting.

Genuinely curious what you're seeing, though — DFlash acceptance / actual TG speedup on the v2, and is it Anbeeld's base-27B drafter you paired? If the base drafter accelerates a fine-tune cleanly, that's a finding worth writing down.

0 replies

henrykrinkle01 · 2026-06-13T13:36:28Z

henrykrinkle01
Jun 13, 2026

yes I just use the base-27b drafter from Anbeeld. Don't bench it or anything but usually I get 70-100 tps (mixed chat and code). With MTP, I get 50-60. Agree that 100k ctx is a bit tight so I'm just temporarily using it for light tasks. Still waiting for components for a proper dual card build.

0 replies

henrykrinkle01 · 2026-06-13T17:28:45Z

henrykrinkle01
Jun 13, 2026

Yep with dflash draft acceptance is really bad: only 21%. 58 tps mixed code + chat. Q5_K_S, 131k ctx (without MTP), 22/24 GB VRAM.
Still faster than MTP though. Same prompt, 3k output, 50 tps with MTP

0 replies

sghomelab · 2026-06-15T18:50:08Z

sghomelab
Jun 15, 2026

I testing this using this with Opencode and it gets stuck in a loop:
beellama-qwopus-coder | 692.50.209.840 I slot print_timing: id 0 | task 105398 | prompt processing, n_tokens = 131072, progress = 0.87, t = 197.66 s / 663.12 tokens per second
beellama-qwopus-coder | 692.54.764.142 I slot print_timing: id 0 | task 105398 | prompt processing, n_tokens = 133120, progress = 0.89, t = 202.21 s / 658.31 tokens per second
beellama-qwopus-coder | 692.59.360.404 I slot print_timing: id 0 | task 105398 | prompt processing, n_tokens = 135168, progress = 0.90, t = 206.81 s / 653.58 tokens per second
beellama-qwopus-coder | 693.03.997.603 I slot print_timing: id 0 | task 105398 | prompt processing, n_tokens = 137216, progress = 0.91, t = 211.45 s / 648.94 tokens per second
beellama-qwopus-coder | 693.08.688.083 I slot print_timing: id 0 | task 105398 | prompt processing, n_tokens = 139264, progress = 0.93, t = 216.14 s / 644.33 tokens per second
beellama-qwopus-coder | 693.13.428.598 I slot print_timing: id 0 | task 105398 | prompt processing, n_tokens = 141312, progress = 0.94, t = 220.88 s / 639.77 tokens per second
beellama-qwopus-coder | 693.18.220.345 I slot print_timing: id 0 | task 105398 | prompt processing, n_tokens = 143360, progress = 0.95, t = 225.67 s / 635.26 tokens per second
beellama-qwopus-coder | 693.23.051.938 I slot print_timing: id 0 | task 105398 | prompt processing, n_tokens = 145408, progress = 0.97, t = 230.50 s / 630.83 tokens per second
beellama-qwopus-coder | 693.27.928.077 I slot print_timing: id 0 | task 105398 | prompt processing, n_tokens = 147456, progress = 0.98, t = 235.38 s / 626.46 tokens per second
beellama-qwopus-coder | 693.32.847.791 I slot print_timing: id 0 | task 105398 | prompt processing, n_tokens = 149504, progress = 0.99, t = 240.30 s / 622.16 tokens per second
beellama-qwopus-coder | 693.33.603.614 I slot print_timing: id 0 | task 105398 | prompt processing, n_tokens = 149804, progress = 1.00, t = 241.05 s / 621.46 tokens per second
beellama-qwopus-coder | 693.34.799.928 I slot print_timing: id 0 | task 105398 | prompt processing, n_tokens = 150275, progress = 1.00, t = 242.25 s / 620.33 tokens per second
beellama-qwopus-coder | 693.35.892.932 I slot create_check: id 0 | task 105398 | created context checkpoint 1 of 32 (pos_min = 150274, pos_max = 150274, n_tokens = 150275, size = 3336.684 MiB)
beellama-qwopus-coder | 693.36.129.009 I slot print_timing: id 0 | task 105398 | prompt processing, n_tokens = 150316, progress = 1.00, t = 243.58 s / 617.11 tokens per second
beellama-qwopus-coder | 693.36.468.300 I slot update_slots: id 0 | task 105398 | accepted 3/ 3 draft tokens
beellama-qwopus-coder | 693.36.638.842 I slot update_slots: id 0 | task 105398 | accepted 3/ 3 draft tokens
beellama-qwopus-coder | 693.36.807.048 I slot update_slots: id 0 | task 105398 | accepted 3/ 3 draft tokens
beellama-qwopus-coder | 693.36.807.163 W slot process_toke: id 0 | task 105398 | raw tool marker observed while lazy grammar is enabled; keeping DFlash governed by active grammar boundary in_reasoning=0 n_decoded=12 reasoning_tokens=0 visible_tokens=13
beellama-qwopus-coder | 693.49.224.921 I slot print_timing: id 0 | task 105398 | n_decoded = 100, tg = 7.74 t/s
beellama-qwopus-coder | 693.50.793.877 I slot print_timing: id 0 | task 105398 | prompt eval time = 243748.57 ms / 150320 tokens ( 1.62 ms per token, 616.70 tokens per second)
beellama-qwopus-coder | 693.50.793.880 I slot print_timing: id 0 | task 105398 | eval time = 14495.29 ms / 111 tokens ( 130.59 ms per token, 7.66 tokens per second)
beellama-qwopus-coder | 693.50.793.880 I slot print_timing: id 0 | task 105398 | total time = 258243.86 ms / 150431 tokens
beellama-qwopus-coder | 693.50.793.881 I slot print_timing: id 0 | task 105398 | graphs reused = 95745
beellama-qwopus-coder | 693.50.793.882 I slot print_timing: id 0 | task 105398 | draft acceptance = 1.00000 ( 9 accepted / 9 generated)
beellama-qwopus-coder | 693.50.793.897 I statistics draft-mtp: #calls(b,g,a) = 484 8289 8289, #gen drafts = 8289, #acc drafts = 7067, #gen tokens = 24867, #acc tokens = 18276, dur(b,g,a) = 0.637, 60104.089, 7.385 ms
beellama-qwopus-coder | 693.50.796.230 I slot release: id 0 | task 105398 | stop processing: n_tokens = 150430, truncated = 0
beellama-qwopus-coder | 693.50.796.363 I srv update_slots: all slots are idle
beellama-qwopus-coder | 693.51.241.324 I srv params_from_: Chat format: peg-native
beellama-qwopus-coder | 693.51.330.122 I slot get_availabl: id 0 | task -1 | selected slot by LCP similarity, sim_best = 0.996 (> 0.100 thold), f_keep = 1.000
beellama-qwopus-coder | 693.51.330.732 I reasoning-budget: activated, budget=2147483647 tokens
beellama-qwopus-coder | 693.51.330.733 I reasoning-budget: deactivated (natural end)
beellama-qwopus-coder | 693.51.330.785 I slot launch_slot_: id 0 | task 105584 | processing task, is_child = 0
beellama-qwopus-coder | 693.53.012.865 I slot create_check: id 0 | task 105584 | created context checkpoint 2 of 32 (pos_min = 150564, pos_max = 150564, n_tokens = 150565, size = 3342.202 MiB)
beellama-qwopus-coder | 693.55.311.131 I slot create_check: id 0 | task 105584 | created context checkpoint 3 of 32 (pos_min = 151076, pos_max = 151076, n_tokens = 151077, size = 3352.972 MiB)
beellama-qwopus-coder | 693.55.640.327 I slot update_slots: id 0 | task 105584 | accepted 3/ 3 draft tokens
beellama-qwopus-coder | 693.55.814.478 I slot update_slots: id 0 | task 105584 | accepted 3/ 3 draft tokens
beellama-qwopus-coder | 693.55.980.759 I slot update_slots: id 0 | task 105584 | accepted 3/ 3 draft tokens
beellama-qwopus-coder | 693.56.147.177 I slot update_slots: id 0 | task 105584 | accepted 3/ 3 draft tokens
beellama-qwopus-coder | 693.56.317.761 I slot update_slots: id 0 | task 105584 | accepted 3/ 3 draft tokens
beellama-qwopus-coder | 693.56.486.007 I slot update_slots: id 0 | task 105584 | accepted 3/ 3 draft tokens
beellama-qwopus-coder | 693.56.653.699 I slot update_slots: id 0 | task 105584 | accepted 3/ 3 draft tokens
beellama-qwopus-coder | 693.56.820.245 I slot update_slots: id 0 | task 105584 | accepted 3/ 3 draft tokens
beellama-qwopus-coder | 693.56.820.361 W slot process_toke: id 0 | task 105584 | raw tool marker observed while lazy grammar is enabled; keeping DFlash governed by active grammar boundary in_reasoning=0 n_decoded=32 reasoning_tokens=0 visible_tokens=33
beellama-qwopus-coder | 694.06.424.445 I slot print_timing: id 0 | task 105584 | n_decoded = 100, tg = 9.13 t/s
beellama-qwopus-coder | 694.09.429.903 I slot print_timing: id 0 | task 105584 | n_decoded = 121, tg = 8.67 t/s
beellama-qwopus-coder | 694.12.435.748 I slot print_timing: id 0 | task 105584 | n_decoded = 142, tg = 8.37 t/s
beellama-qwopus-coder | 694.15.441.237 I slot print_timing: id 0 | task 105584 | n_decoded = 163, tg = 8.16 t/s
beellama-qwopus-coder | 694.18.447.505 I slot print_timing: id 0 | task 105584 | n_decoded = 184, tg = 8.01 t/s
beellama-qwopus-coder | 694.21.453.984 I slot print_timing: id 0 | task 105584 | n_decoded = 205, tg = 7.89 t/s
beellama-qwopus-coder | 694.24.477.521 I slot print_timing: id 0 | task 105584 | n_decoded = 226, tg = 7.79 t/s
beellama-qwopus-coder | 694.27.488.730 I slot print_timing: id 0 | task 105584 | n_decoded = 247, tg = 7.71 t/s
beellama-qwopus-coder | 694.30.499.696 I slot print_timing: id 0 | task 105584 | prompt eval time = 4139.70 ms / 651 tokens ( 6.36 ms per token, 157.26 tokens per second)
beellama-qwopus-coder | 694.30.499.700 I slot print_timing: id 0 | task 105584 | eval time = 35029.19 ms / 268 tokens ( 130.71 ms per token, 7.65 tokens per second)
beellama-qwopus-coder | 694.30.499.701 I slot print_timing: id 0 | task 105584 | total time = 39168.89 ms / 919 tokens
beellama-qwopus-coder | 694.30.499.702 I slot print_timing: id 0 | task 105584 | graphs reused = 95985
beellama-qwopus-coder | 694.30.499.703 I slot print_timing: id 0 | task 105584 | draft acceptance = 1.00000 ( 24 accepted / 24 generated)
beellama-qwopus-coder | 694.30.499.721 I statistics draft-mtp: #calls(b,g,a) = 485 8297 8297, #gen drafts = 8297, #acc drafts = 7075, #gen tokens = 24891, #acc tokens = 18300, dur(b,g,a) = 0.639, 60172.241, 7.393 ms
beellama-qwopus-coder | 694.30.502.024 I slot release: id 0 | task 105584 | stop processing: n_tokens = 151348, truncated = 0
beellama-qwopus-coder | 694.30.502.073 I srv update_slots: all slots are idle
beellama-qwopus-coder | 695.10.782.474 I srv params_from_: Chat format: peg-native
beellama-qwopus-coder | 695.10.874.897 I slot get_availabl: id 0 | task -1 | selected slot by LCP similarity, sim_best = 0.997 (> 0.100 thold), f_keep = 1.000
beellama-qwopus-coder | 695.10.875.838 I reasoning-budget: activated, budget=2147483647 tokens
beellama-qwopus-coder | 695.10.875.839 I reasoning-budget: deactivated (natural end)
beellama-qwopus-coder | 695.10.875.891 I slot launch_slot_: id 0 | task 105832 | processing task, is_child = 0
beellama-qwopus-coder | 695.12.113.682 I slot create_check: id 0 | task 105832 | created context checkpoint 4 of 32 (pos_min = 151349, pos_max = 151349, n_tokens = 151350, size = 3358.424 MiB)
beellama-qwopus-coder | 695.14.394.440 I slot create_check: id 0 | task 105832 | created context checkpoint 5 of 32 (pos_min = 151819, pos_max = 151819, n_tokens = 151820, size = 3369.028 MiB)
beellama-qwopus-coder | 695.14.720.592 I slot update_slots: id 0 | task 105832 | accepted 3/ 3 draft tokens
beellama-qwopus-coder | 695.14.891.576 I slot update_slots: id 0 | task 105832 | accepted 3/ 3 draft tokens
beellama-qwopus-coder | 695.14.891.730 W slot process_toke: id 0 | task 105832 | raw tool marker observed while lazy grammar is enabled; keeping DFlash governed by active grammar boundary in_reasoning=0 n_decoded=9 reasoning_tokens=0 visible_tokens=9
beellama-qwopus-coder | 695.15.056.393 I slot update_slots: id 0 | task 105832 | accepted 0/ 3 draft tokens
beellama-qwopus-coder | 695.27.883.577 I slot print_timing: id 0 | task 105832 | n_decoded = 100, tg = 7.50 t/s
beellama-qwopus-coder | 695.30.890.419 I slot print_timing: id 0 | task 105832 | n_decoded = 121, tg = 7.41 t/s
beellama-qwopus-coder | 695.33.889.741 I slot print_timing: id 0 | task 105832 | prompt eval time = 3676.56 ms / 476 tokens ( 7.72 ms per token, 129.47 tokens per second)
beellama-qwopus-coder | 695.33.889.744 I slot print_timing: id 0 | task 105832 | eval time = 19337.27 ms / 142 tokens ( 136.18 ms per token, 7.34 tokens per second)
beellama-qwopus-coder | 695.33.889.744 I slot print_timing: id 0 | task 105832 | total time = 23013.83 ms / 618 tokens
beellama-qwopus-coder | 695.33.889.745 I slot print_timing: id 0 | task 105832 | graphs reused = 96118
beellama-qwopus-coder | 695.33.889.746 I slot print_timing: id 0 | task 105832 | draft acceptance = 0.66667 ( 6 accepted / 9 generated)
beellama-qwopus-coder | 695.33.889.764 I statistics draft-mtp: #calls(b,g,a) = 486 8300 8300, #gen drafts = 8300, #acc drafts = 7077, #gen tokens = 24900, #acc tokens = 18306, dur(b,g,a) = 0.641, 60198.151, 7.395 ms
beellama-qwopus-coder | 695.33.892.043 I slot release: id 0 | task 105832 | stop processing: n_tokens = 151965, truncated = 0
beellama-qwopus-coder | 695.33.892.098 I srv update_slots: all slots are idle
beellama-qwopus-coder | 695.41.192.504 I srv params_from_: Chat format: peg-native
beellama-qwopus-coder | 695.41.260.197 I slot get_availabl: id 0 | task -1 | selected slot by LRU, t_last = 43423543871
beellama-qwopus-coder | 695.41.261.730 I reasoning-budget: activated, budget=2147483647 tokens
beellama-qwopus-coder | 695.41.261.732 I reasoning-budget: deactivated (natural end)
beellama-qwopus-coder | 695.41.261.784 I slot launch_slot_: id 0 | task 105971 | processing task, is_child = 0
beellama-qwopus-coder | 695.41.261.795 I slot update_slots: id 0 | task 105971 | Checking checkpoint with [151819, 151819] against 32...
beellama-qwopus-coder | 695.41.261.797 I slot update_slots: id 0 | task 105971 | Checking checkpoint with [151349, 151349] against 32...
beellama-qwopus-coder | 695.41.261.797 I slot update_slots: id 0 | task 105971 | Checking checkpoint with [151076, 151076] against 32...
beellama-qwopus-coder | 695.41.261.798 I slot update_slots: id 0 | task 105971 | Checking checkpoint with [150564, 150564] against 32...
beellama-qwopus-coder | 695.41.261.798 I slot update_slots: id 0 | task 105971 | Checking checkpoint with [150274, 150274] against 32...
beellama-qwopus-coder | 695.41.261.800 W slot update_slots: id 0 | task 105971 | forcing full prompt re-processing due to lack of cache data (likely due to SWA or hybrid/recurrent memory, see ggml-org/llama.cpp#13194 (comment))
beellama-qwopus-coder | 695.41.261.807 W slot update_slots: id 0 | task 105971 | erased invalidated context checkpoint (pos_min = 150274, pos_max = 150274, n_tokens = 150275, n_swa = 0, pos_next = 0, size = 3336.684 MiB)
beellama-qwopus-coder | 695.41.348.189 W slot update_slots: id 0 | task 105971 | erased invalidated context checkpoint (pos_min = 150564, pos_max = 150564, n_tokens = 150565, n_swa = 0, pos_next = 0, size = 3342.202 MiB)
beellama-qwopus-coder | 695.41.411.767 W slot update_slots: id 0 | task 105971 | erased invalidated context checkpoint (pos_min = 151076, pos_max = 151076, n_tokens = 151077, n_swa = 0, pos_next = 0, size = 3352.972 MiB)
beellama-qwopus-coder | 695.41.465.019 W slot update_slots: id 0 | task 105971 | erased invalidated context checkpoint (pos_min = 151349, pos_max = 151349, n_tokens = 151350, n_swa = 0, pos_next = 0, size = 3358.424 MiB)
beellama-qwopus-coder | 695.41.514.699 W slot update_slots: id 0 | task 105971 | erased invalidated context checkpoint (pos_min = 151819, pos_max = 151819, n_tokens = 151820, n_swa = 0, pos_next = 0, size = 3369.028 MiB)
beellama-qwopus-coder | 695.45.047.592 I slot print_timing: id 0 | task 105971 | prompt processing, n_tokens = 4096, progress = 0.04, t = 3.79 s / 1081.94 tokens per second
beellama-qwopus-coder | 695.46.856.177 I slot print_timing: id 0 | task 105971 | prompt processing, n_tokens = 6144, progress = 0.06, t = 5.59 s / 1098.25 tokens per second
beellama-qwopus-coder | 695.48.705.984 I slot print_timing: id 0 | task 105971 | prompt processing, n_tokens = 8192, progress = 0.08, t = 7.44 s / 1100.46 tokens per second
beellama-qwopus-coder | 695.50.597.729 I slot print_timing: id 0 | task 105971 | prompt processing, n_tokens = 10240, progress = 0.10, t = 9.34 s / 1096.84 tokens per second
beellama-qwopus-coder | 695.52.533.036 I slot print_timing: id 0 | task 105971 | prompt processing, n_tokens = 12288, progress = 0.12, t = 11.27 s / 1090.21 tokens per second
beellama-qwopus-coder | 695.54.507.214 I slot print_timing: id 0 | task 105971 | prompt processing, n_tokens = 14336, progress = 0.14, t = 13.25 s / 1082.34 tokens per second
beellama-qwopus-coder | 695.56.521.907 I slot print_timing: id 0 | task 105971 | prompt processing, n_tokens = 16384, progress = 0.16, t = 15.26 s / 1073.65 tokens per second
beellama-qwopus-coder | 695.58.582.331 I slot print_timing: id 0 | task 105971 | prompt processing, n_tokens = 18432, progress = 0.19, t = 17.32 s / 1064.17 tokens per second
beellama-qwopus-coder | 696.00.683.555 I slot print_timing: id 0 | task 105971 | prompt processing, n_tokens = 20480, progress = 0.21, t = 19.42 s / 1054.49 tokens per second
beellama-qwopus-coder | 696.02.826.987 I slot print_timing: id 0 | task 105971 | prompt processing, n_tokens = 22528, progress = 0.23, t = 21.57 s / 1044.65 tokens per second
beellama-qwopus-coder | 696.04.078.305 I srv params_from_: Chat format: peg-native
beellama-qwopus-coder | 696.05.012.660 I slot print_timing: id 0 | task 105971 | prompt processing, n_tokens = 24576, progress = 0.25, t = 23.75 s / 1034.74 tokens per second
beellama-qwopus-coder | 696.07.241.196 I slot print_timing: id 0 | task 105971 | prompt processing, n_tokens = 26624, progress = 0.27, t = 25.98 s / 1024.81 tokens per second
beellama-qwopus-coder | 696.09.513.283 I slot print_timing: id 0 | task 105971 | prompt processing, n_tokens = 28672, progress = 0.29, t = 28.25 s / 1014.88 tokens per second
beellama-qwopus-coder | 696.11.830.454 I slot print_timing: id 0 | task 105971 | prompt processing, n_tokens = 30720, progress = 0.31, t = 30.57 s / 1004.95 tokens per second
beellama-qwopus-coder | 696.14.190.694 I slot print_timing: id 0 | task 105971 | prompt processing, n_tokens = 32768, progress = 0.33, t = 32.93 s / 995.11 tokens per second
beellama-qwopus-coder | 696.16.596.539 I slot print_timing: id 0 | task 105971 | prompt processing, n_tokens = 34816, progress = 0.35, t = 35.33 s / 985.32 tokens per second
beellama-qwopus-coder | 696.19.044.655 I slot print_timing: id 0 | task 105971 | prompt processing, n_tokens = 36864, progress = 0.37, t = 37.78 s / 975.68 tokens per second
beellama-qwopus-coder | 696.21.535.845 I slot print_timing: id 0 | task 105971 | prompt processing, n_tokens = 38912, progress = 0.39, t = 40.27 s / 966.18 tokens per second
beellama-qwopus-coder | 696.24.069.412 I slot print_timing: id 0 | task 105971 | prompt processing, n_tokens = 40960, progress = 0.41, t = 42.81 s / 956.84 tokens per second
beellama-qwopus-coder | 696.26.645.880 I slot print_timing: id 0 | task 105971 | prompt processing, n_tokens = 43008, progress = 0.43, t = 45.38 s / 947.65 tokens per second
beellama-qwopus-coder | 696.29.265.595 I slot print_timing: id 0 | task 105971 | prompt processing, n_tokens = 45056, progress = 0.45, t = 48.00 s / 938.59 tokens per second
beellama-qwopus-coder | 696.31.931.050 I slot print_timing: id 0 | task 105971 | prompt processing, n_tokens = 47104, progress = 0.47, t = 50.67 s / 929.64 tokens per second
beellama-qwopus-coder | 696.34.641.991 I slot print_timing: id 0 | task 105971 | prompt processing, n_tokens = 49152, progress = 0.49, t = 53.38 s / 920.79 tokens per second
beellama-qwopus-coder | 696.37.393.326 I slot print_timing: id 0 | task 105971 | prompt processing, n_tokens = 51200, progress = 0.51, t = 56.13 s / 912.14 tokens per second
beellama-qwopus-coder | 696.40.183.583 I slot print_timing: id 0 | task 105971 | prompt processing, n_tokens = 53248, progress = 0.54, t = 58.92 s / 903.71 tokens per second
beellama-qwopus-coder | 696.43.017.696 I slot print_timing: id 0 | task 105971 | prompt processing, n_tokens = 55296, progress = 0.56, t = 61.76 s / 895.40 tokens per second
beellama-qwopus-coder | 696.45.893.980 I slot print_timing: id 0 | task 105971 | prompt processing, n_tokens = 57344, progress = 0.58, t = 64.63 s / 887.24 tokens per second
beellama-qwopus-coder | 696.48.815.101 I slot print_timing: id 0 | task 105971 | prompt processing, n_tokens = 59392, progress = 0.60, t = 67.55 s / 879.19 tokens per second
beellama-qwopus-coder | 696.51.781.522 I slot print_timing: id 0 | task 105971 | prompt processing, n_tokens = 61440, progress = 0.62, t = 70.52 s / 871.25 tokens per second
beellama-qwopus-coder | 696.54.787.604 I slot print_timing: id 0 | task 105971 | prompt processing, n_tokens = 63488, progress = 0.64, t = 73.53 s / 863.48 tokens per second
beellama-qwopus-coder | 696.57.840.622 I slot print_timing: id 0 | task 105971 | prompt processing, n_tokens = 65536, progress = 0.66, t = 76.58 s / 855.80 tokens per second
beellama-qwopus-coder | 697.00.935.079 I slot print_timing: id 0 | task 105971 | prompt processing, n_tokens = 67584, progress = 0.68, t = 79.67 s / 848.26 tokens per second
beellama-qwopus-coder | 697.04.075.212 I slot print_timing: id 0 | task 105971 | prompt processing, n_tokens = 69632, progress = 0.70, t = 82.81 s / 840.83 tokens per second
beellama-qwopus-coder | 697.07.259.735 I slot print_timing: id 0 | task 105971 | prompt processing, n_tokens = 71680, progress = 0.72, t = 86.00 s / 833.51 tokens per second
beellama-qwopus-coder | 697.10.484.154 I slot print_timing: id 0 | task 105971 | prompt processing, n_tokens = 73728, progress = 0.74, t = 89.22 s / 826.34 tokens per second
beellama-qwopus-coder | 697.13.753.333 I slot print_timing: id 0 | task 105971 | prompt processing, n_tokens = 75776, progress = 0.76, t = 92.49 s / 819.28 tokens per second
beellama-qwopus-coder | 697.17.067.452 I slot print_timing: id 0 | task 105971 | prompt processing, n_tokens = 77824, progress = 0.78, t = 95.81 s / 812.31 tokens per second
beellama-qwopus-coder | 697.20.430.470 I slot print_timing: id 0 | task 105971 | prompt processing, n_tokens = 79872, progress = 0.80, t = 99.17 s / 805.42 tokens per second
beellama-qwopus-coder | 697.23.837.599 I slot print_timing: id 0 | task 105971 | prompt processing, n_tokens = 81920, progress = 0.82, t = 102.58 s / 798.63 tokens per second
beellama-qwopus-coder | 697.27.291.814 I slot print_timing: id 0 | task 105971 | prompt processing, n_tokens = 83968, progress = 0.84, t = 106.03 s / 791.93 tokens per second
beellama-qwopus-coder | 697.30.785.434 I slot print_timing: id 0 | task 105971 | prompt processing, n_tokens = 86016, progress = 0.86, t = 109.52 s / 785.36 tokens per second
beellama-qwopus-coder | 697.34.325.412 I slot print_timing: id 0 | task 105971 | prompt processing, n_tokens = 88064, progress = 0.89, t = 113.06 s / 778.89 tokens per second
beellama-qwopus-coder | 697.37.913.100 I slot print_timing: id 0 | task 105971 | prompt processing, n_tokens = 90112, progress = 0.91, t = 116.65 s / 772.49 tokens per second
beellama-qwopus-coder | 697.41.546.108 I slot print_timing: id 0 | task 105971 | prompt processing, n_tokens = 92160, progress = 0.93, t = 120.28 s / 766.18 tokens per second
beellama-qwopus-coder | 697.41.572.201 W srv stop: cancel task, id_task = 105971
beellama-qwopus-coder | 697.42.091.523 I srv params_from_: Chat format: peg-native
beellama-qwopus-coder | 697.45.222.467 I slot release: id 0 | task 105971 | stop processing: n_tokens = 94208, truncated = 0
beellama-qwopus-coder | 697.45.222.530 I slot get_availabl: id 0 | task -1 | selected slot by LRU, t_last = 43554874297
beellama-qwopus-coder | 697.45.223.027 I reasoning-budget: activated, budget=2147483647 tokens
beellama-qwopus-coder | 697.45.223.028 I reasoning-budget: deactivated (natural end)
beellama-qwopus-coder | 697.45.223.124 I slot launch_slot_: id 0 | task 105984 | processing task, is_child = 0
beellama-qwopus-coder | 697.45.223.138 W slot update_slots: id 0 | task 105984 | forcing full prompt re-processing due to lack of cache data (likely due to SWA or hybrid/recurrent memory, see ggml-org/llama.cpp#13194 (comment))
beellama-qwopus-coder | 697.48.726.773 I slot print_timing: id 0 | task 105984 | prompt processing, n_tokens = 4096, progress = 0.03, t = 3.50 s / 1169.07 tokens per second
beellama-qwopus-coder | 697.50.543.234 I slot print_timing: id 0 | task 105984 | prompt processing, n_tokens = 6144, progress = 0.04, t = 5.32 s / 1154.87 tokens per second
beellama-qwopus-coder | 697.52.400.060 I slot print_timing: id 0 | task 105984 | prompt processing, n_tokens = 8192, progress = 0.05, t = 7.18 s / 1141.44 tokens per second
beellama-qwopus-coder | 697.54.296.984 I slot print_timing: id 0 | task 105984 | prompt processing, n_tokens = 10240, progress = 0.07, t = 9.07 s / 1128.52 tokens per second
beellama-qwopus-coder | 697.56.236.271 I slot print_timing: id 0 | task 105984 | prompt processing, n_tokens = 12288, progress = 0.08, t = 11.01 s / 1115.76 tokens per second
beellama-qwopus-coder | 697.58.216.202 I slot print_timing: id 0 | task 105984 | prompt processing, n_tokens = 14336, progress = 0.09, t = 12.99 s / 1103.36 tokens per second
beellama-qwopus-coder | 698.00.238.494 I slot print_timing: id 0 | task 105984 | prompt processing, n_tokens = 16384, progress = 0.11, t = 15.02 s / 1091.15 tokens per second
beellama-qwopus-coder | 698.02.301.714 I slot print_timing: id 0 | task 105984 | prompt processing, n_tokens = 18432, progress = 0.12, t = 17.08 s / 1079.25 tokens per second
beellama-qwopus-coder | 698.04.405.390 I slot print_timing: id 0 | task 105984 | prompt processing, n_tokens = 20480, progress = 0.13, t = 19.18 s / 1067.65 tokens per second
beellama-qwopus-coder | 698.06.551.327 I slot print_timing: id 0 | task 105984 | prompt processing, n_tokens = 22528, progress = 0.15, t = 21.33 s / 1056.26 tokens per second
beellama-qwopus-coder | 698.08.741.517 I slot print_timing: id 0 | task 105984 | prompt processing, n_tokens = 24576, progress = 0.16, t = 23.52 s / 1044.97 tokens per second
beellama-qwopus-coder | 698.10.971.578 I slot print_timing: id 0 | task 105984 | prompt processing, n_tokens = 26624, progress = 0.18, t = 25.75 s / 1034.00 tokens per second
beellama-qwopus-coder | 698.13.246.531 I slot print_timing: id 0 | task 105984 | prompt processing, n_tokens = 28672, progress = 0.19, t = 28.02 s / 1023.15 tokens per second
beellama-qwopus-coder | 698.15.564.098 I slot print_timing: id 0 | task 105984 | prompt processing, n_tokens = 30720, progress = 0.20, t = 30.34 s / 1012.49 tokens per second
beellama-qwopus-coder | 698.17.924.895 I slot print_timing: id 0 | task 105984 | prompt processing, n_tokens = 32768, progress = 0.22, t = 32.70 s / 1002.03 tokens per second
beellama-qwopus-coder | 698.20.331.242 I slot print_timing: id 0 | task 105984 | prompt processing, n_tokens = 34816, progress = 0.23, t = 35.11 s / 991.68 tokens per second
beellama-qwopus-coder | 698.22.780.432 I slot print_timing: id 0 | task 105984 | prompt processing, n_tokens = 36864, progress = 0.24, t = 37.56 s / 981.54 tokens per second
beellama-qwopus-coder | 698.25.270.005 I slot print_timing: id 0 | task 105984 | prompt processing, n_tokens = 38912, progress = 0.26, t = 40.05 s / 971.66 tokens per second
beellama-qwopus-coder | 698.27.802.647 I slot print_timing: id 0 | task 105984 | prompt processing, n_tokens = 40960, progress = 0.27, t = 42.58 s / 961.97 tokens per second
beellama-qwopus-coder | 698.30.376.830 I slot print_timing: id 0 | task 105984 | prompt processing, n_tokens = 43008, progress = 0.28, t = 45.15 s / 952.48 tokens per second
beellama-qwopus-coder | 698.32.992.582 I slot print_timing: id 0 | task 105984 | prompt processing, n_tokens = 45056, progress = 0.30, t = 47.77 s / 943.20 tokens per second
beellama-qwopus-coder | 698.35.650.411 I slot print_timing: id 0 | task 105984 | prompt processing, n_tokens = 47104, progress = 0.31, t = 50.43 s / 934.10 tokens per second
beellama-qwopus-coder | 698.38.350.980 I slot print_timing: id 0 | task 105984 | prompt processing, n_tokens = 49152, progress = 0.32, t = 53.13 s / 925.16 tokens per second
beellama-qwopus-coder | 698.41.096.481 I slot print_timing: id 0 | task 105984 | prompt processing, n_tokens = 51200, progress = 0.34, t = 55.87 s / 916.36 tokens per second
beellama-qwopus-coder | 698.43.885.356 I slot print_timing: id 0 | task 105984 | prompt processing, n_tokens = 53248, progress = 0.35, t = 58.66 s / 907.71 tokens per second
beellama-qwopus-coder | 698.46.715.086 I slot print_timing: id 0 | task 105984 | prompt processing, n_tokens = 55296, progress = 0.36, t = 61.49 s / 899.24 tokens per second
beellama-qwopus-coder | 698.49.589.681 I slot print_timing: id 0 | task 105984 | prompt processing, n_tokens = 57344, progress = 0.38, t = 64.37 s / 890.90 tokens per second
beellama-qwopus-coder | 698.52.504.755 I slot print_timing: id 0 | task 105984 | prompt processing, n_tokens = 59392, progress = 0.39, t = 67.28 s / 882.74 tokens per second
beellama-qwopus-coder | 698.55.228.731 W srv stop: cancel task, id_task = 105984
beellama-qwopus-coder | 698.55.463.307 I slot release: id 0 | task 105984 | stop processing: n_tokens = 61440, truncated = 0
beellama-qwopus-coder | 698.55.463.448 I slot get_availabl: id 0 | task -1 | selected slot by LRU, t_last = 43625115137
beellama-qwopus-coder | 698.55.464.846 I reasoning-budget: activated, budget=2147483647 tokens
beellama-qwopus-coder | 698.55.464.848 I reasoning-budget: deactivated (natural end)
beellama-qwopus-coder | 698.55.464.897 I slot launch_slot_: id 0 | task 106020 | processing task, is_child = 0
beellama-qwopus-coder | 698.55.464.906 W slot update_slots: id 0 | task 106020 | forcing full prompt re-processing due to lack of cache data (likely due to SWA or hybrid/recurrent memory, see ggml-org/llama.cpp#13194 (comment))
beellama-qwopus-coder | 698.58.957.078 I slot print_timing: id 0 | task 106020 | prompt processing, n_tokens = 4096, progress = 0.04, t = 3.49 s / 1172.91 tokens per second
beellama-qwopus-coder | 699.00.770.782 I slot print_timing: id 0 | task 106020 | prompt processing, n_tokens = 6144, progress = 0.06, t = 5.31 s / 1157.96 tokens per second
beellama-qwopus-coder | 699.02.627.540 I slot print_timing: id 0 | task 106020 | prompt processing, n_tokens = 8192, progress = 0.08, t = 7.16 s / 1143.72 tokens per second
beellama-qwopus-coder | 699.04.524.142 I slot print_timing: id 0 | task 106020 | prompt processing, n_tokens = 10240, progress = 0.10, t = 9.06 s / 1130.34 tokens per second
beellama-qwopus-coder | 699.05.338.860 I srv params_from_: Chat format: peg-native
beellama-qwopus-coder | 699.06.462.040 I slot print_timing: id 0 | task 106020 | prompt processing, n_tokens = 12288, progress = 0.12, t = 11.00 s / 1117.38 tokens per second
beellama-qwopus-coder | 699.08.442.498 I slot print_timing: id 0 | task 106020 | prompt processing, n_tokens = 14336, progress = 0.14, t = 12.98 s / 1104.67 tokens per second
beellama-qwopus-coder | 699.10.463.843 I slot print_timing: id 0 | task 106020 | prompt processing, n_tokens = 16384, progress = 0.16, t = 15.00 s / 1092.34 tokens per second
beellama-qwopus-coder | 699.12.526.426 I slot print_timing: id 0 | task 106020 | prompt processing, n_tokens = 18432, progress = 0.19, t = 17.06 s / 1080.33 tokens per second
beellama-qwopus-coder | 699.14.630.794 I slot print_timing: id 0 | task 106020 | prompt processing, n_tokens = 20480, progress = 0.21, t = 19.17 s / 1068.57 tokens per second
beellama-qwopus-coder | 699.16.776.844 I slot print_timing: id 0 | task 106020 | prompt processing, n_tokens = 22528, progress = 0.23, t = 21.31 s / 1057.06 tokens per second
beellama-qwopus-coder | 699.18.964.985 I slot print_timing: id 0 | task 106020 | prompt processing, n_tokens = 24576, progress = 0.25, t = 23.50 s / 1045.78 tokens per second
beellama-qwopus-coder | 699.21.197.695 I slot print_timing: id 0 | task 106020 | prompt processing, n_tokens = 26624, progress = 0.27, t = 25.73 s / 1034.63 tokens per second
beellama-qwopus-coder | 699.23.471.362 I slot print_timing: id 0 | task 106020 | prompt processing, n_tokens = 28672, progress = 0.29, t = 28.01 s / 1023.76 tokens per second
beellama-qwopus-coder | 699.25.788.541 I slot print_timing: id 0 | task 106020 | prompt processing, n_tokens = 30720, progress = 0.31, t = 30.32 s / 1013.07 tokens per second
beellama-qwopus-coder | 699.28.151.281 I slot print_timing: id 0 | task 106020 | prompt processing, n_tokens = 32768, progress = 0.33, t = 32.69 s / 1002.50 tokens per second
beellama-qwopus-coder | 699.30.555.809 I slot print_timing: id 0 | task 106020 | prompt processing, n_tokens = 34816, progress = 0.35, t = 35.09 s / 992.17 tokens per second
beellama-qwopus-coder | 699.33.002.866 I slot print_timing: id 0 | task 106020 | prompt processing, n_tokens = 36864, progress = 0.37, t = 37.54 s / 982.05 tokens per second
beellama-qwopus-coder | 699.35.493.838 I slot print_timing: id 0 | task 106020 | prompt processing, n_tokens = 38912, progress = 39, t = 40.03 s / 972.10 tokens per second
beellama-qwopus-coder | 699.38.025.081 I slot print_timing: id 0 | task 106020 | prompt processing, n_tokens = 40960, progress = 41, t = 42.56 s / 962.40 tokens per second
beellama-qwopus-coder | 699.40.600.799 I slot print_timing: id 0 | task 106020 | prompt processing, n_tokens = 43008, progress = 0.43, t = 45.14 s / 952.86 tokens per second

0 replies

noonghunna · 2026-06-15T20:51:24Z

noonghunna
Jun 15, 2026
Maintainer Author

Thanks for the log — that's enough to see it's not a generation loop; it's the harness reprocessing your whole context every turn. The tell repeats three times:

forcing full prompt re-processing due to lack of cache data (likely due to SWA or hybrid/recurrent memory…)

At ~150K, KVarN's windowed KV can't reuse the prompt cache across OpenCode's resent context, so each turn erases the checkpoints and reprocesses the full ~150K from scratch (~2–4 min at your prefill rate). OpenCode then hits its own timeout and cancels mid-reprocess (cancel task in your log) and resubmits → reprocess→cancel→reprocess. Decode at 150K is also only ~7–9 t/s. So it looks stuck but it's grinding.

Two things would confirm and likely fix it:

A rig report — bash scripts/report.sh --full (redacted is fine). I want your exact CTX_SIZE, KV type, MTP n, beellama image, and VRAM headroom at depth.
Try a lower context budget in OpenCode. 150K of resent context per turn on a single 24 GB card is the stressor — the cache-invalidation reprocess scales with it. Dropping OpenCode's context window (or CTX_SIZE) well under the MTP-on ~160K ceiling should stop the reprocess-cancel loop.

If you can also share whether the card is headless or driving a display, that helps — a desktop-shared 3090 loses a few GB that tightens this further.

0 replies

sghomelab · 2026-06-16T02:04:31Z

sghomelab
Jun 16, 2026

Rig report to follow below. Btw, I am getting no safetensors when I ran the setup script. Not sure if this matters? See below.

WEIGHTS=qwopus-coder bash scripts/setup.sh qwen3.6-27b
[model] qwen3.6-27b:qwopus-coder-mtp-q5km -> Jackrong/Qwopus3.6-27B-Coder-MTP-GGUF Qwopus3.6-27B-Coder-MTP-Q5_K_M.gguf -> qwopus3.6-27b-coder-gguf/jackrong-mtp-q5km
[preflight] checking environment...
[preflight] docker: 29.4.1 (compose v2 ok)
[preflight] gpu: 2× detected
[preflight] GPU 0: NVIDIA GeForce RTX 3090 (UUID: GPU-85a6aab0-097d-628e-e33c-848cc5c04871)
[preflight] GPU 1: NVIDIA GeForce GTX 1080 (UUID: GPU-6613017b-bc0f-34ff-1380-91e814014c3f)
[preflight] disk: 1631 GB free at /opt/ai/models (need ~25 GB)
[preflight] ok.

Setup root: /opt/ai/compose/club-3090
Model dir: /opt/ai/models
[genesis] qwen3.6-27b doesn't use Genesis — skipping clone.
[model] Using 'hf download' (hf_transfer if available) ...
/home/mars/.hf-cli/venv/lib/python3.12/site-packages/huggingface_hub/constants.py:298: FutureWarning: The HF_HUB_ENABLE_HF_TRANSFER environment variable is deprecated as 'hf_transfer' is not used anymore. Please use HF_XET_HIGH_PERFORMANCE instead to enable high performance transfer with Xet. Visit https://huggingface.co/docs/huggingface_hub/package_reference/environment_variables#hfxethighperformance for more details.
warnings.warn(
✓ Downloaded
path: /mnt/opt/ai/models/qwopus3.6-27b-coder-gguf/jackrong-mtp-q5km/Qwopus3.6-27B-Coder-MTP-Q5_K_M.gguf
[verify] Checking SHA256 of every *.safetensors against HF x-linked-etag ...
*[verify] No .safetensors found in /opt/ai/models/qwopus3.6-27b-coder-gguf/jackrong-mtp-q5km — download may have failed.

0 replies

sghomelab · 2026-06-16T02:04:38Z

sghomelab
Jun 16, 2026

scripts/report.sh --full

club-3090 rig report

Generated: 2026-06-16 01:27:13 UTC

Redacted output (paths, host, user, tokens). Re-run with --no-redact for full data.

System

OS: Ubuntu 24.04.2 LTS
Kernel: 6.8.0-124-generic
Environment: bare metal
Locale: en_US.UTF-8
Timezone: +08
Uptime: up 1 hour, 34 minutes

CPU + RAM

CPU: AMD Ryzen 9 9950X3D 16-Core Processor (32 threads)
RAM: 91Gi total, 87Gi available
Swap: 8.0Gi

Disk

<MODEL_DIR>: 1.6T available, ext4 filesystem
/huggingface: 506G available, ext4 filesystem
/var/lib/docker: 506G available, ext4 filesystem

GPU hardware

GPU 0: NVIDIA GeForce RTX 3090 | 24576 MiB | driver 580.159.04 | VBIOS 94.02.42.00.A9 | persistence=Enabled
- Power: limit=390.00 W (default=390.00 W, max=480.00 W) | current_draw=30.62 W
- PCIe: x4 lanes negotiated (GPU max x16, Gen up to 4) | bus 00000000:01:00.0 ⚠ slot is narrower than GPU capability — affects load + all-reduce bandwidth
GPU 1: NVIDIA GeForce GTX 1080 | 8192 MiB | driver 580.159.04 | VBIOS 86.04.60.00.D2 | persistence=Enabled
- Power: limit=180.00 W (default=180.00 W, max=217.00 W) | current_draw=6.14 W
- PCIe: x4 lanes negotiated (GPU max x16, Gen up to 3) | bus 00000000:02:00.0 ⚠ slot is narrower than GPU capability — affects load + all-reduce bandwidth
CUDA Runtime (per driver): 13.0
ECC mode: [N/A] (3090s don't have ECC; expect N/A)

NVLink

No NVLink detected (PCIe-only)

Topology

PCIe / GPU topology matrix

        GPU0    GPU1    CPU Affinity    NUMA Affinity   GPU NUMA ID
GPU0     X      PHB     0-31    0               N/A
GPU1    PHB      X      0-31    0               N/A

Legend:

  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing at most a single PCIe bridge
  NV#  = Connection traversing a bonded set of # NVLinks

PCIe / P2P detail (lspci)

lspci PCIe/P2P detail (LnkSta / ACS / topology)

# lspci -t  (PCIe topology tree)
-[0000:00]-+-00.0
           +-00.2
           +-01.0
           +-01.1-[01]--+-00.0
           |            \-00.1
           +-01.2-[02]--+-00.0
           |            \-00.1
           +-01.3-[03]----00.0
           +-01.4-[04]----00.0
           +-01.5-[05]----00.0
           +-01.6-[06]----00.0
           +-01.7-[07]----00.0
           +-02.0
           +-03.0
           +-04.0
           +-08.0
           +-08.1-[08]--+-00.0
           |            +-00.1
           |            +-00.2
           |            +-00.3
           |            +-00.4
           |            +-00.5
           |            \-00.6
           +-08.3-[09]----00.0
           +-14.0
           +-14.3
           +-18.0
           +-18.1
           +-18.2
           +-18.3
           +-18.4
           +-18.5
           +-18.6
           \-18.7

# lspci -vvv -s 0000:01:00.0  (GPU function: LnkCap/LnkSta/ACSCap/ACSCtl)
                LnkCap: Port #0, Speed 16GT/s, Width x16, ASPM L0s L1, Exit Latency L0s <512ns, L1 <4us
                LnkSta: Speed 2.5GT/s (downgraded), Width x4 (downgraded)

# lspci -vvv -s 0000:00:01.1  (upstream bridge of 0000:01:00.0: LnkCap/LnkSta/ACSCap/ACSCtl)
                LnkCap: Port #1, Speed 16GT/s, Width x4, ASPM not supported
                LnkSta: Speed 2.5GT/s, Width x4
                ACSCap: SrcValid+ TransBlk+ ReqRedir+ CmpltRedir+ UpstreamFwd+ EgressCtrl- DirectTrans+
                ACSCtl: SrcValid+ TransBlk- ReqRedir+ CmpltRedir+ UpstreamFwd+ EgressCtrl- DirectTrans-

# lspci -vvv -s 0000:02:00.0  (GPU function: LnkCap/LnkSta/ACSCap/ACSCtl)
                LnkCap: Port #0, Speed 8GT/s, Width x16, ASPM L0s L1, Exit Latency L0s <512ns, L1 <4us
                LnkSta: Speed 2.5GT/s (downgraded), Width x4 (downgraded)

# lspci -vvv -s 0000:00:01.2  (upstream bridge of 0000:02:00.0: LnkCap/LnkSta/ACSCap/ACSCtl)
                LnkCap: Port #0, Speed 16GT/s, Width x4, ASPM not supported
                LnkSta: Speed 2.5GT/s, Width x4
                ACSCap: SrcValid+ TransBlk+ ReqRedir+ CmpltRedir+ UpstreamFwd+ EgressCtrl- DirectTrans+
                ACSCtl: SrcValid+ TransBlk- ReqRedir+ CmpltRedir+ UpstreamFwd+ EgressCtrl- DirectTrans-

# lspci -nnk | grep -A3 -i nvidia  (driver binding + device IDs)
01:00.0 VGA compatible controller [0300]: NVIDIA Corporation GA102 [GeForce RTX 3090] [10de:2204] (rev a1)
        Subsystem: ASUSTeK Computer Inc. GA102 [GeForce RTX 3090] [1043:87af]
        Kernel driver in use: nvidia
        Kernel modules: nvidiafb, nouveau, nvidia_drm, nvidia
01:00.1 Audio device [0403]: NVIDIA Corporation GA102 High Definition Audio Controller [10de:1aef] (rev a1)
        Subsystem: ASUSTeK Computer Inc. GA102 High Definition Audio Controller [1043:87af]
        Kernel driver in use: snd_hda_intel
        Kernel modules: snd_hda_intel
02:00.0 VGA compatible controller [0300]: NVIDIA Corporation GP104 [GeForce GTX 1080] [10de:1b80] (rev a1)
        Subsystem: PC Partner Limited / Sapphire Technology GP104 [GeForce GTX 1080] [174b:1425]
        Kernel driver in use: nvidia
        Kernel modules: nvidiafb, nouveau, nvidia_drm, nvidia
02:00.1 Audio device [0403]: NVIDIA Corporation GP104 High Definition Audio Controller [10de:10f0] (rev a1)
        Subsystem: PC Partner Limited / Sapphire Technology GP104 High Definition Audio Controller [174b:1425]
        Kernel driver in use: snd_hda_intel
        Kernel modules: snd_hda_intel

Full nvidia-smi

Full nvidia-smi output

Tue Jun 16 09:27:14 2026
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 580.159.04             Driver Version: 580.159.04     CUDA Version: 13.0     |
+-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce RTX 3090        On  |   00000000:01:00.0 Off |                  N/A |
| 53%   42C    P8             30W /  390W |   23270MiB /  24576MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   1  NVIDIA GeForce GTX 1080        On  |   00000000:02:00.0 Off |                  N/A |
| 27%   35C    P8              6W /  180W |     111MiB /   8192MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI              PID   Type   Process name                        GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|    0   N/A  N/A           43788      C   /app/llama-server                     23260MiB |
|    1   N/A  N/A            1701      C   ...ars/ComfyUI/.venv/bin/python3        106MiB |
+-----------------------------------------------------------------------------------------+

Display / desktop state

$DISPLAY: unset (headless)
Display processes running: none detected
GPU 0 idle VRAM: 23270 MiB (held by running beellama-qwopus-coder)
GPU 1 idle VRAM: 111 MiB (held by running beellama-qwopus-coder)

Container runtime

Docker: 29.4.1
docker compose (v2): 2.38.2
NVIDIA Container Toolkit: 1.19.1

Stack version

club-3090: v0.8.7-150-g99ced8e (branch: master, SHA 99ced8e)
GENESIS_PIN default: 7b9fd319 (per scripts/setup.sh)
Cached vLLM images:
- tag nightly-1acd67a795ebccdf9b9db7697ae9082058301657 digest sha256:2128e005fec5faea8eb3cb772d53aaf6575bb180ac178d7d86fe3e48d16c33eb (5 weeks ago)
- tag latest digest sha256:04563c302537a91aa49ebdfbceda96111c5712275999b7e8804fa598f0b5641d (7 weeks ago)
- tag nightly digest sha256:10c7a6ba51c69b8860546568b79faf48f9c7becc98595b5434624897e8d8f04c (7 weeks ago)

Profile state

Profile schema version: 1
Profile counts: 9 hardware, 7 models, 5 workloads, 12 engines, 10 drafters
Compose registry: 46 entries
Canonical scenarios: 9
Calibration:
- gemma-4-12b: 1 rows
- gemma-4-26b-a4b: 0 rows
- gemma-4-31b: 3 rows
- qwen3.6-27b: 2 rows
- qwen3.6-35b-a3b: 1 rows
Active estate: none (~/.club3090/estate.yml not found)

KV math calibration

Overall: 7/7 (100%)
No FAIL rows. kv-calc projections should agree with measured VRAM within the ±1.5 GB error band.

Full kv-calc --calibration output

========================================================================================
Calibration — predicted per-card VRAM vs measured BENCHMARKS rows
========================================================================================

  Predicted = weights + activation + overhead + drafter + (KV capped at available).
  Budget = mem_util × VRAM. Measured = nvidia-smi peak during bench (target ≈ budget).
  Verdict ✓ iff PASS/TIGHT and measured < VRAM (boot OK).

== qwen3.6-27b ==
  compose                     predicted    budget   measured  verdict
  ─────────────────────────   ─────────  ────────  ─────────  ───────
  dual                          19.91 GB   22.80 GB    23.60 GB    PASS ✓
  minimal@64K                   21.60 GB   21.60 GB    22.40 GB   TIGHT ✓

  Verdict accuracy: 2/2 (100%)

== qwen3.6-35b-a3b ==
  compose                     predicted    budget   measured  verdict
  ─────────────────────────   ─────────  ────────  ─────────  ───────
  qwen-35b-a3b-dual             13.68 GB   22.08 GB    22.10 GB    PASS ✓

  Verdict accuracy: 1/1 (100%)

== gemma-4-31b ==
  compose                     predicted    budget   measured  verdict
  ─────────────────────────   ─────────  ────────  ─────────  ───────
  gemma-dual-int8               22.80 GB   22.80 GB    22.20 GB   TIGHT ✓
  gemma-dual                    22.80 GB   22.80 GB    22.50 GB   TIGHT ✓
  gemma-dual-int8@256K          22.80 GB   22.80 GB    22.50 GB   TIGHT ✓

  Verdict accuracy: 3/3 (100%)

== gemma-4-12b ==
  compose                     predicted    budget   measured  verdict
  ─────────────────────────   ─────────  ────────  ─────────  ───────
  gemma-dual                    21.60 GB   21.60 GB    21.60 GB   TIGHT ✓

  Verdict accuracy: 1/1 (100%)

Overall: 7/7 (100%)

Notes:
  - This is a directional estimator (±1.5 GB error band on the breakdown).
  - vLLM's `gpu_worker.py` boot log is the authoritative source.
  - If predicted PASS but measured > budget, file an issue with `scripts/report.sh --bench`.

Active container

Name: beellama-qwopus-coder
Engine: unknown
Status: Up 51 seconds (healthy)
Ports: 0.0.0.0:8020->8080/tcp, [::]:8020->8080/tcp
Image: ghcr.io/anbeeld/beellama.cpp
Image digest: sha256:77d9e4fb7c743c86b222859999de1920d5b6a813a25a974e860cffe64d6a5f7e
Build tag (OCI version): preview-v0.3.2
Upstream commit (OCI revision): 98caf25aef870a6c745503c5dec9f5cbc5f50691
Upstream source: https://github.com/Anbeeld/beellama.cpp

Container Python / CUDA versions

**PyTorch:**OCI runtime exec failed: exec failed: unable to start container process: exec: "python3": executable file not found in $PATH
**vLLM:**OCI runtime exec failed: exec failed: unable to start container process: exec: "python3": executable file not found in $PATH

nvidia-smi inside container:
```
0, NVIDIA GeForce RTX 3090, 580.159.04
```

Boot log highlights

Full boot log (first 200 lines)

First 200 lines of docker logs

warn: LLAMA_ARG_HOST environment variable is set, but will be overwritten by command line argument --host
0.00.158.590 I log_info: verbosity = 3 (adjust with the `-lv N` CLI arg)
0.00.158.592 I device_info:
0.00.242.450 I   - CUDA0   : NVIDIA GeForce RTX 3090 (24124 MiB, 23859 MiB free)
0.00.242.457 I   - CPU     : AMD Ryzen 9 9950X3D 16-Core Processor (94106 MiB, 94106 MiB free)
0.00.242.526 I system_info: n_threads = 16 (n_threads_batch = 16) / 32 | CUDA : ARCHS = 500,610,700,750,800,860,890,900 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | FA_HALF_QUANTS = 1 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
0.00.242.571 I srv          init: running without SSL
0.00.242.605 I srv          init: using 31 threads for HTTP server
0.00.242.686 I srv         start: binding port with default address family
0.00.243.823 I srv  llama_server: loading model
0.00.243.834 I srv    load_model: loading model '/models/qwopus3.6-27b-coder-gguf/jackrong-mtp-q5km/Qwopus3.6-27B-Coder-MTP-Q5_K_M.gguf'
0.00.532.225 I srv    load_model: [spec] estimated memory usage of MTP context is 868.02 MiB
0.00.532.243 I common_init_result: fitting params to device memory ...
0.00.532.244 I common_init_result: (for bugs during this step try to reproduce them with -fit off, or provide --verbose logs if the bug only occurs with -fit on)
0.00.796.774 W common_fit_params: failed to fit params to free device memory: n_gpu_layers already set by user to -2, abort
0.05.144.379 W llama_init_from_model: KVarN requires non-unified KV streams; forcing kv_unified=false
0.05.144.383 W llama_init_from_model: KVarN preset kvarn_k4v4_g128 is experimental; only kvarn_k4v2_g128 is reference-aligned
0.05.144.401 W llama_context: n_ctx_seq (163840) < n_ctx_train (262144) -- the full capacity of the model will not be utilized
0.05.198.577 I common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
TCQ decode: context-adaptive V alpha enabled
0.05.253.773 I srv    load_model: creating MTP draft context against the target model '/models/qwopus3.6-27b-coder-gguf/jackrong-mtp-q5km/Qwopus3.6-27B-Coder-MTP-Q5_K_M.gguf'
0.05.253.780 W llama_init_from_model: KVarN is target-context-only; disabling it for this auxiliary context
0.05.253.803 W llama_context: n_ctx_seq (163840) < n_ctx_train (262144) -- the full capacity of the model will not be utilized
0.05.295.410 I srv    load_model: initializing slots, n_slots = 1
0.05.308.951 I common_context_can_seq_rm: the context supports bounded partial sequence removal
0.05.344.530 I common_speculative_impl_draft_mtp: adding speculative implementation 'draft-mtp'
0.05.344.534 I common_speculative_impl_draft_mtp: - n_max=3, n_min=0, p_min=0.00, n_embd=5120, backend_sampling=1
0.05.344.536 I common_speculative_impl_draft_mtp: - gpu_layers=-1, cache_k=f16, cache_v=f16, ctx_tgt=yes, ctx_dft=yes, devices=[default]
0.05.344.649 I srv    load_model: speculative decoding context initialized
0.05.344.655 I slot   load_model: id  0 | task -1 | speculative decoding context initialized
0.05.344.655 I slot   load_model: id  0 | task -1 | new slot, n_ctx = 163840
0.05.344.700 I srv    load_model: prompt cache is disabled - use `--cache-ram N` to enable it
0.05.344.701 I srv    load_model: for more info see https://github.com/ggml-org/llama.cpp/pull/16391
0.05.344.702 I srv    load_model: context checkpoints enabled, max = 32, min spacing = 256
0.05.344.725 W srv          init: --cache-idle-slots requires --cache-ram, disabling
0.05.354.164 I init: chat template, example_format: '<|im_start|>system
You are a helpful assistant<|im_end|>
<|im_start|>user
Hello<|im_end|>
<|im_start|>assistant
Hi there<|im_end|>
<|im_start|>user
How are you?<|im_end|>
<|im_start|>assistant
<think>

</think>

'
0.05.361.129 I srv          init: init: chat template, thinking = 0
0.05.361.175 I srv  llama_server: model loaded
0.05.361.179 I srv  llama_server: server is listening on http://0.0.0.0:8080
0.05.361.187 I srv  update_slots: all slots are idle
0.41.329.262 I srv  params_from_: Chat format: peg-native
0.41.329.403 I slot get_availabl: id  0 | task -1 | selected slot by LRU, t_last = -1
0.41.329.431 I reasoning-budget: activated, budget=2147483647 tokens
0.41.329.432 I reasoning-budget: deactivated (natural end)
0.41.329.452 I slot launch_slot_: id  0 | task 0 | processing task, is_child = 0
0.41.533.502 I slot create_check: id  0 | task 0 | created context checkpoint 1 of 32 (pos_min = 8, pos_max = 8, n_tokens = 9, size = 175.850 MiB)
0.41.640.964 I slot update_slots: id  0 | task 0 | accepted  3/ 3 draft tokens
0.41.695.175 I slot update_slots: id  0 | task 0 | accepted  0/ 3 draft tokens
0.41.744.771 I slot update_slots: id  0 | task 0 | accepted  3/ 3 draft tokens
0.41.744.882 I slot print_timing: id  0 | task 0 | prompt eval time =     249.36 ms /    13 tokens (   19.18 ms per token,    52.13 tokens per second)
0.41.744.884 I slot print_timing: id  0 | task 0 |        eval time =     165.96 ms /    10 tokens (   16.60 ms per token,    60.26 tokens per second)
0.41.744.884 I slot print_timing: id  0 | task 0 |       total time =     415.31 ms /    23 tokens
0.41.744.889 I slot print_timing: id  0 | task 0 |    graphs reused =          3
0.41.744.890 I slot print_timing: id  0 | task 0 | draft acceptance = 0.66667 (    6 accepted /     9 generated)
0.41.744.909 I statistics        draft-mtp: #calls(b,g,a) =    1      3      3, #gen drafts =      3, #acc drafts =     2, #gen tokens =      9, #acc tokens =     6, dur(b,g,a) = 0.003, 17.631, 0.002 ms
0.41.744.935 I slot      release: id  0 | task 0 | stop processing: n_tokens = 22, truncated = 0
0.41.744.951 I srv  update_slots: all slots are idle
1.13.470.240 I srv  params_from_: Chat format: peg-native
1.13.470.788 I slot get_availabl: id  0 | task -1 | selected slot by LRU, t_last = 1491973086000
1.13.470.830 I reasoning-budget: activated, budget=2147483647 tokens
1.13.470.833 I reasoning-budget: deactivated (natural end)
1.13.470.853 I slot launch_slot_: id  0 | task 6 | processing task, is_child = 0
1.13.470.866 I slot update_slots: id  0 | task 6 | Checking checkpoint with [8, 8] against 1...
1.13.470.871 W slot update_slots: id  0 | task 6 | forcing full prompt re-processing due to lack of cache data (likely due to SWA or hybrid/recurrent memory, see https://github.com/ggml-org/llama.cpp/pull/13194#issuecomment-2868343055)
1.13.470.877 W slot update_slots: id  0 | task 6 | erased invalidated context checkpoint (pos_min = 8, pos_max = 8, n_tokens = 9, n_swa = 0, pos_next = 0, size = 175.850 MiB)
1.13.499.800 I srv  params_from_: Chat format: peg-native
1.14.039.578 I slot create_check: id  0 | task 6 | created context checkpoint 1 of 32 (pos_min = 538, pos_max = 538, n_tokens = 539, size = 186.690 MiB)
1.14.184.079 I slot update_slots: id  0 | task 6 | accepted  3/ 3 draft tokens
1.14.184.196 I slot print_timing: id  0 | task 6 | prompt eval time =     659.12 ms /   552 tokens (    1.19 ms per token,   837.49 tokens per second)
1.14.184.198 I slot print_timing: id  0 | task 6 |        eval time =      54.11 ms /     3 tokens (   18.04 ms per token,    55.45 tokens per second)
1.14.184.199 I slot print_timing: id  0 | task 6 |       total time =     713.22 ms /   555 tokens
1.14.184.199 I slot print_timing: id  0 | task 6 |    graphs reused =          3
1.14.184.200 I slot print_timing: id  0 | task 6 | draft acceptance = 1.00000 (    3 accepted /     3 generated)
1.14.184.217 I statistics        draft-mtp: #calls(b,g,a) =    2      4      4, #gen drafts =      4, #acc drafts =     3, #gen tokens =     12, #acc tokens =     9, dur(b,g,a) = 0.004, 23.529, 0.002 ms
1.14.184.256 I slot      release: id  0 | task 6 | stop processing: n_tokens = 556, truncated = 0
1.14.184.277 I slot get_availabl: id  0 | task -1 | selected slot by LRU, t_last = 1492005525320
1.14.184.725 I reasoning-budget: activated, budget=2147483647 tokens
1.14.184.727 I reasoning-budget: deactivated (natural end)
1.14.184.738 I slot launch_slot_: id  0 | task 8 | processing task, is_child = 0
1.14.184.745 I slot update_slots: id  0 | task 8 | Checking checkpoint with [538, 538] against 3...
1.14.184.747 W slot update_slots: id  0 | task 8 | forcing full prompt re-processing due to lack of cache data (likely due to SWA or hybrid/recurrent memory, see https://github.com/ggml-org/llama.cpp/pull/13194#issuecomment-2868343055)
1.14.184.748 W slot update_slots: id  0 | task 8 | erased invalidated context checkpoint (pos_min = 538, pos_max = 538, n_tokens = 539, n_swa = 0, pos_next = 0, size = 186.690 MiB)
1.17.651.299 I slot print_timing: id  0 | task 8 | prompt processing, n_tokens =   4096, progress = 0.50, t =   3.47 s / 1181.58 tokens per second
1.19.447.826 I slot print_timing: id  0 | task 8 | prompt processing, n_tokens =   6144, progress = 0.75, t =   5.26 s / 1167.38 tokens per second
1.20.817.636 I slot print_timing: id  0 | task 8 | prompt processing, n_tokens =   7657, progress = 0.94, t =   6.63 s / 1154.40 tokens per second
1.21.273.988 I slot print_timing: id  0 | task 8 | prompt processing, n_tokens =   8160, progress = 1.00, t =   7.09 s / 1151.04 tokens per second
1.21.391.632 I slot create_check: id  0 | task 8 | created context checkpoint 1 of 32 (pos_min = 8159, pos_max = 8159, n_tokens = 8160, size = 345.813 MiB)
1.21.444.230 I slot print_timing: id  0 | task 8 | prompt processing, n_tokens =   8169, progress = 1.00, t =   7.26 s / 1125.29 tokens per second
1.21.555.789 I slot update_slots: id  0 | task 8 | accepted  3/ 3 draft tokens
1.21.617.156 I slot update_slots: id  0 | task 8 | accepted  3/ 3 draft tokens
1.21.674.086 I slot update_slots: id  0 | task 8 | accepted  2/ 3 draft tokens
1.21.674.211 I slot print_timing: id  0 | task 8 | prompt eval time =    7310.18 ms /  8173 tokens (    0.89 ms per token,  1118.03 tokens per second)
1.21.674.213 I slot print_timing: id  0 | task 8 |        eval time =     179.16 ms /    10 tokens (   17.92 ms per token,    55.81 tokens per second)
1.21.674.214 I slot print_timing: id  0 | task 8 |       total time =    7489.35 ms /  8183 tokens
1.21.674.215 I slot print_timing: id  0 | task 8 |    graphs reused =          5
1.21.674.215 I slot print_timing: id  0 | task 8 | draft acceptance = 0.88889 (    8 accepted /     9 generated)
1.21.674.232 I statistics        draft-mtp: #calls(b,g,a) =    3      7      7, #gen drafts =      7, #acc drafts =     6, #gen tokens =     21, #acc tokens =    17, dur(b,g,a) = 0.006, 41.393, 0.004 ms
1.21.674.477 I slot      release: id  0 | task 8 | stop processing: n_tokens = 8184, truncated = 0
1.21.674.516 I srv  update_slots: all slots are idle
2.16.807.512 I srv  params_from_: Chat format: peg-native
2.16.815.585 I slot get_availabl: id  0 | task -1 | selected slot by LCP similarity, sim_best = 0.488 (> 0.100 thold), f_keep = 0.507
2.16.816.012 I reasoning-budget: activated, budget=2147483647 tokens
2.16.816.014 I reasoning-budget: deactivated (natural end)
2.16.816.065 I slot launch_slot_: id  0 | task 23 | processing task, is_child = 0
2.16.816.078 I slot update_slots: id  0 | task 23 | Checking checkpoint with [8159, 8159] against 4152...
2.16.816.081 W slot update_slots: id  0 | task 23 | forcing full prompt re-processing due to lack of cache data (likely due to SWA or hybrid/recurrent memory, see https://github.com/ggml-org/llama.cpp/pull/13194#issuecomment-2868343055)
2.16.816.083 W slot update_slots: id  0 | task 23 | erased invalidated context checkpoint (pos_min = 8159, pos_max = 8159, n_tokens = 8160, n_swa = 0, pos_next = 0, size = 345.813 MiB)
2.20.294.896 I slot print_timing: id  0 | task 23 | prompt processing, n_tokens =   4096, progress = 0.48, t =   3.48 s / 1177.41 tokens per second
2.22.084.095 I slot print_timing: id  0 | task 23 | prompt processing, n_tokens =   6144, progress = 0.72, t =   5.27 s / 1166.28 tokens per second
2.23.772.109 I slot print_timing: id  0 | task 23 | prompt processing, n_tokens =   7995, progress = 0.94, t =   6.96 s / 1149.36 tokens per second
2.23.937.734 I slot print_timing: id  0 | task 23 | prompt processing, n_tokens =   8151, progress = 0.96, t =   7.12 s / 1144.54 tokens per second
2.24.056.290 I slot create_check: id  0 | task 23 | created context checkpoint 1 of 32 (pos_min = 8150, pos_max = 8150, n_tokens = 8151, size = 345.778 MiB)
2.24.407.171 I slot print_timing: id  0 | task 23 | prompt processing, n_tokens =   8507, progress = 1.00, t =   7.59 s / 1120.66 tokens per second
2.24.525.769 I slot create_check: id  0 | task 23 | created context checkpoint 2 of 32 (pos_min = 8506, pos_max = 8506, n_tokens = 8507, size = 353.744 MiB)
2.24.637.355 I slot update_slots: id  0 | task 23 | accepted  1/ 3 draft tokens
2.24.699.065 I slot update_slots: id  0 | task 23 | accepted  3/ 3 draft tokens
2.24.756.080 I slot update_slots: id  0 | task 23 | accepted  0/ 3 draft tokens
2.24.813.104 I slot update_slots: id  0 | task 23 | accepted  2/ 3 draft tokens
2.24.869.458 I slot update_slots: id  0 | task 23 | accepted  1/ 3 draft tokens
2.24.926.087 I slot update_slots: id  0 | task 23 | accepted  3/ 3 draft tokens
2.24.982.770 I slot update_slots: id  0 | task 23 | accepted  3/ 3 draft tokens
2.25.039.601 I slot update_slots: id  0 | task 23 | accepted  3/ 3 draft tokens
2.25.095.758 I slot update_slots: id  0 | task 23 | accepted  0/ 3 draft tokens
2.25.152.597 I slot update_slots: id  0 | task 23 | accepted  3/ 3 draft tokens
2.25.152.710 W slot process_toke: id  0 | task 23 | raw tool marker observed while lazy grammar is enabled; keeping DFlash governed by active grammar boundary in_reasoning=0 n_decoded=30 reasoning_tokens=0 visible_tokens=30
2.25.208.794 I slot update_slots: id  0 | task 23 | accepted  0/ 3 draft tokens
2.27.687.978 I slot print_timing: id  0 | task 23 | n_decoded =    100, tg =  32.13 t/s
2.30.722.701 I slot print_timing: id  0 | task 23 | n_decoded =    185, tg =  30.09 t/s
2.33.751.858 I slot print_timing: id  0 | task 23 | n_decoded =    269, tg =  29.31 t/s
2.36.778.443 I slot print_timing: id  0 | task 23 | n_decoded =    353, tg =  28.93 t/s
2.39.798.613 I slot print_timing: id  0 | task 23 | n_decoded =    437, tg =  28.71 t/s
2.41.437.550 I slot print_timing: id  0 | task 23 | prompt eval time =    7759.23 ms /  8511 tokens (    0.91 ms per token,  1096.89 tokens per second)
2.41.437.554 I slot print_timing: id  0 | task 23 |        eval time =   16862.24 ms /   482 tokens (   34.98 ms per token,    28.58 tokens per second)
2.41.437.554 I slot print_timing: id  0 | task 23 |       total time =   24621.46 ms /  8993 tokens
2.41.437.555 I slot print_timing: id  0 | task 23 |    graphs reused =        463
2.41.437.556 I slot print_timing: id  0 | task 23 | draft acceptance = 0.57576 (   19 accepted /    33 generated)
2.41.437.571 I statistics        draft-mtp: #calls(b,g,a) =    4     18     18, #gen drafts =     18, #acc drafts =    14, #gen tokens =     54, #acc tokens =    36, dur(b,g,a) = 0.007, 105.572, 0.019 ms
2.41.437.891 I slot      release: id  0 | task 23 | stop processing: n_tokens = 8992, truncated = 0
2.41.437.948 I srv  update_slots: all slots are idle
4.04.870.431 I srv  params_from_: Chat format: peg-native
4.04.878.697 I slot get_availabl: id  0 | task -1 | selected slot by LCP similarity, sim_best = 0.987 (> 0.100 thold), f_keep = 1.000
4.04.879.143 I reasoning-budget: activated, budget=2147483647 tokens
4.04.879.145 I reasoning-budget: deactivated (natural end)
4.04.879.184 I slot launch_slot_: id  0 | task 493 | processing task, is_child = 0
4.05.069.954 I slot create_check: id  0 | task 493 | created context checkpoint 3 of 32 (pos_min = 8993, pos_max = 8993, n_tokens = 8994, size = 364.415 MiB)
4.05.311.612 I slot update_slots: id  0 | task 493 | accepted  1/ 3 draft tokens
4.05.372.319 I slot update_slots: id  0 | task 493 | accepted  1/ 3 draft tokens
4.05.428.685 I slot update_slots: id  0 | task 493 | accepted  1/ 3 draft tokens
4.05.485.563 I slot update_slots: id  0 | task 493 | accepted  2/ 3 draft tokens
4.05.542.184 I slot update_slots: id  0 | task 493 | accepted  2/ 3 draft tokens
4.05.598.927 I slot update_slots: id  0 | task 493 | accepted  3/ 3 draft tokens
4.05.655.843 I slot update_slots: id  0 | task 493 | accepted  3/ 3 draft tokens
4.05.712.067 I slot update_slots: id  0 | task 493 | accepted  0/ 3 draft tokens
4.05.768.277 I slot update_slots: id  0 | task 493 | accepted  0/ 3 draft tokens
4.05.824.566 I slot update_slots: id  0 | task 493 | accepted  0/ 3 draft tokens
4.05.881.132 I slot update_slots: id  0 | task 493 | accepted  2/ 3 draft tokens
4.05.881.248 W slot process_toke: id  0 | task 493 | raw tool marker observed while lazy grammar is enabled; keeping DFlash governed by active grammar boundary in_reasoning=0 n_decoded=26 reasoning_tokens=0 visible_tokens=27
4.08.521.995 I slot print_timing: id  0 | task 493 | n_decoded =    100, tg =  30.58 t/s
4.11.549.994 I slot print_timing: id  0 | task 493 | n_decoded =    183, tg =  29.06 t/s
4.12.784.254 I slot print_timing: id  0 | task 493 | prompt eval time =     372.78 ms /   116 tokens (    3.21 ms per token,   311.18 tokens per second)
4.12.784.258 I slot print_timing: id  0 | task 493 |        eval time =    7532.27 ms /   217 tokens (   34.71 ms per token,    28.81 tokens per second)
4.12.784.258 I slot print_timing: id  0 | task 493 |       total time =    7905.05 ms /   333 tokens
4.12.784.259 I slot print_timing: id  0 | task 493 |    graphs reused =        661
4.12.784.260 I slot print_timing: id  0 | task 493 | draft acceptance = 0.45455 (   15 accepted /    33 generated)
4.12.784.277 I statistics        draft-mtp: #calls(b,g,a) =    5     29     29, #gen drafts =     29, #acc drafts =    22, #gen tokens =     87, #acc tokens =    51, dur(b,g,a) = 0.008, 169.236, 0.028 ms
4.12.784.603 I slot      release: id  0 | task 493 | stop processing: n_tokens = 9324, truncated = 0
4.12.784.645 I srv  update_slots: all slots are idle
4.13.933.389 I srv  params_from_: Chat format: peg-native
4.13.942.403 I slot get_availabl: id  0 | task -1 | selected slot by LCP similarity, sim_best = 0.819 (> 0.100 thold), f_keep = 1.000
4.13.943.013 I reasoning-budget: activated, budget=2147483647 tokens
4.13.943.015 I reasoning-budget: deactivated (natural end)
4.13.943.061 I slot launch_slot_: id  0 | task 698 | processing task, is_child = 0
4.14.110.310 I slot create_check: id  0 | task 698 | created context checkpoint 4 of 32 (pos_min = 9325, pos_max = 9325, n_tokens = 9326, size = 370.100 MiB)
4.15.757.419 I slot create_check: id  0 | task 698 | created context checkpoint 5 of 32 (pos_min = 10869, pos_max = 10869, n_tokens = 10870, size = 402.440 MiB)
4.16.374.746 I slot create_check: id  0 | task 698 | created context checkpoint 6 of 32 (pos_min = 11381, pos_max = 11381, n_tokens = 11382, size = 413.209 MiB)
4.16.489.776 I slot update_slots: id  0 | task 698 | accepted  0/ 3 draft tokens
4.16.555.229 I slot update_slots: id  0 | task 698 | accepted  3/ 3 draft tokens
4.16.622.601 I slot update_slots: id  0 | task 698 | accepted  0/ 3 draft tokens
4.16.688.812 I slot update_slots: id  0 | task 698 | accepted  3/ 3 draft tokens
4.16.746.485 I slot update_slots: id  0 | task 698 | accepted  3/ 3 draft tokens
4.16.805.291 I slot update_slots: id  0 | task 698 | accepted  2/ 3 draft tokens
4.16.805.414 W slot process_toke: id  0 | task 698 | raw tool marker observed while lazy grammar is enabled; keeping DFlash governed by active grammar boundary in_reasoning=0 n_decoded=17 reasoning_tokens=0 visible_tokens=18
4.19.917.467 I slot print_timing: id  0 | task 698 | n_decoded =    100, tg =  28.65 t/s
4.22.935.898 I slot print_timing: id  0 | task 698 | n_decoded =    179, tg =  27.50 t/s
4.23.734.291 I slot print_timing: id  0 | task 698 | prompt eval time =    2484.00 ms /  2062 tokens (    1.20 ms per token,   830.11 tokens per second)
4.23.734.294 I slot print_timing: id  0 | task 698 |        eval time =    7307.21 ms /   200 tokens (   36.54 ms per token,    27.37 tokens per second)
4.23.734.295 I slot print_timing: id  0 | task 698 |       total time =    9791.21 ms /  2262 tokens

Recent failed boot attempts

Exited vLLM/llama.cpp containers exist but all >24h old — likely not relevant to current investigation.

verify-full.sh output

verify-full output

[autodetect] using running container=beellama-qwopus-coder url=http://localhost:8020  (skip: PREFLIGHT_NO_AUTODETECT=1)
[autodetect] served model='Qwopus3.6-27B-Coder-MTP-Q5_K_M.gguf' (from http://localhost:8020/v1/models; set MODEL= to override)
Running FULL functional test against http://localhost:8020
  model=Qwopus3.6-27B-Coder-MTP-Q5_K_M.gguf  container=beellama-qwopus-coder  engine=llamacpp

[1/9] Server reachable on /v1/models ...
  ✓ server is serving
[2/9] Genesis patches applied ...
  ⊘ llama.cpp engine — Genesis is vLLM-only, not applicable (skipped)
[warmup] priming engine (cold cudagraph/JIT, up to 180s, not scored) ...
[warmup] engine warm
[3/9] Basic completion — capital of France ...
  ✓ reply contains 'Paris'
[4/9] Tool calling ...
  ✓ tool_calls[] populated with get_weather
[5/9] Streaming (SSE) ...
  ✓ streamed 19 chunks, 71 chars:  Lines of code align, Searching for the hidden bug, Found it in the end. ...
[6/9] Streaming tool-calls (thinking-on) ...
  ✓ streamed delta.tool_calls (get_weather) + finish_reason=tool_calls, no <tool_call> leak
[7/9] Thinking / reasoning mode ...
  ✓ reasoning 256 chars, content 1 chars (finish=stop)
    reasoning: 1.  **Analyze the Request:**     *   Question: What is 2+2? ...
    content:   4...
[8/9] Output quality / cascade detection (2K-token completion) ...
  ✓ output OK — 10116 chars, variety=0.680, max_line_repeat=0, finish=stop
[9/9] MTP acceptance length threshold ...
  ⊘ llama.cpp engine — MTP acceptance check is vLLM-log-format-specific (run engine-side verification separately) (skipped)

All checks passed. Stack is ready for full-functionality use.

verify-stress.sh output

verify-stress output (7 boundary checks incl. Cliff 2 needle recall)

[autodetect] using running container=beellama-qwopus-coder url=http://localhost:8020  (skip: PREFLIGHT_NO_AUTODETECT=1)
[autodetect] served model='Qwopus3.6-27B-Coder-MTP-Q5_K_M.gguf' (from http://localhost:8020/v1/models; set MODEL= to override)
Running STRESS / boundary test against http://localhost:8020
  model=Qwopus3.6-27B-Coder-MTP-Q5_K_M.gguf  container=beellama-qwopus-coder  engine=llamacpp
  This script does the heavy stuff (longctx needle ladder + ~25K-token tool prefill).
  For the fast functional smoke (~2 min), use verify-full.sh instead.

[1/8] Long-context needle small rungs (10K / 30K) ...
    ✓   9820 tokens: recalled 'violet platypus 66' (got: violet platypus 66 )  prefill=1092.9 t/s (9s)
    ✓  29319 tokens: recalled 'silver chinchilla 80' (got: silver chinchilla 80 )  prefill=999.6 t/s (29s)
  ✓ all long-ctx depths recalled secret correctly
[2/8] Tool response prefill OOM (~25K-token mock tool response) ...
  ✓ tool prefill OK — text response (669 chars, finish=stop)
[3/8] IDE-agent one-shot prompt (sys + tool schemas + user request) ...
  ✓ IDE-agent one-shot OK — 48 completion tokens (209 chars), finish=stop
[4/8] Multi-turn agent prompt (sys + tools + 4-turn history) ...
  ✓ multi-turn agent OK
[5/8] LCB-coding shape (LeetCode-style problem + structured plan) ...
  ✓ LCB-coding shape OK
[6/8] Reasoning-heavy (math problem + max_tokens=8192) ...
  ✓ reasoning-heavy OK — 4720 completion tokens
[7/8] Long-context needle large rungs (60K / 90K — Cliff 2 territory) ...
    ✓  58570 tokens: recalled 'sapphire capybara 21' (got: sapphire capybara 21 )  prefill=868.2 t/s (67s)
    ✓  91071 tokens: recalled 'sapphire axolotl 61' (got: sapphire axolotl 61 )  prefill=762.0 t/s (120s)
  ✓ all long-ctx depths recalled secret correctly
[8/8] Context ceiling ladder (staggered NIAH from ~95000 → ~0.92 × n_ctx) ...
    n_ctx=163840  ladder: 95000 → 125000 → 150732 (3 rungs)
    calibrated: scale=100 → 6515 tokens (tok/scale_unit=65.15)
    VRAM free (ladder start): 831 MB
    ✓ rung 1/3: target=95K  actual=94K tok (57%)  recalled 'violet iguana 95'  prefill=732.4 t/s (121s)  VRAM_free=831MB
    ✓ rung 2/3: target=125K  actual=124K tok (76%)  recalled 'silver falcon 12'  prefill=656.6 t/s (181s)  VRAM_free=831MB
    ✓ rung 3/3: target=150K  actual=150K tok (91%)  recalled 'sapphire platypus 70'  prefill=602.1 t/s (240s)  VRAM_free=831MB

  ✓ ceiling ladder: all 3 rungs passed — fillable to 150415 tok (91% of n_ctx=163840)
    \033[33m⚠\033[0m VRAM margin thin at ceiling: 831 MB free < 1024 MB threshold
      Recall succeeded at 91% fill, but sustained agent load also carries
      prompt-cache + context-checkpoint overhead (~292 MiB, see #197).
      Agent users should target a CTX_SIZE where margin ≥ 1024 MB at this depth.

1 stress check(s) failed. See hints above.

soak-test.sh (SOAK_MODE=continuous) output

soak-test stdout (5-session × 5-turn ramping conversation, ~25 min)

[soak] ERROR: no running club-3090 container found (vllm-/llama-cpp-/ik-llama-/beellama-/sglang- × qwen36-27b/qwen36-35b-a3b/gemma-4-31b); set CONTAINER=... or CONTAINER=none for host engines
scripts/soak-test.sh: line 262: results/report-soak-20260616-094328/nvidia-smi-final.csv: No such file or directory
scripts/soak-test.sh: line 265: results/report-soak-20260616-094328/docker-stats-final.jsonl: No such file or directory
[soak] artifacts: results/report-soak-20260616-094328

_soak summary.md not produced — check stdout above_

bench.sh output

bench output (3 warmups + 5 measured per prompt)

[autodetect] using running container=beellama-qwopus-coder url=http://localhost:8020  (skip: PREFLIGHT_NO_AUTODETECT=1)
[autodetect] served model='Qwopus3.6-27B-Coder-MTP-Q5_K_M.gguf' (from http://localhost:8020/v1/models; set MODEL= to override)

========== NARRATIVE (prompt=65 chars, max_tokens=1000) ==========
=== warmups (3) ===
  warm-1     wall= 19.17s  ttft=   283ms  toks=1000  wall_TPS= 52.17  decode_TPS= 52.95
  warm-2     wall= 21.06s  ttft=    88ms  toks=1000  wall_TPS= 47.48  decode_TPS= 47.68
  warm-3     wall= 19.19s  ttft=    88ms  toks=1000  wall_TPS= 52.12  decode_TPS= 52.36

=== measured (5) ===
  run-1      wall= 18.76s  ttft=    87ms  toks=1000  wall_TPS= 53.31  decode_TPS= 53.56
  run-2      wall= 19.48s  ttft=    89ms  toks=1000  wall_TPS= 51.34  decode_TPS= 51.58
  run-3      wall= 19.75s  ttft=    89ms  toks=1000  wall_TPS= 50.63  decode_TPS= 50.86
  run-4      wall= 18.53s  ttft=    89ms  toks=1000  wall_TPS= 53.96  decode_TPS= 54.22
  run-5      wall= 18.12s  ttft=    88ms  toks=1000  wall_TPS= 55.20  decode_TPS= 55.47

=== summary [narrative] (n=5) ===
  wall_TPS       mean=  52.89   std=  1.88   CV= 3.6%   min=50.63   max=55.20
  decode_TPS     mean=  53.14   std=  1.90   CV= 3.6%   min=50.86   max=55.47
  TTFT          mean=    88ms  std=    1ms  min=87ms  max=89ms
  PP tok/s       n/a (long-prompt fallback below)

========== CODE (prompt=78 chars, max_tokens=800) ==========
=== warmups (3) ===
  warm-1     wall= 12.60s  ttft=   166ms  toks= 800  wall_TPS= 63.48  decode_TPS= 64.32
  warm-2     wall= 12.24s  ttft=    88ms  toks= 800  wall_TPS= 65.38  decode_TPS= 65.86
  warm-3     wall= 12.00s  ttft=    90ms  toks= 800  wall_TPS= 66.69  decode_TPS= 67.20

=== measured (5) ===
  run-1      wall= 11.87s  ttft=    88ms  toks= 800  wall_TPS= 67.42  decode_TPS= 67.93
  run-2      wall= 12.07s  ttft=    89ms  toks= 800  wall_TPS= 66.28  decode_TPS= 66.78
  run-3      wall= 11.76s  ttft=    88ms  toks= 800  wall_TPS= 68.04  decode_TPS= 68.55
  run-4      wall= 12.00s  ttft=    89ms  toks= 800  wall_TPS= 66.66  decode_TPS= 67.16
  run-5      wall= 12.60s  ttft=    88ms  toks= 800  wall_TPS= 63.51  decode_TPS= 63.96

=== summary [code] (n=5) ===
  wall_TPS       mean=  66.38   std=  1.74   CV= 2.6%   min=63.51   max=68.04
  decode_TPS     mean=  66.87   std=  1.77   CV= 2.6%   min=63.96   max=68.55
  TTFT          mean=    89ms  std=    1ms  min=88ms  max=89ms
  PP tok/s       n/a (long-prompt fallback below)

========== PROMPT-PROCESSING (fallback target=10000 prompt tokens, max_tokens=16) ==========
=== measured (1) ===
  run-1      wall= 12.60s  ttft= 12242ms  prompt_toks= 13179  PP_tok/s=1076.58

=== summary [prompt-processing] (n=1) ===
  PP tok/s       mean=1076.58   std=  0.00   CV= 0.0%   min=1076.58   max=1076.58
  TTFT          mean= 12242ms  std=    0ms  min=12242ms  max=12242ms

=== GPU state ===
0, 98 %, 23298 MiB, 24576 MiB, 332.52 W, 67
1, 0 %, 111 MiB, 8192 MiB, 6.14 W, 36

=== Last 3 SpecDecoding metrics ===

bench-agentic.sh output

bench-agentic output (1 session x 12 default turns, curve-shape estimate; ~8 min estimate)

[autodetect] using running container=beellama-qwopus-coder url=http://localhost:8020  (skip: PREFLIGHT_NO_AUTODETECT=1)

========================================================================
SESSION 1/1 — 12 turns, context grows to ~29,033 tokens
========================================================================
  Turn  Prompt tok   TTFT ms  Decode TPS  Result chars
  ----- ---------- --------- ----------- -------------
  1          1,212      1538        43.1           307
  2          1,395       658        45.8           249
  3          1,576       869        41.8           278
  4          1,786       683        53.5         8,373
  5          4,843      3486        37.2         8,912
  6          7,572      3168        36.5         3,106
  7          8,918      1983        31.9         6,495
  8         10,881      2620        35.7         2,576
  9         12,223      2092        41.6        25,250
  10        21,387     10160        29.0        17,407
  11        27,539      7811        26.5        21,299
  12        35,238      9615        17.9        21,883  ⚠ tool-call miss (synthetic result injected)


========================================================================
SUMMARY — multi-turn prefill stress (1 session(s) × 12 turns)
========================================================================
  tool-call misses: 1/12 turns — ramp continued via synthetic results (#255); depth/curve unaffected, but tool-call reliability is degraded at depth on this config.
  Turn  Prompt tok   TTFT ms   σ ms  Decode TPS  Notes
  ----- ---------- --------- ------ -----------  ───────────────────────────────────
  1          1,212      1538      0        43.1  cold-start (compile/warmup — excluded from growth)
  2          1,395       658      0        45.8  warm baseline
  3          1,576       869      0        41.8
  4          1,786       683      0        53.5
  5          4,843      3486      0        37.2  ⚠  TTFT 5.3× warm-baseline (O(n)-like growth for this arch_class)
  6          7,572      3168      0        36.5  ⚠  TTFT 4.8× warm-baseline (O(n)-like growth for this arch_class)
  7          8,918      1983      0        31.9  ↑  TTFT 3.0× warm-baseline
  8         10,881      2620      0        35.7  ↑  TTFT 4.0× warm-baseline
  9         12,223      2092      0        41.6  ↑  TTFT 3.2× warm-baseline
  10        21,387     10160      0        29.0  ⚠  TTFT 15.4× warm-baseline (O(n)-like growth for this arch_class)
  11        27,539      7811      0        26.5  ⚠  TTFT 11.9× warm-baseline (O(n)-like growth for this arch_class)
  12        35,238      9615      0        17.9  ⚠  TTFT 14.6× warm-baseline (O(n)-like growth for this arch_class)

────────────────────────────────────────────────────────────────────────
  TTFT growth by accumulated context (12 turns, 1 sessions):
    Turn 1 (cold):           1538 ms TTFT  — compile/warmup, excluded from growth
    Turn 2 (warm base):      658 ms TTFT @ 1,395 prompt tokens
    Turn 12:                 9615 ms TTFT @ 35,238 prompt tokens
    Context grew 25.3×,  TTFT grew 14.6× (warm baseline → last turn)
    ⚠  TTFT grew near-linearly — O(n)-like accumulated-context cost for this cell.
    (Full-context O(n) growth would approach 25.3× with context)

  Note — DeltaNet/SSM state is NOT prefix-cacheable on vLLM Qwen3-Next cells.
  Attention KV caching can still work, but recurrent-state recomputation scales
  O(n) with sequence length. Prior single-card 24 GB vLLM Qwen3-Next observations
  saw degradation above ~35K tokens and timeouts around ~74K; treat those as
  informational per-arch_class guideposts. llama.cpp is not affected.

=== GPU state ===
0, 100 %, 23298 MiB, 24576 MiB, 388.64 W, 68
1, 0 %, 111 MiB, 8192 MiB, 6.14 W, 36

=== Last 3 SpecDecoding metrics ===

Generated by bash scripts/report.sh. Flags: --verify (verify-full), --stress (verify-stress 7/7 incl. Cliff 2 needles), --soak (SOAK_MODE=continuous, catches Cliff 2b), --bench (canonical TPS), --agentic (multi-turn TTFT/decode curve-shape, ~8 min estimate), --full (all five, ~43 min estimate). Use --no-redact to disable redaction (internal sharing only).

0 replies

noonghunna · 2026-06-16T12:24:13Z

noonghunna
Jun 16, 2026
Maintainer Author

Both things are good news — nothing's actually wrong with your setup.

*The "No .safetensors found" message is a false alarm, not a failed download. Your GGUF downloaded fine — the ✓ Downloaded … Qwopus3.6-27B-Coder-MTP-Q5_K_M.gguf line above it is the truth, and you've clearly been running it (those are your serving logs further up). What tripped is the verify step globbing for *.safetensors on a GGUF-only model → it finds none and prints "may have failed." On current master the catalog tells setup.sh to verify *.gguf for this model (just confirmed: WEIGHT_VERIFY_GLOB='*.gguf'), so a git pull makes the message go away. Your weights are intact either way — no need to re-download.

Thanks for the rig report — it rules hardware out of the loop. Three reads:

PCIe x4 on the 3090 only slows the one-time model load from host RAM (it's the cross-card/host-transfer link). Single-card decode doesn't touch it — so it's not the loop cause. (It does mean a slow cold start; that's expected and one-time.)
The GTX 1080 (GPU 1) is correctly unused. The single-card compose pins to CUDA_VISIBLE_DEVICES=0 (your 3090), so beellama never tries to spill layers onto the 8 GB Pascal card — which is what you want; it couldn't help here. Nothing to change.
Headless 3090 (your idle draw says so) → full 24 GB for KV. Good.

So the stuck behaviour is exactly the cache-invalidation reprocess from before, confirmed not hardware: at ~150K, KVarN's windowed KV can't reuse the prompt cache, OpenCode resends its growing context each turn → beellama reprocesses the full ~150K from scratch (~4 min at your ~640 t/s prefill) → OpenCode hits its own timeout, cancels, resubmits → reprocess→cancel→reprocess. Looks stuck; it's grinding.

The fix is to keep the working context well under the ~160K MTP-on ceiling so each reprocess finishes before OpenCode's cancel timeout:

Cap OpenCode's context window (and/or enable its compaction if it has it). Try ~96K to start — that reprocesses in ~90s instead of ~4 min, which should break the cancel loop.
Or drop CTX_SIZE on the compose to match.

If 96K still loops, share the OpenCode timeout setting and I'll help size it — but I'd bet capping the resent context clears it.

0 replies

Qwopus3.6-27B-Coder on a single 3090 — first KVarN compose 🧪 #392

Uh oh!

noonghunna Jun 12, 2026 Maintainer

Results Card — 1× RTX 3090

① Serving

② Quality — core 8-pack → /150

③ Takeaways

Requirements

Getting it / Run it

Credits

Replies: 11 comments

Uh oh!

henrykrinkle01 Jun 13, 2026

Uh oh!

noonghunna Jun 13, 2026 Maintainer Author

Uh oh!

henrykrinkle01 Jun 13, 2026

Uh oh!

noonghunna Jun 13, 2026 Maintainer Author

Uh oh!

henrykrinkle01 Jun 13, 2026

Uh oh!

Uh oh!

henrykrinkle01 Jun 13, 2026

Uh oh!

sghomelab Jun 15, 2026

Uh oh!

noonghunna Jun 15, 2026 Maintainer Author

Uh oh!

Uh oh!

sghomelab Jun 16, 2026

Uh oh!

sghomelab Jun 16, 2026

club-3090 rig report

System

CPU + RAM

Disk

GPU hardware

NVLink

Topology

PCIe / P2P detail (lspci)

Full nvidia-smi

Display / desktop state

Container runtime

Stack version

Profile state

KV math calibration

Active container

Container Python / CUDA versions

Boot log highlights

Full boot log (first 200 lines)

Recent failed boot attempts

verify-full.sh output

verify-stress.sh output

soak-test.sh (SOAK_MODE=continuous) output

bench.sh output

bench-agentic.sh output

Uh oh!

noonghunna Jun 16, 2026 Maintainer Author

noonghunna
Jun 12, 2026
Maintainer

henrykrinkle01
Jun 13, 2026

noonghunna
Jun 13, 2026
Maintainer Author

henrykrinkle01
Jun 13, 2026

noonghunna
Jun 13, 2026
Maintainer Author

henrykrinkle01
Jun 13, 2026

henrykrinkle01
Jun 13, 2026

sghomelab
Jun 15, 2026

noonghunna
Jun 15, 2026
Maintainer Author

sghomelab
Jun 16, 2026

sghomelab
Jun 16, 2026

noonghunna
Jun 16, 2026
Maintainer Author