When Should a Local Qwen3.6-27B Think? I Measured It. #221

noonghunna · 2026-05-24T22:37:20Z

noonghunna
May 24, 2026
Maintainer

It's quietly become the default model for people who run their own — and the base for a whole derivative scene on Hugging Face. So I measured the one setting most of us get wrong: reasoning mode.

The model that earned a cult

If you run models locally, you already know Qwen3.6-27B. It punches so far above its class that it's the de-facto local default by a wide margin — and, more tellingly, the base for a steady stream of community work on Hugging Face: agentic fine-tunes, delta-merges, MTP-drafter variants, exotic quants. You don't get that cottage industry around a mediocre model. The base is exceptional; the scene is the proof.

Which is why one result surprised me. The feature this generation is sold on — "thinking," the reasoning channel — is the one place where, run locally, more is not better. That's not a knock on the model. It's the opposite: Qwen3.6-27B answers directly so well that, locally, thinking is mostly a tax you don't need to pay.

Here are the numbers, the two distinct ways thinking fails, and the benchmarking trap that nearly had me publish the exact opposite.

What I measured

Model: Qwen3.6-27B, Q4_K_M, single RTX 3090, llama.cpp.
Arms: reasoning on (default) vs off (enable_thinking=false). Greedy, identical prompts.
Benches: HumanEval+ (164) and LiveCodeBench v6 (70) for code; GSM-Symbolic and GPQA-Diamond for math and science.
Why "local" matters: on your own GPU, thinking is billed in your seconds and your KV budget — a cost an API leaderboard never shows you.

The finding: thinking doesn't pay

Task	thinking ON	thinking OFF
HumanEval+ (code)	92.1%	94.5%
LiveCodeBench v6 (code)	64.3%	90.0%
GSM-Symbolic (math)	80%	100%
GPQA-Diamond (science)	86%	80% — tie

Reasoning-off wins or ties everywhere I could measure, and reasoning-on always cost multiples of the tokens — ≈13× the output on HumanEval+. The lone nominal "win," GPQA, is a within-noise tie (below). One caveat on the math row: GSM at this difficulty is easy enough that no-think already maxes out, so it shows thinking's cost, not whether it could help on harder problems.

For local code, the call is unconditional: default thinking off.

Two ways it fails

Thinking doesn't fail one way — it fails two, by difficulty:

Easy problems → overthinking. On HumanEval+ it finishes, then talks itself out of a correct answer — restructures working code and drops an import, over-reasons a one-liner into the wrong algorithm. It knew the answer cold; thinking added entropy.
Hard problems → runaway. On LiveCodeBench, ~half of reasoning-on runs never emit an answer at all — they reason until they hit the token cap and return nothing. No-think solves the same problems directly in a fraction of the budget.

Runaway is the mode that bites locally: it doesn't just waste tokens, it burns your whole context window to hand back silence — worse than a wrong answer, because you paid full price for it.

The +36 that vanished

Here's the part every benchmark writeup should include and almost none do.

My first GPQA run had thinking winning by 36 of the 50 questions — 43/50 (86%) vs 7/50 (14%). I nearly wrote it up. It was a bug in my harness: the no-think arm had a smaller token budget and was getting cut off mid-answer (on GPQA the model reasons in the answer text, so it needs the room). Give both arms an adequate budget and that 14% climbs to 80% — and the 36-question gap collapses to 3 of 50 (86% vs 80%), inside the noise. A tie.

So my headline nearly read "thinking helps science!" — backwards, and entirely an artifact of how I measured. That's the real, transferable lesson, and it holds for any model. Before you trust a thinking delta:

give both arms an adequate, non-truncating budget (adequate, not necessarily equal);
count truncation separately from wrong — running out of budget isn't reasoning badly;
avoid ceiling tasks — if no-think already scores 100%, you can only measure thinking hurting;
validate your grader both ways — too strict fakes a loss, too loose fakes a win.

If you're building on 27B — a fine-tune, a merge, an MTP variant, a fresh quant — that's the recipe to check whether thinking pays for your derivative. I'd genuinely like to see those numbers.

Fair caveats

This is one setup, and the honest reading depends on it:

Quantization. Everything here is Q4. Runaway is exactly where 4-bit error compounds over a long chain, so the runaway tail may be a quantization effect rather than the model's design — short reasoning matched full precision (GPQA Q4 86% ≈ the published bf16 ~88%). A bf16 run is the fair test, and it's next.
GPQA was a tie, not a loss — reasoning-on holds its own on the hardest recall task.
The model is strong. It gets the right answers; thinking just adds cost here, on these tasks, run this way.
Scope. One model, one quant, one engine, n=1 per problem. I'm not claiming this for "local models" in general — only for Qwen3.6-27B, run the way most of us actually run it.

The takeaway

Qwen3.6-27B is the local model to beat, and it earned that with its direct answers, not its reasoning channel. For local code-gen, default thinking off — faster, cheaper, at least as accurate. Keep reasoning-on in your pocket for genuinely multi-step work; and if you bound it (a structured-CoT grammar), know that bounding makes it terminate, not out-reason writing the code — it still trailed no-think by ~24 points on the hard set.

Most of us run this model quantized, on a single card. That's the reality these numbers come from, and the experience worth optimizing for. Getting the most out of the best local model, it turns out, is mostly a matter of letting it answer.

Method, numbers, and the budget-ladder are reproducible — happy to share the harness. The reasoning-on bf16 comparison is next; if you run the same A/B on your own 27B derivative, post your numbers.

Update (2026-05-25) — the off / free / bounded three-way on hard math

@sztlink ran the cleanest non-code test this writeup was missing: the full AIME-2026 set (30 problems) on the same local setup (single 3090, Qwen3.6-27B Q4_K_M, llama.cpp, temp 0).

arm	score	note
stock, thinking-off (4096)	17/30	the baseline
stock, thinking-on (30000)	18/30	+1 — within n=30 noise, at ~7× the tokens
stock, bounded thinking (GBNF, 4096)	12/30	below no-think
Huihui abliterated, thinking-on (30000)	15/30	no ablation signal

So even on hard, contamination-resistant contest math — where thinking is most supposed to pay, with plenty of headroom (no GSM-style ceiling) — free thinking gave no reliable uplift (+1 of 30, within noise). And the tell: ~half the thinking-on runs (15/30) still hit the 30k cap without answering — so the runaway is genuine non-termination, not too small a budget.

The third arm is the one that sharpens the whole post. Bounded thinking — a grammar that forces the reasoning into a short plan so it must terminate — hurt here: 12/30, below no-think. That's the opposite of what it does on code, where bounding rescued the runaway (it beat free-thinking by +24pp on LiveCodeBench, purely by terminating). The reason is a clean mirror: on code the runaway lives inside the <think> block, so bounding it helps; on AIME the length is in the answer itself — the derivation is the work — so even with reasoning bounded, 17/30 still hit the cap in the answer body. Bounded CoT is a containment tool for code, not a booster for math.

This off / free / bounded three-way on AIME-2026 is the cleanest non-code test in the writeup — GSM was ceiling-limited, GPQA was a public set — and it lands where everything else did: no-think is the default, even on hard contest math. (Single-pass n=30, stock Q4_K_M, single 3090, llama.cpp, temp 0; receipts in sztlink/boring-receipts. The bounded arm used a strict custom GOAL/APPROACH/EDGE grammar sharing a 4096 total budget — not our shipped scratchpad config — so part of its drop is the plan eating answer-room; directionally clear, not a 1:1 eval.)

laurimyllari · 2026-05-24T22:56:45Z

laurimyllari
May 24, 2026

Very nice! I'd love to try to reproduce some of your numbers and experiment. Is the test method based on benchlocal-cli and could be applied to other models and endpoints too?

1 reply

noonghunna May 24, 2026
Maintainer Author

yup. all docs updated.

sztlink · 2026-05-25T01:31:46Z

sztlink
May 25, 2026

Since you asked for derivative numbers: I ran one small local GGUF datapoint on a single RTX 3090 (llama.cpp / llama-server, --reasoning off, temp 0, max_tokens=4096) using the full AIME 2026 set (30 problems).

stock Qwen3.6-27B Q4_K_M: 17/30
Huihui Qwen3.6-27B abliterated Q4_K_M: 15/30

So in this bounded local GGUF setup, I didn’t see ablated > stock. Not a direct reproduction of your BF16 setup — just a local-context datapoint.

Receipt with commands/log pointers/results:
https://github.com/sztlink/boring-receipts/blob/main/receipts/2026-05-24-3090-aime26-full-stock-vs-huihui-qwen36-27b-q4km.md

Would this kind of derivative datapoint be useful, or is this setup too far from the BF16 baseline to be informative?

6 replies

sztlink May 25, 2026

Thanks — that was exactly the next run. The overnight 30k reasoning pass finished.

Same local regime: single RTX 3090, llama.cpp / llama-server, Qwen3.6-27B Q4_K_M, AIME 2026 full set, temp 0.

Stock Qwen3.6-27B:

--reasoning off, max_tokens=4096: 17/30
--reasoning on, max_tokens=30000: 18/30

Huihui abliterated with --reasoning on, max_tokens=30000: 15/30.

So in this local GGUF setup, thinking-on gave stock +1 problem but at very high output cost; 15/30 stock cases and 15/30 Huihui cases still hit finish_reason=length at 30k. I’d read it narrowly as: no clear AIME reasoning uplift in this 24GB Q4_K_M setup, and still no ablated > stock signal here. Single-pass n=30 caveat applies.

Receipt/logs: https://github.com/sztlink/boring-receipts/blob/main/receipts/2026-05-25-3090-aime26-full-stock-vs-huihui-qwen36-27b-q4km-30k-reasoning-on.md

noonghunna May 25, 2026
Maintainer Author

@sztlink this is the datapoint — thank you. You ran the exact experiment overnight, with a receipt. Exemplary, and genuinely the cleanest non-code reasoning number we have.

How I read it:

Thinking-on gave stock +1/30 (18 vs 17) — within noise at n=30. So even on AIME — hard contest math, the place thinking is most supposed to pay, and with plenty of headroom (no GSM ceiling) — there's no clear reasoning uplift. That's the strongest version of the finding: thinking doesn't reliably help even here.
For that +1, it spent ~7× the tokens (30k vs 4096).
The killer detail: ~half the thinking-on runs (15/30) still hit finish_reason=length at 30k. That matters a lot — it means runaway isn't budget-starvation (30k is 7× the off-budget and half still never terminate), it's genuine non-termination. Exactly the failure mode from the code benches, reproduced on math at a generous budget.
Abliterated still ≤ stock (15 ≤ 18) — no ablation signal.

This is the test GSM and GPQA couldn't be — AIME has the headroom GSM lacked and the contamination-resistance GPQA lacked — and the answer comes back the same. I'm folding it into the writeup as the non-code anchor, credited (receipt linked).

One nuance it sharpens for me: the post flagged "maybe runaway is a Q4 quant artifact." Your 30k-budget-still-50%-truncation result makes runaway look real rather than budget-driven — though a bf16 run would still tell us whether full precision runs away less. Either way, this closes the most important open question in the post. 🙏

noonghunna May 25, 2026
Maintainer Author

@sztlink one more arm worth a run, since you're cranking max_tokens to fight the runaway: bounded thinking. Instead of giving the reasoning more room, give it a grammar — a GBNF that constrains the <think> block to a short GOAL/APPROACH/EDGE plan, so it's forced to terminate and answer in a fraction of the budget. That kills your "~half hit the 30k cap" problem outright — every problem answers.

It also completes the picture: free thinking didn't beat no-think on your AIME run (18 vs 17), so bounded is the one arm left untested.

Honest caveat from our code data, though: bounding rescues the truncation (our FSM beat free-thinking by +24pp on LiveCodeBench — entirely by terminating where free ran away) but still trailed no-think (66% vs 90%). So it's a containment tool: it makes thinking finish, not out-reason just answering directly. Whether that holds on hard math is genuinely open — a 3-line bounded plan might be too tight for AIME-style derivations, or it might be the sweet spot. Either result is a clean datapoint.

It ships as bounded-thinking.yml (llama.cpp single) — writeup + the grammar in docs/STRUCTURED_COT.md. If you run it on AIME-2026 you'd have the full three-way (off / free / bounded) on the cleanest non-code bench we've got — I'd fold that into the writeup too. 🙏

sztlink May 25, 2026

Ran the bounded arm too. Result is useful but negative.

Same local regime: RTX 3090, llama.cpp / llama-server, stock Qwen3.6-27B Q4_K_M, AIME 2026 full set, temp 0.

Three-way stock comparison now:

no-think, max_tokens=4096: 17/30
free thinking, max_tokens=30000: 18/30
bounded thinking, strict GAE GBNF, max_tokens=4096: 12/30

The grammar structure fired on 30/30 cases. Reasoning was bounded, but answer-body generation still hit finish_reason=length on 17/30 cases, and accuracy dropped below no-think.

Important caveat: this was a stricter local GOAL/APPROACH/EDGE grammar, not an exact DeepSeek-scratchpad reproduction. The DeepSeek scratchpad smoke mostly fired, but one case degenerated into repeated PLAN: 1. 1. 1..., so I tightened the grammar for the full run.

Receipt/logs/grammar:
https://github.com/sztlink/boring-receipts/blob/main/receipts/2026-05-25-3090-aime26-stock-qwen36-27b-q4km-bounded-gae-4096.md

noonghunna May 25, 2026
Maintainer Author

@sztlink the three-way is complete, and it's the useful kind of negative — thank you, receipt and all 🙏. Full picture, AIME-2026, stock Q4_K_M, single 3090:

no-think (4096): 17/30
free thinking (30k): 18/30
bounded GAE (4096): 12/30

Bounding hurt — and the why is the gem in your data: the grammar fired on 30/30 and the reasoning was bounded as designed, yet 17/30 still hit finish_reason=length — in the answer body, not the think block. That's the mirror image of the code result. On LiveCodeBench the runaway lived inside <think>, so bounding it rescued +24pp by forcing termination. On AIME the length demand is in the derivation itself — the answer is the multi-step work — so a tight think-grammar can't help, and at a 4096 total budget the plan even eats room the answer needed. Bounded CoT is a containment tool for code (short answer, runaway in the thinking), not a booster for math (long answer, the work can't be skipped). Confirms the caveat exactly: a 3-line GOAL/APPROACH/EDGE plan is too tight for AIME-style derivations.

Two things before we call it a clean loss, if you're up for one more run:

Budget confound. no-think and bounded both ran at 4096 total, but bounded spends part of that on the plan → the answer arm had less room than no-think got. So some of the 12/30 is answer-starvation, not bounding per se. Clean isolation: bounded think + a larger ANSWER budget (8–12k). Given no-think fits 4096 and still beat it, I'd bet bounded stays ≤ no-think — but it removes the confound.
It's your stricter GAE grammar, not our shipped bounded-thinking.yml (DeepSeek scratchpad) — so this is directional for bounded-CoT-on-math, not a 1:1 eval of our config. And the PLAN: 1. 1. 1... degeneration you hit on the scratchpad smoke is a real structured-CoT gotcha (loose grammar → repetition loop on hard problems) — that earns its own line in our STRUCTURED_COT notes. Thanks for catching it.

Net: the cleanest non-code bench we have says no-think is the default even on hard contest math — free ties at ~7× the tokens + ~50% runaway, bounded trails. I'm folding the full off / free / bounded triplet into the writeup as the non-code anchor, credited, receipts linked. This thread is basically the rigorous appendix the post was asking for. 🙏

Mateleo · 2026-05-25T11:58:34Z

Mateleo
May 25, 2026

I don't understand this part from your post "My first GPQA run had thinking winning by 36 points — 86% to 14%"
Do you mean 50% ?

Also I'm still wandering, why thinking is that bad ? What about speed ?

1 reply

noonghunna May 25, 2026
Maintainer Author

@Mateleo good questions — and you caught a genuinely confusing bit of wording, thanks.

On the "+36": that's +36 questions out of 50, not percentage points. First run: thinking got 43/50 right (86%), no-think got 7/50 (14%) — a 36-question gap, which in percentages is the 86-vs-14 = 72-point gap. The post wrote it as "+36 points" (mixing the question count with the percentages), which is exactly what tripped you up — that's on me, the wording there is loose.

The important part: that 86-vs-14 was a measurement bug — the no-think arm's answers were getting cut off by a too-small token budget. Fix the budget and the fair result is 43/50 (86%) vs 40/50 (80%) — a 3-question / 6-point gap, which is within the noise at n=50. So it's really a tie, not a win for either side.

Why is thinking "that bad"? It's narrower than "the model reasons badly":

Easy problems → it overthinks and talks itself out of an answer it already had right.
Hard problems → it runs away: keeps reasoning until it hits the token limit and never emits an answer (~half of the hard coding problems). No-think just writes the solution directly.
It's task-specific — on science recall (GPQA) thinking was a tie, not a loss; it only clearly hurt on code. So it's "no reliable gain at real cost," not "can't reason."
Caveat: this is a 4-bit quant, so the runaway tail might be quantization error compounding on long chains rather than the model itself — a full-precision run is next.

Speed — great question, and it's the clincher locally:

Per-token decode speed is identical (same model either way).
But thinking emits far more tokens before the answer — ~13× the output on HumanEval+ — so time-to-a-usable-answer is multiples longer with thinking on.
Worst case (runaway) it spends the whole budget and returns nothing.
So thinking-off is dramatically faster end-to-end and equal-or-better accuracy. That's the whole point: locally you pay for thinking in your own GPU-seconds, and here it buys nothing.

Hope that clears it up. 🙏

noonghunna · 2026-06-01T18:48:55Z

noonghunna
Jun 1, 2026
Maintainer Author

Correction (2026-06-03): the footnote's “acceptance collapses on prose (~0.10)” was an over-read of the noisy acceptance-rate diagnostic. A careful tok/s re-test (measured no-spec controls, three v0.3.0 images) shows DFlash prose is net-positive — dual-Q8 +52% over the no-spec baseline, single +27%, gemma single +28–31%. The “prose collapse” is retracted; full A/B in #288. The thinking-on-vs-off numbers in this card are unaffected.

A cross-engine data point for this thread — same ON-vs-OFF question, but on a different stack: beellama v0.3.0 (Anbeeld's llama.cpp fork w/ DFlash spec-dec), dual RTX 3090 (PCIe, no NVLink), Qwen3.6-27B Q8 weights at the full 262K native context. Same conclusion as the OP.

Serving — beellama v0.3.0, dual 3090 (TP via layer-split)

Config	Spec-dec	KV / ctx	decode TPS (narr / code)	TTFT	VRAM / card
Qwen3.6-27B UD-Q8_K_XL	DFlash (Anbeeld IQ4_XS draft, n=3)	q5_0 / q4_1 · 262K	36.0 / 84.6	~93 ms	~21.2 GB (balanced via `--tensor-split 0.575,0.425`)

(engine-internal decode TPS; 3 warm + 5 measured; temp 0.6 / top-k 20 / top-p 1.0. DFlash is output-lossless, so thinking on/off changes only the request, never the spec path. Code runs ~2.3× narrative because DFlash acceptance runs higher on code (~0.58) than prose (~0.3); prose is still net-positive on tok/s — see the correction note at the top. The MTP spec option on this stack caps ~160K on dual Q8; DFlash is what reaches full 262K. Image: ghcr.io/noonghunna/beellama-cpp:multiarch-v0.3.0-efe856397; multi-GPU DFlash testing tracked in #288.)

Quality — 8-pack suite, thinking ON vs OFF (`benchlocal-cli`, verifier-backed, n=1)

Pack	thinking ON	thinking OFF
toolcall-15	14/15	13/15
instructfollow-15	15/15	15/15
structoutput-15	14/15	14/15
dataextract-15	9/15	8/15
reasonmath-15	12/15	13/15
bugfind-15	14/15	14/15
hermesagent-20	13/20	13/20
cli-40	20/40	18/40
TOTAL (8-pack)	111/150 (74%)	108/150 (72%)

Optional reasoning/code packs (run on top of the core 8-pack — kept separate so the /150 stays intact; aider-polyglot-30 is also in this optional set but wasn't run here):

Pack	thinking ON	thinking OFF
humaneval-plus-30	21/30	24/30
lcb-v6-30	22/30	18/30

(30-sample subsets — absolute % aren't directly comparable to the OP's fuller HumanEval+/LCB runs; the comparable signal is the ON-vs-OFF delta.)

Takeaways

Thinking buys no reliable edge here either — 8-pack +3/150 (~2pp), squarely within n=1 noise. Corroborates the OP's thesis on a completely different serving stack (DFlash spec-dec, dual-card, Q8, full 262K).
The "should-help" packs split both ways: the optional code packs disagree (humaneval+ favors OFF by 3, lcb-v6 favors ON by 4) and reasonmath edges OFF (+1). A genuine reasoning benefit would move them together — this is pack-level variance, not signal.
Thinking-OFF stays the default: equal quality, far fewer tokens, and on a spec-dec engine that means materially higher throughput (prose decode stays ~36 TPS while thinking just inflates token count for no quality return).
Weak spots are mode-independent: dataextract (~55% — the model emits JSON numbers as strings) and cli-40 agentic (~47%). hermesagent landed 65%, above our historical Qwen baseline.

tl;dr — a third independent config (after the OP's runs and our earlier fine-tune A/Bs) where reasoning-on shows no measurable quality gain on Qwen3.6-27B. Thinking-off is the right local default.

0 replies

lolren · 2026-06-03T06:53:44Z

lolren
Jun 3, 2026

For me, the non-thinking version sometimes fails. just stops mid work.
The thinking version is solid.
Some tricks to make this model behave like a SOTA model:
Have a SOTa model plan the task step by step. In my case GPT5.5. Then have the SOTA model write the prompt for this, usually in a file. A long detailed prompt, with instructions how to test, how to use x hardware etc..
And finally, give the instructions to this model. It works amazing. I love it more and more, especially after the cloud models token fiasco.
Thank you @noonghunna for all your hard work.. This club is actually what made the model useful for most of us.

Edit: I am using PI coding agent, I've had the model to write me a continue plugin, that just makes the model continue : like /continue 10 would continue 10 times. I've also instructed the model to append all the final messages and highlights in a md file, so i can see them after... This way, I can be sure it will work on my long project all night without me having to wake up and see the model not code.

3 replies

noonghunna Jun 3, 2026
Maintainer Author

Thanks @lolren — genuinely made my day, and exactly the kind of report that makes this thread worth keeping. 🙏

A few thoughts on what you're hitting:

The non-thinking "stops mid-work" is almost certainly fixable — it's usually not the model quitting, it's sampling. Our composes default temperature to 0.6 server-side, but a temperature your agent sends overrides that, and most harnesses (PI included) send their own — often the model card's 1.0, which is exactly when Qwen tends to emit an early stop. Set temperature ≈ 0.6 in PI's client config and non-thinking usually gets a lot more reliable. Details in the agent-stops-mid-task FAQ entry (tracked in #232). That said — if thinking-on is rock-solid for you, that matches what this thread measured, so no need to fight it.

The "frontier model plans → local model executes" split is a great pattern. The local model is at its best executing a tight, well-specified plan rather than doing the open-ended planning itself, so handing it a detailed prompt with test/hardware instructions plays right to its strengths.

Love the /continue + append-highlights-to-.md overnight setup. Beyond the convenience, periodically condensing to a highlights file keeps the live context lean — which also sidesteps the accumulated-context wobble we document in CLIFFS (single-card sessions can drift once accumulated context climbs past ~20–25K tokens). If you're ever up for sharing that PI /continue plugin, I suspect a bunch of folks here would grab it.

Thanks again — this is the good stuff.

lolren Jun 4, 2026

I will tweak and share the continue PI plugin, as it's not perfect now, it just works. Will share it when I get back from work, this evening.
A bit of update: I've been running Qwen (thinking mode) for about 30 hours until this morning when I've noticed it was kinda stuck with a complex task.
I've fired GPT 5.5 Xhigh to solve the issue and , for the first time, I HAD THE LONGEST Codex session for just under 6 hours! It made the issue better, but it could not fix it as well completely! So I've asked GPT 5.5 to create a handover and Qwen 27B is back at it!
I honestly think this is one of the best model we will have for a long time, at this parameter count! I would be shocked if someone shoots itself in the foot and releases a better one that can be closer to SOTA models than this!
This has saved me tons of tokens!
Also, I've asked GPT 5.5 Xhigh to find and fix all issues in the enormous code QWEN added! and he did find some, as they all do, but nothing serious! most of the implementations were spot on!
Just in case anyone is curios of the work I'm doing this is the repo and I've asked the model to complete Thread and Matter support for this microcontroller! It has root access, no sandbox to my Linux Machine!
GPT 5.5Xhigh estimated this work to be about 200 more slices! But Qwen 27B got it to a sorta working state!
This work was started with GPT 5.2Xhigh and continued now with 5.5Xhigh and Qwen, Deepseek v4 pro and Flash! Gemini, models tried but failed every task (up until 3.1).
P.S. Refreshing every day to see when Qwen 3.7 27B is released! If it will, and if the gains are the same like from 3.5 to 3.6, wow! Who would pay for cloud anymore?!?
At the moment, I've resumed Qwen to work on the remaining implementations!

Edit: I don't know what changed from the early days of using this. Maybe some bugs in VLLM have been resolved, but it's much smarter now!

noonghunna Jun 4, 2026
Maintainer Author

great feedback! thank you!

lolren · 2026-06-04T12:36:30Z

lolren
Jun 4, 2026

meantime-on-r-vibecoding-v0-jp0qctw6ixxg1

0 replies

lolren · 2026-06-04T19:24:42Z

lolren
Jun 4, 2026

https://github.com/lolren/pi-ext-auto-continue

1 reply

Dmtrii-tesla Jun 9, 2026

/goal command is good for many cases

efschu · 2026-06-09T06:29:32Z

efschu
Jun 9, 2026

my default vllm setting for thinking is false, but sometimes, i want it to think. so i added a feature to my omp (oh-my-pi) so i could easily turn on thinking with reasoning budget.

can1357/oh-my-pi#2170

3 replies

efschu · 2026-06-09T13:30:04Z

efschu
Jun 9, 2026

nah, auto is doing something with another model, default is smol.

this would be a to big change. for now i added a 0-thinking level so you can switch manually between 0 (off) and 16k high token for thinking.

0 replies

When Should a Local Qwen3.6-27B Think? I Measured It. #221

Uh oh!

Uh oh!

noonghunna May 24, 2026 Maintainer

The model that earned a cult

What I measured

The finding: thinking doesn't pay

Two ways it fails

The +36 that vanished

Fair caveats

The takeaway

Update (2026-05-25) — the off / free / bounded three-way on hard math

Replies: 9 comments · 15 replies

Uh oh!

Uh oh!

Uh oh!

noonghunna May 24, 2026 Maintainer Author

Uh oh!

Uh oh!

Uh oh!

noonghunna May 25, 2026 Maintainer Author

Uh oh!

noonghunna May 25, 2026 Maintainer Author

Uh oh!

Uh oh!

noonghunna May 25, 2026 Maintainer Author

Uh oh!

Uh oh!

noonghunna May 25, 2026 Maintainer Author

Uh oh!

Uh oh!

noonghunna Jun 1, 2026 Maintainer Author

Serving — beellama v0.3.0, dual 3090 (TP via layer-split)

Quality — 8-pack suite, thinking ON vs OFF (benchlocal-cli, verifier-backed, n=1)

Takeaways

Uh oh!

Uh oh!

Uh oh!

noonghunna Jun 3, 2026 Maintainer Author

Uh oh!

Uh oh!

Uh oh!

noonghunna Jun 4, 2026 Maintainer Author

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

noonghunna
May 24, 2026
Maintainer

Replies: 9 comments 15 replies

noonghunna May 24, 2026
Maintainer Author

noonghunna May 25, 2026
Maintainer Author

noonghunna May 25, 2026
Maintainer Author

noonghunna May 25, 2026
Maintainer Author

noonghunna May 25, 2026
Maintainer Author

noonghunna
Jun 1, 2026
Maintainer Author

Quality — 8-pack suite, thinking ON vs OFF (`benchlocal-cli`, verifier-backed, n=1)

noonghunna Jun 3, 2026
Maintainer Author

noonghunna Jun 4, 2026
Maintainer Author