When Should a Local Qwen3.6-27B Think? I Measured It. #221
Replies: 9 comments 15 replies
-
|
Very nice! I'd love to try to reproduce some of your numbers and experiment. Is the test method based on benchlocal-cli and could be applied to other models and endpoints too? |
Beta Was this translation helpful? Give feedback.
-
|
Since you asked for derivative numbers: I ran one small local GGUF datapoint on a single RTX 3090 (
So in this bounded local GGUF setup, I didn’t see ablated > stock. Not a direct reproduction of your BF16 setup — just a local-context datapoint. Receipt with commands/log pointers/results: Would this kind of derivative datapoint be useful, or is this setup too far from the BF16 baseline to be informative? |
Beta Was this translation helpful? Give feedback.
-
|
I don't understand this part from your post "My first GPQA run had thinking winning by 36 points — 86% to 14%" Also I'm still wandering, why thinking is that bad ? What about speed ? |
Beta Was this translation helpful? Give feedback.
-
A cross-engine data point for this thread — same ON-vs-OFF question, but on a different stack: beellama v0.3.0 (Anbeeld's llama.cpp fork w/ DFlash spec-dec), dual RTX 3090 (PCIe, no NVLink), Qwen3.6-27B Q8 weights at the full 262K native context. Same conclusion as the OP. Serving — beellama v0.3.0, dual 3090 (TP via layer-split)
(engine-internal decode TPS; 3 warm + 5 measured; temp 0.6 / top-k 20 / top-p 1.0. DFlash is output-lossless, so thinking on/off changes only the request, never the spec path. Code runs ~2.3× narrative because DFlash acceptance runs higher on code (~0.58) than prose (~0.3); prose is still net-positive on tok/s — see the correction note at the top. The MTP spec option on this stack caps ~160K on dual Q8; DFlash is what reaches full 262K. Image: Quality — 8-pack suite, thinking ON vs OFF (
|
| Pack | thinking ON | thinking OFF |
|---|---|---|
| toolcall-15 | 14/15 | 13/15 |
| instructfollow-15 | 15/15 | 15/15 |
| structoutput-15 | 14/15 | 14/15 |
| dataextract-15 | 9/15 | 8/15 |
| reasonmath-15 | 12/15 | 13/15 |
| bugfind-15 | 14/15 | 14/15 |
| hermesagent-20 | 13/20 | 13/20 |
| cli-40 | 20/40 | 18/40 |
| TOTAL (8-pack) | 111/150 (74%) | 108/150 (72%) |
Optional reasoning/code packs (run on top of the core 8-pack — kept separate so the /150 stays intact; aider-polyglot-30 is also in this optional set but wasn't run here):
| Pack | thinking ON | thinking OFF |
|---|---|---|
| humaneval-plus-30 | 21/30 | 24/30 |
| lcb-v6-30 | 22/30 | 18/30 |
(30-sample subsets — absolute % aren't directly comparable to the OP's fuller HumanEval+/LCB runs; the comparable signal is the ON-vs-OFF delta.)
Takeaways
- Thinking buys no reliable edge here either — 8-pack +3/150 (~2pp), squarely within n=1 noise. Corroborates the OP's thesis on a completely different serving stack (DFlash spec-dec, dual-card, Q8, full 262K).
- The "should-help" packs split both ways: the optional code packs disagree (humaneval+ favors OFF by 3, lcb-v6 favors ON by 4) and reasonmath edges OFF (+1). A genuine reasoning benefit would move them together — this is pack-level variance, not signal.
- Thinking-OFF stays the default: equal quality, far fewer tokens, and on a spec-dec engine that means materially higher throughput (prose decode stays ~36 TPS while thinking just inflates token count for no quality return).
- Weak spots are mode-independent: dataextract (~55% — the model emits JSON numbers as strings) and cli-40 agentic (~47%). hermesagent landed 65%, above our historical Qwen baseline.
tl;dr — a third independent config (after the OP's runs and our earlier fine-tune A/Bs) where reasoning-on shows no measurable quality gain on Qwen3.6-27B. Thinking-off is the right local default.
Beta Was this translation helpful? Give feedback.
-
|
For me, the non-thinking version sometimes fails. just stops mid work. Edit: I am using PI coding agent, I've had the model to write me a continue plugin, that just makes the model continue : like /continue 10 would continue 10 times. I've also instructed the model to append all the final messages and highlights in a md file, so i can see them after... This way, I can be sure it will work on my long project all night without me having to wake up and see the model not code. |
Beta Was this translation helpful? Give feedback.
-
|
Beta Was this translation helpful? Give feedback.
-
Beta Was this translation helpful? Give feedback.
-
|
my default vllm setting for thinking is false, but sometimes, i want it to think. so i added a feature to my omp (oh-my-pi) so i could easily turn on thinking with reasoning budget. |
Beta Was this translation helpful? Give feedback.
-
|
nah, auto is doing something with another model, default is smol. this would be a to big change. for now i added a 0-thinking level so you can switch manually between 0 (off) and 16k high token for thinking. |
Beta Was this translation helpful? Give feedback.

Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
It's quietly become the default model for people who run their own — and the base for a whole derivative scene on Hugging Face. So I measured the one setting most of us get wrong: reasoning mode.
The model that earned a cult
If you run models locally, you already know Qwen3.6-27B. It punches so far above its class that it's the de-facto local default by a wide margin — and, more tellingly, the base for a steady stream of community work on Hugging Face: agentic fine-tunes, delta-merges, MTP-drafter variants, exotic quants. You don't get that cottage industry around a mediocre model. The base is exceptional; the scene is the proof.
Which is why one result surprised me. The feature this generation is sold on — "thinking," the reasoning channel — is the one place where, run locally, more is not better. That's not a knock on the model. It's the opposite: Qwen3.6-27B answers directly so well that, locally, thinking is mostly a tax you don't need to pay.
Here are the numbers, the two distinct ways thinking fails, and the benchmarking trap that nearly had me publish the exact opposite.
What I measured
enable_thinking=false). Greedy, identical prompts.The finding: thinking doesn't pay
Reasoning-off wins or ties everywhere I could measure, and reasoning-on always cost multiples of the tokens — ≈13× the output on HumanEval+. The lone nominal "win," GPQA, is a within-noise tie (below). One caveat on the math row: GSM at this difficulty is easy enough that no-think already maxes out, so it shows thinking's cost, not whether it could help on harder problems.
For local code, the call is unconditional: default thinking off.
Two ways it fails
Thinking doesn't fail one way — it fails two, by difficulty:
Runaway is the mode that bites locally: it doesn't just waste tokens, it burns your whole context window to hand back silence — worse than a wrong answer, because you paid full price for it.
The +36 that vanished
Here's the part every benchmark writeup should include and almost none do.
My first GPQA run had thinking winning by 36 of the 50 questions — 43/50 (86%) vs 7/50 (14%). I nearly wrote it up. It was a bug in my harness: the no-think arm had a smaller token budget and was getting cut off mid-answer (on GPQA the model reasons in the answer text, so it needs the room). Give both arms an adequate budget and that 14% climbs to 80% — and the 36-question gap collapses to 3 of 50 (86% vs 80%), inside the noise. A tie.
So my headline nearly read "thinking helps science!" — backwards, and entirely an artifact of how I measured. That's the real, transferable lesson, and it holds for any model. Before you trust a thinking delta:
If you're building on 27B — a fine-tune, a merge, an MTP variant, a fresh quant — that's the recipe to check whether thinking pays for your derivative. I'd genuinely like to see those numbers.
Fair caveats
This is one setup, and the honest reading depends on it:
The takeaway
Qwen3.6-27B is the local model to beat, and it earned that with its direct answers, not its reasoning channel. For local code-gen, default thinking off — faster, cheaper, at least as accurate. Keep reasoning-on in your pocket for genuinely multi-step work; and if you bound it (a structured-CoT grammar), know that bounding makes it terminate, not out-reason writing the code — it still trailed no-think by ~24 points on the hard set.
Most of us run this model quantized, on a single card. That's the reality these numbers come from, and the experience worth optimizing for. Getting the most out of the best local model, it turns out, is mostly a matter of letting it answer.
Method, numbers, and the budget-ladder are reproducible — happy to share the harness. The reasoning-on bf16 comparison is next; if you run the same A/B on your own 27B derivative, post your numbers.
Update (2026-05-25) — the off / free / bounded three-way on hard math
@sztlink ran the cleanest non-code test this writeup was missing: the full AIME-2026 set (30 problems) on the same local setup (single 3090, Qwen3.6-27B Q4_K_M, llama.cpp, temp 0).
So even on hard, contamination-resistant contest math — where thinking is most supposed to pay, with plenty of headroom (no GSM-style ceiling) — free thinking gave no reliable uplift (+1 of 30, within noise). And the tell: ~half the thinking-on runs (15/30) still hit the 30k cap without answering — so the runaway is genuine non-termination, not too small a budget.
The third arm is the one that sharpens the whole post. Bounded thinking — a grammar that forces the reasoning into a short plan so it must terminate — hurt here: 12/30, below no-think. That's the opposite of what it does on code, where bounding rescued the runaway (it beat free-thinking by +24pp on LiveCodeBench, purely by terminating). The reason is a clean mirror: on code the runaway lives inside the
<think>block, so bounding it helps; on AIME the length is in the answer itself — the derivation is the work — so even with reasoning bounded, 17/30 still hit the cap in the answer body. Bounded CoT is a containment tool for code, not a booster for math.This off / free / bounded three-way on AIME-2026 is the cleanest non-code test in the writeup — GSM was ceiling-limited, GPQA was a public set — and it lands where everything else did: no-think is the default, even on hard contest math. (Single-pass n=30, stock Q4_K_M, single 3090, llama.cpp, temp 0; receipts in
sztlink/boring-receipts. The bounded arm used a strict custom GOAL/APPROACH/EDGE grammar sharing a 4096 total budget — not our shipped scratchpad config — so part of its drop is the plan eating answer-room; directionally clear, not a 1:1 eval.)Beta Was this translation helpful? Give feedback.
All reactions