Qwen3.6 35b-a3b experiments #241

laurimyllari · 2026-05-27T22:52:23Z

laurimyllari
May 27, 2026

More speed? How does TPS narrative 206.4 / code 256.0 (TTFT 112/115 ms) sound?

Inspired by the recent quality measurements, I started experimenting with the 35B-A3B quants to see how they'd compare numbers wise. I've tried using one before with pi.dev with less than stellar results - but the speeds are impressive.

I started by doing a partial offload of Unsloth UD-Q8_K_XL (https://huggingface.co/unsloth/Qwen3.6-35B-A3B-MTP-GGUF). I created a simple compose file with the following:

     --fit
      --ctx-size ${CTX_SIZE:-262144}
      -b ${BATCH_SIZE:-4096}
      -ub ${UBATCH_SIZE:-1024}
      -np ${NP:-1}
      -ctk ${KV_TYPE:-q8_0}
      -ctv ${KV_TYPE:-q5_0}
      -khad
      -vhad
      --merge-qkv
      -fa on
      --cache-ram 4096
      --no-mmap
      --jinja
      --parallel-tool-calls
      --reasoning ${REASONING:-off}
      --reasoning-format ${REASONING_FORMAT:-deepseek}
      --temp ${TEMP:-${TEMPERATURE:-0.6}}
      --top-p ${TOP_P:-0.95}
      --top-k ${TOP_K:-20}
      --min-p ${MIN_P:-0.0}
      --repeat-penalty ${REPEAT_PENALTY:-1.0}

I tried a few different options (-ngl 999 --n-cpu-moe 24, mtp (ran out of host memory..), --merge-up-gate-experts, --run-time-repack) but basically the fastest partial offload was with just --fit --no-mmap. I didn't try -ot as I didn't know how to try to improve on --fit.

Speeds were not great, but considering that our 27b llama.cpp option pretty recently was ~20tps, it wasn't bad at all.

I ran a benchmark and a few quality tests with a 120k context as a baseline for Qwen 3.6 35B-A3B quality.

TPS narrative 64.5 / code 64.2 (TTFT 248/227 ms).
Prompt-processing fallback: 1697 tok/s
Quality: 102/150 (68%) — toolcall-15 14/15 (93%) · instructfollow-15 13/15 (87%) · structoutput-15 12/15 (80%) · dataextract-15 12/15 (80%) · reasonmath-15 11/15 (73%) · bugfind-15 13/15 (87%) · hermesagent-20 9/20 (45%) · cli-40 18/40 (45%)
Quality: 104/150 (69%) — toolcall-15 14/15 (93%) · instructfollow-15 13/15 (87%) · structoutput-15 12/15 (80%) · dataextract-15 12/15 (80%) · reasonmath-15 11/15 (73%) · bugfind-15 13/15 (87%) · hermesagent-20 11/20 (55%) · cli-40 18/40 (45%)
Quality: 104/150 (69%) — toolcall-15 14/15 (93%) · instructfollow-15 13/15 (87%) · structoutput-15 12/15 (80%) · dataextract-15 12/15 (80%) · reasonmath-15 11/15 (73%) · bugfind-15 13/15 (87%) · hermesagent-20 11/20 (55%) · cli-40 18/40 (45%)
Quality: 103/150 (69%) — toolcall-15 14/15 (93%) · instructfollow-15 14/15 (93%) · structoutput-15 12/15 (80%) · dataextract-15 12/15 (80%) · reasonmath-15 11/15 (73%) · bugfind-15 13/15 (87%) · hermesagent-20 10/20 (50%) · cli-40 17/40 (42%)
Quality: 102/150 (68%) — toolcall-15 14/15 (93%) · instructfollow-15 13/15 (87%) · structoutput-15 12/15 (80%) · dataextract-15 11/15 (73%) · reasonmath-15 11/15 (73%) · bugfind-15 13/15 (87%) · hermesagent-20 10/20 (50%) · cli-40 18/40 (45%)
Aider-polyglot: 15/30 — cpp 3/5 · go 2/5 · java 2/5 · javascript 5/5 · python 2/5 · rust 1/5
Aider-polyglot: 13/30 — cpp 1/5 · go 2/5 · java 2/5 · javascript 5/5 · python 2/5 · rust 1/5
Aider-polyglot: 16/30 — cpp 4/5 · go 2/5 · java 2/5 · javascript 5/5 · python 2/5 · rust 1/5
Aider-polyglot: 16/30 — cpp 3/5 · go 2/5 · java 2/5 · javascript 5/5 · python 2/5 · rust 2/5

These numbers seemed quite reasonable, and in line with the expectation that 35b-a3b is not as strong as the 27b model. Numbers were closer than I expected though.

I picked a few recommended quants that fit entirely in VRAM with a reasonably sized q8/q5 kv cache. I also enabled MTP where available. I first ran a bench, quality and aider-polyglot for each, then followed up with a three run quality-full/aider-polyglot rebench. I noticed pretty big differences in the runtimes and noted them below - I have the full logs too if anyone wants to take a closer look at any of the data. It's pretty quick to reproduce though at these speeds. :)

Unsloth UD-IQ4_XS
- TPS narrative 186.1 / code 234.6 (TTFT 119/114 ms).
- Quality: 101/150 (67%) — toolcall-15 14/15 (93%) · instructfollow-15 14/15 (93%) · structoutput-15 12/15 (80%) · dataextract-15 11/15 (73%) · reasonmath-15 12/15 (80%) · bugfind-15 13/15 (87%) · hermesagent-20 9/20 (45%) · cli-40 16/40 (40%)
- Aider-polyglot: 18/30 — cpp 3/5 · go 3/5 · java 3/5 · javascript 5/5 · python 2/5 · rust 2/5
- launch: CTX_SIZE=196608 docker compose -f fit-mtp.yml up
- quality-full takes ~18 minutes
- Quality: 100/150 (67%) — toolcall-15 14/15 (93%) · instructfollow-15 14/15 (93%) · structoutput-15 12/15 (80%) · dataextract-15 11/15 (73%) · reasonmath-15 12/15 (80%) · bugfind-15 13/15 (87%) · hermesagent-20 9/20 (45%) · cli-40 15/40 (38%)
- Quality: 100/150 (67%) — toolcall-15 14/15 (93%) · instructfollow-15 14/15 (93%) · structoutput-15 12/15 (80%) · dataextract-15 11/15 (73%) · reasonmath-15 12/15 (80%) · bugfind-15 13/15 (87%) · hermesagent-20 9/20 (45%) · cli-40 15/40 (38%)
- Quality: 102/150 (68%) — toolcall-15 14/15 (93%) · instructfollow-15 14/15 (93%) · structoutput-15 12/15 (80%) · dataextract-15 11/15 (73%) · reasonmath-15 12/15 (80%) · bugfind-15 13/15 (87%) · hermesagent-20 11/20 (55%) · cli-40 15/40 (38%)
- aider-polyglot takes ~12 minutes
- Aider-polyglot: 15/30 — cpp 1/5 · go 3/5 · java 3/5 · javascript 5/5 · python 2/5 · rust 1/5
- Aider-polyglot: 17/30 — cpp 3/5 · go 3/5 · java 3/5 · javascript 5/5 · python 2/5 · rust 1/5
- Aider-polyglot: 17/30 — cpp 3/5 · go 3/5 · java 3/5 · javascript 5/5 · python 2/5 · rust 1/5
Mudler APEX I-Compact
- https://huggingface.co/mudler/Qwen3.6-35B-A3B-APEX-MTP-GGUF
- TPS narrative 205.2 / code 254.5 (TTFT 126/123 ms).
- Quality: 107/150 (71%) — toolcall-15 14/15 (93%) · instructfollow-15 13/15 (87%) · structoutput-15 13/15 (87%) · dataextract-15 13/15 (87%) · reasonmath-15 14/15 (93%) · bugfind-15 13/15 (87%) · hermesagent-20 12/20 (60%) · cli-40 15/40 (38%)
- Aider-polyglot: 14/30 — cpp 3/5 · go 2/5 · java 3/5 · javascript 3/5 · python 2/5 · rust 1/5
- launch: CTX_SIZE=196608 GGUF_FILE=qwen3.6-35b-a3b-gguf/mudler-icompact/Qwen3.6-35B-A3B-APEX-MTP-I-Compact.gguf docker compose -f fit-mtp.yml up
- quality-full takes ~12 minutes
- Quality: 106/150 (71%) — toolcall-15 14/15 (93%) · instructfollow-15 13/15 (87%) · structoutput-15 13/15 (87%) · dataextract-15 13/15 (87%) · reasonmath-15 14/15 (93%) · bugfind-15 13/15 (87%) · hermesagent-20 11/20 (55%) · cli-40 15/40 (38%)
- Quality: 105/150 (70%) — toolcall-15 14/15 (93%) · instructfollow-15 13/15 (87%) · structoutput-15 13/15 (87%) · dataextract-15 13/15 (87%) · reasonmath-15 14/15 (93%) · bugfind-15 13/15 (87%) · hermesagent-20 11/20 (55%) · cli-40 14/40 (35%)
- Quality: 105/150 (70%) — toolcall-15 14/15 (93%) · instructfollow-15 13/15 (87%) · structoutput-15 13/15 (87%) · dataextract-15 13/15 (87%) · reasonmath-15 14/15 (93%) · bugfind-15 13/15 (87%) · hermesagent-20 11/20 (55%) · cli-40 14/40 (35%)
- aider-polyglot takes ~16 minutes
- Aider-polyglot: 13/30 — cpp 2/5 · go 2/5 · java 3/5 · javascript 3/5 · python 2/5 · rust 1/5
- Aider-polyglot: 14/30 — cpp 2/5 · go 3/5 · java 3/5 · javascript 3/5 · python 2/5 · rust 1/5
- Aider-polyglot: 14/30 — cpp 2/5 · go 3/5 · java 3/5 · javascript 3/5 · python 2/5 · rust 1/5
ByteShape IQ4_XS
- https://huggingface.co/byteshape/Qwen3.6-35B-A3B-MTP-GGUF
- TPS narrative 206.4 / code 256.0 (TTFT 112/115 ms).
- launch: CTX_SIZE=196608 GGUF_FILE=qwen3.6-35b-a3b-gguf/byteshape-iq4xs/Qwen3.6-35B-A3B-IQ4_XS-4.19bpw.gguf docker compose -f fit-mtp.yml up
- quality-full takes ~24 minutes
- Quality: 103/150 (69%) — toolcall-15 14/15 (93%) · instructfollow-15 13/15 (87%) · structoutput-15 12/15 (80%) · dataextract-15 11/15 (73%) · reasonmath-15 14/15 (93%) · bugfind-15 12/15 (80%) · hermesagent-20 12/20 (60%) · cli-40 15/40 (38%)
- Quality: 104/150 (69%) — toolcall-15 14/15 (93%) · instructfollow-15 13/15 (87%) · structoutput-15 12/15 (80%) · dataextract-15 11/15 (73%) · reasonmath-15 14/15 (93%) · bugfind-15 12/15 (80%) · hermesagent-20 12/20 (60%) · cli-40 16/40 (40%)
- Quality: 103/150 (69%) — toolcall-15 14/15 (93%) · instructfollow-15 13/15 (87%) · structoutput-15 12/15 (80%) · dataextract-15 11/15 (73%) · reasonmath-15 14/15 (93%) · bugfind-15 12/15 (80%) · hermesagent-20 11/20 (55%) · cli-40 16/40 (40%)
- aider-polyglot takes ~12 minutes
- Aider-polyglot: 16/30 — cpp 3/5 · go 4/5 · java 3/5 · javascript 3/5 · python 2/5 · rust 1/5
- Aider-polyglot: 15/30 — cpp 3/5 · go 4/5 · java 3/5 · javascript 3/5 · python 1/5 · rust 1/5
- last aider-polyglot run got stuck on javascript complex numbers
- quality runs took 2x time compared to other quants
Thireus 4.2574bpw
- https://gguf0.thireus.com/quant_assign.html
- TPS narrative 177.8 / code 178.7 (TTFT 150/146 ms).
- Quality: 103/150 (69%) — toolcall-15 14/15 (93%) · instructfollow-15 13/15 (87%) · structoutput-15 12/15 (80%) · dataextract-15 11/15 (73%) · reasonmath-15 14/15 (93%) · bugfind-15 13/15 (87%) · hermesagent-20 10/20 (50%) · cli-40 16/40 (40%)
- Aider-polyglot: 17/30 — cpp 3/5 · go 3/5 · java 4/5 · javascript 5/5 · python 2/5 · rust 0/5
- launch: CTX_SIZE=196608 GGUF_FILE=qwen3.6-35b-a3b-gguf/thireus-4.2574bpw/Qwen3.6-35B-A3B.WEB_USER-4.2574bpw-17GB-GGUF_17GB-GPU_0GB-CPU.c636b0e_2194a4d.gguf docker compose -f fit.yml up
- quality-full takes ~19 minutes
- Quality: 97/150 (65%) — toolcall-15 13/15 (87%) · instructfollow-15 14/15 (93%) · structoutput-15 12/15 (80%) · dataextract-15 11/15 (73%) · reasonmath-15 12/15 (80%) · bugfind-15 12/15 (80%) · hermesagent-20 9/20 (45%) · cli-40 14/40 (35%)
- Quality: 100/150 (67%) — toolcall-15 13/15 (87%) · instructfollow-15 14/15 (93%) · structoutput-15 12/15 (80%) · dataextract-15 11/15 (73%) · reasonmath-15 12/15 (80%) · bugfind-15 12/15 (80%) · hermesagent-20 11/20 (55%) · cli-40 15/40 (38%)
- Quality: 100/150 (67%) — toolcall-15 13/15 (87%) · instructfollow-15 14/15 (93%) · structoutput-15 12/15 (80%) · dataextract-15 11/15 (73%) · reasonmath-15 12/15 (80%) · bugfind-15 12/15 (80%) · hermesagent-20 11/20 (55%) · cli-40 15/40 (38%)
- aider-polyglot takes ~15 minutes
- Aider-polyglot: 15/30 — cpp 4/5 · go 2/5 · java 3/5 · javascript 4/5 · python 2/5 · rust 0/5
- Aider-polyglot: 15/30 — cpp 4/5 · go 2/5 · java 3/5 · javascript 4/5 · python 2/5 · rust 0/5
- Aider-polyglot: 17/30 — cpp 4/5 · go 2/5 · java 4/5 · javascript 4/5 · python 2/5 · rust 1/5

I'll give the Unsloth UD-IQ4_XS a try in pi.dev for real coding. I'm not sure if there's much here for a real informed decision as the numbers are noisy and sometimes seem to point in opposite directions.

@noonghunna A couple of notes on the quality tests:

toolcall-15

TC-05 Date and Time Parsing
- Current date (benchmark_reference_date and benchmark_reference_day) is not included in the request so assistant has no date to go by

instructfollow-15

IF-10 Exact Word Count
- request has
```
  {
    "top_p": 1,
    "max_tokens": 16384,
    "temperature": 0,
    "chat_template_kwargs": {
    "enable_thinking": true
  }
```
- Qwen Thinking mode for general tasks: temperature=1.0, top_p=0.95, top_k=20, min_p=0.0, presence_penalty=1.5, repetition_penalty=1.0
- With the recommended sampler parameters, test passes on bytesloth 35b-a3b
IF-12 Contradictory Constraints \u2014 Standardized Conflict Format
- Assistant response: "IMPOSSIBLE - Three sentences of exactly ten words each would total thirty words, which directly contradicts the requirement for a total of exactly twenty-five words."
- Verifier looks for "25" and "30" in addition to "IMPOSSIBLE -"

structoutput-15

SO-07 Deeply nested JSON with arrays of objects and required null values.
- Expects an object with "user"
- Validation is questionable, doesn't check null values etc, just that the response
  has $.user.id with value 42
- deepseek-v4-flash also fails this
SO-08 CSV with Special Characters
- Verifier expects wrong columns?
SO-10 JSON to Markdown Table
- Verifier says "missing markdown header | name | score | grade |" but response
  started with | name | score | grade |\n
- Verifier might be too strict, possibly expecting exactly three hyphens on 2nd line for each column?

noonghunna · 2026-05-28T01:01:01Z

noonghunna
May 28, 2026
Maintainer

@laurimyllari this is a fantastic writeup — five quants, full 8-pack + aider-polyglot with n=3 rebenches, runtime notes, and careful per-scenario verifier findings. Exactly the kind of rigor that moves the catalog. Reactions + tracking IDs:

(Edited: engine is ik-llama, per your clarification — corrected below.)

The speed is the story. 35B-A3B at 206 narr / 256 code TPS on ik-llama --fit is a config we don't ship today — we have ik APEX presets for this model, but none tuned with your --fit full-offload + asymmetric KV. And the quality (67–71%, mudler APEX I-Compact topping at 107/150) lands right where you'd expect: a touch under the 27B (we measure that at 109/150 thinking-off) but closer than the "A3B is weaker" prior suggests. Filed club-3090#242 to capture your --fit config on top of our existing ik composes — your config as a PR would be ideal if you're up for it.

Two things jumped out:

Your KV is q8_0(K)/q5_0(V) — exactly the asymmetric K-high/V-low pattern we just documented from Anbeeld's KV-quant benchmarks (new FAQ entry). Nice independent convergence.
You ran MTP at 200+ TPS on the MoE — interesting, because on vLLM we found built-in MTP shares the MoE forward and costs ~−51% TPS on 35B-A3B (so we default it off there). ik-llama — which leads upstream on MTP — evidently dodges that; flagged to validate in Feature request: ik-llama --fit 35B-A3B compose (laurimyllari's fast config) #242.

Your verifier notes are gold — each now has a tracking ID:

TC-05 (no date injected) → benchlocal-cli#35
IF-10 (thinking pack forced to temp=0; passes at the recommended thinking sampler) → benchlocal-cli#36 — the most impactful catch; it's cross-cutting (every thinking-enabled pack greedy-samples off-distribution).
IF-12 / SO-07 / SO-08 / SO-10 (digit-vs-spelled, under-validated JSON, wrong columns, over-strict markdown header) → benchlocal-cli#37

On the noise you flagged — matches our experience exactly: 8-pack swings ±5-7 and aider ±2-3 at n=1, so your n≥3 is the right call. Mudler APEX I-Compact (consistently 105–107 + cleanest reasonmath/dataextract) reads as the real signal.

Thanks again — this is the kind of contribution the project runs on. Happy to add your numbers as a cross-rig BENCHMARKS row once #242 lands.

0 replies

noonghunna · 2026-05-28T01:58:01Z

noonghunna
May 28, 2026
Maintainer

Following up with the consolidated picture, @laurimyllari — your matrix gave us the 35B-A3B quality data we were missing, so here are all three models we serve locally side-by-side. Heavy caveat: mixed engines + rigs, the TPS bases aren't identical — read this as the rough landscape, not a leaderboard.

Model	Config	Narr / Code TPS	Max ctx	8-pack /150	aider /30
Qwen3.6-35B-A3B	ik `--fit`, Mudler APEX I-Compact (your run)	~205 / 254	~196K	105–107	13–14
Qwen3.6-27B	ik-llama IQ4_KS + MTP, 1× 3090	60 / 69	200K	101	19
Gemma-4-31B	beellama.cpp + DFlash, 1× 3090	47 / 88	~150K (SWA)	109 / 114 (thinking-on)	— (TBD)

What stands out:

35B-A3B is the runaway speed king — ~3–4× the dense models' decode (MoE, 3B active). Your Mudler APEX I-Compact is the standout: top quality and near-top speed. Its 8-pack (105–107) lands a hair under Gemma and basically tied with the cluster — closer than the "A3B is weaker" prior suggests, exactly as you found.
Gemma-4-31B narrowly tops the 8-pack (109 thinking-off / 114 thinking-on), and unusually for our local models reasoning-on actually helps it (+5). On Qwen3.6-27B thinking is net-negative for code (we measured HE+ −4 / LCB −18 vs thinking-off), so it ships thinking-off.
27B keeps the aider crown (19 vs 13–18) and the safest long-context single-card config.

But here's the thing you already put your finger on — "I'm not sure if there's much here for a real informed decision as the numbers are noisy and sometimes seem to point in opposite directions." That's exactly our problem. All three sit inside a ~98–114 8-pack band that's barely wider than the ±5–7 run-to-run noise. Even your careful n=3 matrix couldn't cleanly rank them — which is, honestly, one of the most useful findings in the thread.

So I'd genuinely value your eyes on how to get more discrimination per GPU-hour, because we keep hitting the same wall:

A sharper-but-slower suite you may not have run: alongside the --full 8-pack you used, quality-test.sh --reasoning runs HumanEval+-30 / LiveCodeBench-v6-30 / GSM-symbolic-30 / GPQA-diamond (execution-backed, thinking-on). It separates models better than the 8-pack, but it's slow enough that we don't run it by default — and we've only baselined the 27B. Given how fast your setup is, it might be cheap for you to run.
The real gap: we have zero long-horizon agentic coverage — the thing local coding agents actually do all day. Everything (8-pack, reasoning suite, even aider) is short-to-medium-horizon, isolated-ish edits. We're eyeing DeepSWE (113 real-repo, test-verified tasks, run via Pier + mini-swe-agent against an OpenAI-compatible endpoint) to fill it — but two worries before burning the GPU-hours: (1) long-horizon × 113 tasks on a single card is expensive, and (2) it targets frontier agents, so 30B-class local models might all bunch up near zero and collapse the discrimination at the bottom instead of the top.

The question we keep circling, and I'd love to bounce off you: lean on the reasoning suite despite the runtime? A long-horizon bench like DeepSWE (have you tried it, or SWE-bench-style, locally)? Real-task replay — which is exactly the pi.dev coding you said you'd point UD-IQ4_XS at, and honestly the signal I'd trust most? Or something cheaper we're missing entirely?

And more broadly — any feedback on club-3090 as a whole: rough edges, missing configs, docs that didn't land? Your writeups are some of the highest-signal contributions the project gets. Thanks again. 🙏

0 replies

laurimyllari · 2026-05-28T16:05:33Z

laurimyllari
May 28, 2026
Author

Thank you for the kind words @noonghunna.

The way I've been thinking about this is to start with a two prong approach. 1) Try to get as much signal as possible out of the current tests and 2) find breaking real use-cases and use those to measure/improve (record/replay).

This is why I've been digging into the quality results, to try to understand what they're actually measuring and to find the real failures. Eliminating the false negatives should give us some more useful tests, and there might be some false positives too (although I haven't thought about how to track those down yet - maybe a run with a tiny or broken model could be a starting point).

I think the results reporting could be leveled up. I noticed the large variation in runtime above but we don't really have any visibility into what's behind it. It could be something useful as one of the failure modes often mentioned for this particular model is that it tends to get into a loop or overthink. The byteshape quant failure especially with one of the aider runs getting stuck interests me as it could be the model doing something strange or it could be the test harness failing - either of those would be great to understand. There is probably already signal that we're capturing but not seeing. A good first step might be to surface more runtime information (time and tokens) and split the failures into verification failures and token limits.

I'll try the larger quality-test.sh --reasoning set, I'm pretty interested to see how it looks. Do we have a collection of quality baselines, it would be nice to be able to just run a new model/configuration and compare?

I agree with your concerns about the larger test suites, and they would probably need an initial effort (in addition to implementing the test setup) to ensure that the results are both valid and useful. My intuition is to first focus on the current test suite and see how to get the most signal out of it - a lot of that probably applies to a heavier suite too.

For the second approach (real-task replay) I've been tinkering with a little proxy tool. I don't know if you already have something like this or if there are established tools like it. For me it has been a way to learn how to use agents and see what they can do. The way I diagnosed the IF-10 failure above was to run the whole quality test through the proxy, look at the logs for failed tests, take part of the prompt and search for it to find the right capture, then replay that request with different sampler parameters. I'm hoping to use this also to record failures from pi.dev sessions and be able to reproduce them. This is probably similar to what @mgabor3141 used for the excellent tool-calling reliability capture/repro. It's a simple python+sqlite backend and a web frontend - happy to share if anyone would find value in it. The only reason I don't have it in a public repo is that there's probably zero hand written lines of code in there..

On club-3090 - I really appreciate what you're doing. Having a well documented and simple "download this, run this compose" makes it really easy to provision local model serving and get started quickly. I hope more people find this. There are two potential directions that are interesting to me - either a great vLLM set up supporting parallel requests, or a llama-swap to support multiple models and configurations. I experimented with the latter, having multiple presets for overriding request parameters and automatically switching between models and configurations (see https://github.com/laurimyllari/club-3090/blob/lauri/models/qwen3.6-27b/ik_llama-cpp/compose/single/ik_llama-swap.config.yaml for an example). It is more of a hacker setup than a production service, but could be useful to devs and powerful if integrated with the rest of the club-3090 approach.

2 replies

noonghunna May 29, 2026
Maintainer

Two-prong framing is exactly right — and quick but important: most of your reporting wishlist is already in the output you had in front of you.

Every quality-test.sh run ends with a Failure breakdown: block — one line per failed scenario as pack scenario: failure_mode (detail), with the full detail string, e.g.:

- dataextract-15 DE-05: verifier_fail (7/14 atomic fields correct (50%). product_name: mismatch | product_price_paid: expected number)

So the verification-vs-error split you wanted is already there (verifier_fail vs timeout / agent_runner_timeout / server_error …), one glance below the scoreboard — I think you just pasted the one-line Quality: summary into the thread. For deeper digs, every run also --save-jsons to results/quality/quality-<ts>.json with per-scenario tokens + latency + the full verifier trace, and benchlocal-cli inspect <json> --failed / --scenario X --full / --mode timeout / --diff prev.json reads it back (filter, full trace, regression diff). Fair cop that none of this was documented anywhere a user would look — just fixed that (docs in club-3090 QUALITY_TEST.md + benchlocal-cli README, plus a footer on quality-test.sh pointing at it).

The one real gap you identified stands: there's no distinct token-limit / overthink failure mode — a loop that truncates currently scores as a plain verifier_fail, indistinguishable from a wrong answer. Per-scenario latency already flags the outliers (one of my dataextract cases just clocked 533 s), but a first-class "ran to length" class would make looped vs. hung vs. genuinely wrong one-glance. Filed → noonghunna/benchlocal-cli#61.

False positives via a negative control — love it; running a deliberately weak/broken model and treating any PASS as a too-lenient verifier is clean grader-QA. Filed → noonghunna/benchlocal-cli#62.

Baselines — the compare mechanism already exists (inspect --diff prev.json, and run --previous-result … --exit-on-regression for CI). What we lack is a curated set of trusted baselines to diff against; that's coming via a measurement-record corpus we're building — filed #252 to publish one.

--reasoning suite — please do; happy to drop our 27B thinking-on reference numbers (HumanEval+-30 / LiveCodeBench-v6-30 / GSM-symbolic-30 / GPQA-diamond) so you have a line to compare against.

Your replay proxy — I'd genuinely like to see it (sounds close to @mgabor3141's capture/repro). Honest caveat so I don't over-promise: for suite failures, inspect --scenario X --full already reconstructs the prompt/response/verifier-trace your workflow rebuilds, so the proxy's real edge is capturing real pi.dev sessions and generating agentic scenarios from them — high value, but I'd park integrating it until we hit a failure inspect can't explain or we commit to building the long-horizon agentic suite. Send it over regardless — it's the natural seed for when that day comes.

llama-swap — neat experiment, clean config. We do have switch.sh for operator-driven model switching today; honestly, at our model sizes (27–35B, TP=2 or full-card) request-driven auto-swap means a cold multi-GB reload on the triggering request, so the win over a deliberate switch is thin — and wrapping our vLLM/beellama composes (with their vendored overlays) under a llama.cpp-oriented proxy fights the compose/registry layer. Where it'd really pay off is concurrent multi-model from one endpoint (e.g. a small model for FIM + a big model for chat). Is that the workflow you're after? If it's more "easily A/B different configs," I'd rather sand down switch.sh. Curious what it actually bought you in practice.

Thanks again — genuinely some of the highest-signal input the project gets. 🙏

laurimyllari May 29, 2026
Author

@noonghunna Thanks for the followups and especially tips on benchlocal-cli inspect, that was the tool that I needed. :) Reading through the result JSONs manually was not great.

re llama-swap, you're right about the concurrent multi-model from one endpoint being one great use case, and I'd love that if I had the VRAM for it (a larger model for coding, a small model for quickly processing observations/memory etc). It can also be helpful for providing multiple endpoints to the same model while enforcing different parameters (for example to point a coding agent and a chat interface at).

The model switch is fairly expensive, but my original use case was to use 35b-a3b as a worker/implementer agent spawned by the stronger 27b acting as orchestrator, planner and reviewer. With no concurrency it's workable, although fragile (a single request for the wrong model costs maybe 10-20s).

Take those more as musings than feature requests. The one really high value setup would be production ready single card vLLM. llama-swap hackery is easy to do ad-hoc.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Qwen3.6 35b-a3b experiments #241

Uh oh!

{{title}}

Uh oh!

Replies: 3 comments 2 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Qwen3.6 35b-a3b experiments #241

Uh oh!

laurimyllari May 27, 2026

toolcall-15

instructfollow-15

structoutput-15

Replies: 3 comments · 2 replies

Uh oh!

Uh oh!

noonghunna May 28, 2026 Maintainer

Uh oh!

noonghunna May 28, 2026 Maintainer

Uh oh!

laurimyllari May 28, 2026 Author

Uh oh!

Uh oh!

noonghunna May 29, 2026 Maintainer

Uh oh!

laurimyllari May 29, 2026 Author

laurimyllari
May 27, 2026

Replies: 3 comments 2 replies

noonghunna
May 28, 2026
Maintainer

noonghunna
May 28, 2026
Maintainer

laurimyllari
May 28, 2026
Author

noonghunna May 29, 2026
Maintainer

laurimyllari May 29, 2026
Author