Qwen3.6 35b-a3b experiments #241
Replies: 3 comments 2 replies
-
|
@laurimyllari this is a fantastic writeup — five quants, full 8-pack + aider-polyglot with n=3 rebenches, runtime notes, and careful per-scenario verifier findings. Exactly the kind of rigor that moves the catalog. Reactions + tracking IDs:
The speed is the story. 35B-A3B at 206 narr / 256 code TPS on ik-llama Two things jumped out:
Your verifier notes are gold — each now has a tracking ID:
On the noise you flagged — matches our experience exactly: 8-pack swings ±5-7 and aider ±2-3 at n=1, so your n≥3 is the right call. Mudler APEX I-Compact (consistently 105–107 + cleanest reasonmath/dataextract) reads as the real signal. Thanks again — this is the kind of contribution the project runs on. Happy to add your numbers as a cross-rig BENCHMARKS row once #242 lands. |
Beta Was this translation helpful? Give feedback.
-
|
Following up with the consolidated picture, @laurimyllari — your matrix gave us the 35B-A3B quality data we were missing, so here are all three models we serve locally side-by-side. Heavy caveat: mixed engines + rigs, the TPS bases aren't identical — read this as the rough landscape, not a leaderboard.
What stands out:
But here's the thing you already put your finger on — "I'm not sure if there's much here for a real informed decision as the numbers are noisy and sometimes seem to point in opposite directions." That's exactly our problem. All three sit inside a ~98–114 8-pack band that's barely wider than the ±5–7 run-to-run noise. Even your careful n=3 matrix couldn't cleanly rank them — which is, honestly, one of the most useful findings in the thread. So I'd genuinely value your eyes on how to get more discrimination per GPU-hour, because we keep hitting the same wall:
The question we keep circling, and I'd love to bounce off you: lean on the reasoning suite despite the runtime? A long-horizon bench like DeepSWE (have you tried it, or SWE-bench-style, locally)? Real-task replay — which is exactly the pi.dev coding you said you'd point UD-IQ4_XS at, and honestly the signal I'd trust most? Or something cheaper we're missing entirely? And more broadly — any feedback on club-3090 as a whole: rough edges, missing configs, docs that didn't land? Your writeups are some of the highest-signal contributions the project gets. Thanks again. 🙏 |
Beta Was this translation helpful? Give feedback.
-
|
Thank you for the kind words @noonghunna. The way I've been thinking about this is to start with a two prong approach. 1) Try to get as much signal as possible out of the current tests and 2) find breaking real use-cases and use those to measure/improve (record/replay). This is why I've been digging into the quality results, to try to understand what they're actually measuring and to find the real failures. Eliminating the false negatives should give us some more useful tests, and there might be some false positives too (although I haven't thought about how to track those down yet - maybe a run with a tiny or broken model could be a starting point). I think the results reporting could be leveled up. I noticed the large variation in runtime above but we don't really have any visibility into what's behind it. It could be something useful as one of the failure modes often mentioned for this particular model is that it tends to get into a loop or overthink. The byteshape quant failure especially with one of the aider runs getting stuck interests me as it could be the model doing something strange or it could be the test harness failing - either of those would be great to understand. There is probably already signal that we're capturing but not seeing. A good first step might be to surface more runtime information (time and tokens) and split the failures into verification failures and token limits. I'll try the larger I agree with your concerns about the larger test suites, and they would probably need an initial effort (in addition to implementing the test setup) to ensure that the results are both valid and useful. My intuition is to first focus on the current test suite and see how to get the most signal out of it - a lot of that probably applies to a heavier suite too. For the second approach (real-task replay) I've been tinkering with a little proxy tool. I don't know if you already have something like this or if there are established tools like it. For me it has been a way to learn how to use agents and see what they can do. The way I diagnosed the IF-10 failure above was to run the whole quality test through the proxy, look at the logs for failed tests, take part of the prompt and search for it to find the right capture, then replay that request with different sampler parameters. I'm hoping to use this also to record failures from pi.dev sessions and be able to reproduce them. This is probably similar to what @mgabor3141 used for the excellent tool-calling reliability capture/repro. It's a simple python+sqlite backend and a web frontend - happy to share if anyone would find value in it. The only reason I don't have it in a public repo is that there's probably zero hand written lines of code in there.. On club-3090 - I really appreciate what you're doing. Having a well documented and simple "download this, run this compose" makes it really easy to provision local model serving and get started quickly. I hope more people find this. There are two potential directions that are interesting to me - either a great vLLM set up supporting parallel requests, or a llama-swap to support multiple models and configurations. I experimented with the latter, having multiple presets for overriding request parameters and automatically switching between models and configurations (see https://github.com/laurimyllari/club-3090/blob/lauri/models/qwen3.6-27b/ik_llama-cpp/compose/single/ik_llama-swap.config.yaml for an example). It is more of a hacker setup than a production service, but could be useful to devs and powerful if integrated with the rest of the club-3090 approach. |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
More speed? How does TPS narrative 206.4 / code 256.0 (TTFT 112/115 ms) sound?
Inspired by the recent quality measurements, I started experimenting with the 35B-A3B quants to see how they'd compare numbers wise. I've tried using one before with pi.dev with less than stellar results - but the speeds are impressive.
I started by doing a partial offload of Unsloth UD-Q8_K_XL (https://huggingface.co/unsloth/Qwen3.6-35B-A3B-MTP-GGUF). I created a simple compose file with the following:
I tried a few different options (
-ngl 999 --n-cpu-moe 24, mtp (ran out of host memory..),--merge-up-gate-experts,--run-time-repack) but basically the fastest partial offload was with just--fit --no-mmap. I didn't try-otas I didn't know how to try to improve on--fit.Speeds were not great, but considering that our 27b llama.cpp option pretty recently was ~20tps, it wasn't bad at all.
I ran a benchmark and a few quality tests with a 120k context as a baseline for Qwen 3.6 35B-A3B quality.
These numbers seemed quite reasonable, and in line with the expectation that 35b-a3b is not as strong as the 27b model. Numbers were closer than I expected though.
I picked a few recommended quants that fit entirely in VRAM with a reasonably sized q8/q5 kv cache. I also enabled MTP where available. I first ran a bench, quality and aider-polyglot for each, then followed up with a three run quality-full/aider-polyglot rebench. I noticed pretty big differences in the runtimes and noted them below - I have the full logs too if anyone wants to take a closer look at any of the data. It's pretty quick to reproduce though at these speeds. :)
Unsloth UD-IQ4_XS
CTX_SIZE=196608 docker compose -f fit-mtp.yml upMudler APEX I-Compact
CTX_SIZE=196608 GGUF_FILE=qwen3.6-35b-a3b-gguf/mudler-icompact/Qwen3.6-35B-A3B-APEX-MTP-I-Compact.gguf docker compose -f fit-mtp.yml upByteShape IQ4_XS
CTX_SIZE=196608 GGUF_FILE=qwen3.6-35b-a3b-gguf/byteshape-iq4xs/Qwen3.6-35B-A3B-IQ4_XS-4.19bpw.gguf docker compose -f fit-mtp.yml upThireus 4.2574bpw
CTX_SIZE=196608 GGUF_FILE=qwen3.6-35b-a3b-gguf/thireus-4.2574bpw/Qwen3.6-35B-A3B.WEB_USER-4.2574bpw-17GB-GGUF_17GB-GPU_0GB-CPU.c636b0e_2194a4d.gguf docker compose -f fit.yml upI'll give the Unsloth UD-IQ4_XS a try in pi.dev for real coding. I'm not sure if there's much here for a real informed decision as the numbers are noisy and sometimes seem to point in opposite directions.
@noonghunna A couple of notes on the quality tests:
toolcall-15
instructfollow-15
IF-10 Exact Word Count
request has
Qwen Thinking mode for general tasks: temperature=1.0, top_p=0.95, top_k=20, min_p=0.0, presence_penalty=1.5, repetition_penalty=1.0
With the recommended sampler parameters, test passes on bytesloth 35b-a3b
IF-12 Contradictory Constraints \u2014 Standardized Conflict Format
structoutput-15
has $.user.id with value 42
started with
| name | score | grade |\nBeta Was this translation helpful? Give feedback.
All reactions