# Head-to-Head, Gemma 4 31B vs Qwen3.6 27B — 6 configs on dual 3090 (2026-05-11) #119
Replies: 1 comment
-
Update — v0.5.0 ships, Qwen
|
| Metric (Qwen INT8 PTH) | 2026-05-10 row above | v0.5.0 re-bench (2026-05-12) | Δ |
|---|---|---|---|
toolcall-15 |
10/15 (67%) | 10/15 (67%) | 0 |
instructfollow-15 |
13/15 | 13/15 | 0 |
structoutput-15 |
13/15 | 13/15 | 0 |
dataextract-15 |
15/15 | 15/15 | 0 |
reasonmath-15 |
6/15 | 6/15 | 0 |
bugfind-15 |
11/15 | 11/15 | 0 |
hermesagent-20 |
9/20 (45%) | 12/20 (60%) | +3 (+15pp) |
cli-40 |
17/40 (42%) | 17/40 (42%) | 0 |
| TOTAL | 94/150 (62.7%) | 97/150 (64.7%) | +3 (+2pp) |
The move is concentrated in hermesagent-20 (multi-turn agentic loops) — exactly the pack where the froggeric template's three load-bearing fixes compound: empty <think></think> past-turn pollution, no-user-query graceful fallback, unclosed-think-before-tool-call. The other 7 packs are flat = no regression on single-turn flows.
Cross-model implications
The headline cross-model gap from the original report — "Qwen wins multi-turn agent (aider) by 5 points, Gemma wins single-turn quality (quality-full) by 3-7 points" — needs an update on the agentic dimension specifically:
hermesagent-20 |
Original | v0.5.0 |
|---|---|---|
| Qwen INT8 PTH | 9/20 (45%) | 12/20 (60%) |
| Gemma INT8 PTH | 11/20 (55%) | 11/20 (55%) (unchanged — Gemma composes untouched) |
| Gemma bf16 | 14/20 (70%) | 14/20 (70%) (unchanged) |
Qwen catches up on hermesagent-20 (60% vs 55-70% Gemma range) but Gemma bf16 still leads. The matched-config Qwen-vs-Gemma gap on multi-turn agentic tasks narrowed from -10pp to -10pp at the high-end (vs Gemma bf16) and is now +5pp Qwen-favorable vs Gemma INT8 PTH. The aider-polyglot Qwen lead from the original row still stands (19/30 vs 14/30) and likely widens with the template fix.
What's NOT re-benched yet
Only Qwen INT8 PTH was re-run end-to-end against v0.5.0. The other two Qwen rows in the original matrix are now stale for the same reason (both got the froggeric default in v0.5.0) but haven't been re-benched:
- Qwen bf16 (
dual/bf16.yml) — expect similarhermesagent-20lift (~+10 to +15pp) - Qwen TQ3+MTP (Genesis) (
dual/tq3-mtp-genesis.yml) — also expect a lift, especially relevant since the 2026-05-11 row showed hermesagent-20 at only 1/20 (5%); the template fix should restore a meaningful score there. Could be the most impactful re-bench of the three.
Carnice and Qwopus composes are intentionally not affected by the v0.5.0 change (they ship bespoke Hermes-JSON templates and were excluded from the default-on rollout — see commit 84498d4).
Reproducer
git fetch
git checkout v0.5.0
docker compose -f models/qwen3.6-27b/vllm/compose/dual/int8.yml down
MODEL_DIR=/mnt/models/huggingface docker compose -f models/qwen3.6-27b/vllm/compose/dual/int8.yml up -d
bash scripts/quality-test.sh --fullThe --chat-template flag is now baked into the compose so nothing extra to set; the template at models/qwen3.6-27b/vllm/patches/froggeric-chat-template/chat_template.jinja is mounted into /etc/qwen-froggeric-chat-template.jinja at boot.
If anyone re-benches Qwen bf16 or Qwen TQ3+MTP (Genesis) against v0.5.0, drop the numbers below — would close the matrix.
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
All legs run via
scripts/rebench-full.sh(bench + verify-stress + quality-full + soak + aider-polyglot-30). Each leg's results inresults/rebench/<tag>/REPORT.md.Pin note: legs 1-5 share nightly pin
1acd67a7. The Genesis-backed leg (6) runs on the older Genesis-allowlisted pin01d4d1ad3— Genesis v7.72.2'sKNOWN_GOOD_VLLM_PINSdoes not include1acd67a7, and on1acd67a7Genesis triggersmaybe_override_with_speculators+ transformers 5.8.0cached_fileregression. Re-evaluate when Sander tags v7.73.x with a refreshed allowlist.Glossary
turboquant_3bit_nc). ~5× the KV pool of BF16, ~2× INT8 PTH. ~4pp quality penalty vs BF16 per the TQ paper. Hybrid-attention models (Qwen3-Next) need a special multi-query verify kernel for spec-decode.KV pool ÷ max-model-len. e.g. 1.22M tokens ÷ 262K = 4.66× — the rig can serve 4 streams at full context, or 1 stream + 12 parallel agent threads at 100K each."crimson otter 19") is planted at a known token depth and the model is asked to recall it. Tests long-context attention precision.Config matrix
qwen3.6-27b/vllm/compose/dual/int8.ymlgemma-4-31b/vllm/compose/dual/int8.ymlgemma-4-31b/vllm/compose/dual/bf16.ymlqwen3.6-27b/vllm/compose/dual/bf16.ymlqwen3.6-27b/vllm/compose/dual/tq3-mtp.yml(tombstoned)qwen3.6-27b/vllm/compose/dual/tq3-mtp-genesis.ymlTPS (decode, single-stream)
TQ3 caveat: very high variance (decode_TPS min=58, max=122 across 3 narrative runs). The workspace-lock patch (see below) combined with MTP appears to cause inconsistent decode behavior. Per-run output is degraded — see Quality and Verify-stress.
Genesis row note: the Genesis-backed TQ3+MTP row is the working version of the TQ3+MTP intent. CV is 1-4% (matched-config-comparable), output is coherent, and MTP runs at AL≈3.5 / per-position [0.95, 0.84, 0.75] sustained.
Quality (--full, 8 packs, 150 scenarios)
TQ3 caveat: the catastrophic drop is not a TurboQuant capability claim — it's the workspace-lock relaxation patch corrupting model output on anything but trivial generations. See "TQ3 patch chain" below.
Genesis row note: within ~5pp of Qwen INT8 PTH baseline (86 vs 94) — consistent with the TQ3 paper's ~4pp expected quality penalty vs BF16-equivalent KV. Confirms Genesis P67 closes the multi-query verify path that the patch-only TQ3 leg breaks.
Verify-stress (7 boundary checks)
"turururururur","v111111"). Mid-range (tool prefill / IDE-agent / multi-turn / LCB / reasoning-heavy) all passed. Long-context attention is corrupted with the workspace patch + MTP combination.turquoise iguana 82was recalled correctly at 60K).Soak (10 sessions × 5 turns)
Aider-polyglot-30
TQ3 aider note: the run hit the 2700s timeout with 0/30 exercises completed. Each exercise needs coherent multi-turn code output; the workspace-patch corruption produces malformed responses the aider harness can't make sense of, so every exercise burns its full timeout window without completion.
Genesis row note: 18/30 lands -1 vs Qwen INT8 PTH (19/30) — within noise. Confirms TQ3 quality drop doesn't measurably hurt multi-turn coding tasks at this scale.
Headline takeaways (with the data we have so far)
Gemma bf16 leads on raw TPS (109 narr / 143 code) — no KV-quant overhead. INT8 PTH costs Gemma ~11% TPS for 2× KV pool (262K vs 200K). bf16 is the right pick if you have ctx budget at single-stream.
Qwen INT8 PTH leads on aider-polyglot (19/30, 63.3%) — multi-turn code editing benefits from the 2-seq concurrency budget Qwen INT8 PTH affords at 262K. Gemma's INT8 PTH leg ran at seqs=1 (Gemma weights bigger → less KV budget left over) so it can't match.
Gemma bf16 leads on quality-full (103/150, 68.7%), Gemma INT8 PTH second (100/150). The 3-point INT8-PTH → bf16 quality lift is real but small. Qwen legs land 94-96/150 — Qwen's instruction-following is slightly weaker than Gemma 4 on the cli-40 benchmark in particular (but matches everywhere else).
Cross-model: Qwen wins multi-turn agent (aider) by 5 points, Gemma wins single-turn quality (quality-full) by 3-7 points. The gap is real, not config-drift; both INT8 PTH legs are matched on ctx (262K) and KV class.
TQ3 KV pool is dramatic (1.42M tokens vs INT8 PTH's 605K = +134%, 5.41× concurrency at 262K, no-MTP). The patch-only TQ3+MTP attempt is NOT usable (see leg 5 caveats), but the Genesis-backed TQ3+MTP path IS deployable (see leg 6): 1.22M KV pool / 4.66× concurrency, 7/7 verify-stress, 0/100 silent-empty, 18/30 aider — all within noise of the Qwen INT8 PTH baseline at 2× the KV capacity. Genesis P67 (multi-query verify kernel for spec-decode K+1) closes the path the patch-only chain cannot.
TQ3 + MTP — two paths (one broken, one working)
dual/tq3-mtp.yml, tombstoned)dual/tq3-mtp-genesis.yml)The workspace lock was added (per upstream code comment) to prevent "DBO tensor leak" when ubatches hold views into the old workspace. Relaxing it user-space corrupts outputs because the lock guards more invariants than DBO alone; Genesis P67 instead introduces a proper multi-query Triton kernel for spec-decode K+1 verify against compressed TQ cache, which is the real fix.
Path forward for upstream-only (no-Genesis) TQ3 + MTP: wait for an upstream P67-equivalent multi-query verify kernel (none of #40914 / #40798 / #40792 / #42215 implement it today — see
docs/UPSTREAM.mdandtq3-mtp.ymlheader). Until then, the Genesis-backed compose is the production-grade TQ+MTP path.Notes
REPORT.mdinresults/rebench/<tag>/REPORT.mdwith phase timings, per-language breakdown, reproducer command, and delta vs the previous comparable leg.decode_TPS(post-warmup, 3 measured runs each).Pending
(none — all 6 legs complete as of 2026-05-11 18:16 UTC. The 6th leg adds the Genesis-backed TQ3+MTP path as the deployable counterpart to leg 5's broken patch-only attempt. Together legs 5+6 establish: TQ3+MTP without Genesis is not viable today; with Genesis it lands within ~5pp of the INT8 PTH quality baseline at 2× the KV pool.)
Beta Was this translation helpful? Give feedback.
All reactions