# Head-to-Head, Gemma 4 31B vs Qwen3.6 27B — 6 configs on dual 3090 (2026-05-11) #119

noonghunna · 2026-05-12T01:07:38Z

noonghunna
May 12, 2026
Maintainer

All legs run via scripts/rebench-full.sh (bench + verify-stress + quality-full + soak + aider-polyglot-30). Each leg's results in results/rebench/<tag>/REPORT.md.

Pin note: legs 1-5 share nightly pin 1acd67a7. The Genesis-backed leg (6) runs on the older Genesis-allowlisted pin 01d4d1ad3 — Genesis v7.72.2's KNOWN_GOOD_VLLM_PINS does not include 1acd67a7, and on 1acd67a7 Genesis triggers maybe_override_with_speculators + transformers 5.8.0 cached_file regression. Re-evaluate when Sander tags v7.73.x with a refreshed allowlist.

Glossary

Term	Definition
BF16 (bfloat16)	16-bit floating-point — vLLM's default unquantized KV-cache type. Highest quality, largest VRAM footprint per token.
INT8 PTH (per-token-head)	8-bit KV-cache quantization scaled per (token, head) pair. ~2× the KV pool of BF16 at near-baseline quality. The production-safe baseline on this stack.
TQ3 / TurboQuant 3-bit	Aggressive ~3-bit-per-token KV quantization (`turboquant_3bit_nc`). ~5× the KV pool of BF16, ~2× INT8 PTH. ~4pp quality penalty vs BF16 per the TQ paper. Hybrid-attention models (Qwen3-Next) need a special multi-query verify kernel for spec-decode.
KV cache pool	GPU memory reserved for cached attention keys + values. Determines how many concurrent streams (or how much context per stream) fit on-device. The actual bottleneck on a 2× 24 GB rig — not compute.
Max concurrency @ ctx	`KV pool ÷ max-model-len`. e.g. 1.22M tokens ÷ 262K = 4.66× — the rig can serve 4 streams at full context, or 1 stream + 12 parallel agent threads at 100K each.
MTP (Multi-Token Prediction)	Built-in speculative-decoding heads shipped with Qwen3-Next / Gemma 4. The model proposes K draft tokens per step then verifies them in a K+1 batch. When accept rates are high, throughput goes up ~1.5-3×.
Spec-decode K+1 verify	The verify pass after MTP drafts K tokens. Needs a special multi-query attention kernel against the compressed KV cache — without it, TQ3 + MTP corrupts long-context output (see broken leg).
Genesis (genesis-vllm-patches)	Sander's modular vLLM patch stack — fixes that haven't merged upstream yet. v7.72.2 ships ~30+ patches covering TQ + hybrid attention + spec-decode interactions.
Genesis P67	The specific Genesis patch that makes TQ3+MTP work: a proper multi-query Triton kernel for spec-decode K+1 verify against compressed TQ cache. No upstream PR implements this today.
TP=2 (tensor-parallel = 2)	Model layers split across both GPUs. Doubles available VRAM at the cost of PCIe round-trips. Standard mode for 27-31B models on dual-3090.
TTFT	Time-To-First-Token — milliseconds from request to the first output token. Reflects prefill cost + dispatch overhead.
CV (coefficient of variation)	stddev ÷ mean, expressed as %. A CV of 1-5% is healthy benchmark noise; >20% means the run isn't reproducible.
wall TPS vs decode TPS	"wall" tokens-per-second includes prefill latency (first-token cost); "decode" excludes prefill — pure generation throughput. Decode is closer to sustained-output feel; wall is round-trip cost.
narrative / code (bench prompts)	The two canonical bench prompts. narrative = "Write a detailed 800-word essay explaining transformer attention." (max_tokens=1000). code = quicksort-style prompt (max_tokens=800, prefill + decode mix).
verify-stress	A 7-check ladder: needle 10K/30K, tool prefill 25K, IDE-agent, multi-turn, LCB-coding, reasoning-heavy 8K, needle 60K/91K. Pass = correct outputs across all checks.
needle in a haystack	Test where a secret phrase (e.g. `"crimson otter 19"`) is planted at a known token depth and the model is asked to recall it. Tests long-context attention precision.
soak / silent_empty	Soak = 20 sessions × 5 turns to catch slow regressions (VRAM growth, latency drift). silent_empty = a turn that returned empty content despite being prompted — a known cliff on broken configs.
aider-polyglot-30	Upstream Aider polyglot benchmark, 30 exercises across 6 languages (cpp/go/java/js/python/rust). Tests multi-turn code-editing under tool use. Deterministic grader.
AutoRound INT4	4-bit weight quantization (model weights, not KV cache). Different from KV-cache quantization. Tom (Lorbus) calibrated the Qwen3.6-27B weights used here.

Config matrix

Leg	compose	tag	ctx	seqs	KV dtype	MTP n	KV pool	concurrency
Qwen INT8 PTH	`qwen3.6-27b/vllm/compose/dual/int8.yml`	qwen-int8-pth-n4-2026-05-10	262K	2	int8_per_token_head	4	605K tok	2.31×
Gemma INT8 PTH	`gemma-4-31b/vllm/compose/dual/int8.yml`	gemma-int8-pth-n4-2026-05-11	262K	1	int8_per_token_head	4	499K tok	1.90×
Gemma bf16	`gemma-4-31b/vllm/compose/dual/bf16.yml`	gemma-bf16-n4-2026-05-11	200K	1	bf16	4	241K tok	1.21×
Qwen bf16	`qwen3.6-27b/vllm/compose/dual/bf16.yml`	qwen-bf16-n4-2026-05-11	200K	1	bf16	4	319K tok	1.59×
Qwen INT8 TQ3	`qwen3.6-27b/vllm/compose/dual/tq3-mtp.yml` (tombstoned)	qwen-int8-tq3-n3-2026-05-11	262K	2	turboquant_3bit_nc	3	1.42M tok	5.41×
Qwen TQ3+MTP (Genesis)	`qwen3.6-27b/vllm/compose/dual/tq3-mtp-genesis.yml`	qwen-tq3-mtp-genesis-2026-05-11	262K	2	turboquant_3bit_nc	3	1.22M tok	4.66×

TPS (decode, single-stream)

How to read this table. Two prompts run per leg: narrative (an 800-word essay — sustained decode) and code (a quicksort prompt — prefill + decode mix). decode TPS excludes the first-token prefill cost; wall TPS (in REPORT.md per-leg) includes it. CV = run-to-run variability (lower is better; >20% means the run isn't reproducible). All numbers are from 3 measured runs after 1 warmup. See Glossary above for any terms in doubt.

Leg	narrative TPS	CV	code TPS	CV
Qwen INT8 PTH	85.0	2.9%	121.1	0.5%
Gemma INT8 PTH	97.4	4.4%	129.4	0.7%
Gemma bf16	108.6	2.9%	142.9	0.5%
Qwen bf16	86.9	3.2%	126.2	4.8%
Qwen INT8 TQ3	98.7	35.9% ⚠️	104.5	27.1% ⚠️
Qwen TQ3+MTP (Genesis)	89.2	3.8%	119.1	0.9%

TQ3 caveat: very high variance (decode_TPS min=58, max=122 across 3 narrative runs). The workspace-lock patch (see below) combined with MTP appears to cause inconsistent decode behavior. Per-run output is degraded — see Quality and Verify-stress.

Genesis row note: the Genesis-backed TQ3+MTP row is the working version of the TQ3+MTP intent. CV is 1-4% (matched-config-comparable), output is coherent, and MTP runs at AL≈3.5 / per-position [0.95, 0.84, 0.75] sustained.

Quality (--full, 8 packs, 150 scenarios)

Leg	total	toolcall	instruct	structout	dataext	reasonmath	bugfind	hermes	cli-40
Qwen INT8 PTH	94/150 (62.7%)	10/15	13/15	13/15	15/15	6/15	11/15	9/20	17/40
Gemma INT8 PTH	100/150 (66.7%)	9/15	13/15	13/15	15/15	6/15	14/15	11/20	19/40
Gemma bf16	103/150 (68.7%)	9/15	13/15	13/15	15/15	6/15	14/15	14/20	19/40
Qwen bf16	96/150 (64.0%)	10/15	13/15	13/15	15/15	6/15	11/15	9/20	19/40
Qwen INT8 TQ3	18/150 (12.0%) ⚠️	2/15	8/15	7/15	0/15	0/15	0/15	1/20	0/40
Qwen TQ3+MTP (Genesis)	86/150 (57.3%)	9/15	13/15	13/15	15/15	4/15	11/15	1/20	20/40

TQ3 caveat: the catastrophic drop is not a TurboQuant capability claim — it's the workspace-lock relaxation patch corrupting model output on anything but trivial generations. See "TQ3 patch chain" below.

Genesis row note: within ~5pp of Qwen INT8 PTH baseline (86 vs 94) — consistent with the TQ3 paper's ~4pp expected quality penalty vs BF16-equivalent KV. Confirms Genesis P67 closes the multi-query verify path that the patch-only TQ3 leg breaks.

Verify-stress (7 boundary checks)

Leg	result
Qwen INT8 PTH	✓ pass
Gemma INT8 PTH	✓ pass
Gemma bf16	✓ pass
Qwen bf16	✓ pass
Qwen INT8 TQ3	✗ 2 failed — long-context needle at 10K/30K/60K/90K returned garbage (`"turururururur"`, `"v111111"`). Mid-range (tool prefill / IDE-agent / multi-turn / LCB / reasoning-heavy) all passed. Long-context attention is corrupted with the workspace patch + MTP combination.
Qwen TQ3+MTP (Genesis)	✓ pass (7/7 — all rungs including 60K needle. 90K rung saw a transient HTTP 000 but `turquoise iguana 82` was recalled correctly at 60K).

Soak (10 sessions × 5 turns)

Leg	verdict	silent-empty turns	p50 decode TPS	notes
Qwen INT8 PTH	PASS	0/100	—	clean
Gemma INT8 PTH	PASS	0/100	—	clean
Gemma bf16	PASS	0/100	—	clean
Qwen bf16	PASS	0/100	—	clean
Qwen INT8 TQ3	PASS*	34/100 (34%) ⚠️	126.5	verdict-PASS but the silent-empty rate is ~34× the baseline; same workspace-patch corruption surface that breaks verify-stress + quality-full
Qwen TQ3+MTP (Genesis)	PASS	0/100	118.54	tps_retention 99.2%, max_growth 0/200 MiB, 0 errors — clean

Aider-polyglot-30

Leg	total	cpp	go	java	js	python	rust
Qwen INT8 PTH	19/30 (63.3%)	3/5	4/5	4/5	4/5	3/5	1/5
Gemma INT8 PTH	14/30 (46.7%)	2/5	3/5	1/5	4/5	3/5	1/5
Gemma bf16	14/30 (46.7%)	2/5	3/5	0/5	4/5	3/5	2/5
Qwen bf16	17/30 (56.7%)	3/5	3/5	3/5	4/5	2/5	2/5
Qwen INT8 TQ3	0/30 (0%) ⚠️	—	—	—	—	—	—
Qwen TQ3+MTP (Genesis)	18/30 (60.0%)	3/5	3/5	3/5	4/5	3/5	2/5

TQ3 aider note: the run hit the 2700s timeout with 0/30 exercises completed. Each exercise needs coherent multi-turn code output; the workspace-patch corruption produces malformed responses the aider harness can't make sense of, so every exercise burns its full timeout window without completion.

Genesis row note: 18/30 lands -1 vs Qwen INT8 PTH (19/30) — within noise. Confirms TQ3 quality drop doesn't measurably hurt multi-turn coding tasks at this scale.

Headline takeaways (with the data we have so far)

Gemma bf16 leads on raw TPS (109 narr / 143 code) — no KV-quant overhead. INT8 PTH costs Gemma ~11% TPS for 2× KV pool (262K vs 200K). bf16 is the right pick if you have ctx budget at single-stream.
Qwen INT8 PTH leads on aider-polyglot (19/30, 63.3%) — multi-turn code editing benefits from the 2-seq concurrency budget Qwen INT8 PTH affords at 262K. Gemma's INT8 PTH leg ran at seqs=1 (Gemma weights bigger → less KV budget left over) so it can't match.
Gemma bf16 leads on quality-full (103/150, 68.7%), Gemma INT8 PTH second (100/150). The 3-point INT8-PTH → bf16 quality lift is real but small. Qwen legs land 94-96/150 — Qwen's instruction-following is slightly weaker than Gemma 4 on the cli-40 benchmark in particular (but matches everywhere else).
Cross-model: Qwen wins multi-turn agent (aider) by 5 points, Gemma wins single-turn quality (quality-full) by 3-7 points. The gap is real, not config-drift; both INT8 PTH legs are matched on ctx (262K) and KV class.
TQ3 KV pool is dramatic (1.42M tokens vs INT8 PTH's 605K = +134%, 5.41× concurrency at 262K, no-MTP). The patch-only TQ3+MTP attempt is NOT usable (see leg 5 caveats), but the Genesis-backed TQ3+MTP path IS deployable (see leg 6): 1.22M KV pool / 4.66× concurrency, 7/7 verify-stress, 0/100 silent-empty, 18/30 aider — all within noise of the Qwen INT8 PTH baseline at 2× the KV capacity. Genesis P67 (multi-query verify kernel for spec-decode K+1) closes the path the patch-only chain cannot.

TQ3 + MTP — two paths (one broken, one working)

Path	Patches	Result	When to use
Patch-only (`dual/tq3-mtp.yml`, tombstoned)	marlin-pad + workspace-lock relaxation	broken — verify-stress 5/7, quality 18/150, aider 0/30	only as a re-test artifact; not deployable
Genesis-backed (`dual/tq3-mtp-genesis.yml`)	marlin-pad + Genesis v7.72.2 (52 GENESIS_* env vars; P67 ON)	working — verify-stress 7/7, quality 86/150, aider 18/30	when you want the TQ3 KV pool and MTP throughput

The workspace lock was added (per upstream code comment) to prevent "DBO tensor leak" when ubatches hold views into the old workspace. Relaxing it user-space corrupts outputs because the lock guards more invariants than DBO alone; Genesis P67 instead introduces a proper multi-query Triton kernel for spec-decode K+1 verify against compressed TQ cache, which is the real fix.

Path forward for upstream-only (no-Genesis) TQ3 + MTP: wait for an upstream P67-equivalent multi-query verify kernel (none of #40914 / #40798 / #40792 / #42215 implement it today — see docs/UPSTREAM.md and tq3-mtp.yml header). Until then, the Genesis-backed compose is the production-grade TQ+MTP path.

Notes

Each leg has its full REPORT.md in results/rebench/<tag>/REPORT.md with phase timings, per-language breakdown, reproducer command, and delta vs the previous comparable leg.
All TPS numbers above are decode_TPS (post-warmup, 3 measured runs each).
Quality % uses raw pass rate; some packs (cli-40) have stricter rubrics than others.
Aider scores reflect the upstream Aider polyglot-30 v1.0.0 deterministic grader.

Pending

(none — all 6 legs complete as of 2026-05-11 18:16 UTC. The 6th leg adds the Genesis-backed TQ3+MTP path as the deployable counterpart to leg 5's broken patch-only attempt. Together legs 5+6 establish: TQ3+MTP without Genesis is not viable today; with Genesis it lands within ~5pp of the INT8 PTH quality baseline at 2× the KV pool.)

noonghunna · 2026-05-12T21:25:16Z

noonghunna
May 12, 2026
Maintainer Author

Update — v0.5.0 ships, Qwen `hermesagent-20` re-benches at 12/20 (60%)

v0.5.0 (release, 2026-05-12) ships froggeric/Qwen-Fixed-Chat-Templates as the default chat template on all 22 vanilla Qwen 3.6-27B composes via --chat-template. Background and the A/B that drove the call: discussion #121.

This makes the Qwen rows in the matrix above stale on the multi-turn agentic pack. Re-bench of the matched-config Qwen INT8 PTH leg (same compose, same vLLM nightly, same KV class, same MTP n — only difference is the chat template now defaults to froggeric):

Metric (Qwen INT8 PTH)	2026-05-10 row above	v0.5.0 re-bench (2026-05-12)	Δ
`toolcall-15`	10/15 (67%)	10/15 (67%)	0
`instructfollow-15`	13/15	13/15	0
`structoutput-15`	13/15	13/15	0
`dataextract-15`	15/15	15/15	0
`reasonmath-15`	6/15	6/15	0
`bugfind-15`	11/15	11/15	0
`hermesagent-20`	9/20 (45%)	12/20 (60%)	+3 (+15pp)
`cli-40`	17/40 (42%)	17/40 (42%)	0
TOTAL	94/150 (62.7%)	97/150 (64.7%)	+3 (+2pp)

The move is concentrated in hermesagent-20 (multi-turn agentic loops) — exactly the pack where the froggeric template's three load-bearing fixes compound: empty <think></think> past-turn pollution, no-user-query graceful fallback, unclosed-think-before-tool-call. The other 7 packs are flat = no regression on single-turn flows.

Cross-model implications

The headline cross-model gap from the original report — "Qwen wins multi-turn agent (aider) by 5 points, Gemma wins single-turn quality (quality-full) by 3-7 points" — needs an update on the agentic dimension specifically:

`hermesagent-20`	Original	v0.5.0
Qwen INT8 PTH	9/20 (45%)	12/20 (60%)
Gemma INT8 PTH	11/20 (55%)	11/20 (55%) (unchanged — Gemma composes untouched)
Gemma bf16	14/20 (70%)	14/20 (70%) (unchanged)

Qwen catches up on hermesagent-20 (60% vs 55-70% Gemma range) but Gemma bf16 still leads. The matched-config Qwen-vs-Gemma gap on multi-turn agentic tasks narrowed from -10pp to -10pp at the high-end (vs Gemma bf16) and is now +5pp Qwen-favorable vs Gemma INT8 PTH. The aider-polyglot Qwen lead from the original row still stands (19/30 vs 14/30) and likely widens with the template fix.

What's NOT re-benched yet

Only Qwen INT8 PTH was re-run end-to-end against v0.5.0. The other two Qwen rows in the original matrix are now stale for the same reason (both got the froggeric default in v0.5.0) but haven't been re-benched:

Qwen bf16 (dual/bf16.yml) — expect similar hermesagent-20 lift (~+10 to +15pp)
Qwen TQ3+MTP (Genesis) (dual/tq3-mtp-genesis.yml) — also expect a lift, especially relevant since the 2026-05-11 row showed hermesagent-20 at only 1/20 (5%); the template fix should restore a meaningful score there. Could be the most impactful re-bench of the three.

Carnice and Qwopus composes are intentionally not affected by the v0.5.0 change (they ship bespoke Hermes-JSON templates and were excluded from the default-on rollout — see commit 84498d4).

Reproducer

git fetch
git checkout v0.5.0
docker compose -f models/qwen3.6-27b/vllm/compose/dual/int8.yml down
MODEL_DIR=/mnt/models/huggingface docker compose -f models/qwen3.6-27b/vllm/compose/dual/int8.yml up -d
bash scripts/quality-test.sh --full

The --chat-template flag is now baked into the compose so nothing extra to set; the template at models/qwen3.6-27b/vllm/patches/froggeric-chat-template/chat_template.jinja is mounted into /etc/qwen-froggeric-chat-template.jinja at boot.

If anyone re-benches Qwen bf16 or Qwen TQ3+MTP (Genesis) against v0.5.0, drop the numbers below — would close the matrix.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

# Head-to-Head, Gemma 4 31B vs Qwen3.6 27B — 6 configs on dual 3090 (2026-05-11) #119

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

# Head-to-Head, Gemma 4 31B vs Qwen3.6 27B — 6 configs on dual 3090 (2026-05-11) #119

Uh oh!

noonghunna May 12, 2026 Maintainer

Glossary

Config matrix

TPS (decode, single-stream)

Quality (--full, 8 packs, 150 scenarios)

Verify-stress (7 boundary checks)

Soak (10 sessions × 5 turns)

Aider-polyglot-30

Headline takeaways (with the data we have so far)

TQ3 + MTP — two paths (one broken, one working)

Notes

Pending

Replies: 1 comment

Uh oh!

noonghunna May 12, 2026 Maintainer Author

Update — v0.5.0 ships, Qwen hermesagent-20 re-benches at 12/20 (60%)

Cross-model implications

What's NOT re-benched yet

Reproducer

noonghunna
May 12, 2026
Maintainer

noonghunna
May 12, 2026
Maintainer Author

Update — v0.5.0 ships, Qwen `hermesagent-20` re-benches at 12/20 (60%)