Gemma-4-31B at long context on a single 24 GB GPU: why it's hard, and what actually works #239
Replies: 5 comments 2 replies
-
|
Thank you for trying it out. I'm currently cooking up BeeLlama v0.3.0 with upstream llama.cpp rebase (MTP support and more) and proper setup of GitHub CI/CD for prebuilts and Docker images. Stay tuned! |
Beta Was this translation helpful? Give feedback.
-
|
Second model on beellama.cpp: Qwen3.6-27B — single RTX 3090 Companion to the Gemma-4 beellama numbers above — same engine (beellama DFlash + windowed KV), different model. Config: unsloth Qwen3.6-27B Q5_K_S + Anbeeld/Qwen3.6-27B-DFlash-GGUF IQ4_XS drafter, Notable: DFlash sidesteps the DeltaNet KV-rollback that blocks built-in spec-dec for Qwen3-Next — first working spec-dec on the hybrid arch on our rig. Bench: narrative 49.5 / code 102.1 TPS, VRAM 21.9 GB @ 100K. (Re-confirmed 2026-05-30: 50.2 / 99.7 TPS, 22.1 GB @ 102K.) 8-pack quality — reasoning OFF vs ON (benchlocal-cli, temp 0; same-session re-run 2026-05-30):
Reasoning is a slight net-positive (+6) on Qwen here — the gains concentrate on the agentic packs (cli-40 +4, dataextract +2), the same shape as the Gemma-4 result above, though smaller and within the ±5–7 8-pack noise band (n=1). Both settings sit in line with base Qwen3.6-27B (~99–103 in the same harness) — DFlash is lossless, so quality holds at either. aider-polyglot-30 (reasoning-on): 20/30 (67%) in ~1h51m. vs thinking-off baseline 19/30 (63%) in ~21 min → reasoning-on = +1 (a tie within noise) at ~6× the wall-clock. So the small 8-pack bump doesn't translate to a real coding-agency gain — thinking-off is the right config for Qwen coding agents. Caveat: beellama is a deep fork chain with no prebuilt Docker — community/experimental. |
Beta Was this translation helpful? Give feedback.
-
Gemma-4-31B on mainline llama.cpp + FA_ALL_QUANTS — single RTX 3090Detailed numbers for the supported, no-fork path from the main post. Config: Speed —
|
| Workload | Wall TPS | Decode TPS | TTFT | VRAM (1× 3090) |
|---|---|---|---|---|
| Narrative (1000 tok) | 36.4 | 36.8 | 0.30 s | ~19.2 GB |
| Code (800 tok) | 36.4 | 36.8 | 0.26 s | ~19.2 GB |
Dead-steady ~36.8 tok/s decode on a single card (332 W). Prefill ~277 tok/s.
Behavioral quality — 8-pack (benchlocal-cli, single-run, temp 0)
Reported with thinking OFF for now; the thinking-ON column is a queued re-test (Gemma-4 is token-efficient with reasoning enabled, so it's worth measuring both — unlike models where thinking-off is strictly better).
| Pack | Thinking OFF | Thinking ON |
|---|---|---|
| toolcall-15 | 12/15 (80%) | pending |
| instructfollow-15 | 13/15 (87%) | pending |
| structoutput-15 | 13/15 (87%) | pending |
| dataextract-15 | 11/15 (73%) | pending |
| reasonmath-15 | 13/15 (87%) | pending |
| bugfind-15 | 15/15 (100%) | pending |
| hermesagent-20 | 13/20 (65%) | pending |
| cli-40 | 19/40 (48%) | pending |
| TOTAL | 109/150 (73%) | pending |
aider-polyglot-30 to follow as well. All measured on a single RTX 3090.
Beta Was this translation helpful? Give feedback.
-
vLLM dual-3090 angle: Gemma-4-31B quality + a reasoning-parser gotchaComplement to the llama.cpp/beellama findings above — here's the vLLM TP=2 side on the same 2× RTX 3090 (PCIe, no NVLink). Engine vLLM v0.22.0, Intel AutoRound INT4 weights + Google's
|
| Config | KV / ctx | KV pool | decode TPS (narr / code) | TTFT (narr / code) | VRAM/card |
|---|---|---|---|---|---|
gemma-int8-mtp |
int8-PTH / 262K | 447,199 tok | 106 / 140 | 62 / 76 ms | ~22.1 GB |
gemma-bf16-mtp |
bf16 / 131K | 195,079 tok | 119 / 152 | 74 / 71 ms | ~22.2 GB |
(decode TPS is per-token throughput — the same with thinking on/off; reasoning-on costs latency via longer generations, not throughput. 3 warm + 5 measured, canonical 800-word essay + quicksort, temp 0.6.)
8-pack quality (benchlocal-cli --full, verifier-backed)
| Pack | int8 reason-OFF | int8 reason-ON | bf16 reason-ON |
|---|---|---|---|
| ToolCall-15 | 13 | 13 | 14 |
| InstructFollow-15 | 8 | 14 | 14 |
| StructOutput-15 | 14 | 14 | 14 |
| DataExtract-15 | 10 | 9 | 11 |
| ReasonMath-15 | 14 | 12 | 12 |
| BugFind-15 | 13 | 15 | 15 |
| HermesAgent-20 | 14 | 12 | 12 |
| CLI-40 | 19 | 18 | 19 |
| TOTAL | 105 / 150 (70%) | 107 / 150 (71%) | 111 / 150 (74%) |
Takeaways
- Reasoning-on is ~a wash on aggregate (+2 to +6, inside the ±5–7 8-pack noise band) and costs real latency (ReasonMath/BugFind p95 roughly doubled), so thinking-OFF stays our default; reasoning-on is a clean opt-in once the parser is wired. The one consistent, sizeable gain is InstructFollow (8→14) — reasoning lets the model plan format/length constraints.
- bf16 vs int8 KV is near-identical on quality (bf16-reason 111 vs int8-reason 107 — within noise), so int8-PTH is effectively lossless here; pick by context need (262K vs 131K), not quality.
- Soak (continuous / Cliff-2b detector) on
gemma-int8-mtp: PASS — 0 errors, 0 silent-empty, 0 VRAM growth, 100% TPS retention, p50 88 TPS.
Configs shipped as vllm/gemma-int8-mtp / vllm/gemma-bf16-mtp.
Beta Was this translation helpful? Give feedback.
-
Dual-3090 beellama angle: Gemma-4-31B Q8 + DFlash at 192KComplement to the single-card beellama numbers above — same engine (beellama v0.3.0 DFlash + windowed KV), but dual 3090 (TP via layer-split) and the Q8_K_XL target, so quality is the focus rather than fitting one card. Image Serving — beellama v0.3.0, dual 3090 (layer-split, PCIe no-NVLink)
(bench.sh n=5. 192K is the robust ceiling on dual Q8 — 262K OOMs even balanced (Q8 is 33 GB + Gemma's full-attn layers grow KV with ctx; Qwen3.6 DeltaNet reaches 262K, Gemma doesn't). The DFlash draft pins to one card ( Quality — 8-pack (
|
| Pack | thinking ON | thinking OFF |
|---|---|---|
| toolcall-15 | 13/15 | 13/15 |
| instructfollow-15 | 14/15 | 14/15 |
| structoutput-15 | 14/15 | 14/15 |
| dataextract-15 | 11/15 | 11/15 |
| reasonmath-15 | 13/15 | 14/15 |
| bugfind-15 | 15/15 | 15/15 |
| hermesagent-20 | 13/20 | 13/20 |
| cli-40 | 20/40 | 19/40 |
| TOTAL (8-pack) | 113/150 (75%) | 113/150 (75%) |
Takeaways
- Thinking on vs off is an exact tie (113 = 113) — six packs identical, the two that move (reasonmath, cli-40) cancel. Same conclusion as the Qwen thread (When Should a Local Qwen3.6-27B Think? I Measured It. #221): reasoning-on buys no measurable quality on this stack → thinking-off is the right default (fewer tokens, higher TPS).
- DFlash is code-good / prose-net-negative on v0.3.0: code acceptance ~0.5 (→ 65 TPS), but prose acceptance collapsed to ~0.08 (→ narr 36 ≈ the no-spec baseline). This is the known v0.3.0-wide DFlash-on-prose regression (Qwen3.6 DeltaNet — no SWA — shows the identical collapse, so it's not SWA-specific); Anbeeld is actively fixing it.
- Strengths: bugfind 15/15, dataextract 11/15 (73%) — notably cleaner JSON numeric-typing than Qwen3.6-27B (which sits at 53% on the same pack). Weak: cli-40 agentic (~48%, same as Qwen).
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
TL;DR
Running Gemma-4-31B at 100K+ context on a single 24 GB GPU turned out to be mostly a build-configuration problem, not a fundamental one. Gemma-4's global-attention layers use head_dim=512, and the prebuilt
ggml-org/llama.cppimage doesn't compile the FlashAttention kernels for it → ~12 tok/s. Built from source with-DGGML_CUDA_FA_ALL_QUANTS=ON, mainline llama.cpp runs Gemma-4 (q5/q4 KV) at 36.5 tok/s with full big context — no fork. beellama.cpp adds a DFlash spec-decode topping (54→193 tok/s, workload-dependent) but isn't required for the base capability.The core problem (and the nuance that fixes it)
Gemma-4 interleaves sliding-window attention (head_dim=256) with global attention (head_dim=512), ~5:1. Two consequences:
TRITON_ATTNpath on a 3090.fattn-tile-dkq512) — but the quantized-KV instances are gated behindGGML_CUDA_FA_ALL_QUANTS(default OFF), which the prebuilt image doesn't compile. Build from source with the flag → fast.Engine comparison — single RTX 3090, 24 GB, Gemma-4-31B (~Q4/INT4)
FA_ALL_QUANTS)The supported path: mainline llama.cpp +
FA_ALL_QUANTSThe official
ggml-org/llama.cpp:server-cudaimage omits one compile flag. Build from source with it:Measured on a single 3090, Gemma-4-31B q5_0/q4_1 KV: 36.5 tok/s — 3× the stock image's 12, with full big context, no fork. The spec-decode topping (MTP, +>2× on dense) is in flight upstream: ggml-org/llama.cpp#23398 (Gemma-4 MTP, WIP); #22527 (FA+SWA crash) also open.
beellama.cpp — the "available now" speed topping (measured)
beellama.cpp (@Anbeeld) bundles the same FA_ALL_QUANTS base + DFlash speculative decoding + windowed KV. Its base speed = mainline+FA_ALL_QUANTS; its edge is the DFlash topping until #23398 lands.
Config: unsloth Q4_K_S target + Anbeeld/gemma-4-31B-it-DFlash-GGUF IQ4_XS drafter,
--spec-type dflash,q5_0/q4_1KV, built from source (sm_86).Single-card context ceiling (q5_0/q4_1 KV, drafter loaded):
--ctx-sizeDecode — workload-dependent (DFlash is lossless; acceptance scales with predictability):
Net: 131–200K context on a single 3090 at 54–193 tok/s, lossless, coherent. Caveat: deep fork chain (llama.cpp → TheTom/turboquant → buun-llama → beellama), no prebuilt Docker, single maintainer — community/experimental (the author notes v0.3.0 with an upstream rebase + prebuilt Docker/CI is in progress).
Speed —
bench.sh, n=5, reasoning ON (standard-protocol anchor; the sweep above shows the wider DFlash range):Prefill ~855 tok/s. vs the mainline supported path (36.8 tok/s both workloads): 1.28× narrative, 2.40× code — the DFlash topping, quantified.
Quality — 8-pack, same beellama engine, reasoning OFF vs ON (benchlocal-cli, temp 0):
Reasoning is net-positive on Gemma-4 (+5) — gains concentrate on agentic/tool work (toolcall, hermesagent, cli all +2), only n=15 noise elsewhere. (Opposite of Qwen, where thinking hurt coding.) And beellama reasoning-OFF = 109/150 lands exactly on the mainline thinking-OFF number (see the llama.cpp comment below) — empirical confirmation that DFlash is quality-neutral at temp 0, i.e. the 2.4×-on-code speedup is free.
Upstream tracking
fattn-tile-dkq512), gated byFA_ALL_QUANTS.Recommendation
FA_ALL_QUANTS— big ctx + 36.5 tok/s, no fork. ⭐All numbers measured on a single RTX 3090. Corrections welcome.
Beta Was this translation helpful? Give feedback.
All reactions