Gemma-4-31B at long context on a single 24 GB GPU: why it's hard, and what actually works #239

noonghunna · 2026-05-27T04:44:38Z

noonghunna
May 27, 2026
Maintainer

TL;DR

Running Gemma-4-31B at 100K+ context on a single 24 GB GPU turned out to be mostly a build-configuration problem, not a fundamental one. Gemma-4's global-attention layers use head_dim=512, and the prebuilt ggml-org/llama.cpp image doesn't compile the FlashAttention kernels for it → ~12 tok/s. Built from source with -DGGML_CUDA_FA_ALL_QUANTS=ON, mainline llama.cpp runs Gemma-4 (q5/q4 KV) at 36.5 tok/s with full big context — no fork. beellama.cpp adds a DFlash spec-decode topping (54→193 tok/s, workload-dependent) but isn't required for the base capability.

The core problem (and the nuance that fixes it)

Gemma-4 interleaves sliding-window attention (head_dim=256) with global attention (head_dim=512), ~5:1. Two consequences:

KV cache — SWA-aware allocation caps most layers at the 1024-token window, so big context fits modest VRAM (mainline GGML already does this).
Attention speed — head_dim=512 needs a FlashAttention kernel that supports it. Two different FAs matter here:
- Dao-AILab FA (used by vLLM / SGLang): head_dim=512 PRs are Hopper/Blackwell-only — no Ampere. So vLLM/SGLang fall back to the slow TRITON_ATTN path on a 3090.
- GGML's own FA (llama.cpp): does have a head_dim=512 kernel (fattn-tile-dkq512) — but the quantized-KV instances are gated behind GGML_CUDA_FA_ALL_QUANTS (default OFF), which the prebuilt image doesn't compile. Build from source with the flag → fast.

Engine comparison — single RTX 3090, 24 GB, Gemma-4-31B (~Q4/INT4)

Engine	Windowed KV (big ctx)	Fast head_dim=512	Single-card result
llama.cpp (mainline, built w/ `FA_ALL_QUANTS`)	✅	✅	big ctx + 36.5 tok/s — supported, no fork ⭐
llama.cpp (stock prebuilt image)	✅	❌ flag off	big ctx, ~12 tok/s
vLLM	✅ hybrid allocator	❌ Dao-AILab FA → slow TRITON_ATTN	dual-card great (86–145 @ 262K); single-card quant-KV needs open PR #40391
ik_llama	❌ full KV → ~24K wall	✅ + Google MTP drafter (~1.7×)	fast but ctx-capped
SGLang	(same Dao-AILab FA wall)	❌	untested; same head_dim=512 issue
beellama.cpp	✅	✅	131–200K ctx, 54–193 tok/s w/ DFlash, lossless (community/experimental)

The supported path: mainline llama.cpp + `FA_ALL_QUANTS`

The official ggml-org/llama.cpp:server-cuda image omits one compile flag. Build from source with it:

cmake -B build -DGGML_CUDA=ON -DGGML_CUDA_FA=ON -DGGML_CUDA_FA_ALL_QUANTS=ON -DCMAKE_CUDA_ARCHITECTURES=86 ...

Measured on a single 3090, Gemma-4-31B q5_0/q4_1 KV: 36.5 tok/s — 3× the stock image's 12, with full big context, no fork. The spec-decode topping (MTP, +>2× on dense) is in flight upstream: ggml-org/llama.cpp#23398 (Gemma-4 MTP, WIP); #22527 (FA+SWA crash) also open.

beellama.cpp — the "available now" speed topping (measured)

beellama.cpp (@Anbeeld) bundles the same FA_ALL_QUANTS base + DFlash speculative decoding + windowed KV. Its base speed = mainline+FA_ALL_QUANTS; its edge is the DFlash topping until #23398 lands.

Config: unsloth Q4_K_S target + Anbeeld/gemma-4-31B-it-DFlash-GGUF IQ4_XS drafter, --spec-type dflash, q5_0/q4_1 KV, built from source (sm_86).

Single-card context ceiling (q5_0/q4_1 KV, drafter loaded):

`--ctx-size`	VRAM
131K	21.8 GB	comfortable
163K	22.6 GB	~2 GB headroom
200K	23.65 GB	boots (redline)
262K	—	OOM

Decode — workload-dependent (DFlash is lossless; acceptance scales with predictability):

Workload	TPS	DFlash accept
Prose (essay)	54	~20%
Code (LRU cache)	67	~19%
Predictable (numbers)	182	~75%
Boilerplate (repeat)	193	~80%

Net: 131–200K context on a single 3090 at 54–193 tok/s, lossless, coherent. Caveat: deep fork chain (llama.cpp → TheTom/turboquant → buun-llama → beellama), no prebuilt Docker, single maintainer — community/experimental (the author notes v0.3.0 with an upstream rebase + prebuilt Docker/CI is in progress).

Speed — bench.sh, n=5, reasoning ON (standard-protocol anchor; the sweep above shows the wider DFlash range):

Workload	Decode TPS	TTFT	VRAM @ 100K
Narrative	47.1 (CV 3.2%)	104 ms	~21.8 GB
Code	88.4 (CV 6.1%)	105 ms	~21.8 GB

Prefill ~855 tok/s. vs the mainline supported path (36.8 tok/s both workloads): 1.28× narrative, 2.40× code — the DFlash topping, quantified.

Quality — 8-pack, same beellama engine, reasoning OFF vs ON (benchlocal-cli, temp 0):

Pack	reasoning OFF	reasoning ON
toolcall-15	12/15	14/15
instructfollow-15	14/15	14/15
structoutput-15	13/15	13/15
dataextract-15	11/15	11/15
reasonmath-15	13/15	12/15
bugfind-15	14/15	14/15
hermesagent-20	13/20	15/20
cli-40	19/40	21/40
TOTAL	109/150 (73%)	114/150 (76%)

Reasoning is net-positive on Gemma-4 (+5) — gains concentrate on agentic/tool work (toolcall, hermesagent, cli all +2), only n=15 noise elsewhere. (Opposite of Qwen, where thinking hurt coding.) And beellama reasoning-OFF = 109/150 lands exactly on the mainline thinking-OFF number (see the llama.cpp comment below) — empirical confirmation that DFlash is quality-neutral at temp 0, i.e. the 2.4×-on-code speedup is free.

Upstream tracking

GGML FA head_dim=512: already in mainline (fattn-tile-dkq512), gated by FA_ALL_QUANTS.
Gemma-4 MTP (spec-dec): ggml-org/llama.cpp#23398 (WIP); base MTP mechanism #22673 merged.
Dao-AILab FA head_dim=512 (vLLM/SGLang): #2467 (Hopper), #2447 (Blackwell) — no Ampere.
vLLM per-head KV: #40391 (open).

Recommendation

Single-card Gemma-4, supported: mainline llama.cpp built with FA_ALL_QUANTS — big ctx + 36.5 tok/s, no fork. ⭐
Want the DFlash speed topping now: beellama.cpp (community/experimental).
Dual-card: vLLM (Marlin INT4 + per-head KV).
Note: the "FlashAttention won't fix Ampere" caveat is about Dao-AILab FA (vLLM/SGLang) — GGML's own FA does head_dim=512; on llama.cpp it's a compile flag, not a fork.

All numbers measured on a single RTX 3090. Corrections welcome.

Anbeeld · 2026-05-27T09:09:38Z

Anbeeld
May 27, 2026

Thank you for trying it out. I'm currently cooking up BeeLlama v0.3.0 with upstream llama.cpp rebase (MTP support and more) and proper setup of GitHub CI/CD for prebuilts and Docker images. Stay tuned!

2 replies

noonghunna May 27, 2026
Maintainer Author

I'm on it matey, you are a star! thank you!

JDWarner May 27, 2026

I've found on vLLM that the official Google assistant MTP is extremely good at 7 predicted tokens and, importantly, it's 2GB smaller than Z-lab's DFlash drafter. Really excited to see if that extra 2GB can push up to full context on a single card!

noonghunna · 2026-05-27T09:22:40Z

noonghunna
May 27, 2026
Maintainer Author

Second model on beellama.cpp: Qwen3.6-27B — single RTX 3090

Companion to the Gemma-4 beellama numbers above — same engine (beellama DFlash + windowed KV), different model. Config: unsloth Qwen3.6-27B Q5_K_S + Anbeeld/Qwen3.6-27B-DFlash-GGUF IQ4_XS drafter, --spec-type dflash, q5_0/q4_1 KV, 100K ctx, built from source (FA_ALL_QUANTS, sm_86).

Notable: DFlash sidesteps the DeltaNet KV-rollback that blocks built-in spec-dec for Qwen3-Next — first working spec-dec on the hybrid arch on our rig.

Bench: narrative 49.5 / code 102.1 TPS, VRAM 21.9 GB @ 100K. (Re-confirmed 2026-05-30: 50.2 / 99.7 TPS, 22.1 GB @ 102K.)

8-pack quality — reasoning OFF vs ON (benchlocal-cli, temp 0; same-session re-run 2026-05-30):

Pack	reasoning OFF	reasoning ON
toolcall-15	13/15 (87%)	14/15 (93%)
instructfollow-15	14/15 (93%)	14/15 (93%)
structoutput-15	14/15 (93%)	14/15 (93%)
dataextract-15	8/15 (53%)	10/15 (67%)
reasonmath-15	12/15 (80%)	13/15 (87%)
bugfind-15	14/15 (93%)	12/15 (80%)
hermesagent-20	14/20 (70%)	14/20 (70%)
cli-40	18/40 (45%)	22/40 (55%)
TOTAL	107/150 (71%)	113/150 (75%)

Reasoning is a slight net-positive (+6) on Qwen here — the gains concentrate on the agentic packs (cli-40 +4, dataextract +2), the same shape as the Gemma-4 result above, though smaller and within the ±5–7 8-pack noise band (n=1). Both settings sit in line with base Qwen3.6-27B (~99–103 in the same harness) — DFlash is lossless, so quality holds at either.

aider-polyglot-30 (reasoning-on): 20/30 (67%) in ~1h51m. vs thinking-off baseline 19/30 (63%) in ~21 min → reasoning-on = +1 (a tie within noise) at ~6× the wall-clock. So the small 8-pack bump doesn't translate to a real coding-agency gain — thinking-off is the right config for Qwen coding agents.

Caveat: beellama is a deep fork chain with no prebuilt Docker — community/experimental.

0 replies

noonghunna · 2026-05-27T12:09:15Z

noonghunna
May 27, 2026
Maintainer Author

Gemma-4-31B on mainline llama.cpp + FA_ALL_QUANTS — single RTX 3090

Detailed numbers for the supported, no-fork path from the main post. Config: gemma-4-31B-it Q4_K_S, --cache-type-k q5_0 --cache-type-v q4_1, -c 65536, -fa on, -ngl 999, single 3090 — built from source with -DGGML_CUDA_FA_ALL_QUANTS=ON (sm_86, CUDA 12.8). (Mainline does SWA-windowed KV, so context scales to the 131–200K range like beellama; 65K used here for the bench.)

Speed — `bench.sh`, n=5 (warmup 3 + 5 measured), CV 0.1%

Workload	Wall TPS	Decode TPS	TTFT	VRAM (1× 3090)
Narrative (1000 tok)	36.4	36.8	0.30 s	~19.2 GB
Code (800 tok)	36.4	36.8	0.26 s	~19.2 GB

Dead-steady ~36.8 tok/s decode on a single card (332 W). Prefill ~277 tok/s.

Behavioral quality — 8-pack (benchlocal-cli, single-run, temp 0)

Reported with thinking OFF for now; the thinking-ON column is a queued re-test (Gemma-4 is token-efficient with reasoning enabled, so it's worth measuring both — unlike models where thinking-off is strictly better).

Pack	Thinking OFF	Thinking ON
toolcall-15	12/15 (80%)	pending
instructfollow-15	13/15 (87%)	pending
structoutput-15	13/15 (87%)	pending
dataextract-15	11/15 (73%)	pending
reasonmath-15	13/15 (87%)	pending
bugfind-15	15/15 (100%)	pending
hermesagent-20	13/20 (65%)	pending
cli-40	19/40 (48%)	pending
TOTAL	109/150 (73%)	pending

aider-polyglot-30 to follow as well. All measured on a single RTX 3090.

0 replies

noonghunna · 2026-05-31T23:35:50Z

noonghunna
May 31, 2026
Maintainer Author

vLLM dual-3090 angle: Gemma-4-31B quality + a reasoning-parser gotcha

Complement to the llama.cpp/beellama findings above — here's the vLLM TP=2 side on the same 2× RTX 3090 (PCIe, no NVLink). Engine vLLM v0.22.0, Intel AutoRound INT4 weights + Google's gemma-4-31B-it-assistant MTP drafter (n=4). Two shipped dual configs:

gemma-int8-mtp — int8_per_token_head KV (vLLM PR #40391, rebased), 262K ctx
gemma-bf16-mtp — bf16 KV, 131K ctx (no KV overlay)

⚠️ The reasoning gotcha (this is why the llama.cpp path looked better at reasoning-on)

vLLM ships Gemma4ReasoningParser but it's off unless you pass --reasoning-parser gemma4. Without it, enable_thinking leaves Gemma's <|channel>thought…<channel|> trace in the response content → JSON/structured outputs fail to parse (invalid_json): DataExtract collapsed 10→0, StructOutput 14→8. Wiring --reasoning-parser gemma4 routes the trace into reasoning_content, leaving a clean answer → fully recovers. It's a vLLM config gap, not a model limit (and it's safe with thinking off — the parser no-ops when no channel tokens are present).

Serving (2× 3090, vLLM v0.22.0, TP=2, MTP n=4)

Config	KV / ctx	KV pool	decode TPS (narr / code)	TTFT (narr / code)	VRAM/card
`gemma-int8-mtp`	int8-PTH / 262K	447,199 tok	106 / 140	62 / 76 ms	~22.1 GB
`gemma-bf16-mtp`	bf16 / 131K	195,079 tok	119 / 152	74 / 71 ms	~22.2 GB

(decode TPS is per-token throughput — the same with thinking on/off; reasoning-on costs latency via longer generations, not throughput. 3 warm + 5 measured, canonical 800-word essay + quicksort, temp 0.6.)

8-pack quality (benchlocal-cli `--full`, verifier-backed)

Pack	int8 reason-OFF	int8 reason-ON	bf16 reason-ON
ToolCall-15	13	13	14
InstructFollow-15	8	14	14
StructOutput-15	14	14	14
DataExtract-15	10	9	11
ReasonMath-15	14	12	12
BugFind-15	13	15	15
HermesAgent-20	14	12	12
CLI-40	19	18	19
TOTAL	105 / 150 (70%)	107 / 150 (71%)	111 / 150 (74%)

Takeaways

Reasoning-on is ~a wash on aggregate (+2 to +6, inside the ±5–7 8-pack noise band) and costs real latency (ReasonMath/BugFind p95 roughly doubled), so thinking-OFF stays our default; reasoning-on is a clean opt-in once the parser is wired. The one consistent, sizeable gain is InstructFollow (8→14) — reasoning lets the model plan format/length constraints.
bf16 vs int8 KV is near-identical on quality (bf16-reason 111 vs int8-reason 107 — within noise), so int8-PTH is effectively lossless here; pick by context need (262K vs 131K), not quality.
Soak (continuous / Cliff-2b detector) on gemma-int8-mtp: PASS — 0 errors, 0 silent-empty, 0 VRAM growth, 100% TPS retention, p50 88 TPS.

Configs shipped as vllm/gemma-int8-mtp / vllm/gemma-bf16-mtp.

0 replies

noonghunna · 2026-06-01T22:25:23Z

noonghunna
Jun 1, 2026
Maintainer Author

Dual-3090 beellama angle: Gemma-4-31B Q8 + DFlash at 192K

Complement to the single-card beellama numbers above — same engine (beellama v0.3.0 DFlash + windowed KV), but dual 3090 (TP via layer-split) and the Q8_K_XL target, so quality is the focus rather than fitting one card. Image ghcr.io/noonghunna/beellama-cpp:multiarch-v0.3.0-efe856397 (our stop-gap build; Anbeeld's official server-cuda-v0.3.0 now exists too).

Serving — beellama v0.3.0, dual 3090 (layer-split, PCIe no-NVLink)

Config	Spec-dec	KV / ctx	decode TPS (narr / code)	TTFT	VRAM / card
Gemma-4-31B UD-Q8_K_XL	DFlash (Anbeeld IQ4_XS draft)	q5_0 / q4_1 · 192K	36.0 / 65.4	~190 ms	~21.4 / 21.9 GB

(bench.sh n=5. 192K is the robust ceiling on dual Q8 — 262K OOMs even balanced (Q8 is 33 GB + Gemma's full-attn layers grow KV with ctx; Qwen3.6 DeltaNet reaches 262K, Gemma doesn't). The DFlash draft pins to one card (drafter=1 device), so VRAM skews — --tensor-split 0.55,0.45 rebalances to ~21.4/21.9. Decode is Q8-bandwidth-bound + layer-split is a pipeline (no decode parallelism) — a lighter quant ~doubles it.)

Quality — 8-pack (`benchlocal-cli`, verifier-backed), thinking ON vs OFF (n=1)

Pack	thinking ON	thinking OFF
toolcall-15	13/15	13/15
instructfollow-15	14/15	14/15
structoutput-15	14/15	14/15
dataextract-15	11/15	11/15
reasonmath-15	13/15	14/15
bugfind-15	15/15	15/15
hermesagent-20	13/20	13/20
cli-40	20/40	19/40
TOTAL (8-pack)	113/150 (75%)	113/150 (75%)

Takeaways

Thinking on vs off is an exact tie (113 = 113) — six packs identical, the two that move (reasonmath, cli-40) cancel. Same conclusion as the Qwen thread (When Should a Local Qwen3.6-27B Think? I Measured It. #221): reasoning-on buys no measurable quality on this stack → thinking-off is the right default (fewer tokens, higher TPS).
DFlash is code-good / prose-net-negative on v0.3.0: code acceptance ~0.5 (→ 65 TPS), but prose acceptance collapsed to ~0.08 (→ narr 36 ≈ the no-spec baseline). This is the known v0.3.0-wide DFlash-on-prose regression (Qwen3.6 DeltaNet — no SWA — shows the identical collapse, so it's not SWA-specific); Anbeeld is actively fixing it.
Strengths: bugfind 15/15, dataextract 11/15 (73%) — notably cleaner JSON numeric-typing than Qwen3.6-27B (which sits at 53% on the same pack). Weak: cli-40 agentic (~48%, same as Qwen).

0 replies

Gemma-4-31B at long context on a single 24 GB GPU: why it's hard, and what actually works #239

Uh oh!

Uh oh!

noonghunna May 27, 2026 Maintainer

TL;DR

The core problem (and the nuance that fixes it)

Engine comparison — single RTX 3090, 24 GB, Gemma-4-31B (~Q4/INT4)

The supported path: mainline llama.cpp + FA_ALL_QUANTS

beellama.cpp — the "available now" speed topping (measured)

Upstream tracking

Recommendation

Replies: 5 comments · 2 replies

Uh oh!

Uh oh!

Anbeeld May 27, 2026

Uh oh!

noonghunna May 27, 2026 Maintainer Author

Uh oh!

JDWarner May 27, 2026

Uh oh!

Uh oh!

noonghunna May 27, 2026 Maintainer Author

Uh oh!

noonghunna May 27, 2026 Maintainer Author

Gemma-4-31B on mainline llama.cpp + FA_ALL_QUANTS — single RTX 3090

Speed — bench.sh, n=5 (warmup 3 + 5 measured), CV 0.1%

Behavioral quality — 8-pack (benchlocal-cli, single-run, temp 0)

Uh oh!

noonghunna May 31, 2026 Maintainer Author

vLLM dual-3090 angle: Gemma-4-31B quality + a reasoning-parser gotcha

⚠️ The reasoning gotcha (this is why the llama.cpp path looked better at reasoning-on)

Serving (2× 3090, vLLM v0.22.0, TP=2, MTP n=4)

8-pack quality (benchlocal-cli --full, verifier-backed)

Takeaways

Uh oh!

noonghunna Jun 1, 2026 Maintainer Author

Dual-3090 beellama angle: Gemma-4-31B Q8 + DFlash at 192K

Serving — beellama v0.3.0, dual 3090 (layer-split, PCIe no-NVLink)

Quality — 8-pack (benchlocal-cli, verifier-backed), thinking ON vs OFF (n=1)

Takeaways

noonghunna
May 27, 2026
Maintainer

The supported path: mainline llama.cpp + `FA_ALL_QUANTS`

Replies: 5 comments 2 replies

Anbeeld
May 27, 2026

noonghunna May 27, 2026
Maintainer Author

noonghunna
May 27, 2026
Maintainer Author

noonghunna
May 27, 2026
Maintainer Author

Speed — `bench.sh`, n=5 (warmup 3 + 5 measured), CV 0.1%

noonghunna
May 31, 2026
Maintainer Author

8-pack quality (benchlocal-cli `--full`, verifier-backed)

noonghunna
Jun 1, 2026
Maintainer Author

Quality — 8-pack (`benchlocal-cli`, verifier-backed), thinking ON vs OFF (n=1)