🌀 DiffusionGemma 26B-A4B — vLLM's first diffusion LLM on dual 3090 (🧪 experimental) #364

noonghunna · 2026-06-11T05:55:36Z

noonghunna
Jun 11, 2026
Maintainer

🧪 Experimental — runs on the official vllm/vllm-openai:gemma image + 3 vendored Ampere/TP fix-mounts; gated on the unmerged vllm#45163. No production guarantee yet — cross-rig numbers very welcome (see "What'd help").

We just wired up vllm/diffusiongemma-dual — Google's DiffusionGemma 26B-A4B, vLLM's first discrete-diffusion LM (dLLM). Instead of autoregressive token-by-token decoding, it denoises a whole 256-token "canvas" in parallel via bidirectional attention. It's a gemma4-SWA-MoE backbone (3.8B active) with a block-diffusion decode head. All credit for the model goes to Google; RedHatAI for the fp8 weights; vLLM for the arch + the official image — see Credits.

The headline: it runs on 2× 3090 (Ampere sm_86) at all — vLLM's official :gemma image targets H100/B200, and fp8 on Ampere hits a Marlin shared-memory wall; 3 small vendored fixes (sm_86 fp8 Marlin sub-tile-K pad + a TP=2 vocab fix) unlock it, and throughput is content-dependent in a way AR models aren't.

🎴 Results Card — 2× RTX 3090, official `:gemma` image (digest-pinned), eager, TP=2

① Serving

Config	Spec-dec	KV	ctx	Narr	Code	Peak (low-entropy)	VRAM
fp8 + 3 fix-mounts ⭐	none (diffusion)	bf16	262K	177	180	~1100	23 GB/card

wall_TPS only — decode_TPS is meaningless for diffusion (it emits a whole canvas at once). Throughput is content-dependent (entropy-bound sampler): ~7 tokens committed per denoising step on prose vs ~25 on low-entropy text — a ~3.4× swing. Long-ctx: the engine serves cleanly to 262K (verify-stress ladder: no crash/OOM, soak 0-growth), but exact needle recall is clean only to ~30K and degrades beyond (1-char drops at 57K, misses by 90K) — a diffusion long-ctx-attention limit, not a system failure. bench.sh FORCE_TOKENS=2000 → ~257 TPS (sustained throughput rises with output size — the block-parallel win the default bench doesn't show).

② Quality — core 8-pack (/150, gemma4 parsers): reasoning OFF vs ON

Clean A/B — every pack forced thinking-off vs thinking-on (verifier-graded; no timeout contamination):

Pack	OFF	ON
toolcall-15	12	13
instructfollow-15	12	14
structoutput-15	14	14
dataextract-15	11	9
reasonmath-15	13	12
bugfind-15	10	11
hermesagent-20	11	9
cli-40	13	14
Total /150	96 (64%)	96 (64%)
det-6 · sandbox-2	72 · 24	73 · 23

Reasoning is a wash on DiffusionGemma — 96/150 either way. Thinking-on gains (instructfollow +2, toolcall/bugfind/cli +1) are exactly cancelled by losses (dataextract/hermesagent −2, reasonmath −1). The interesting part is the contrast with its autoregressive sibling gemma-4-26b-a4b (same gemma4-SWA-MoE backbone, AWQ): that one gains +11 with thinking (98 → 109), but the diffusion decode doesn't benefit from an explicit CoT pass — block-parallel canvas denoising already refines bidirectionally/iteratively, so chain-of-thought adds ~nothing (and slightly hurts extraction + agentic). Practical upshot: serve it thinking-off — same quality, lower latency.

hermesagent-20 11/20 (thinking-off) = end-to-end agentic tool-use works. The gemma4 parsers are load-bearing — the deterministic 5-pack jumps 57% → 84% with them. cli-40 (35%) is the drag (strict shell-semantics + safety discipline). verify-full: completion + tool-call + reasoning + output-quality (8319 chars, 0 line-repeats, no cascade) all pass. soak-continuous PASS — 0 MiB VRAM growth, 0/25 silent-empty turns, 0 errors, 100% TPS retention (5 sessions × 5 turns).

③ Takeaways

A diffusion LLM that actually serves on Ampere. The unlock is the 3 vendored fixes, not the model — without them :gemma clean dies in warmup on the fp8 Marlin wall (Invalid thread config … K=352/1056 … 99 KB shared mem).
Throughput rises with output size + low entropy — ~177/180 typical, ~257 at a forced 2K output, ~1100 on predictable text. Plan around the curve, not a single number.
Reasoning is a wash — serve thinking-off. Forced thinking-on scored identically (96/150) to thinking-off; unlike the AR sibling (+11 with CoT), the diffusion decode gains nothing from an explicit reasoning pass. Lower latency for free.
Dual-card only on vLLM — fp8 weights are ~26 GB; no Ampere-compatible 4-bit vLLM quant exists yet (NVFP4 is Blackwell-only; the rest are MLX/GGUF). Single-card today = the Q4_K_M GGUF via llama-diffusion-cli (CLI-only).
It self-terminates ~1.5–2K words — won't one-shot a 10K-word essay even with a high cap; use multi-turn.

How it runs here

vLLM publishes an official vllm/vllm-openai:gemma image (a stock build of the dgemma branch) with the arch baked in. We pin it by digest (purge-resistant) and bind-mount 3 fixes that aren't upstream (vLLM tests H100/B200 + TP=1, so neither our Ampere fp8 path nor our TP=2 path is covered):

marlin.py + marlin_utils_fp8.py — sm_86 fp8 Marlin sub-tile-K pad (the K=352/1056 warmup wall).
diffusion_gemma.py — TP-vocab soft-embed + dtype fix (the TP=2 path).

It's gated on the unmerged vllm#45163 — when the arch + an Ampere Marlin fix land upstream, the mounts go away.

Requirements

2× 24 GB GPU (TP=2). Single-card vLLM isn't available — fp8 is ~26 GB; no Ampere 4-bit quant exists yet.
~26 GB weights (RedHatAI fp8-dynamic) + the official :gemma image (auto-pulled).

Getting it / Run it

git clone https://github.com/noonghunna/club-3090 && cd club-3090
./scripts/setup.sh diffusiongemma-26b-a4b               # fetch the RedHatAI fp8 weights
./scripts/switch.sh --force vllm/diffusiongemma-dual    # --force: it's experimental (non-functional status)

Serves an OpenAI-compatible API on :8042 (model diffusiongemma-26b-a4b) with gemma4 tool-calling + reasoning. --force is required for the experimental status.

📖 New to the stack? The FAQ covers the setup basics — where weights land / putting them on another drive (MODEL_DIR), switching models, setting your own default config, and offline / air-gapped installs (see its Setup section).

Tip — gpu-mode users: one-word mode —
gpu-mode dgemma
It frees both GPUs (stops any other dual-card model), boots DiffusionGemma on :8199, brings up support services, and wires Open WebUI. It's dual-card, so run gpu-mode off before switching to another model.

Streaming note: SSE delivers a whole 256-token canvas as ~1 chunk (block-parallel decode), not token-by-token — expected for a dLLM.

What'd help

Cross-rig numbers (other 3090 pairs / 4090s — fit, TPS, the content-dependent curve), and especially an Ampere-compatible 4-bit quant (AWQ/GPTQ → Marlin WNA16, ~13 GB) — that's the missing piece for a single-card vLLM path.

Credits

The model: Google — DiffusionGemma 26B-A4B (vLLM's first discrete-diffusion LM).
The fp8 weights: RedHatAI — diffusiongemma-26B-A4B-it-FP8-dynamic (compressed-tensors).
The engine + image: vLLM — the DiffusionGemma arch (#45163) and the official vllm/vllm-openai:gemma image we pin.

tomByrer · 2026-06-11T15:15:16Z

tomByrer
Jun 11, 2026

self-terminates ~1.5–2K words — won't one-shot a 10K-word essay even with a high cap; use multi-turn

Good point... I guess this is bad at 1shot vibe-coding?

3 replies

noonghunna Jun 11, 2026
Maintainer Author

yeah probably a few attempts might do, it tends occasionally miss out characters in NIAH recalls. The test still pass though and strangely the quality tests are not too bad either. I won't count on it for production level coding, but nonetheless can still do basic scripting/devops easily.

tomByrer Jun 12, 2026

can still do basic scripting/devops easily

I'm beginning to think many so-so models can be coaxed into doing 'production level coding' via slicing code changes into very small steps, like you were mentoring a junior programmer. Maybe tasks would be split between smaller models that are better at a certain task.

I need to block out an afternoon or 4 to sort this out, unless others have suggestions?

noonghunna Jun 12, 2026
Maintainer Author

a "loaded" harness can certainly squeeze more out of the small models, try factory droid cli. hermes i didn't like for being too slow. by loaded i'm implying loaded-with-skills.

noonghunna · 2026-06-14T15:58:15Z

noonghunna
Jun 14, 2026
Maintainer Author

DiffusionGemma vs its autoregressive sibling — full head-to-head

Both gemma4-SWA-MoE 26B-A4B, same gemma4 parsers, core 8-pack (/150). The dLLM is the experimental :gemma-image diffusion build; the AR sibling is gemma-4-26b-a4b (cyankiwi AWQ, INT8-PTH KV) — numbers from rebench gemma-26ba4b-int8r.

Pack	Diffusion OFF	Diffusion ON	AR OFF	AR ON
toolcall-15	12	13	13	13
instructfollow-15	12	14	14	14
structoutput-15	14	14	14	14
dataextract-15	11	9	9	11
reasonmath-15	13	12	13	14
bugfind-15	10	11	13	13
hermesagent-20	11	9	10	9
cli-40	13	14	12	21
Total /150	96	96	98	109

Breakdown

Think-OFF it's a near-tie (96 vs 98) — and a genuine mixed bag per-pack: the diffusion model actually leads on dataextract (11 vs 9), cli-40 (13 vs 12) and hermesagent (11 vs 10); the AR sibling leads on instructfollow (14 vs 12) and bugfind (13 vs 10). Net AR +2 (noise).
Think-ON is where they diverge (96 vs 109, +13 AR). The AR sibling's entire CoT gain (98 → 109) is concentrated in cli-40 (12 → 21, +9) plus dataextract (+2) and reasonmath (+1). The diffusion model is reasoning-invariant — 96 → 96, with per-pack moves cancelling out (instructfollow +2, dataextract −2, hermesagent −2). Block-parallel canvas denoising already refines bidirectionally, so an explicit CoT pass adds ~nothing.

Upshot

Max quality → the AR sibling, thinking-on (109) — ~on par with gemma-4-31B's 107. Its think-on agentic jump (cli-40 +9) is the decisive gap.
Lowest latency, no quality loss → either at thinking-off (~96–98) — and the diffusion model is already at its ceiling there (CoT-free by design, so no latency tax for a reasoning pass it can't use).
The dLLM's draw is the property (reasoning-invariant, bidirectional decode), not a quality win. The AR 26B-A4B remains the higher-ceiling model of the pair.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

🌀 DiffusionGemma 26B-A4B — vLLM's first diffusion LLM on dual 3090 (🧪 experimental) #364

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 2 comments 3 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

🌀 DiffusionGemma 26B-A4B — vLLM's first diffusion LLM on dual 3090 (🧪 experimental) #364

Uh oh!

Uh oh!

noonghunna Jun 11, 2026 Maintainer

🎴 Results Card — 2× RTX 3090, official :gemma image (digest-pinned), eager, TP=2

① Serving

② Quality — core 8-pack (/150, gemma4 parsers): reasoning OFF vs ON

③ Takeaways

How it runs here

Requirements

Getting it / Run it

What'd help

Credits

Replies: 2 comments · 3 replies

Uh oh!

tomByrer Jun 11, 2026

Uh oh!

noonghunna Jun 11, 2026 Maintainer Author

Uh oh!

tomByrer Jun 12, 2026

Uh oh!

Uh oh!

noonghunna Jun 12, 2026 Maintainer Author

Uh oh!

noonghunna Jun 14, 2026 Maintainer Author

DiffusionGemma vs its autoregressive sibling — full head-to-head

Breakdown

Upshot

noonghunna
Jun 11, 2026
Maintainer

🎴 Results Card — 2× RTX 3090, official `:gemma` image (digest-pinned), eager, TP=2

Replies: 2 comments 3 replies

tomByrer
Jun 11, 2026

noonghunna Jun 11, 2026
Maintainer Author

noonghunna Jun 12, 2026
Maintainer Author

noonghunna
Jun 14, 2026
Maintainer Author