π DiffusionGemma 26B-A4B β vLLM's first diffusion LLM on dual 3090 (π§ͺ experimental) #364
noonghunna
announced in
Announcements
Replies: 2 comments 3 replies
-
Good point... I guess this is bad at 1shot vibe-coding? |
Beta Was this translation helpful? Give feedback.
3 replies
-
DiffusionGemma vs its autoregressive sibling β full head-to-headBoth
Breakdown
Upshot
|
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
π§ͺ Experimental β runs on the official
vllm/vllm-openai:gemmaimage + 3 vendored Ampere/TP fix-mounts; gated on the unmerged vllm#45163. No production guarantee yet β cross-rig numbers very welcome (see "What'd help").We just wired up
vllm/diffusiongemma-dualβ Google's DiffusionGemma 26B-A4B, vLLM's first discrete-diffusion LM (dLLM). Instead of autoregressive token-by-token decoding, it denoises a whole 256-token "canvas" in parallel via bidirectional attention. It's a gemma4-SWA-MoE backbone (3.8B active) with a block-diffusion decode head. All credit for the model goes to Google; RedHatAI for the fp8 weights; vLLM for the arch + the official image β see Credits.The headline: it runs on 2Γ 3090 (Ampere sm_86) at all β vLLM's official
:gemmaimage targets H100/B200, and fp8 on Ampere hits a Marlin shared-memory wall; 3 small vendored fixes (sm_86 fp8 Marlin sub-tile-K pad + a TP=2 vocab fix) unlock it, and throughput is content-dependent in a way AR models aren't.π΄ Results Card β 2Γ RTX 3090, official
:gemmaimage (digest-pinned), eager, TP=2β Serving
wall_TPSonly βdecode_TPSis meaningless for diffusion (it emits a whole canvas at once). Throughput is content-dependent (entropy-bound sampler): ~7 tokens committed per denoising step on prose vs ~25 on low-entropy text β a ~3.4Γ swing. Long-ctx: the engine serves cleanly to 262K (verify-stress ladder: no crash/OOM, soak 0-growth), but exact needle recall is clean only to ~30K and degrades beyond (1-char drops at 57K, misses by 90K) β a diffusion long-ctx-attention limit, not a system failure.bench.sh FORCE_TOKENS=2000β ~257 TPS (sustained throughput rises with output size β the block-parallel win the default bench doesn't show).β‘ Quality β core 8-pack (/150, gemma4 parsers): reasoning OFF vs ON
Clean A/B β every pack forced thinking-off vs thinking-on (verifier-graded; no timeout contamination):
Reasoning is a wash on DiffusionGemma β 96/150 either way. Thinking-on gains (instructfollow +2, toolcall/bugfind/cli +1) are exactly cancelled by losses (dataextract/hermesagent β2, reasonmath β1). The interesting part is the contrast with its autoregressive sibling
gemma-4-26b-a4b(same gemma4-SWA-MoE backbone, AWQ): that one gains +11 with thinking (98 β 109), but the diffusion decode doesn't benefit from an explicit CoT pass β block-parallel canvas denoising already refines bidirectionally/iteratively, so chain-of-thought adds ~nothing (and slightly hurts extraction + agentic). Practical upshot: serve it thinking-off β same quality, lower latency.hermesagent-20 11/20 (thinking-off) = end-to-end agentic tool-use works. The gemma4 parsers are load-bearing β the deterministic 5-pack jumps 57% β 84% with them. cli-40 (35%) is the drag (strict shell-semantics + safety discipline). verify-full: completion + tool-call + reasoning + output-quality (8319 chars, 0 line-repeats, no cascade) all pass. soak-continuous PASS β 0 MiB VRAM growth, 0/25 silent-empty turns, 0 errors, 100% TPS retention (5 sessions Γ 5 turns).
β’ Takeaways
:gemmaclean dies in warmup on the fp8 Marlin wall (Invalid thread config β¦ K=352/1056 β¦ 99 KB shared mem).llama-diffusion-cli(CLI-only).How it runs here
vLLM publishes an official
vllm/vllm-openai:gemmaimage (a stock build of the dgemma branch) with the arch baked in. We pin it by digest (purge-resistant) and bind-mount 3 fixes that aren't upstream (vLLM tests H100/B200 + TP=1, so neither our Ampere fp8 path nor our TP=2 path is covered):marlin.py+marlin_utils_fp8.pyβ sm_86 fp8 Marlin sub-tile-K pad (the K=352/1056 warmup wall).diffusion_gemma.pyβ TP-vocab soft-embed + dtype fix (the TP=2 path).It's gated on the unmerged vllm#45163 β when the arch + an Ampere Marlin fix land upstream, the mounts go away.
Requirements
:gemmaimage (auto-pulled).Getting it / Run it
Serves an OpenAI-compatible API on
:8042(modeldiffusiongemma-26b-a4b) with gemma4 tool-calling + reasoning.--forceis required for the experimental status.π New to the stack? The FAQ covers the setup basics β where weights land / putting them on another drive (
MODEL_DIR), switching models, setting your own default config, and offline / air-gapped installs (see its Setup section).What'd help
Cross-rig numbers (other 3090 pairs / 4090s β fit, TPS, the content-dependent curve), and especially an Ampere-compatible 4-bit quant (AWQ/GPTQ β Marlin WNA16, ~13 GB) β that's the missing piece for a single-card vLLM path.
Credits
diffusiongemma-26B-A4B-it-FP8-dynamic(compressed-tensors).vllm/vllm-openai:gemmaimage we pin.Beta Was this translation helpful? Give feedback.
All reactions