BeeLlama v0.3.0 #288

Anbeeld · 2026-05-31T21:14:59Z

Anbeeld
May 31, 2026

Greetings to the club. I'm actively working on BeeLlama v0.3.0 right now, aiming to bring the full upstream feature set into it, and fix multi-slot and multi-GPU. That said, I only possess one 3090, and rely on community reports for this.

I want to ask some dual+ 3090/4090/5090 folks to build and try the latest v0.3.0 branch and report back how it works rn, with DFlash, MTP, and no-spec too.

That would help a lot. You can share your results and logs either here or in BeeLlama issue for multi-GPU Anbeeld/beellama.cpp#39

Single GPU folks can of course help as well, mainly in terms of edge cases and general stability where I missed something myself.

Thanks a lot in advance.

noonghunna · 2026-06-01T00:43:44Z

noonghunna
Jun 1, 2026
Maintainer

UPDATE (2026-06-02) — Anbeeld's official images are live; we've switched to them. club-3090 now defaults to ghcr.io/anbeeld/beellama.cpp:server-cuda-v0.3.0-<commit> (validated e0663be2713c on 2× 3090). PR #48 is closed — superseded by Anbeeld's own CI fix. ⚠️ His official CUDA build targets sm_86 + sm_89 only (3090 / 4090) — no sm_120, so it won't run on a 5090 / Blackwell yet; our snapshot below stays published purely as the 5090 stopgap (sm_86;89;120) until his CI adds Blackwell. Original note kept below for history.

Quick follow-up to lower the barrier for the dual+ GPU testers you're after, @Anbeeld — the main thing slowing community reports is that everyone has to compile from source.

We're hosting an unofficial multi-arch (sm_86/89/120 = RTX 3090/4090/5090) Docker image of v0.3.0 so testers can docker pull instead of building. One caveat to be upfront about: it's a point-in-time snapshot pinned to one commit — it won't track your ongoing v0.3.0 commits. It's a stopgap so people can start this week; the canonical nightly really belongs on your side (your branch, your registry). We'll post the exact pull tag here once it's built + smoke-checked on our rig.
To make a nightly that auto-tracks v0.3.0, we drafted a GitHub Actions workflow (nightly + on-push → your GHCR) and opened a PR: Fix turbo-KV build flag (FA_ALL_QUANTS) + add CUDA→GHCR build workflow Anbeeld/beellama.cpp#48. It also fixes a small real thing — the CUDA Dockerfile is missing -DGGML_CUDA_FA_ALL_QUANTS=ON, which the turbo/TCQ KV types need at runtime.

We did it as a PR rather than just running it on our side because you own the dev branch, so CI linked to it is the right home for it — and that keeps you in control of the official image. Happy to adjust the workflow however you'd like.

0 replies

noonghunna · 2026-06-01T01:45:54Z

noonghunna
Jun 1, 2026
Maintainer

UPDATE (2026-06-02): club-3090 now uses Anbeeld's official image by default (ghcr.io/anbeeld/beellama.cpp:server-cuda-v0.3.0-…) — the BEELLAMA_IMAGE=… prefix below is no longer needed for the default case (the composes are wired to it). The duals ship now (no longer parked). 5090 / Blackwell: his official build is sm_86;89 only — see the 5090 notes in the pull + launch steps.

beellama v0.3.0 on 2× RTX 3090 — results + how to test 📌 (one-stop)

Consolidating our testing here so it's all in one place. If you're testing too: skim "Already reported" so you don't re-file known stuff — we mostly want you to confirm it on your topology (esp. 4090 / 5090 / 3+ GPU) and flag anything new.

✅ Already reported (known — please don't re-file, just confirm/extend)

Tested commit efe856397 on 2× RTX 3090 (Ampere sm_86, PCIe, no NVLink). Tracked upstream on beellama #39.

	DFlash	MTP
Gemma-4-31B	✅ works (code ~157 TPS) · ⚠️ prose acceptance regressed	❌ won't load — `unknown model architecture: gemma4_mtp`
Qwen3.6-27B	✅ works (code ~145 TPS) · ⚠️ prose regressed (same)	✅ works · robust on prose (~0.44 vs DFlash ~0.07)

✅ Multi-GPU DFlash crash is FIXED (both models). No more drafter decode failed -1; --spec-draft-device CUDA0 no longer ggml_aborts. Boot now logs drafter=1 devices; enabling GPU cross ring (was drafter=2 … disabling on the old build). Code DFlash is excellent — 4.2× over no-spec.
⚠️ Prose/free-text acceptance regressed to ~0.07 (net-negative vs no-spec). It's NOT multi-GPU (identical single ≡ dual) and NOT the SWA mismatch — Qwen3.6-27B is DeltaNet (not SWA-windowed) yet shows the identical collapse, so it's a v0.3.0-wide DFlash-on-prose regression. Code is unaffected.
❌ Gemma MTP won't load (gemma4_mtp arch unsupported); ✅ Qwen MTP loads + holds prose where DFlash collapses.
We also filed PR Anbeeld#48: enable FA_ALL_QUANTS (turbo KV) + a CUDA→GHCR nightly workflow.

Our 2× 3090 numbers (indicative — q5_0 K / q4_1 V)

Config	Prose: accept · TPS	Code: accept · TPS
Gemma DFlash dual	0.07 · 35	0.33–0.64 · 95–157
Qwen DFlash dual	0.09 · 38	0.58 · 145
Qwen MTP dual	0.44 · 42	0.87 · 58
no-spec baseline	— · 37	— · 37

Pullable image — now Anbeeld's official CI build (no compiling):

# 3090 / 4090 (sm_86 / sm_89) — canonical, tracks Anbeeld's v0.3.0:
docker pull ghcr.io/anbeeld/beellama.cpp:server-cuda-v0.3.0-e0663be2713c

5090 / Blackwell (sm_120): Anbeeld's official build is sm_86;89 only, so it won't run on a 5090 yet. Until his CI adds sm_120, use our multi-arch snapshot (sm_86;89;120):
docker pull ghcr.io/noonghunna/beellama-cpp:multiarch-v0.3.0-efe856397

📦 First-time setup (skip if you already run club-3090 with these GGUFs)

Repo + runtime — clone club-3090 (or git pull); you need Docker + the NVIDIA Container Toolkit. Pull the image (above).
Grab the GGUFs — you need a target + a DFlash drafter (not bundled). Into your model store (MODEL_DIR, default = the repo's models-cache/):

export MODEL_DIR="$PWD/models-cache"   # or wherever you keep models
# Gemma-4-31B  (target + DFlash drafter)
hf download unsloth/gemma-4-31B-it-GGUF      gemma-4-31B-it-Q4_K_S.gguf       --local-dir "$MODEL_DIR/gemma-4-31b-gguf/unsloth-q4ks"
hf download Anbeeld/gemma-4-31B-it-DFlash-GGUF gemma4-31b-it-dflash-IQ4_XS.gguf --local-dir "$MODEL_DIR/gemma-4-31b-gguf/anbeeld-dflash-iq4xs"
# Qwen3.6-27B  (target + DFlash drafter)
hf download unsloth/Qwen3.6-27B-GGUF         Qwen3.6-27B-Q5_K_S.gguf          --local-dir "$MODEL_DIR/qwen3.6-27b-gguf/unsloth-q5ks"
hf download Anbeeld/Qwen3.6-27B-DFlash-GGUF  Qwen3.6-27B-DFlash-IQ4_XS.gguf   --local-dir "$MODEL_DIR/qwen3.6-27b-gguf/anbeeld-dflash-iq4xs"

(huggingface-cli download … on older hub versions. club-3090's scripts/setup.sh can also fetch via the catalog — scripts/setup.sh --help.)
3. We already ship beellama composes — you'll launch one below; list them with:

bash scripts/switch.sh --list | grep beellama

(single + dual, Gemma + Qwen.)

🧪 How to test on your rig + what to report

The quickest high-signal run is ~5 min, two light commands — no heavy bench. DFlash is output-lossless, so we're after spec-dec behaviour (does multi-GPU DFlash engage, acceptance/TPS, crashes), not quality scores.

1. Launch — club-3090 now points the beellama composes at Anbeeld's official image by default (no BEELLAMA_IMAGE needed):

# dual-GPU (the multi-GPU path Anbeeld needs) — 🧪 experimental, so --force:
bash scripts/switch.sh --force beellama/gemma-dflash-dual

# single card:
bash scripts/switch.sh beellama/gemma-dflash

On a 5090? prefix either command with BEELLAMA_IMAGE=ghcr.io/noonghunna/beellama-cpp:multiarch-v0.3.0-efe856397 until Anbeeld's CI ships sm_120.

⏱️ Times out, or crashes with "failed to open GGUF / No such file"?

The image pulled fine (the container starts in <1s) — the server couldn't load a model file. The log names which one:

docker logs --tail 50 "$(docker ps -a --format '{{.Names}}' | grep beellama | head -1)"

failed to open GGUF file … (No such file or directory) → that GGUF isn't under your MODEL_DIR. Note DFlash/MTP needs two files — the target (-m) and the drafter (--spec-draft-model); the target can load fine while the drafter is the one missing. The MODEL_DIR you hf download into must match the one you launch with (echo $MODEL_DIR). Re-run the step-2 downloads for both files into the right MODEL_DIR.
no kernel image is available → you're on a 5090/Blackwell (sm_120); Anbeeld's official build is sm_86/89 only — relaunch with the BEELLAMA_IMAGE=ghcr.io/noonghunna/beellama-cpp:multiarch-v0.3.0-efe856397 prefix.
CUDA out of memory → not enough free VRAM; lower CTX_SIZE= or free the card.

2. Warm it up — send it a code prompt and a prose/essay prompt (they behave very differently under DFlash).

3. Report back (here or on beellama #39) — two commands:

bash scripts/report.sh        # topology + GPUs + NVLink + boot log, one paste

# acceptance + TPS (per-request — report.sh only captures the boot log):
docker logs "$(docker ps --format '{{.Names}}' | grep beellama | head -1)" 2>&1 \
  | grep -iE 'draft acceptance|decode failed|abort' | tail -20

What we're checking: boot log shows dflash: multi-GPU placement detected (… drafter=1 devices); enabling GPU cross ring and no drafter decode failed -1; plus DFlash acceptance + TPS for code vs prose (+ your topology, which report.sh captures).

Optional extras (skip unless you fancy it): bash scripts/verify-full.sh for a functional 8/8 + coherence (pass MODEL=$(curl -s localhost:8062/v1/models | jq -r '.data[0].id') URL=http://localhost:8062 so it doesn't 404 on the Qwen default), or the benchlocal quality-test.sh 8-pack. Neither is needed.

Thanks for jumping on it — shout if anything trips up. 🐝

🔧 Power users — custom configs / other spec-dec angles

Want to poke at other aspects (different KV quant, context ceiling, draft placement, MTP vs DFlash)? Two routes:

A. Tweak knobs via env + switch.sh (env flows straight through) — for KV / ctx / split / draft-file / sampling:

CTX_SIZE=200000 KV_TYPE_K=q8_0 KV_TYPE_V=q5_0 SPLIT_MODE=layer \
  bash scripts/switch.sh --force beellama/gemma-dflash-dual

B. Run a compose file directly (full control, or your own file):

MODEL_DIR=/path/to/your/gguf-models \
BEELLAMA_IMAGE=ghcr.io/anbeeld/beellama.cpp:server-cuda-v0.3.0-e0663be2713c \
  docker compose -f models/gemma-4-31b/beellama/compose/dual/beellama-q4ks-dflash/dflash.yml up

Env knobs the beellama composes expose: CTX_SIZE · KV_TYPE_K/KV_TYPE_V (q5_0 / q4_1 / q8_0 / …) · CROSS_CTX · SPLIT_MODE (layer / row) · MAIN_GPU · BATCH_SIZE/UBATCH_SIZE · DRAFT_FILE / GGUF_FILE · TEMP / TOP_K / TOP_P · REASONING (on/off).

Changing the spec type needs a command edit (it's not an env knob) — copy the compose, edit its command: block, then docker compose -f your-copy.yml up:

MTP instead of DFlash (Qwen only — Gemma's gemma4_mtp won't load): point -m at an MTP-head GGUF + --spec-type mtp. On our rig MTP held prose acceptance (~0.44) where DFlash collapsed (~0.07) — worth confirming cross-rig.
Draft placement: add --spec-draft-device CUDA0 (now crash-free on v0.3.0).
no-spec baseline: drop the --spec-* lines for the target-only TPS to compare against.

Whatever you run, jot the config next to the numbers so it's reproducible. 🙏

0 replies

Anbeeld · 2026-06-01T19:40:22Z

Anbeeld
Jun 1, 2026
Author

Fixed my CI/CD to produce builds and Docker images, with v0.3.0 already having some successful output in that regard. Please try it.

3 replies

noonghunna Jun 1, 2026
Maintainer

🐝 This is great news — official multi-arch v0.3.0 images are exactly what unblocks wider testing. Congrats on getting the CI/CD green, and thanks!

We've been hammering our own stop-gap multiarch v0.3.0 build (efe856397) on 2× RTX 3090 (Ampere sm_86) — Qwen3.6-27B and Gemma-4-31B DFlash dual at Q8, up to 262K (Qwen) / 192K (Gemma) — so we have a direct baseline to compare your official image against. We'll pull the CUDA image (server-cuda-v0.3.0) and validate it on Ampere (boot + DFlash acceptance + coherence) and report numbers back here.

One flag worth double-checking in the CUDA build: our PR #48 (which I see you closed — all good, your own CI is cleaner) added -DGGML_CUDA_FA_ALL_QUANTS=ON. The stock CUDA Dockerfile omits it, and it's required for quantized KV cache (q5_0 / q4_1, etc.) — which is what most dual-3090 owners run to reach big context. Without it, flash-attention rejects those KV types at boot. If your CI image doesn't already set it, it's a one-line add and a big win for the GPU-poor crowd running turbo KV. (If it's already in, even better — we'll confirm on our Ampere test.)

Will report the validation shortly. 🙏

Anbeeld Jun 1, 2026
Author

Thanks. I also updated Bee to latest llama.cpp upstream (CI/CD in process though), which brings VRAM optimization, so now it should fit a bit more context. I'm also working on fixing prose regressions at this very minute.

noonghunna Jun 2, 2026
Maintainer

Tried it — your official v0.3.0 CUDA image is a clean drop-in on Ampere. 🎉

Pulled ghcr.io/anbeeld/beellama.cpp:server-cuda-v0.3.0-e0663be2713c and validated on 2× RTX 3090 (sm_86, PCIe, no NVLink):

Boots single and dual, GPU cross-ring DFlash working — same results we got building from source.
FA_ALL_QUANTS is in your build (our q5_0/q4_1 turbo-KV paths work), so your CI covers both things our PR Make docker optional #48 was for — I've closed Make docker optional #48 as superseded by your own CI fix. That's the canonical path we were hoping for. 👍

To lower the barrier for the dual+ testers you're after: club-3090 now points its image at your official server-cuda-v0.3.0 tag (retiring our stopgap snapshot as the default) and ships the dual configs as ready-to-run composes — so testers can docker pull your image and up -d instead of compiling:

Qwen3.6-27B — MTP dual + DFlash dual (full 262K)
Gemma-4-31B — DFlash dual (192K) + the q4ks dual

These are the same configs from our consolidated report above, now one command for anyone on a dual rig. (I'm pinning the validated e0663be2713c; saw 8b00fe11583b went up too.)

One thing still open (already in our report — just confirming it's in the official build, not our snapshot): the DFlash-on-prose acceptance regression reproduces on the official image. Code path is great (~0.5 accept, ~157 TPS Gemma / ~145 Qwen); prose accept drops to ~0.07 (net-negative). It's cross-model — both Qwen (DeltaNet) and Gemma (SWA) — so it's not the SWA-window mismatch. Qwen MTP is unaffected (robust ~0.44). No urgency, just keeping it on the v0.3.0 list.

One CI heads-up while you iterate: I checked the published server-cuda-v0.3.0 image — libggml-cuda.so is compiled for sm_86 + sm_89 only (no PTX fallback), so it covers 3090/4090 but a 5090 (sm_120 / Blackwell) hits no kernel image is available for execution on the device. Since your OP wanted 5090 testers too, adding 120 to the arch list (-DCMAKE_CUDA_ARCHITECTURES="86;89;120", or a 90-PTX fallback) would let them pull directly. We're keeping our …:multiarch-v0.3.0-… snapshot (sm_86;89;120) as the 5090 stopgap meanwhile.

Way down the road / good-to-have (definitely not asking you to touch the v0.3.0 plan): one combination we'd love to see someday is Google's own Gemma-4 MTP drafter (gemma-4-31B-it-assistant, the gemma4_mtp arch) on top of your windowed SWA KV. The appeal is footprint + prose-robustness, not a flag chase:

We measured that drafter at ~1.0 GB resident (4 layers, SWA-1024 that matches Gemma's own window) vs DFlash's ~4.3 GB on a single card. On a 24 GB 3090 that frees ~3.3 GB for KV — which, with your windowed allocation, looks like real headroom past the current ~128K-text / ~150K-vision single-card DFlash ceiling. (I can't put a number on it yet — the arch doesn't load in the lineage to test.)
Being a separate MTP head rather than DFlash, it should sidestep the prose regression above entirely — i.e. a prose-robust spec-dec path for Gemma, the way MTP already is for Qwen.

Only blocker is arch support: gemma4_mtp loads on ik_llama (their #1744) but not mainline llama.cpp yet (ggml-org/llama.cpp#23398 is the WIP), so it's upstream-gated rather than anything you'd hand-build. Totally fine to park until that lands — flagging it now just so it's on the radar. 🙂

Thanks again for getting the CI + images out — that's exactly what unblocks community testing. We'll keep our composes tracking your official tags.

Anbeeld · 2026-06-02T23:02:11Z

Anbeeld
Jun 2, 2026
Author

Just pushed a multi-step adaptive DM update that should help with prose. Nothing magical, just preventing it from collapsing below baseline, hopefully. Please try it, I'm aiming to release v0.3.0 this week so hope to get main issues ironed out soon. It's already just huge in perspective to v0.2.0, massive update under the hood.

One note on the measurements – AR by itself is not a valuable metric, especially when comparing different spec types. BeeLlama has adaptive draft-max, meaning AR is heavily influenced by DM at that exact time point, and low-ish AR is not always bad because drafting itself is cheap.

It is useful as a diagnostic within one spec system and one prompt and in context of other metrics, but that's it really. E.g. MTP might have 70-80% AR but it drafts 3 tokens while DFlash might draft 16, so there's no direct comparison of AR. The only metric that is crucial by itself is the simplest of them, tok/s.

Not saying you did something wrong or anything like that, really just a note based on my experience how people misunderstand AR.

3 replies

noonghunna Jun 2, 2026
Maintainer

Thanks — and the AR note is well taken, no disagreement. We actually landed the prose conclusion on tok/s, not AR: prose DFlash came out ≈35–36 tok/s vs the ≈37 no-spec baseline (i.e. spec-dec wasn't buying anything on prose), and AR (~0.07) was just the diagnostic we used to explain why. Fully agree it's not comparable across spec types — MTP-drafts-3 vs DFlash-drafts-16 — so the re-test will be reported on tok/s (prose vs no-spec baseline), same prompts, single + dual.

One heads-up on actually trying it: the adaptive-DM work is on the v0.3.0 branch (HEAD 63abcd3, "Improve adaptive DFlash draft depth selection"), but the newest published image is server-cuda-v0.3.0-8b00fe1 — a 2026-06-01 docs commit that predates the whole adaptive-DFlash-draft cluster. So there's nothing on GHCR yet that carries the fix. Is your CI building the v0.3.0 branch (just lagging), or only on tag? We're happy to wait for the next CI image / the v0.3.0 tag, or build 63abcd3 ourselves for an early prose-tok/s signal before you release — whatever's most useful.

Anbeeld Jun 3, 2026
Author

GitHub CI/CD is quite slow, takes hours to build, so yeah.

noonghunna Jun 3, 2026
Maintainer

Gemma leg done — re-ran the full prose A/B on tok/s (your AR point reframed the whole thing). Setup: bench.sh narrative, n=5, no-spec controls, three images — your current e0663be, the new 63abcd3, and our earlier self-built efe856397.

Qwen3.6-27B · single Q5_K_S (1×3090) — no-spec 35.9 t/s

image	prose t/s	AR
`e0663be`	45.3	0.34
`63abcd3`	45.7	0.34

Qwen3.6-27B · dual Q8_K_XL (2×3090, layer-split) — no-spec 23.4 t/s

image	prose t/s	AR
`efe856397`	37.0	0.32
`e0663be`	36.2	0.33
`63abcd3`	35.6	0.41

Gemma-4-31B · single Q4_K_S (1×3090) — no-spec 34.8 t/s

image	prose t/s	AR
`efe856397`	45.5	0.29
`e0663be`	45.2	0.28
`63abcd3`	44.6	0.29

Two takeaways:

Prose DFlash is net-positive across the board — +26–31% single-card, +52–58% dual — on all three images, both models, including the new adaptive-DM HEAD. The new commit nudges qwen-dual AR (0.33→0.41) with tok/s flat; no downside anywhere.
We can't reproduce the "prose collapse" we flagged earlier. Re-running the same efe856397 build that gave us ~0.07–0.10 AR now drafts prose at ~0.29–0.32 AR at the same tok/s — so that was a noisy/prompt-dependent AR reading (exactly your point), and our "net-negative" call also leaned on a wrong no-spec baseline (we'd been using ~37; the real ones are 23.4 dual-Q8 / ~35 single). On tok/s vs the correct baseline, prose DFlash was net-positive the whole time. That one's on us — and thanks for the AR nudge, it's what sent us back to measure it properly.

(63abcd3 was our own sm_86 build of the v0.3.0 HEAD, since CI hasn't published an image past 8b00fe1 yet.)

BeeLlama v0.3.0 #288

Uh oh!

Uh oh!

Anbeeld May 31, 2026

Replies: 4 comments · 6 replies

Uh oh!

Uh oh!

noonghunna Jun 1, 2026 Maintainer

Uh oh!

Uh oh!

noonghunna Jun 1, 2026 Maintainer

beellama v0.3.0 on 2× RTX 3090 — results + how to test 📌 (one-stop)

✅ Already reported (known — please don't re-file, just confirm/extend)

Our 2× 3090 numbers (indicative — q5_0 K / q4_1 V)

📦 First-time setup (skip if you already run club-3090 with these GGUFs)

🧪 How to test on your rig + what to report

⏱️ Times out, or crashes with "failed to open GGUF / No such file"?

🔧 Power users — custom configs / other spec-dec angles

Uh oh!

Uh oh!

Anbeeld Jun 1, 2026 Author

Uh oh!

Uh oh!

noonghunna Jun 1, 2026 Maintainer

Uh oh!

Uh oh!

Anbeeld Jun 1, 2026 Author

Uh oh!

Uh oh!

noonghunna Jun 2, 2026 Maintainer

Uh oh!

Anbeeld Jun 2, 2026 Author

Uh oh!

noonghunna Jun 2, 2026 Maintainer

Uh oh!

Anbeeld Jun 3, 2026 Author

Uh oh!

noonghunna Jun 3, 2026 Maintainer

Anbeeld
May 31, 2026

Replies: 4 comments 6 replies

noonghunna
Jun 1, 2026
Maintainer

noonghunna
Jun 1, 2026
Maintainer

Anbeeld
Jun 1, 2026
Author

noonghunna Jun 1, 2026
Maintainer

Anbeeld Jun 1, 2026
Author

noonghunna Jun 2, 2026
Maintainer

Anbeeld
Jun 2, 2026
Author

noonghunna Jun 2, 2026
Maintainer

Anbeeld Jun 3, 2026
Author

noonghunna Jun 3, 2026
Maintainer