Replies: 4 comments 6 replies
-
Quick follow-up to lower the barrier for the dual+ GPU testers you're after, @Anbeeld — the main thing slowing community reports is that everyone has to compile from source.
We did it as a PR rather than just running it on our side because you own the dev branch, so CI linked to it is the right home for it — and that keeps you in control of the official image. Happy to adjust the workflow however you'd like. |
Beta Was this translation helpful? Give feedback.
-
beellama v0.3.0 on 2× RTX 3090 — results + how to test 📌 (one-stop)Consolidating our testing here so it's all in one place. If you're testing too: skim "Already reported" so you don't re-file known stuff — we mostly want you to confirm it on your topology (esp. 4090 / 5090 / 3+ GPU) and flag anything new. ✅ Already reported (known — please don't re-file, just confirm/extend)Tested commit
Our 2× 3090 numbers (indicative — q5_0 K / q4_1 V)
Pullable image — now Anbeeld's official CI build (no compiling): # 3090 / 4090 (sm_86 / sm_89) — canonical, tracks Anbeeld's v0.3.0:
docker pull ghcr.io/anbeeld/beellama.cpp:server-cuda-v0.3.0-e0663be2713c
📦 First-time setup (skip if you already run club-3090 with these GGUFs)
export MODEL_DIR="$PWD/models-cache" # or wherever you keep models
# Gemma-4-31B (target + DFlash drafter)
hf download unsloth/gemma-4-31B-it-GGUF gemma-4-31B-it-Q4_K_S.gguf --local-dir "$MODEL_DIR/gemma-4-31b-gguf/unsloth-q4ks"
hf download Anbeeld/gemma-4-31B-it-DFlash-GGUF gemma4-31b-it-dflash-IQ4_XS.gguf --local-dir "$MODEL_DIR/gemma-4-31b-gguf/anbeeld-dflash-iq4xs"
# Qwen3.6-27B (target + DFlash drafter)
hf download unsloth/Qwen3.6-27B-GGUF Qwen3.6-27B-Q5_K_S.gguf --local-dir "$MODEL_DIR/qwen3.6-27b-gguf/unsloth-q5ks"
hf download Anbeeld/Qwen3.6-27B-DFlash-GGUF Qwen3.6-27B-DFlash-IQ4_XS.gguf --local-dir "$MODEL_DIR/qwen3.6-27b-gguf/anbeeld-dflash-iq4xs"( bash scripts/switch.sh --list | grep beellama(single + dual, Gemma + Qwen.) 🧪 How to test on your rig + what to reportThe quickest high-signal run is ~5 min, two light commands — no heavy bench. DFlash is output-lossless, so we're after spec-dec behaviour (does multi-GPU DFlash engage, acceptance/TPS, crashes), not quality scores. 1. Launch — club-3090 now points the beellama composes at Anbeeld's official image by default (no # dual-GPU (the multi-GPU path Anbeeld needs) — 🧪 experimental, so --force:
bash scripts/switch.sh --force beellama/gemma-dflash-dual
# single card:
bash scripts/switch.sh beellama/gemma-dflash
⏱️ Times out, or crashes with "failed to open GGUF / No such file"?The image pulled fine (the container starts in <1s) — the server couldn't load a model file. The log names which one: docker logs --tail 50 "$(docker ps -a --format '{{.Names}}' | grep beellama | head -1)"
2. Warm it up — send it a code prompt and a prose/essay prompt (they behave very differently under DFlash). 3. Report back (here or on beellama #39) — two commands: bash scripts/report.sh # topology + GPUs + NVLink + boot log, one paste
# acceptance + TPS (per-request — report.sh only captures the boot log):
docker logs "$(docker ps --format '{{.Names}}' | grep beellama | head -1)" 2>&1 \
| grep -iE 'draft acceptance|decode failed|abort' | tail -20What we're checking: boot log shows Optional extras (skip unless you fancy it): Thanks for jumping on it — shout if anything trips up. 🐝 🔧 Power users — custom configs / other spec-dec anglesWant to poke at other aspects (different KV quant, context ceiling, draft placement, MTP vs DFlash)? Two routes: A. Tweak knobs via env + CTX_SIZE=200000 KV_TYPE_K=q8_0 KV_TYPE_V=q5_0 SPLIT_MODE=layer \
bash scripts/switch.sh --force beellama/gemma-dflash-dualB. Run a compose file directly (full control, or your own file): MODEL_DIR=/path/to/your/gguf-models \
BEELLAMA_IMAGE=ghcr.io/anbeeld/beellama.cpp:server-cuda-v0.3.0-e0663be2713c \
docker compose -f models/gemma-4-31b/beellama/compose/dual/beellama-q4ks-dflash/dflash.yml upEnv knobs the beellama composes expose: Changing the spec type needs a command edit (it's not an env knob) — copy the compose, edit its
Whatever you run, jot the config next to the numbers so it's reproducible. 🙏 |
Beta Was this translation helpful? Give feedback.
-
|
Fixed my CI/CD to produce builds and Docker images, with v0.3.0 already having some successful output in that regard. Please try it. |
Beta Was this translation helpful? Give feedback.
-
|
Just pushed a multi-step adaptive DM update that should help with prose. Nothing magical, just preventing it from collapsing below baseline, hopefully. Please try it, I'm aiming to release v0.3.0 this week so hope to get main issues ironed out soon. It's already just huge in perspective to v0.2.0, massive update under the hood. One note on the measurements – AR by itself is not a valuable metric, especially when comparing different spec types. BeeLlama has adaptive draft-max, meaning AR is heavily influenced by DM at that exact time point, and low-ish AR is not always bad because drafting itself is cheap. It is useful as a diagnostic within one spec system and one prompt and in context of other metrics, but that's it really. E.g. MTP might have 70-80% AR but it drafts 3 tokens while DFlash might draft 16, so there's no direct comparison of AR. The only metric that is crucial by itself is the simplest of them, tok/s. Not saying you did something wrong or anything like that, really just a note based on my experience how people misunderstand AR. |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
Greetings to the club. I'm actively working on BeeLlama v0.3.0 right now, aiming to bring the full upstream feature set into it, and fix multi-slot and multi-GPU. That said, I only possess one 3090, and rely on community reports for this.
I want to ask some dual+ 3090/4090/5090 folks to build and try the latest v0.3.0 branch and report back how it works rn, with DFlash, MTP, and no-spec too.
That would help a lot. You can share your results and logs either here or in BeeLlama issue for multi-GPU Anbeeld/beellama.cpp#39
Single GPU folks can of course help as well, mainly in terms of edge cases and general stability where I missed something myself.
Thanks a lot in advance.
Beta Was this translation helpful? Give feedback.
All reactions