Replies: 5 comments 10 replies
-
|
looks great! i'll pick it up in a couple of days, tied up in some refactoring right now. Any luck enabling MTP for gemma? |
Beta Was this translation helpful? Give feedback.
-
|
Just a note that latest v0.3.2 Preview should have architectural stuff cleaned up now, so should be fine for some initial KVarN testing. Inference speed optimizations will follow later. |
Beta Was this translation helpful? Give feedback.
-
|
@Anbeeld — sorry for the slow turnaround getting back to you on this! Finally put KVarN through the club-3090 rig properly, and focused on the things PPL/KLD don't cover — real-task quality (8-pack), decode speed, and single-card context ceilings. Setup: RTX 3090 (sm_86), Qwen3.6-27B coder fine-tune, Q5_K_M GGUF, beellama v0.3.2-preview ( Serving
3 warm + 5 measured; temp 0.6 / top-k 20 / top-p 1.0. Plain single-card (24 GB) ctx ceilings: q5_0/q4_1 ~196K · kvarn4 ~230K · kvarn2 262K. Needle recalled at ~72K depth (kvarn4 = q5_0/q4_1 control). Bottom row drops MTP to free its ~1.5 GB for the larger window (~35 vs ~57 code TPS). Quality — core 8-pack → /150
Single run per arm — treat ≤±5/150 as noise. off = Takeaways
tl;dr — on Ampere, kvarn4/kvarn4 is a free (quality- + decode-neutral) KV that buys ~1.2× context — it's now our default KV for single-card coder serving. Really nice work 🙌 |
Beta Was this translation helpful? Give feedback.
-
|
v0.3.2 Preview is updated quite massively with upstream merge (Gemma MTP etc.), much more mature KVarN implementation (better performance, fixed VRAM consumption edge cases), more KV cache quant types (q6_1, q3_1 through q2_0, turbo4_tcq), and more. Public builds are swapped to updated default quant pairs setup, with TurboQuant left out as either standard quants or KVarN work better based on situation, but turbo is still available if you build yourself with HALF or ALL quant flags. I'll repeat my own KLD benchmarks soon-ish with a generous pair matrix and Gemma 4 31B model this time, but would also be happy for external feedback. |
Beta Was this translation helpful? Give feedback.
-
|
Thanks @Anbeeld — pulled the updated Good news first: KVarN now works cleanly on Gemma 4 too, not just Qwen. The thing I wanted to flag — I tried the Gemma 4 MTP path from your changelog ("including E2B/E4B assistants") with Tracing it: the runtime PR (#23398) merged to master 2026-06-07, but the converter (#23727, Question: is there a gemma-4 assistant GGUF you've confirmed loads on the merged path — or what arch/flags does your inherited gemma4 MTP expect ( |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
Would be interesting to see your testing and perspective of KVarN KV Cache.
I added a basic implemetation of it to BeeLlama.cpp v0.3.2 Preview.
Thanks!
Beta Was this translation helpful? Give feedback.
All reactions