KVarN KV Cache #329

Anbeeld · 2026-06-05T14:35:16Z

Anbeeld
Jun 5, 2026

Would be interesting to see your testing and perspective of KVarN KV Cache.
I added a basic implemetation of it to BeeLlama.cpp v0.3.2 Preview.
Thanks!

noonghunna · 2026-06-05T15:31:51Z

noonghunna
Jun 5, 2026
Maintainer

looks great! i'll pick it up in a couple of days, tied up in some refactoring right now.

Any luck enabling MTP for gemma?

3 replies

Anbeeld Jun 5, 2026
Author

Haven't looked at it yet with how much stuff is happening around.

tomByrer Jun 6, 2026

KVarN 6-bit matches q8_0, 4-bit matches q5_0 (quality)
https://www.reddit.com/r/LocalLLaMA/comments/1tyockn/kv_cache_quant_benchmarks_kvarn_6bit_matches_q8_0/

'official' thread with article.
CAUTION: occasional 6am hallucinations ;)
https://www.reddit.com/r/LocalLLaMA/comments/1txlhxu/i_implemented_kvarn_in_my_llamacpp_fork_and_ran/

So my initial takeaway is if you have to drop below q8 cache quant, seriously look into KVarN.

edit: seems to help vLLM also. Good article to bookmark; has lots of refrence links:
https://www.reddit.com/r/LocalLLaMA/comments/1twptw2/kvarn_new_kvcache_quant_from_huawei_35_kv_cache/

Anbeeld Jun 6, 2026
Author

Not only AI hallucinates, apparently!

Anbeeld · 2026-06-07T12:05:13Z

Anbeeld
Jun 7, 2026
Author

Just a note that latest v0.3.2 Preview should have architectural stuff cleaned up now, so should be fine for some initial KVarN testing. Inference speed optimizations will follow later.

0 replies

noonghunna · 2026-06-12T21:41:09Z

noonghunna
Jun 12, 2026
Maintainer

@Anbeeld — sorry for the slow turnaround getting back to you on this! Finally put KVarN through the club-3090 rig properly, and focused on the things PPL/KLD don't cover — real-task quality (8-pack), decode speed, and single-card context ceilings. Setup: RTX 3090 (sm_86), Qwen3.6-27B coder fine-tune, Q5_K_M GGUF, beellama v0.3.2-preview (server-cuda-preview-v0.3.2, commit 98caf25), -fa on, FA_ALL_QUANTS, single card.

Serving

Config	Spec-dec	KV / ctx	decode TPS (narr / code)	TTFT	VRAM / card
kvarn4 / kvarn4	MTP (embedded)	kvarn4 / 160K	45.6 / 57.6	~89 ms	23.3 GB
q5_0 / q4_1 (ref)	MTP (embedded)	q5_0+q4_1 / 160K	46.9 / 58.7	~89 ms	23.8 GB
kvarn4 / kvarn4	none	kvarn4 / 230K	~35 (code)	—	23.5 GB

_{3 warm + 5 measured; temp 0.6 / top-k 20 / top-p 1.0. Plain single-card (24 GB) ctx ceilings: q5_0/q4_1 ~196K · kvarn4 ~230K · kvarn2 262K. Needle recalled at ~72K depth (kvarn4 = q5_0/q4_1 control). Bottom row drops MTP to free its ~1.5 GB for the larger window (~35 vs ~57 code TPS).}

Quality — core 8-pack → /150

pack (max)	kvarn4 off	kvarn4 on	q5_0/q4_1 off	q5_0/q4_1 on
toolcall-15	14	14	14	15
instructfollow-15	13	14	12	15
structoutput-15	14	14	14	14
dataextract-15	10	9	9	8
reasonmath-15	12	11	12	12
bugfind-15	12	14	12	14
hermesagent-20	10	10	10	9
cli-40	19	17	19	20
TOTAL / 150	104	103	102	107

_{Single run per arm — treat ≤±5/150 as noise. off = --no-thinking, on = --enable-thinking. dataextract sits 8–10 across all four (a model number-formatting trait, not KV).}

Takeaways

Quality-neutral. All four arms land 102–107/150 (68–71%); no pack systematically favors q5_0/q4_1. KVarN-4 holds the q5-class quality your KLD predicts — on real tool/instruct/extract/reasoning/agentic tasks, not just Wikitext.
Decode-speed-neutral. kvarn4 ≈ q5_0/q4_1 (57.6 vs 58.7 code TPS). The "slower" caveat is prompt-processing; decode is unaffected.
Context win. kvarn4 lifts the single-card ceiling ~196K → ~230K at q5-class fidelity (needle clean @72k). Full 262K fits only on kvarn2 — passed an easy needle, but I wouldn't trust 2-bit for code.
Gotcha for testers: the commit-pinned server-cuda-preview-v0.3.2-317c65e27e1e tag predates the KVarN merge (rejects kvarn* cache types — only turbo/TCQ). KVarN is only in the rolling server-cuda-preview-v0.3.2 (98caf25); had to pin that build's digest for an immutable handle.

tl;dr — on Ampere, kvarn4/kvarn4 is a free (quality- + decode-neutral) KV that buys ~1.2× context — it's now our default KV for single-card coder serving. Really nice work 🙌

4 replies

Anbeeld Jun 12, 2026
Author

I'm currently working on ironing out the implementation, current one is pretty raw with multiply caveats. Stay tuned.

tomByrer Jun 13, 2026

@noonghunna do you happen to manually review the test outputs?
Protokris has his own un-minifying test, & even though some tests 'fail', when he manually reviewed he found the actual failure to be minor enough to pass the test.

noonghunna Jun 13, 2026
Maintainer

Good question. Honest answer: the 8-pack is automated, not hand-graded scenario-by-scenario — but with two things that target exactly the failure mode you're describing.

Where it matters most we verify by execution, not string-match: the sandboxed packs (bugfind / cli / hermesagent) actually run the model's code or agent actions and check whether they work — so a cosmetically-different-but-correct answer passes on merit. The deterministic packs (toolcall / instructfollow / structoutput / dataextract / reasonmath) are stricter format/exact-match checks — those are the ones prone to the "trivially different but actually fine" false-fail.

And we hit your exact point before: an earlier keyword-list rubric was under-crediting real capability ~15–20 pp (correct answers failing on wording), so we swapped it for an outcome-based grader — baking the "is this failure actually minor?" judgment into the verifier instead of relying on a human to re-grade.

What we don't do is exhaustively hand-review all 150 — we spot-check fails (the dataextract 9/15 here traced to JSON numeric-typing, a known Qwen-family quirk, not a real miss). So treat the absolute /150 as a deliberately-conservative floor; what's solid is the comparison — kvarn4 vs q5/q4 both run the identical grader, so even if the floor under-counts, the relative "quality-neutral" read holds.

Haven't seen Protokris's test specifically, but the phenomenon's real — manual review is the true gold standard; we lean on execution-verification + spot-checks for throughput and try to be upfront that the auto-score is a floor, not a ceiling.

noonghunna Jun 13, 2026
Maintainer

One clarification worth adding, since Protokris's test is NVFP4 vs Q4_K_M — those are both weight quants (they compress the model's parameters). KVarN here is a different axis: it compresses the KV cache (the per-token context memory) and leaves the weights alone — the Qwopus coder is Q5_K_M either way. So Protokris is comparing weight fidelity; our kvarn4-vs-q5/q4 table is comparing KV-cache fidelity at fixed weights.

The "auto-grader is a conservative floor, manual review catches trivial fails" point applies to both, though — that's orthogonal to whether you're varying weights or KV.

Anbeeld · 2026-06-14T12:52:25Z

Anbeeld
Jun 14, 2026
Author

v0.3.2 Preview is updated quite massively with upstream merge (Gemma MTP etc.), much more mature KVarN implementation (better performance, fixed VRAM consumption edge cases), more KV cache quant types (q6_1, q3_1 through q2_0, turbo4_tcq), and more. Public builds are swapped to updated default quant pairs setup, with TurboQuant left out as either standard quants or KVarN work better based on situation, but turbo is still available if you build yourself with HALF or ALL quant flags.

I'll repeat my own KLD benchmarks soon-ish with a generous pair matrix and Gemma 4 31B model this time, but would also be happy for external feedback.

1 reply

noonghunna Jun 14, 2026
Maintainer

fantastic news matey! I was waiting for this. well done.

noonghunna · 2026-06-14T13:43:15Z

noonghunna
Jun 14, 2026
Maintainer

Thanks @Anbeeld — pulled the updated server-cuda-preview-v0.3.2 (digest f229f79…) and gave it a spin on Ampere.

Good news first: KVarN now works cleanly on Gemma 4 too, not just Qwen. gemma-4-12B Q8_0 (a coder fine-tune) serves coherent output single-3090 with --cache-type-k/v kvarn4, ~52 tok/s decode, reasoning channel parsed. So the "kvarn4 = free q5-class KV" result extends past the Qwen family. 👍 (side note: it still prints kvarn_k4v4_g128 is experimental; only kvarn_k4v2_g128 is reference-aligned — is k4v2 your recommendation for production, or is k4v4 fine on the matured implementation?)

The thing I wanted to flag — I tried the Gemma 4 MTP path from your changelog ("including E2B/E4B assistants") with cortexist/gemma-4-12B-it-assistant-MTP-GGUF as a -md draft, and it fails on both your f229f79 and mainline server-cuda (5535de):

error loading model: unknown model architecture: 'gemma4_assistant'

Tracing it: the runtime PR (#23398) merged to master 2026-06-07, but the converter (#23727, Gemma4AssistantForCausalLM) is still open — so cortexist's GGUF looks fork-converted, with an arch tag the merged runtime doesn't recognize.

Question: is there a gemma-4 assistant GGUF you've confirmed loads on the merged path — or what arch/flags does your inherited gemma4 MTP expect (--spec-type draft-mtp -md …)? Would love to A/B assistant-MTP vs no-spec on the coder, but I'm stuck finding a draft artifact that loads. 🙏

2 replies

Anbeeld Jun 14, 2026
Author

Regarding MTP, I'm not doing any adjustments to that in my fork, just following what upstream already have. My scope of work is DFlash and fun stuff like KVarN, and I've got enough conflicts to resolve every merge as is already. 😛

noonghunna Jun 14, 2026
Maintainer

haha, yes i can imagine. No worries, I hope it gets merged upstream soon.

KVarN KV Cache #329

Uh oh!

Uh oh!

Anbeeld Jun 5, 2026

Replies: 5 comments · 10 replies

Uh oh!

Uh oh!

noonghunna Jun 5, 2026 Maintainer

Uh oh!

Anbeeld Jun 5, 2026 Author

Uh oh!

Uh oh!

tomByrer Jun 6, 2026

Uh oh!

Anbeeld Jun 6, 2026 Author

Uh oh!

Anbeeld Jun 7, 2026 Author

Uh oh!

Uh oh!

noonghunna Jun 12, 2026 Maintainer

Serving

Quality — core 8-pack → /150

Takeaways

Uh oh!

Anbeeld Jun 12, 2026 Author

Uh oh!

tomByrer Jun 13, 2026

Uh oh!

noonghunna Jun 13, 2026 Maintainer

Uh oh!

noonghunna Jun 13, 2026 Maintainer

Uh oh!

Uh oh!

Anbeeld Jun 14, 2026 Author

Uh oh!

noonghunna Jun 14, 2026 Maintainer

Uh oh!

noonghunna Jun 14, 2026 Maintainer

Uh oh!

Uh oh!

Anbeeld Jun 14, 2026 Author

Uh oh!

noonghunna Jun 14, 2026 Maintainer

Anbeeld
Jun 5, 2026

Replies: 5 comments 10 replies

noonghunna
Jun 5, 2026
Maintainer

Anbeeld Jun 5, 2026
Author

Anbeeld Jun 6, 2026
Author

Anbeeld
Jun 7, 2026
Author

noonghunna
Jun 12, 2026
Maintainer

Anbeeld Jun 12, 2026
Author

noonghunna Jun 13, 2026
Maintainer

noonghunna Jun 13, 2026
Maintainer

Anbeeld
Jun 14, 2026
Author

noonghunna Jun 14, 2026
Maintainer

noonghunna
Jun 14, 2026
Maintainer

Anbeeld Jun 14, 2026
Author

noonghunna Jun 14, 2026
Maintainer