Open-source TurboQuant · PolarQuant + QJL · Decoder LLM KV compression
The Google Research blog reports TurboQuant numbers on open LLMs (including Gemma, Mistral, Llama-3.1-8B-Instruct):
| Aspect | Summary |
|---|---|
| KV memory | Cache size reduced by roughly 6× or more (including 3 bits per key/value without fine-tuning). |
| Attention speed | Up to about 8× faster attention logits vs. their described 32-bit JAX baseline on H100 (for 4-bit TurboQuant in their setup). |
| Downstream quality | On long context and needle-in-a-haystack they report strong / near-full results at this compression; comparisons to KIVI and others use LongBench and related benchmarks (LongBench, ZeroSCROLLS, RULER, L-Eval, etc.). |
| Vector search | For high-dimensional search (e.g. GloVe, d=200) — competitive 1@k recall vs. PQ, RabbiQ in a data-oblivious setting. |
These figures refer to the original paper and its experimental setup; in your environment actual memory savings, latency, and quality depend on the model, bit width, context length, and whether you use fused Triton or dequant into standard attention.
TurboQuantProd: PolarQuant + QJL, 1.5 / 2 / 2.5 / 3 / 4 bits, configurablehead_dim.- Compress / restore K/V pairs:
compress/decompress,quantize_kv(including returning the compressed representation). - Optional calibration from batches/tensors:
calibrate_turboquant_from_tensor,calibrate_turboquant_from_batches,CalibrationMode(seeturboquant.calibration).
- Scores from compressed K:
quantized_attention_scores_triton—q @ k^T / sqrt(d)without fully materializing unpacked K; causal mask, additive mask, GQA/MQA (num_kv_heads). - Fused attention (softmax × V) on compressed K/V:
quantized_attention_fused_triton; supportedhead_dim: 16, 32, 64, 128, 256. - Paged KV in the vLLM style:
quantized_attention_fused_triton_paged, packingpack_dense_kv_to_paged; zeroing allocator copies —turboquant.vllm_pack(paged_kv_views_from_allocator_buffer,uint8_pages_to_paged_dict). - Centroid preload:
TurboQuantProd.preload_centroids(...)(useful on first run forbits >= 4).
- Fused decode path without Triton:
quantized_attention_fused_autouses portable SDPA fallback on MPS. - Works with decoder fused wrappers in HF integration as a non-Triton backend (
enable_decoder_fused_attention). TurboQuantProd(device="mps")is supported; aliasesdevice="metal"anddevice="mlx"map to MPS.
TurboQuantModel: model wrapper + quantizer; legacy path via per-layer tuplesquantize_past_key_values/dequantize_past_key_values.TurboQuantDynamicCache: compressed KV on full-width layers; sliding-window layers stay ordinary HF layers.turboquant_encoder_decoder_cache/TurboQuantEncoderDecoderCache(orTurboQuantModel.make_encoder_decoder_cache()): HFEncoderDecoderCachefor cross-attention KV (encoder memory). T5 / mT5:head_dim = config.d_kv. M2M-100 / NLLB (dense): Hub configs often usemodel_type: m2m_100andM2M100ForConditionalGeneration;head_dim = config.d_model // config.decoder_attention_heads(typically 64 for 600M/1.3B/418M, 128 fornllb-200-3.3Bwithd_model=2048). Default path keeps stock HF attention on decompressed KV for T5 (unscaled logits); MarianMT uses the same cross-KV cache idea but is not covered by decoder fused attention (supported_fused_attention_architectures) until a dedicated wrapper exists.- Export to per-layer paged tensors:
turboquant.hf_cache.export_cache_to_paged_per_layer. - Decoder fused attention (CUDA + Triton) without full dequant every step for several architectures:
enable_decoder_fused_attention/install_decoder_fused_attentionand related APIs inturboquant.hf_fused_attention; includes Llama (incl. Llama 3.3), Mistral, Qwen2, Qwen3, Gemma2 (falls back to stock on softcap / non-standard query scaling), Phi-3 / Phi-4 / Phi-4-mini (HF aliasphi4, samePhi3Attention), Phi-4 Multimodal text decoder (phi4_multimodal,Phi4MultimodalAttention; vision/audio towers unchanged), DeepSeek-V2/V3 (MLA, aliasesdeepseek,deepseek_r1,deepseek_r2) with safe stock fallback when fused parity constraints are not met, InternLM2 / InternLM2.5 (internlm2, fusedwqkv) and InternLM3 (internlm3) via Hubtrust_remote_code(seeturboquant.hf_internlm_fused), Cohere, Granite, Starcoder2 — full list:supported_fused_attention_architectures(). - Cache modes: e.g.
triton_fused_layers,strict_reencode, optionalhybrid_float_cache(memory vs. dequant trade-off on long context — see code andexamples/hf_generate_turboquant_cache.py).
- vLLM v1: patch,
--kv-cache-dtype turboquant,--turboquant-bits—integrations/vllm_upstream/README.md. - llama.cpp: same paged layout as vLLM,
*.tqmetasidecar,turboquant.llama_cpp_pack—integrations/llama_cpp/README.md.
examples/simple_usage.py— inner product / cosine metrics on the score matrix (not only vector MSE).examples/hf_generate_turboquant_cache.py— generation with compressed cache;--fusedfor CUDA+Triton.tests/test_hf_nllb_distilled_600m_hub_config.py— optional Hub fetch offacebook/nllb-200-distilled-600Mconfig only (no weights): assertsmodel_type == m2m_100,head_dim = 64, andturboquant_encoder_decoder_cachelayer count (skips if offline / no cache).benchmarks/needle_in_a_haystack_simple.py,benchmarks/longbench_simple.py— simplified runs.
- High MSE between original and reconstructed K/V is expected. The method optimizes dot products (attention logits), not exact L2 reconstruction; judge quality by score distortion (see
examples/simple_usage.py). - Without fused backend, standard HF attention after a cache update may need to dequant the entire prefix to float K/V — cost grows with context length. To lower per-step cost, use fused backends: CUDA + Triton or Apple Metal (MPS fallback path) (
examples/hf_generate_turboquant_cache.py --fused,turboquant/hf_fused_attention.py). - For Gemma2, set head size from
config.head_dim, not onlyhidden_size // num_attention_heads. - Not every setup is covered by the fused layer (sliding-window / some masks / architecture-specific constraints). For DeepSeek MLA, unsupported fused conditions automatically fall back to stock HF attention — see
turboquant/hf_fused_attention.pyand the integration READMEs.
From PyPI (distribution name turboquant-kv; Python import remains turboquant):
pip install turboquant-kv
pip install "turboquant-kv[triton]" # GPU: Triton; on Windows pulls triton-windows
pip install "turboquant-kv[hf]" # transformers — dynamic cache and HF examplesFrom source:
git clone https://github.com/hackimov/turboquant-kv.git
cd turboquant-kv
python -m venv .venv
# Windows: .venv\Scripts\activate
# Linux/macOS: source .venv/bin/activate
pip install -e .
pip install -e ".[triton]" # GPU: Triton; on Windows pulls triton-windows
pip install -e ".[hf]" # transformers — dynamic cache and HF examplesFrom the repository root, with venv activated and pip install -e . done (plus extras below as needed).
This repo’s maintainers often use a local virtualenv named venv_turboquant/ (listed in .gitignore). On Windows, after .\venv_turboquant\Scripts\activate, run the same commands as below; or call the interpreter directly, e.g. venv_turboquant\Scripts\python.exe examples\simple_usage.py.
Core without HF — compress/decompress demo and inner product metrics (MSE/cosine on the score matrix):
python examples/simple_usage.py
# reproducible on CPU: python examples/simple_usage.py --seed 42 --device cpuHow those metrics are built (same tensors as in the script):
flowchart LR
K["K (float)"] --> C["TurboQuant\ncompress"]
C --> Kr["K' (reconstructed)"]
K --> S["S = K K^T / sqrt(d)"]
Kr --> Sr["S' = K' K'^T / sqrt(d)"]
S --> M["MSE(S, S')\ncosine(flat S, flat S')"]
Sr --> M
Example (CPU, python examples/simple_usage.py --seed 42 --device cpu) — stable for that seed/device pair; CUDA or another seed will differ:
Device: cpu
Compressing KV...
Decompressing KV...
MSE inner product : 0.580963
Cosine similarity : 0.868170
--- attention score matrix (K K^T / sqrt(d)) ---
Cosine similarity (higher better) [#######################---] 0.868170
MSE inner product (lower better, cap=2) [##################--------] 0.580963
[OK] Good quality (cosine > 0.85)
Hugging Face + compressed KV in generate — requires pip install -e ".[hf]":
python examples/hf_generate_turboquant_cache.py
python examples/hf_generate_turboquant_cache.py --from-config # no Hub weight downloadFused attention (auto backend):
# CUDA + Triton path (recommended on NVIDIA):
pip install -e ".[triton]"
python examples/hf_generate_turboquant_cache.py --fused
python examples/hf_generate_turboquant_cache.py --fused --fused-arch mistral
# if attention type is a subclass of a registered implementation:
python examples/hf_generate_turboquant_cache.py --fused --allow-attention-subclassOn Apple Silicon, the same --fused flag uses the Metal/MPS fallback path automatically (no Triton required).
The script also supports --bits, --strict-reencode, --hybrid-float-cache — see python examples/hf_generate_turboquant_cache.py -h.
Proxy benchmarks (after pip install -e .):
python benchmarks/needle_in_a_haystack_simple.py
python benchmarks/longbench_simple.pyimport torch
from turboquant import TurboQuantProd
quantizer = TurboQuantProd(bits=3, head_dim=128)
k = torch.randn(1, 8, 256, 128)
v = torch.randn(1, 8, 256, 128)
compressed = quantizer.compress(k, v)
k_rec, v_rec = quantizer.decompress(compressed)For a full model: TurboQuantModel, make_dynamic_cache(), and optionally enable_decoder_fused_attention() — see examples/hf_generate_turboquant_cache.py.
FAISS-like, in-memory approximate ANN with native TurboQuant compression:
import torch
from turboquant.search import VectorIndex
index = VectorIndex(
dim=128,
bits=3,
metric="ip", # "ip" | "cosine" | "l2"
device="cpu", # or "cuda"
seed=42,
)
xb = torch.randn(10000, 128) # database vectors
xq = torch.randn(4, 128) # query vectors
index.add(xb) # optional: index.add(xb, ids=custom_ids)
scores, ids = index.search(xq, k=10)
print(scores.shape, ids.shape) # torch.Size([4, 10]) torch.Size([4, 10])Notes:
index.ntotalreturns number of indexed vectors.index.reset()clears all compressed vectors.- Search runs in chunks (
search_chunk_size) to bound memory.
Example script: examples/vector_search_simple.py.
| Target | Document |
|---|---|
| vLLM | integrations/vllm_upstream/README.md |
| llama.cpp | integrations/llama_cpp/README.md |
In-code notes: VLLM_INTEGRATION_NOTES, LLAMA_CPP_INTEGRATION_NOTES in turboquant/hf_cache.py.
pip install -r requirements.txt
# for test_hf_*: pip install -e ".[hf]"
python -m unittest discover -s tests -p "test_*.py" -vWithout [hf], some tests are skipped; without GPU/Triton, CUDA tests are skipped.