Skip to content

v0.5.0

Latest

Choose a tag to compare

@jagmarques jagmarques released this 12 Jun 23:57
· 29 commits to main since this release
v0.5.0
99203fa

Adds a standalone quant-only compression path and several improvements to the eviction pipeline.

What changed

  • compress_kv_cache(past_key_values, mode="quant_only") -- near-lossless compression without eviction. K3V2 pb=0 adds +0.276% PPL at 6.1x (Mistral-7B-v0.1, wikitext-2, n=161 chunks). Validated on 7 model architectures; NIAH recall preserved to 32K context on A100.
  • key_bits / value_bits on nexusquant_evict for asymmetric compression (K3V2: 3-bit keys, 2-bit values).
  • layer_bit_profile and compress_layers params.
  • NexusQuantEvictTruncate: physically removes evicted tokens, saving real GPU memory.
  • New quality preset "asym" (K3V2 at 60% eviction).
  • Triton E8 GPU kernel + fused dequant-matmul (nexusquant/kernels/).
  • vLLM PagedAttention integration.
  • SWA-aware compression for hybrid-attention models (Gemma-2, MiMo-style).
  • rope_scaling propagation fixed for Llama-3.1-family.

Requirements note

The quant-only path works with transformers >= 4.46. The eviction hook path requires transformers >= 5.0 and torch >= 2.4 (see README compatibility note).

Install

PyPI 0.5.0 is pending trusted-publisher configuration. In the meantime, install from source:

pip install git+https://github.com/jagmarques/nexusquant.git