Release v0.5.0 · jagmarques/nexusquant

Adds a standalone quant-only compression path and several improvements to the eviction pipeline.

What changed

compress_kv_cache(past_key_values, mode="quant_only") -- near-lossless compression without eviction. K3V2 pb=0 adds +0.276% PPL at 6.1x (Mistral-7B-v0.1, wikitext-2, n=161 chunks). Validated on 7 model architectures; NIAH recall preserved to 32K context on A100.
key_bits / value_bits on nexusquant_evict for asymmetric compression (K3V2: 3-bit keys, 2-bit values).
layer_bit_profile and compress_layers params.
NexusQuantEvictTruncate: physically removes evicted tokens, saving real GPU memory.
New quality preset "asym" (K3V2 at 60% eviction).
Triton E8 GPU kernel + fused dequant-matmul (nexusquant/kernels/).
vLLM PagedAttention integration.
SWA-aware compression for hybrid-attention models (Gemma-2, MiMo-style).
rope_scaling propagation fixed for Llama-3.1-family.

Requirements note

The quant-only path works with transformers >= 4.46. The eviction hook path requires transformers >= 5.0 and torch >= 2.4 (see README compatibility note).

Install

PyPI 0.5.0 is pending trusted-publisher configuration. In the meantime, install from source:

pip install git+https://github.com/jagmarques/nexusquant.git

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

v0.5.0

Choose a tag to compare

Sorry, something went wrong.

Sorry, something went wrong.

Uh oh!

No results found

Uh oh!