HieraSparse

HieraSparse is a sparse KV cache system for LLM inference that reduces memory and computation cost while preserving generation quality. It combines a hierarchical block-based memory layout with N:M structured sparse attention kernels and near-zero-overhead online compression.

Requirements

NVIDIA GPU (pre-tuned kernels: L40S only; tuning scripts provided for other GPUs in the same architecture generation)
CUDA 12.8 or above

Installation

bash scripts/install_hierasparse.sh

This creates a hierasparse conda environment with Python 3.10, PyTorch 2.10, flash-attn, and TileLang.

Quick Start

conda activate hierasparse

python example/generation.py \
  --model_name meta-llama/Llama-3.1-8B-Instruct \
  --cache hierasparse \
  --block_seq_size 64 \
  --prune_key_prefill_ratio 0.5 \
  --prune_value_prefill_ratio 0.5

Benchmarks

Run each task after installation:

# Quality evaluation on LongBench (~300 min full, ~30 min fast subset)
bash scripts/bench_longbench.sh

# Compression kernel latency
bash scripts/bench_compression.sh

# Attention kernel latency (prefill + decode)
bash scripts/bench_kernel.sh

# Optimization ablation
bash scripts/bench_optimization.sh

# Baseline comparison (requires installing MUSTAFAR separately)
bash scripts/bench_mustafar.sh

# Layer-wise breakdown vs. sequence length
bash scripts/bench_layer.sh

# End-to-end generation latency and memory usage
bash scripts/bench_e2e.sh

Code Structure

hierasparse/
  caches/         # KV cache implementations (dense, compressed, hierarchical)
  kernels/        # Sparse prefill/decode attention and compression kernels
  models/         # Patched model classes (Llama, Mistral, Qwen3)
  interface.py    # HuggingFace attention interface wiring
  operators.py    # Kernel dispatch logic
  prune_method.py # Pruning/sparsification methods
  compress_method.py
archived_kernels/ # Pre-compiled kernel sources for L40S
scripts/          # Installation and benchmark scripts
benchmark/        # Benchmark scripts (quality + efficiency)
example/          # generation.py end-to-end example

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
.github/workflows		.github/workflows
3rdparty		3rdparty
archived_kernels		archived_kernels
benchmark		benchmark
example		example
hierasparse		hierasparse
output		output
scripts		scripts
.gitignore		.gitignore
.gitmodules		.gitmodules
.pre-commit-config.yaml		.pre-commit-config.yaml
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

HieraSparse

Requirements

Installation

Quick Start

Benchmarks

Code Structure

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 1

Languages

Folders and files

Latest commit

History

Repository files navigation

HieraSparse

Requirements

Installation

Quick Start

Benchmarks

Code Structure

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 1

Languages

Packages