Skip to content

lqhl/rabitq-rs

Repository files navigation

RaBitQ-rs

Crates.io Documentation Build Status License Rust Version

Pure Rust implementation of advanced vector quantization and search algorithms:

  • RaBitQ quantization with IVF (Inverted File) search - Production ready
  • MSTG (Multi-Scale Tree Graph) index - ⚠️ Experimental (v0.5.0)

Up to 32× memory compression with significantly higher accuracy than traditional Product Quantization (PQ) or Scalar Quantization (SQ).

Why RaBitQ?

RaBitQ delivers exceptional compression-accuracy trade-offs for approximate nearest neighbor search:

  • vs. Raw Vectors: Achieve up to 32× memory reduction with configurable compression ratios
  • vs. PQ/SQ: Substantially higher accuracy, especially at high recall targets (90%+)
  • How it Works: Combines 1-bit binary codes with multi-bit refinement codes for precise distance estimation

The quantization applies to residual vectors (data - centroid), enabling accurate reconstruction even at extreme compression rates.

Why This Rust Implementation?

This library provides a feature-complete RaBitQ + IVF search engine with all optimizations from the original C++ implementation:

  • Faster Config Mode: 100-500× faster index building with <1% accuracy loss
  • FHT Rotation: Fast Hadamard Transform for efficient orthonormal rotations
  • SIMD Accelerated: Optimized distance computations on modern CPUs
  • Memory Safe: No segfaults, no unsafe code in dependencies
  • Complete IVF Pipeline: Training (built-in k-means or FAISS-compatible), search, persistence

MSTG: Multi-Scale Tree Graph Index (⚠️ Experimental)

New in v0.5.0: MSTG is an experimental hybrid index combining hierarchical clustering with graph navigation.

⚠️ Experimental Status: MSTG is under active development and lacks comprehensive testing and tuning. For production workloads, we recommend using the battle-tested IVF+RaBitQ index. MSTG is provided for research and experimentation purposes.

Key Features

  • Hierarchical Balanced Clustering: Adaptive multi-scale clustering with configurable branching factor
  • Closure Assignment: Vectors can belong to multiple clusters (RNG rule) for better recall
  • HNSW Graph Navigation: Fast approximate nearest centroid search
  • Flexible Precision: FP32, BF16, FP16, or INT8 for centroids
  • Dynamic Pruning: Adaptive cluster selection based on query-centroid distances
  • Python Bindings: Seamless integration with NumPy and ann-benchmarks

When to Use MSTG vs IVF+RaBitQ

Feature IVF+RaBitQ MSTG (Experimental)
Maturity ✅ Production Ready ⚠️ Experimental
Testing ✅ Comprehensive ⚠️ Limited
Best For All use cases (10K-10M+) Research & experimentation
Recall 85-95% (well-tuned) 90-98% (theoretical)
Build Time Fast (k-means) Moderate (hierarchical)
Query Speed Fast Very Fast (HNSW navigation)
Memory Low Low-Medium (configurable)
Stability ✅ Stable ⚠️ API may change
Documentation ✅ Complete 🚧 In progress

Recommendation: Use IVF+RaBitQ for production workloads. It's well-tested, documented, and optimized. Consider MSTG only if you're researching advanced indexing techniques or willing to help with testing and tuning.

Platform Support

x86_64 (Intel/AMD): ✅ Fully supported and tested
ARM64 (Apple Silicon, ARM servers): ⚠️ Known issues - See ARM64_KNOWN_ISSUES.md

ARM64 Notice: The current implementation has critical bugs on ARM64/AArch64 platforms that affect both L2 and Inner Product metrics. Distance calculations produce incorrect results, making the library unreliable on Apple Silicon Macs and ARM servers. We're actively investigating these issues. For now, production use on ARM64 is NOT recommended.

Quick Start

Add to your Cargo.toml:

[dependencies]
rabitq-rs = "0.5"

IVF+RaBitQ Index (Recommended for Production)

Build an IVF+RaBitQ index and search in ~10 lines:

use rabitq_rs::{IvfRabitqIndex, SearchParams, Metric, RotatorType};
use rand::prelude::*;

fn main() -> Result<(), Box<dyn std::error::Error>> {
    let mut rng = StdRng::seed_from_u64(42);
    let dim = 128;

    // Generate 10,000 random vectors
    let dataset: Vec<Vec<f32>> = (0..10_000)
        .map(|_| (0..dim).map(|_| rng.gen::<f32>()).collect())
        .collect();

    // Train index with 256 clusters, 7-bit quantization
    let index = IvfRabitqIndex::train(
        &dataset,
        256,                         // nlist (number of clusters)
        7,                           // total_bits (1 sign + 6 magnitude)
        Metric::L2,
        RotatorType::FhtKacRotator,  // Fast Hadamard Transform
        42,                          // random seed
        false,                       // use_faster_config
    )?;

    // Search: probe 32 clusters, return top 10 neighbors
    let params = SearchParams::new(10, 32);
    let results = index.search(&dataset[0], params)?;

    println!("Top neighbor ID: {}, distance: {}", results[0].id, results[0].score);
    Ok(())
}

MSTG Index (⚠️ Experimental - For Research Only)

Build an MSTG index and search in ~10 lines:

use rabitq_rs::mstg::{MstgIndex, MstgConfig, SearchParams};
use rand::prelude::*;

fn main() -> Result<(), Box<dyn std::error::Error>> {
    let mut rng = StdRng::seed_from_u64(42);
    let dim = 128;

    // Generate 10,000 random vectors
    let dataset: Vec<Vec<f32>> = (0..10_000)
        .map(|_| (0..dim).map(|_| rng.gen::<f32>()).collect())
        .collect();

    // Build index with default configuration
    let index = MstgIndex::build(&dataset, MstgConfig::default())?;

    // Search: balanced mode with top 10 results
    let params = SearchParams::balanced(10);
    let results = index.search(&dataset[0], &params);

    println!("Top neighbor ID: {}, distance: {:.6}", results[0].vector_id, results[0].distance);
    Ok(())
}

⚠️ Note: MSTG is experimental and not recommended for production use. Use IVF+RaBitQ for stable, well-tested performance.

Usage

IVF+RaBitQ Index

Training with Built-in K-Means

use rabitq_rs::{IvfRabitqIndex, Metric, RotatorType};

let index = IvfRabitqIndex::train(
    &vectors,
    1024,                        // nlist: number of IVF clusters
    7,                           // total_bits: quantization precision
    Metric::L2,                  // or Metric::InnerProduct
    RotatorType::FhtKacRotator,
    12345,                       // seed for reproducibility
    false,                       // use_faster_config
)?;

Training with Pre-computed Clusters (FAISS-Compatible)

If you already have centroids and cluster assignments from FAISS or another tool:

use rabitq_rs::{IvfRabitqIndex, Metric, RotatorType};

let index = IvfRabitqIndex::train_with_clusters(
    &vectors,
    &centroids,      // shape: [nlist, dim]
    &assignments,    // shape: [num_vectors], cluster ID per vector
    7,               // total_bits
    Metric::L2,
    RotatorType::FhtKacRotator,
    42,              // seed
    false,           // use_faster_config
)?;

Using Faster Config for Large Datasets

For datasets >100K vectors, enable faster_config to accelerate training by 100-500× with minimal accuracy loss (<1%):

let index = IvfRabitqIndex::train(
    &vectors,
    4096,
    7,
    Metric::L2,
    RotatorType::FhtKacRotator,
    42,
    true,  // use_faster_config = true
)?;

Searching

use rabitq_rs::SearchParams;

let params = SearchParams::new(
    100,   // top_k: return 100 nearest neighbors
    64,    // nprobe: search 64 nearest clusters
);

let results = index.search(&query_vector, params)?;

for result in results.iter().take(10) {
    println!("ID: {}, Distance: {}", result.id, result.score);
}

Parameter Tuning:

  • nprobe: Higher = better recall, slower search (typical: 10-1024)
  • top_k: Number of neighbors to return
  • total_bits: More bits = higher accuracy, larger index (typical: 3-7)

Saving and Loading

// Save index to disk (with CRC32 checksum)
index.save("my_index.bin")?;

// Load index from disk
let loaded_index = IvfRabitqIndex::load("my_index.bin")?;

MSTG Index (⚠️ Experimental)

Building an MSTG Index

use rabitq_rs::mstg::{MstgIndex, MstgConfig, ScalarPrecision};
use rabitq_rs::Metric;

let mut config = MstgConfig::default();

// Configure clustering
config.max_posting_size = 5000;     // Max vectors per cluster
config.branching_factor = 10;       // Hierarchical branching
config.balance_weight = 1.0;        // Balance vs purity trade-off

// Configure closure assignment (multi-assignment)
config.closure_epsilon = 0.15;      // Distance threshold
config.max_replicas = 8;            // Max assignments per vector

// Configure quantization
config.rabitq_bits = 7;             // 3-8 bits (7 recommended)
config.faster_config = true;        // Fast mode
config.metric = Metric::L2;         // or Metric::InnerProduct

// Configure HNSW graph
config.hnsw_m = 32;                 // Connectivity
config.hnsw_ef_construction = 200;  // Build quality
config.centroid_precision = ScalarPrecision::BF16; // FP32/BF16/FP16/INT8

let index = MstgIndex::build(&vectors, config)?;

Searching

use rabitq_rs::mstg::SearchParams;

// Option 1: Pre-defined modes
let params = SearchParams::high_recall(100);  // 95%+ recall
let params = SearchParams::balanced(100);     // ~90% recall (default)
let params = SearchParams::low_latency(100);  // ~80% recall, fastest

// Option 2: Custom parameters
let params = SearchParams::new(
    150,   // ef_search: candidates to explore
    0.6,   // pruning_epsilon: dynamic pruning threshold
    100,   // top_k: number of results
);

let results = index.search(&query_vector, &params);

for result in results.iter().take(10) {
    println!("ID: {}, Distance: {:.6}", result.vector_id, result.distance);
}

Parameter Tuning:

  • ef_search: Higher = better recall, slower (typical: 50-300)
  • pruning_epsilon: Higher = more clusters searched (typical: 0.4-0.8)
  • max_posting_size: Smaller = more clusters, better balance (typical: 3000-10000)
  • rabitq_bits: More bits = higher accuracy (typical: 5-7)

⚠️ Reminder: MSTG is experimental. For production use, refer to the IVF+RaBitQ documentation above.

CLI Tool: Benchmark on GIST-1M

The ivf_rabitq binary lets you evaluate RaBitQ on standard datasets like GIST-1M:

1. Download GIST-1M Dataset

mkdir -p data/gist
wget -P data/gist ftp://ftp.irisa.fr/local/texmex/corpus/gist.tar.gz
tar -xzvf data/gist/gist.tar.gz -C data/gist

2. Build Index

cargo run --release --bin ivf_rabitq -- \
    --base data/gist/gist_base.fvecs \
    --nlist 4096 \
    --bits 3 \
    --faster-config \
    --save data/gist/index.bin

3. Run Benchmark

cargo run --release --bin ivf_rabitq -- \
    --load data/gist/index.bin \
    --queries data/gist/gist_query.fvecs \
    --gt data/gist/gist_groundtruth.ivecs \
    --benchmark

This performs an automatic nprobe sweep and reports:

nprobe | QPS     | Recall@100
-------|---------|------------
5      | 12500   | 45.2%
10     | 8200    | 62.8%
20     | 5100    | 75.4%
...

Alternative: Pre-cluster with FAISS

For C++ library compatibility, generate clusters using the provided Python helper:

python python/ivf.py \
    data/gist/gist_base.fvecs \
    4096 \
    data/gist/gist_centroids_4096.fvecs \
    data/gist/gist_clusterids_4096.ivecs \
    l2

cargo run --release --bin ivf_rabitq -- \
    --base data/gist/gist_base.fvecs \
    --centroids data/gist/gist_centroids_4096.fvecs \
    --assignments data/gist/gist_clusterids_4096.ivecs \
    --bits 3 \
    --save data/gist/index.bin

Python Bindings

RaBitQ-RS provides Python bindings for the MSTG (Multi-Scale Tree Graph) index, enabling seamless integration with Python machine learning workflows and ann-benchmarks.

Installation

# Install dependencies
pip install maturin numpy

# Build and install from source
maturin develop --release --features python

Quick Start (Python)

import numpy as np
from rabitq_rs import MstgIndex

# Generate sample data
data = np.random.randn(10000, 128).astype(np.float32)
query = np.random.randn(128).astype(np.float32)

# Create and build index
index = MstgIndex(
    dimension=128,
    metric="euclidean",          # or "angular"
    max_posting_size=5000,
    rabitq_bits=7,               # 3-8 bits
    faster_config=True,          # Fast quantization mode
    centroid_precision="bf16"    # "fp32", "bf16", "fp16", or "int8"
)

index.fit(data)

# Search
index.set_query_arguments(ef_search=150, pruning_epsilon=0.6)
results = index.query(query, k=10)

# Results: numpy array of shape (k, 2) with [id, distance]
for neighbor_id, distance in results:
    print(f"ID: {int(neighbor_id)}, Distance: {distance:.6f}")

Configuration Parameters

Index Parameters (Build Time):

  • dimension: Vector dimensionality (required)
  • metric: Distance metric - "euclidean" (L2) or "angular" (inner product)
  • max_posting_size: Maximum vectors per cluster (default: 5000)
  • rabitq_bits: Quantization bits, 3-8 (default: 7)
  • faster_config: Fast quantization mode (default: True)
  • centroid_precision: Centroid storage - "fp32", "bf16", "fp16", "int8" (default: "bf16")
  • hnsw_m: HNSW connectivity (default: 32)
  • hnsw_ef_construction: HNSW build quality (default: 200)

Query Parameters (Search Time):

  • ef_search: Centroid candidates to explore (default: 150)
  • pruning_epsilon: Dynamic pruning threshold (default: 0.6)
  • k: Number of neighbors to return

Batch Queries

# Query multiple vectors efficiently
queries = np.random.randn(100, 128).astype(np.float32)
results_list = index.batch_query(queries, k=10)

# Returns list of numpy arrays, one per query
for i, results in enumerate(results_list):
    print(f"Query {i}: Top neighbor ID = {int(results[0, 0])}")

Memory Usage

memory_bytes = index.get_memory_usage()
print(f"Index size: {memory_bytes / 1024**2:.2f} MB")

ann-benchmarks Integration

RaBitQ-RS includes a complete ann-benchmarks wrapper for standardized evaluation:

# Run benchmarks
cd ann_benchmarks
python run.py --algorithm rabitq-mstg --dataset fashion-mnist-784-euclidean

See ann_benchmarks/README.md for detailed benchmark configuration and usage.

Testing

Run the test suite before committing changes:

# Rust tests
cargo fmt
cargo clippy --all-targets --all-features
cargo test

# Python tests
python test_python_bindings.py

All tests use seeded RNGs for reproducible results.

Citation

If you use RaBitQ in your research, please cite the original paper:

@inproceedings{guo2024rabitq,
  title={RaBitQ: Quantizing High-Dimensional Vectors with a Theoretical Error Bound for Approximate Nearest Neighbor Search},
  author={Guo, Jianyang and Wang, Zihan and Tang, Yun and Wang, Jihao and Cai, Ruihang and Cai, Peng and Zhang, Xiyuan and Wang, Jianye and Zhao, Jun},
  booktitle={Proceedings of the ACM on Management of Data},
  year={2024}
}

Original C++ Implementation: VectorDB-NTU/RaBitQ-Library

License

Licensed under the Apache License, Version 2.0. See LICENSE for details.

About

IVF + RaBitQ in Rust

Resources

License

Stars

Watchers

Forks

Contributors 9

Languages