rustcluster

Fast, Rust-backed clustering for Python. Six algorithms, sklearn-compatible API, purpose-built embedding clustering with 11x PCA speedup.

Highlights

6 algorithms: KMeans, MiniBatchKMeans, DBSCAN, HDBSCAN, AgglomerativeClustering, EmbeddingCluster
EmbeddingCluster: purpose-built pipeline for OpenAI/Cohere/Voyage embeddings (L2-normalize, PCA, spherical K-means)
EmbeddingReducer: standalone PCA transformer with save/load (fit once, cluster for free), chunked transform, sample-fit, and an f32 fast path that halves working-set memory
rustcluster.utils.extract_embeddings_from_spark: stream a Spark DataFrame into NumPy via Arrow, no Python list-of-lists overhead
rustcluster.index: FAISS-flavored flat vector search (IndexFlatL2, IndexFlatIP) plus a fused all-pairs similarity_graph kernel
faer-accelerated PCA: 11x faster than hand-rolled matmul via SIMD-optimized GEMM
3 distance metrics: euclidean, cosine, manhattan
3 evaluation metrics: silhouette score, Calinski-Harabasz, Davies-Bouldin
KD-tree acceleration for DBSCAN/HDBSCAN neighbor queries (10-200x on low-d data)
Native f32/f64: no silent upcast, doubles cache efficiency with f32
Cluster slotting: snapshot fitted clusters, assign new points 100x faster than refitting
Pickle serialization for all fitted models
GIL released during all compute, plays well with threads and async
631 tests across Rust (237) and Python (394) suites

Installation

pip install rustcluster

Or from source (requires Rust toolchain + Python 3.10+):

pip install maturin
git clone https://github.com/mfbaig35r/rustcluster.git
cd rustcluster
maturin develop --release

Quickstart

K-Means

from rustcluster import KMeans

model = KMeans(n_clusters=3, random_state=42)
model.fit(X)
model.labels_           # cluster assignments
model.cluster_centers_  # centroids (k x d)
model.inertia_          # sum of squared distances
model.predict(X_new)    # assign new data

Embedding Clustering

Purpose-built pipeline for dense embedding vectors (OpenAI, Cohere, Voyage, etc.):

from rustcluster.experimental import EmbeddingCluster

model = EmbeddingCluster(n_clusters=50, reduction_dim=128)
model.fit(embeddings)          # L2-normalize, PCA, spherical K-means
model.labels_                  # cluster assignments
model.cluster_centers_         # unit-norm centroids in reduced space
model.intra_similarity_        # per-cluster cosine similarity
model.reduced_data_            # access PCA-reduced data

EmbeddingReducer (Fit Once, Cluster Many)

PCA is 99% of the embedding pipeline runtime. Separate reduction from clustering to iterate for free:

from rustcluster.experimental import EmbeddingReducer

# Pay the PCA cost once
reducer = EmbeddingReducer(target_dim=128)
X_reduced = reducer.fit_transform(embeddings)  # 323K x 1536 -> 128 in ~56s
reducer.save("pca_128.bin")

# Iterate on clustering for free
reducer = EmbeddingReducer.load("pca_128.bin")
X_reduced = reducer.transform(new_embeddings)

EmbeddingCluster(n_clusters=50, reduction_dim=None).fit(X_reduced)   # ~4s
EmbeddingCluster(n_clusters=100, reduction_dim=None).fit(X_reduced)  # ~8s
EmbeddingCluster(n_clusters=200, reduction_dim=None).fit(X_reduced)  # ~15s

Matryoshka models (e.g., text-embedding-3-small) can skip PCA entirely:

reducer = EmbeddingReducer(target_dim=128, method="matryoshka")
X_reduced = reducer.fit_transform(embeddings)  # instant: column slice + L2-normalize

Memory-optimized end-to-end workflow (v0.7.0). Runs on a 28 GB Databricks driver without manual chunking:

from rustcluster.utils import extract_embeddings_from_spark
from rustcluster.experimental import EmbeddingReducer, EmbeddingCluster

# Stream embeddings from Spark with no Python list-of-lists overhead
embeddings, pdf = extract_embeddings_from_spark(
    df,
    embedding_col="embedding",
    metadata_cols=["supplier_id", "commodity"],
    dtype=np.float32,
)

# Fit on a 50K sample, transform all 312K in chunks, f32 throughout
reducer = EmbeddingReducer(target_dim=128, fit_sample_size=50_000)
X = reducer.fit_transform(embeddings, chunk_size=50_000)

model = EmbeddingCluster(n_clusters=20, reduction_dim=None).fit(X)

Output dtype tracks input dtype: float32 in, float32 out. (Prior to v0.7.0 the output was always upcast to float64.)

See the embedding clustering guide for full documentation.

Cluster Slotting (Incremental Assignment)

Fit once, assign new data forever. Snapshot freezes cluster centroids; new points are assigned without re-clustering:

from rustcluster import KMeans, ClusterSnapshot

# Fit and snapshot
model = KMeans(n_clusters=50).fit(X_train)
snapshot = model.snapshot()
snapshot.save("clusters/")

# Later: load and assign new data (no refit needed)
snapshot = ClusterSnapshot.load("clusters/")
labels = snapshot.assign(X_new)                   # 100x faster than refitting

Works with KMeans, MiniBatchKMeans, and EmbeddingCluster. EmbeddingCluster snapshots bake in the full preprocessing pipeline (L2-normalize, PCA, spherical assignment).

Confidence scoring and rejection:

result = snapshot.assign_with_scores(X_new, confidence_threshold=0.3)
result.labels_       # -1 for rejected points
result.confidences_  # [0, 1); higher means more decisive assignment
result.distances_    # distance to nearest centroid
result.rejected_     # boolean mask

Adaptive thresholds (v2): Per-cluster rejection thresholds calibrated from training data. Fixes the problem where a global threshold rejects too many points from diffuse clusters:

snapshot.calibrate(X_train)  # compute per-cluster confidence distributions
result = snapshot.assign_with_scores(X_new, adaptive_threshold=True, adaptive_percentile="p10")

Mahalanobis boundaries (v2): Diagonal Mahalanobis distance accounts for per-cluster, per-dimension variance:

snapshot.calibrate(X_train)
labels = snapshot.assign(X_new, boundary_mode="mahalanobis")

Drift detection:

report = snapshot.drift_report(X_recent)
report.global_mean_distance_  # compare to training baseline
report.relative_drift_        # per-cluster drift
report.kappa_drift_           # vMF concentration shift (spherical only, v2)
report.direction_drift_       # centroid direction shift (spherical only, v2)
report.rejection_rate_        # fraction of points beyond per-cluster bounds (requires calibrate())

rejection_rate_ is NaN until snapshot.calibrate(X_train) is called. The per-cluster bounds come from the calibration distribution, not the fit-time data.

Hierarchical slotting (v2): Cascading snapshots for multi-level classification (e.g., commodity, then sub-commodity):

from rustcluster.experimental import HierarchicalSnapshot

hier = HierarchicalSnapshot.build(X_train, root_model, n_sub_clusters=10)
root_labels, child_labels = hier.assign(X_new)
hier.save("clusters/hierarchy/")

Persistence: safetensors (centroids) + JSON (metadata). A 50-cluster, 128d snapshot is ~50 KB vs GBs of training data.

Validated on 323K CROSS ruling embeddings (113x speedup, 99.86% fidelity) and 312K supplier embeddings (453x speedup, 99.94% fidelity). Hierarchical slotting improved heading purity from 37% to 54% on CROSS rulings.

Mini-Batch K-Means

from rustcluster import MiniBatchKMeans

model = MiniBatchKMeans(n_clusters=3, batch_size=256, random_state=42)
model.fit(X_large)      # scales to large datasets

DBSCAN

from rustcluster import DBSCAN

model = DBSCAN(eps=0.5, min_samples=5)
model.fit(X)
model.labels_                # -1 for noise
model.core_sample_indices_   # core point indices

HDBSCAN

from rustcluster import HDBSCAN

model = HDBSCAN(min_cluster_size=5)
model.fit(X)
model.labels_              # -1 for noise
model.probabilities_       # soft membership [0, 1]
model.cluster_persistence_ # per-cluster stability

Agglomerative Clustering

from rustcluster import AgglomerativeClustering

model = AgglomerativeClustering(n_clusters=3, linkage="ward")
model.fit(X)
model.labels_     # cluster assignments
model.children_   # merge history
model.distances_  # distance at each merge

Evaluation Metrics

from rustcluster import silhouette_score, calinski_harabasz_score, davies_bouldin_score

silhouette_score(X, labels)         # [-1, 1], higher is better
calinski_harabasz_score(X, labels)  # higher is better
davies_bouldin_score(X, labels)     # lower is better

Distance Metrics

All algorithms accept a metric parameter:

KMeans(n_clusters=5, metric="cosine")
DBSCAN(eps=0.3, metric="manhattan")
HDBSCAN(min_cluster_size=5, metric="euclidean")

Metric	Aliases	KD-tree acceleration	Notes
`"euclidean"`	`"l2"`	Yes	Default for all algorithms
`"cosine"`		No (brute force)	K-means forces Lloyd (Hamerly assumes Euclidean)
`"manhattan"`	`"cityblock"`, `"l1"`	Yes

Ward linkage requires euclidean metric.

Performance

K-Means vs scikit-learn

Single-threaded, n_init=1, median of 5 runs:

n	d	k	Speedup vs sklearn
1,000	8	8	2.9x
10,000	8	8	2.4x
100,000	8	32	3.2x
100,000	32	32	1.4x

DBSCAN and HDBSCAN use KD-tree acceleration for d <= 16 with euclidean or manhattan metrics, reducing neighbor queries from O(n^2) to O(n log n).

Embedding Clustering

Measured on 323K embeddings (text-embedding-3-small, 1536d → 128d, K=98, Apple Silicon):

Workflow	Time
Full pipeline (PCA + cluster)	58s
Subsequent run (cached reduced data)	7.5s
5 clustering configs on cached data	74s
Matryoshka (no PCA needed)	~5s

Memory (v0.7.0, 312K x 1536d -> 128d on a 28 GB Databricks driver):

Stage	v0.6.x	v0.7.0
Arrow extraction from Spark	~5 GB	~2 GB
PCA fit	3.8 GB centered matrix	~300 MB (sample-fit + f32)
PCA transform	3.8 GB peak	~300 MB per chunk (f32)
Total Python peak	~8-9 GB	~1.5-3 GB

Full benchmarks: python benches/benchmark.py

Serialization

All models support pickle:

import pickle

model = KMeans(n_clusters=3).fit(X)
data = pickle.dumps(model)
model_restored = pickle.loads(data)  # fitted state preserved

EmbeddingReducer uses a compact binary format:

reducer.save("pca_128.bin")                   # 1.5 KB
reducer = EmbeddingReducer.load("pca_128.bin") # instant

ClusterSnapshot uses safetensors + JSON for portable, safe persistence:

snapshot = model.snapshot()
snapshot.save("clusters/")                     # safetensors + metadata.json
snapshot = ClusterSnapshot.load("clusters/")   # zero-copy load

Development

maturin develop --release              # build
cargo test --no-default-features --lib # Rust tests (237)
pytest tests/ -v                       # Python tests (394, + 6 opt-in perf via -m perf)
python benches/benchmark.py            # benchmark vs sklearn
cargo fmt -- --check                   # formatting
cargo clippy --no-default-features --lib -- -D warnings  # linting

Architecture

Three-layer kernel design separating concerns:

PyO3 boundary (src/lib.rs): input validation, GIL release, dtype dispatch
Algorithm logic (src/kmeans.rs, src/snapshot/, etc.): iteration, convergence, ndarray types
Hot kernel (src/utils.rs, src/distance.rs): raw &[F] slices for auto-vectorization

The embedding pipeline adds:

Embedding module (src/embedding/): spherical K-means, PCA (faer-backed), vMF refinement, EmbeddingReducer

See docs/architecture-decisions.md for details and docs/lessons-building-rustcluster.md for the full build story.

Contributing

See CONTRIBUTING.md for how to add algorithms, distance metrics, and tests.

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 112 Commits
.claude/plans		.claude/plans
.github/workflows		.github/workflows
benches		benches
docs		docs
examples		examples
experiments		experiments
notebooks		notebooks
python/rustcluster		python/rustcluster
src		src
tests		tests
.gitignore		.gitignore
CONTRIBUTING.md		CONTRIBUTING.md
Cargo.toml		Cargo.toml
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

rustcluster

Highlights

Installation

Quickstart

K-Means

Embedding Clustering

EmbeddingReducer (Fit Once, Cluster Many)

Cluster Slotting (Incremental Assignment)

Mini-Batch K-Means

DBSCAN

HDBSCAN

Agglomerative Clustering

Evaluation Metrics

Distance Metrics

Performance

K-Means vs scikit-learn

Embedding Clustering

Serialization

Development

Architecture

Contributing

License

About

Uh oh!

Releases 12

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

rustcluster

Highlights

Installation

Quickstart

K-Means

Embedding Clustering

EmbeddingReducer (Fit Once, Cluster Many)

Cluster Slotting (Incremental Assignment)

Mini-Batch K-Means

DBSCAN

HDBSCAN

Agglomerative Clustering

Evaluation Metrics

Distance Metrics

Performance

K-Means vs scikit-learn

Embedding Clustering

Serialization

Development

Architecture

Contributing

License

About

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases 12

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages