Fast, Rust-backed clustering for Python. Six algorithms, sklearn-compatible API, purpose-built embedding clustering with 11x PCA speedup.
- 6 algorithms: KMeans, MiniBatchKMeans, DBSCAN, HDBSCAN, AgglomerativeClustering, EmbeddingCluster
- EmbeddingCluster: purpose-built pipeline for OpenAI/Cohere/Voyage embeddings (L2-normalize, PCA, spherical K-means)
- EmbeddingReducer: standalone PCA transformer with save/load (fit once, cluster for free), chunked transform, sample-fit, and an f32 fast path that halves working-set memory
rustcluster.utils.extract_embeddings_from_spark: stream a Spark DataFrame into NumPy via Arrow, no Python list-of-lists overheadrustcluster.index: FAISS-flavored flat vector search (IndexFlatL2,IndexFlatIP) plus a fused all-pairssimilarity_graphkernel- faer-accelerated PCA: 11x faster than hand-rolled matmul via SIMD-optimized GEMM
- 3 distance metrics: euclidean, cosine, manhattan
- 3 evaluation metrics: silhouette score, Calinski-Harabasz, Davies-Bouldin
- KD-tree acceleration for DBSCAN/HDBSCAN neighbor queries (10-200x on low-d data)
- Native f32/f64: no silent upcast, doubles cache efficiency with f32
- Cluster slotting: snapshot fitted clusters, assign new points 100x faster than refitting
- Pickle serialization for all fitted models
- GIL released during all compute, plays well with threads and async
- 631 tests across Rust (237) and Python (394) suites
pip install rustclusterOr from source (requires Rust toolchain + Python 3.10+):
pip install maturin
git clone https://github.com/mfbaig35r/rustcluster.git
cd rustcluster
maturin develop --releasefrom rustcluster import KMeans
model = KMeans(n_clusters=3, random_state=42)
model.fit(X)
model.labels_ # cluster assignments
model.cluster_centers_ # centroids (k x d)
model.inertia_ # sum of squared distances
model.predict(X_new) # assign new dataPurpose-built pipeline for dense embedding vectors (OpenAI, Cohere, Voyage, etc.):
from rustcluster.experimental import EmbeddingCluster
model = EmbeddingCluster(n_clusters=50, reduction_dim=128)
model.fit(embeddings) # L2-normalize, PCA, spherical K-means
model.labels_ # cluster assignments
model.cluster_centers_ # unit-norm centroids in reduced space
model.intra_similarity_ # per-cluster cosine similarity
model.reduced_data_ # access PCA-reduced dataPCA is 99% of the embedding pipeline runtime. Separate reduction from clustering to iterate for free:
from rustcluster.experimental import EmbeddingReducer
# Pay the PCA cost once
reducer = EmbeddingReducer(target_dim=128)
X_reduced = reducer.fit_transform(embeddings) # 323K x 1536 -> 128 in ~56s
reducer.save("pca_128.bin")
# Iterate on clustering for free
reducer = EmbeddingReducer.load("pca_128.bin")
X_reduced = reducer.transform(new_embeddings)
EmbeddingCluster(n_clusters=50, reduction_dim=None).fit(X_reduced) # ~4s
EmbeddingCluster(n_clusters=100, reduction_dim=None).fit(X_reduced) # ~8s
EmbeddingCluster(n_clusters=200, reduction_dim=None).fit(X_reduced) # ~15sMatryoshka models (e.g., text-embedding-3-small) can skip PCA entirely:
reducer = EmbeddingReducer(target_dim=128, method="matryoshka")
X_reduced = reducer.fit_transform(embeddings) # instant: column slice + L2-normalizeMemory-optimized end-to-end workflow (v0.7.0). Runs on a 28 GB Databricks driver without manual chunking:
from rustcluster.utils import extract_embeddings_from_spark
from rustcluster.experimental import EmbeddingReducer, EmbeddingCluster
# Stream embeddings from Spark with no Python list-of-lists overhead
embeddings, pdf = extract_embeddings_from_spark(
df,
embedding_col="embedding",
metadata_cols=["supplier_id", "commodity"],
dtype=np.float32,
)
# Fit on a 50K sample, transform all 312K in chunks, f32 throughout
reducer = EmbeddingReducer(target_dim=128, fit_sample_size=50_000)
X = reducer.fit_transform(embeddings, chunk_size=50_000)
model = EmbeddingCluster(n_clusters=20, reduction_dim=None).fit(X)Output dtype tracks input dtype: float32 in, float32 out. (Prior to v0.7.0 the output was always upcast to float64.)
See the embedding clustering guide for full documentation.
Fit once, assign new data forever. Snapshot freezes cluster centroids; new points are assigned without re-clustering:
from rustcluster import KMeans, ClusterSnapshot
# Fit and snapshot
model = KMeans(n_clusters=50).fit(X_train)
snapshot = model.snapshot()
snapshot.save("clusters/")
# Later: load and assign new data (no refit needed)
snapshot = ClusterSnapshot.load("clusters/")
labels = snapshot.assign(X_new) # 100x faster than refittingWorks with KMeans, MiniBatchKMeans, and EmbeddingCluster. EmbeddingCluster snapshots bake in the full preprocessing pipeline (L2-normalize, PCA, spherical assignment).
Confidence scoring and rejection:
result = snapshot.assign_with_scores(X_new, confidence_threshold=0.3)
result.labels_ # -1 for rejected points
result.confidences_ # [0, 1); higher means more decisive assignment
result.distances_ # distance to nearest centroid
result.rejected_ # boolean maskAdaptive thresholds (v2): Per-cluster rejection thresholds calibrated from training data. Fixes the problem where a global threshold rejects too many points from diffuse clusters:
snapshot.calibrate(X_train) # compute per-cluster confidence distributions
result = snapshot.assign_with_scores(X_new, adaptive_threshold=True, adaptive_percentile="p10")Mahalanobis boundaries (v2): Diagonal Mahalanobis distance accounts for per-cluster, per-dimension variance:
snapshot.calibrate(X_train)
labels = snapshot.assign(X_new, boundary_mode="mahalanobis")Drift detection:
report = snapshot.drift_report(X_recent)
report.global_mean_distance_ # compare to training baseline
report.relative_drift_ # per-cluster drift
report.kappa_drift_ # vMF concentration shift (spherical only, v2)
report.direction_drift_ # centroid direction shift (spherical only, v2)
report.rejection_rate_ # fraction of points beyond per-cluster bounds (requires calibrate())rejection_rate_ is NaN until snapshot.calibrate(X_train) is called. The per-cluster bounds come from the calibration distribution, not the fit-time data.
Hierarchical slotting (v2): Cascading snapshots for multi-level classification (e.g., commodity, then sub-commodity):
from rustcluster.experimental import HierarchicalSnapshot
hier = HierarchicalSnapshot.build(X_train, root_model, n_sub_clusters=10)
root_labels, child_labels = hier.assign(X_new)
hier.save("clusters/hierarchy/")Persistence: safetensors (centroids) + JSON (metadata). A 50-cluster, 128d snapshot is ~50 KB vs GBs of training data.
Validated on 323K CROSS ruling embeddings (113x speedup, 99.86% fidelity) and 312K supplier embeddings (453x speedup, 99.94% fidelity). Hierarchical slotting improved heading purity from 37% to 54% on CROSS rulings.
from rustcluster import MiniBatchKMeans
model = MiniBatchKMeans(n_clusters=3, batch_size=256, random_state=42)
model.fit(X_large) # scales to large datasetsfrom rustcluster import DBSCAN
model = DBSCAN(eps=0.5, min_samples=5)
model.fit(X)
model.labels_ # -1 for noise
model.core_sample_indices_ # core point indicesfrom rustcluster import HDBSCAN
model = HDBSCAN(min_cluster_size=5)
model.fit(X)
model.labels_ # -1 for noise
model.probabilities_ # soft membership [0, 1]
model.cluster_persistence_ # per-cluster stabilityfrom rustcluster import AgglomerativeClustering
model = AgglomerativeClustering(n_clusters=3, linkage="ward")
model.fit(X)
model.labels_ # cluster assignments
model.children_ # merge history
model.distances_ # distance at each mergefrom rustcluster import silhouette_score, calinski_harabasz_score, davies_bouldin_score
silhouette_score(X, labels) # [-1, 1], higher is better
calinski_harabasz_score(X, labels) # higher is better
davies_bouldin_score(X, labels) # lower is betterAll algorithms accept a metric parameter:
KMeans(n_clusters=5, metric="cosine")
DBSCAN(eps=0.3, metric="manhattan")
HDBSCAN(min_cluster_size=5, metric="euclidean")| Metric | Aliases | KD-tree acceleration | Notes |
|---|---|---|---|
"euclidean" |
"l2" |
Yes | Default for all algorithms |
"cosine" |
No (brute force) | K-means forces Lloyd (Hamerly assumes Euclidean) | |
"manhattan" |
"cityblock", "l1" |
Yes |
Ward linkage requires euclidean metric.
Single-threaded, n_init=1, median of 5 runs:
| n | d | k | Speedup vs sklearn |
|---|---|---|---|
| 1,000 | 8 | 8 | 2.9x |
| 10,000 | 8 | 8 | 2.4x |
| 100,000 | 8 | 32 | 3.2x |
| 100,000 | 32 | 32 | 1.4x |
DBSCAN and HDBSCAN use KD-tree acceleration for d <= 16 with euclidean or manhattan metrics, reducing neighbor queries from O(n^2) to O(n log n).
Measured on 323K embeddings (text-embedding-3-small, 1536d → 128d, K=98, Apple Silicon):
| Workflow | Time |
|---|---|
| Full pipeline (PCA + cluster) | 58s |
| Subsequent run (cached reduced data) | 7.5s |
| 5 clustering configs on cached data | 74s |
| Matryoshka (no PCA needed) | ~5s |
Memory (v0.7.0, 312K x 1536d -> 128d on a 28 GB Databricks driver):
| Stage | v0.6.x | v0.7.0 |
|---|---|---|
| Arrow extraction from Spark | ~5 GB | ~2 GB |
| PCA fit | 3.8 GB centered matrix | ~300 MB (sample-fit + f32) |
| PCA transform | 3.8 GB peak | ~300 MB per chunk (f32) |
| Total Python peak | ~8-9 GB | ~1.5-3 GB |
Full benchmarks: python benches/benchmark.py
All models support pickle:
import pickle
model = KMeans(n_clusters=3).fit(X)
data = pickle.dumps(model)
model_restored = pickle.loads(data) # fitted state preservedEmbeddingReducer uses a compact binary format:
reducer.save("pca_128.bin") # 1.5 KB
reducer = EmbeddingReducer.load("pca_128.bin") # instantClusterSnapshot uses safetensors + JSON for portable, safe persistence:
snapshot = model.snapshot()
snapshot.save("clusters/") # safetensors + metadata.json
snapshot = ClusterSnapshot.load("clusters/") # zero-copy loadmaturin develop --release # build
cargo test --no-default-features --lib # Rust tests (237)
pytest tests/ -v # Python tests (394, + 6 opt-in perf via -m perf)
python benches/benchmark.py # benchmark vs sklearn
cargo fmt -- --check # formatting
cargo clippy --no-default-features --lib -- -D warnings # lintingThree-layer kernel design separating concerns:
- PyO3 boundary (
src/lib.rs): input validation, GIL release, dtype dispatch - Algorithm logic (
src/kmeans.rs,src/snapshot/, etc.): iteration, convergence, ndarray types - Hot kernel (
src/utils.rs,src/distance.rs): raw&[F]slices for auto-vectorization
The embedding pipeline adds:
- Embedding module (
src/embedding/): spherical K-means, PCA (faer-backed), vMF refinement, EmbeddingReducer
See docs/architecture-decisions.md for details and docs/lessons-building-rustcluster.md for the full build story.
See CONTRIBUTING.md for how to add algorithms, distance metrics, and tests.
MIT