Skip to content

Move k-means implementation from diskann-providers to diskann-disk#933

Merged
arrayka merged 3 commits intomainfrom
copilot/move-kmeans-to-diskann-disk
Apr 10, 2026
Merged

Move k-means implementation from diskann-providers to diskann-disk#933
arrayka merged 3 commits intomainfrom
copilot/move-kmeans-to-diskann-disk

Conversation

Copy link
Copy Markdown
Contributor

Copilot AI commented Apr 9, 2026

K-means in diskann-providers was the last consumer of the old BLAS-based clustering path; PQ training has since migrated to diskann-quantization. The only active call site remaining was disk-index partitioning in diskann-disk.

We will keep diskann-providers's implementation for now and move it to diskann-disk, rather than switching to the one in diskann-quantization, for the following reasons:

  • K-means in diskann-providers performs better at higher dimensions (>100):
image
  • K-means in diskann-providers supports multi-threading:
image

We will work on closing these performance gaps and converging the two implementations in separate PRs.

Changes in this PR

diskann-disk

  • Added src/utils/kmeans.rs — k-means implementation moved from diskann-providers
  • Added src/utils/math_util.rs — mathematical utilities (compute_vecs_l2sq, compute_closest_centers, compute_closest_centers_in_block, and helpers) extracted from diskann-providers and deduplicated
  • Exported k_means_clustering, k_meanspp_selecting_pivots, run_lloyds, compute_vecs_l2sq, compute_closest_centers, compute_closest_centers_in_block from utils/mod.rs
  • Updated utils/partition.rs to import kmeans functions and math utilities from local modules instead of diskann-providers
  • Moved kmeans criterion and iai-callgrind benchmarks from diskann-providers/benches to diskann-disk/benches
  • Added proptest and approx to dev-dependencies

diskann-providers

  • Deleted src/utils/kmeans.rs
  • Removed k_means_clustering, k_meanspp_selecting_pivots, run_lloyds, compute_vecs_l2sq, compute_vec_l2sq from the public API
  • Removed the now-deduplicated math utility implementations from math_util.rs
  • Removed dead OPQ code: generate_optimized_pq_pivots, opq_quantize_all_chunks, copy_chunk_centroids_to_full_table, their test, and unused imports/constants — these were the sole remaining callers of k-means in this crate and were already gated behind #[allow(dead_code)]

@arrayka
Copy link
Copy Markdown
Contributor

arrayka commented Apr 10, 2026

@copilot update the PR description to reflect the recent changes

Copy link
Copy Markdown
Contributor Author

Copilot AI commented Apr 10, 2026

@copilot update the PR description to reflect the recent changes

Updated. The description now reflects all changes including the subsequent math_util.rs extraction into diskann-disk and the deduplication of compute_vecs_l2sq/compute_closest_centers.

Copilot AI requested a review from arrayka April 10, 2026 03:11
@codecov-commenter
Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 94.35028% with 20 lines in your changes missing coverage. Please review.
✅ Project coverage is 89.33%. Comparing base (ea37491) to head (1639a95).

Files with missing lines Patch % Lines
diskann-disk/src/utils/math_util.rs 94.36% 19 Missing ⚠️
diskann-disk/src/utils/kmeans.rs 94.11% 1 Missing ⚠️
Additional details and impacted files

Impacted file tree graph

@@            Coverage Diff             @@
##             main     #933      +/-   ##
==========================================
- Coverage   89.38%   89.33%   -0.06%     
==========================================
  Files         447      448       +1     
  Lines       84299    84054     -245     
==========================================
- Hits        75353    75089     -264     
- Misses       8946     8965      +19     
Flag Coverage Δ
miri 89.33% <94.35%> (-0.06%) ⬇️
unittests 89.17% <94.35%> (-0.06%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines Coverage Δ
diskann-disk/src/utils/partition.rs 92.54% <ø> (ø)
diskann-providers/src/model/pq/pq_construction.rs 91.03% <ø> (-1.12%) ⬇️
diskann-providers/src/utils/math_util.rs 95.18% <ø> (+0.54%) ⬆️
diskann-disk/src/utils/kmeans.rs 90.86% <94.11%> (ø)
diskann-disk/src/utils/math_util.rs 94.36% <94.36%> (ø)
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@arrayka arrayka marked this pull request as ready for review April 10, 2026 03:29
@arrayka arrayka requested review from a team and Copilot April 10, 2026 03:29
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Moves the legacy BLAS-based, multithreaded k-means implementation out of diskann-providers and into diskann-disk, aligning the code with its only remaining production call site (disk index partitioning) while removing dead OPQ-related clustering code from providers.

Changes:

  • Add diskann-disk::utils::{kmeans, math_util} and re-export clustering + closest-center utilities from diskann-disk::utils.
  • Update disk partitioning to use the newly-local k-means/math utilities.
  • Remove k-means exports, k-means benchmarks, and dead OPQ code from diskann-providers; move corresponding benchmarks to diskann-disk and add dev-deps for tests.

Reviewed changes

Copilot reviewed 18 out of 19 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
diskann-providers/src/utils/mod.rs Stops re-exporting k-means + closest-center math utilities from providers.
diskann-providers/src/utils/math_util.rs Removes k-means/closest-center math utilities, leaving residual/vector helpers.
diskann-providers/src/model/pq/pq_construction.rs Deletes dead OPQ code that depended on the old k-means path; removes unused imports/constants.
diskann-providers/benches/benchmarks/mod.rs Drops k-means criterion benchmark module from providers.
diskann-providers/benches/benchmarks_iai/mod.rs Drops k-means iai-callgrind benchmark module from providers.
diskann-providers/benches/bench_main.rs Removes k-means criterion benchmark registration from providers.
diskann-providers/benches/bench_main_iai.rs Removes k-means iai-callgrind benchmark registration from providers.
diskann-disk/src/utils/partition.rs Switches partitioning to import k-means + closest-center utilities from diskann-disk::utils.
diskann-disk/src/utils/mod.rs Exposes new math_util and kmeans modules + re-exports their APIs.
diskann-disk/src/utils/math_util.rs Introduces closest-center and L2-norm utilities (moved/deduped from providers) with tests.
diskann-disk/src/utils/kmeans.rs Introduces k-means++ pivot selection and Lloyd’s algorithm (moved from providers) with tests/proptests.
diskann-disk/Cargo.toml Adds approx and proptest to dev-dependencies to support moved tests.
diskann-disk/benches/benchmarks/mod.rs Registers the moved criterion k-means benchmarks in disk crate.
diskann-disk/benches/benchmarks/kmeans_bench.rs Adds criterion benchmark using diskann-disk::utils k-means/math utilities.
diskann-disk/benches/benchmarks_iai/mod.rs Registers the moved iai-callgrind k-means benchmarks in disk crate.
diskann-disk/benches/benchmarks_iai/kmeans_bench_iai.rs Adds iai-callgrind benchmark using diskann-disk::utils k-means/math utilities.
diskann-disk/benches/bench_main.rs Registers k-means criterion benchmark entrypoint in disk crate.
diskann-disk/benches/bench_main_iai.rs Registers k-means iai-callgrind benchmark entrypoint in disk crate.
Cargo.lock Records new dev-dependency resolution for approx and proptest in diskann-disk.
Comments suppressed due to low confidence (2)

diskann-disk/src/utils/kmeans.rs:314

  • sum clamps f32::INFINITY distances to f32::MAX, but later prefix_sum is built from the original pivot_dist values. If any pivot_dist is INFINITY (e.g., squared L2 overflow for large coordinates), prefix_sum can become inf while sum is finite, triggering the "Prefix sum should not be greater than sum" error even though the algorithm could proceed. Handle INFINITY consistently (either allow sum to be infinite in f64, or clamp/filter distances both when computing sum and when accumulating prefix_sum).
    diskann-disk/src/utils/kmeans.rs:972
  • The proptest names use k_meansspp_... (double 's'), which looks like a typo of k_meanspp_... and makes these tests harder to grep/relate to the k-means++ implementation. Consider renaming the test functions for consistency.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copy link
Copy Markdown
Contributor

@hildebrandmw hildebrandmw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Didn't note that the OPQ was already marked with dead_code. Nice find. I think @arkrishn94 is also rooting out the rest of the OPQ plumbing. I think the conflict shouldn't be too bad.

@arrayka arrayka enabled auto-merge (squash) April 10, 2026 18:13
Copy link
Copy Markdown
Contributor

@arkrishn94 arkrishn94 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. Will remove the rest of the OPQ plumbing in a quick follow up

@arrayka arrayka merged commit 8341f6d into main Apr 10, 2026
28 checks passed
@arrayka arrayka deleted the copilot/move-kmeans-to-diskann-disk branch April 10, 2026 18:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Get rid of utils/kmeans.rs

6 participants