compute_vecs_l2sq: Replace scalar L2 Squared norm with SIMD-optimized FastL2NormSquared by arrayka · Pull Request #1107 · microsoft/DiskANN

arrayka · 2026-05-27T02:36:06Z

Changes

diskann-disk/src/utils/math_util.rs:
- Replaced the private compute_vec_l2sq() helper function with FastL2NormSquared.evaluate(chunk)
[Unrelated improvements] kmeans_bench.rs (criterion):
- Switched from iter() to iter_batched with BatchSize::SmallInput to avoid measuring setup costs (clone, allocation)
- Inlined snrm2_benchmark_rust() helper into the benchmark closure
[Unrelated improvements] kmeans_bench_iai.rs (iai-callgrind):
- Moved thread pool creation into setup_data() so it's excluded from measurement
- Removed unnecessary data.clone() calls, since there are no multiple loops (the benchmark is executed only once)
- Added black_box to prevent dead-code elimination of outputs

Performance

iai-callgrind shows 60% CPU cost reduction (see Estimated Cycles metric) after switching to FastL2NormSquared.evaluate(chunk).
The measurement was taken in a single-threaded run with 896 dimensions.

bench_main_iai::kmeans_bench_iai::snrm2_benchmark_rust_iai
  Instructions:                    29000796|304003517            (-90.4604%) [-10.4826x]
  L1 Hits:                         34794766|388098122            (-91.0345%) [-11.1539x]
  L2 Hits:                                2|15                   (-86.6667%) [-7.50000x]
  RAM Hits:                         5606322|5606321              (+0.00002%) [+1.00000x]
  Total read+write:                40401090|393704458            (-89.7382%) [-9.74490x]
  Estimated Cycles:               231016046|584319432            (-60.4641%) [-2.52935x]

criterion shows 61% latency reduction.
The measurement was taken in a single-threaded run with 896 dimensions.

.\bench_main-a1fa58ecbc5b2232.exe --bench Snrm2 --color=never
Gnuplot not found, using plotters backend
Benchmarking kmeans-computation/Snrm2 Rust Run
Benchmarking kmeans-computation/Snrm2 Rust Run: Warming up for 3.0000 s

Warning: Unable to complete 50 samples in 5.0s. You may wish to increase target time to 8.3s, or reduce sample count to 30.
Benchmarking kmeans-computation/Snrm2 Rust Run: Collecting 50 samples in estimated 8.2595 s (50 iterations)
Benchmarking kmeans-computation/Snrm2 Rust Run: Analyzing
kmeans-computation/Snrm2 Rust Run
                        time:   [45.687 ms 46.349 ms 47.244 ms]
                        change: [-61.988% -61.410% -60.611%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 13 outliers among 50 measurements (26.00%)
  2 (4.00%) low severe
  4 (8.00%) low mild
  1 (2.00%) high mild
  6 (12.00%) high severe

P.S. Neither tool shows noticeable performance improvements for 4-dimensional vectors.

codecov-commenter · 2026-05-27T02:59:19Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 89.46%. Comparing base (3dc4a28) to head (fcad292).
⚠️ Report is 5 commits behind head on main.

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #1107      +/-   ##
==========================================
- Coverage   89.48%   89.46%   -0.02%     
==========================================
  Files         474      482       +8     
  Lines       89753    91075    +1322     
==========================================
+ Hits        80316    81481    +1165     
- Misses       9437     9594     +157

Flag	Coverage Δ
miri	`89.46% <100.00%> (-0.02%)`	⬇️
unittests	`89.11% <100.00%> (-0.02%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines	Coverage Δ
diskann-disk/src/utils/math_util.rs	`98.82% <100.00%> (-0.02%)`	⬇️

... and 15 files with indirect coverage changes

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Copilot

Pull request overview

This PR updates diskann-disk’s squared L2 norm computation to use the SIMD-optimized diskann_vector::norm::FastL2NormSquared implementation, and adjusts the k-means benchmark harnesses to better isolate measured work.

Changes:

Replaced the scalar per-vector L2-squared loop in compute_vecs_l2sq with FastL2NormSquared.evaluate(...).
Updated the Criterion benchmark to use iter_batched to avoid including setup (clones/allocations) in timing.
Updated the iai-callgrind benchmark to move thread-pool creation into setup and add black_box to prevent DCE.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 1 comment.

File	Description
diskann-disk/src/utils/math_util.rs	Switches `compute_vecs_l2sq` to use `FastL2NormSquared` for faster L2-squared norms.
diskann-disk/benches/benchmarks/kmeans_bench.rs	Uses Criterion `iter_batched` and inlines snrm2 benchmark logic to reduce setup overhead in measurements.
diskann-disk/benches/benchmarks_iai/kmeans_bench_iai.rs	Moves pool creation into setup, removes unnecessary clones, and adds `black_box` for more reliable callgrind measurements.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@harsha-simhadri

# DiskANN v0.53.0 Release Notes ## Breaking Changes An AI generated, human reviewed list of changes is summarized below. ### Paged search overhauled — channel-based API ([#1078](#1078)) `PagedSearchState` and its `'static`-bound pause/resume model have been replaced with an async, channel-based interface. The recommended way to drive paged search is now via a `tokio::sync::mpsc` channel, with the searcher embedded in an otherwise-`'static` future. See the [rendered RFC](https://github.com/microsoft/DiskANN/blob/main/rfcs/01078-paged-search.md) for the new shape. Callers wired against `PagedSearchState` must migrate to the channel API. Users of paged search via `wrapped_async::DiskANNIndex` that know their inner futures will never suspend can use the new `wrapped_async::DiskANNIndex::paged_search_no_await`; this will efficiently run paged searches with minimal synchronization overhead. ### `DiskANNIndex::flat_search` removed ([#1076](#1076)) `DiskANNIndex::flat_search` and the `IdIterator` trait have been removed from the `diskann` crate. Equivalent functionality lives on the new inherent method `DiskIndexSearcher::flat_search` in `diskann-disk`. This unblocks the experimental directions in #1067 and #983. ```rust // Before diskann_index.flat_search(query, ...)?; // After disk_index_searcher.flat_search(query, ...).await?; ``` ### `DiskIndexSearcher::flat_search` now batched ([#1097](#1097)) The new `DiskIndexSearcher::flat_search` uses the bulk `pq_distances` path instead of one-vector-at-a-time `Accessor::build_query_computer` + `evaluate_similarity`. Downstream behavior is equivalent but tighter resource bounds apply. ### `centroid` removed from PQ interfaces ([#1010](#1010)) The dataset-centroid argument has been removed from `FixedChunkPQTable` construction, `populate`, and most other PQ APIs. The shift only ever worked for L2 distance and was silently ignored for inner-product / cosine, so passing it was a footgun. When an L2 shift is required, fold it into the PQ pivots instead (the library now does this internally). ```rust // Before let table = FixedChunkPQTable::new(.., centroid, ..); // After — drop the centroid argument let table = FixedChunkPQTable::new(.., ..); ``` ### Flat search interface ([#983](#983)) A new `flat` module in `diskann` adds a provider-agnostic brute-force search surface, mirroring the shape of graph search. Backends implement a single trait, `DistancesUnordered<C>` (in `flat/strategy.rs`), which fuses iteration and distance computation, allowing any backend (in-memory, quantized, disk, remote) to plug into a shared algorithm. See the [rendered RFC](https://github.com/microsoft/DiskANN/blob/main/rfcs/00983-flat-search.md). This is additive but is the new canonical surface — direct ad-hoc flat-search call sites should migrate. ### `bf_tree` extracted into `diskann-bftree` crate ([#1020](#1020)) The bf_tree provider has been moved out of `diskann-providers` (previously at `diskann-providers/src/model/graph/provider/async_/bf_tree/`) into a new standalone `diskann-bftree` crate. Along with the move: - Switched from PQ to spherical quantization. - Dropped dependencies on `DeletionCheck`, `AsDeletionCheck`, and `RemoveDeletedIdsAndCopy`. - Simplified generics. Consumers must update their `Cargo.toml` to depend on `diskann-bftree` and update import paths. ### `direct_distance_impl` and `inner_product_raw` re-exposed ([#1081](#1081)) `direct_distance_impl` (free function) and `FixedChunkPQTable::inner_product_raw` are `pub` again after being privatized in #1044. Restored to unblock a downstream user. Not breaking in the typical direction — this restores previously available API surface. ### MinMax `recompress` takes a grid-scale parameter ([#1109](#1109)) The MinMax `recompress` API now accepts a grid-scale parameter. ## New Features - SIMD-optimized L2-squared norm ([#1107](#1107)) - Significantly faster bitmap computation ([#1099](#1099)) - Large speedup on the bitmap construction path used by filtered search. - LLVM IR bloat regression check in CI ([#1083](#1083)) - CI now flags regressions in generated LLVM IR size, helping catch unintended monomorphization blow-ups. - Recall computation fix for under-k groundtruth ([#1069](#1069)) ## Merged PRs * Revise README for DiskANN3 by @harsha-simhadri in #1046 * [CI] Try to fix publishing step by @hildebrandmw in #1057 * [benchmark] Remove `DispatchRule` by @hildebrandmw in #1064 * [benchmark] Automatic Input Registration by @hildebrandmw in #1066 * Remove centroid from most PQ interfaces by @hildebrandmw in #1010 * [diskann/disk] Remove `flat_search` from `DiskANNIndex` by @hildebrandmw in #1076 * macos build and miri check to nightly by @harsha-simhadri in #1058 * [API] Make some methods public again by @hildebrandmw in #1081 * [benchmark] Simply `Inputs` more by @hildebrandmw in #1077 * Turn on stack protection for the diskann-garnet NuGet build by @jackmoffitt in #1082 * Fix options for diskann-garnet nuget pipeline by @jackmoffitt in #1091 * [CI] add LLVM IR bloat regression check by @arazumov in #1083 * Bump openssl from 0.10.79 to 0.10.80 by @dependabot[bot] in #1093 * [Disk CI benchmarks] Use 1ES.Pool=diskann-github by @arazumov in #869 * Fix recall computation for fewer than k groundtruth results by @magdalendobson in #1069 * bf_tree migration away from diskann-providers by @JordanMaples in #1020 * [RFC/diskann] Overhaul paged search by @hildebrandmw in #1078 * Remove unsafe code from compute_vec_l2sq by @arazumov in #1094 * Remove direct accessor call in `diskann-garnet` by @hildebrandmw in #1098 * Refactor `DiskIndexSearcher::flat_search` to use batching by @hildebrandmw in #1097 * [flat index] Flat Search Interface by @arkrishn94 in #983 * migrating multi-hop tests from diskann-providers to diskann by @JordanMaples in #928 * Significantly speed up bitmap computation by @magdalendobson in #1099 * `compute_vecs_l2sq`: Replace scalar L2 Squared norm with SIMD-optimized FastL2NormSquared by @arazumov in #1107 * [minmax] Add grid scaling to recompress API by @arkrishn94 in #1109 **Full Changelog**: v0.52.0...v0.53.0

Alex Razumov (from Dev Box) added 3 commits May 26, 2026 17:03

Added side-by-side comparison

00a7398

Removing comparison

8b9bd93

fmt

d239119

arrayka marked this pull request as ready for review May 27, 2026 03:30

arrayka requested review from a team and Copilot May 27, 2026 03:30

Copilot started reviewing on behalf of arrayka May 27, 2026 03:31 View session

Copilot AI reviewed May 27, 2026

View reviewed changes

Comment thread diskann-disk/benches/benchmarks/kmeans_bench.rs

Addex .expect()

fcad292

hildebrandmw approved these changes May 27, 2026

View reviewed changes

arrayka enabled auto-merge (squash) May 27, 2026 21:59

arrayka linked an issue May 27, 2026 that may be closed by this pull request

Explore compute_vec_l2sq speed-up #1096

Closed

arkrishn94 approved these changes May 27, 2026

View reviewed changes

arrayka merged commit c32b838 into main May 27, 2026
25 of 26 checks passed

arrayka deleted the u/alrazu/fast_l2sq branch May 27, 2026 22:37

arkrishn94 mentioned this pull request May 28, 2026

Bump version to 0.53.0 #1111

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

compute_vecs_l2sq: Replace scalar L2 Squared norm with SIMD-optimized FastL2NormSquared#1107

compute_vecs_l2sq: Replace scalar L2 Squared norm with SIMD-optimized FastL2NormSquared#1107
arrayka merged 4 commits into
mainfrom
u/alrazu/fast_l2sq

arrayka commented May 27, 2026 •

edited

Loading

Uh oh!

codecov-commenter commented May 27, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

arrayka commented May 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Changes

Performance

Uh oh!

codecov-commenter commented May 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

arrayka commented May 27, 2026 •

edited

Loading

codecov-commenter commented May 27, 2026 •

edited

Loading