Skip to content

compute_vecs_l2sq: Replace scalar L2 Squared norm with SIMD-optimized FastL2NormSquared#1107

Merged
arrayka merged 4 commits into
mainfrom
u/alrazu/fast_l2sq
May 27, 2026
Merged

compute_vecs_l2sq: Replace scalar L2 Squared norm with SIMD-optimized FastL2NormSquared#1107
arrayka merged 4 commits into
mainfrom
u/alrazu/fast_l2sq

Conversation

@arrayka
Copy link
Copy Markdown
Contributor

@arrayka arrayka commented May 27, 2026

Changes

  • diskann-disk/src/utils/math_util.rs:
    • Replaced the private compute_vec_l2sq() helper function with FastL2NormSquared.evaluate(chunk)
  • [Unrelated improvements] kmeans_bench.rs (criterion):
    • Switched from iter() to iter_batched with BatchSize::SmallInput to avoid measuring setup costs (clone, allocation)
    • Inlined snrm2_benchmark_rust() helper into the benchmark closure
  • [Unrelated improvements] kmeans_bench_iai.rs (iai-callgrind):
    • Moved thread pool creation into setup_data() so it's excluded from measurement
    • Removed unnecessary data.clone() calls, since there are no multiple loops (the benchmark is executed only once)
    • Added black_box to prevent dead-code elimination of outputs

Performance

iai-callgrind shows 60% CPU cost reduction (see Estimated Cycles metric) after switching to FastL2NormSquared.evaluate(chunk).
The measurement was taken in a single-threaded run with 896 dimensions.

bench_main_iai::kmeans_bench_iai::snrm2_benchmark_rust_iai
  Instructions:                    29000796|304003517            (-90.4604%) [-10.4826x]
  L1 Hits:                         34794766|388098122            (-91.0345%) [-11.1539x]
  L2 Hits:                                2|15                   (-86.6667%) [-7.50000x]
  RAM Hits:                         5606322|5606321              (+0.00002%) [+1.00000x]
  Total read+write:                40401090|393704458            (-89.7382%) [-9.74490x]
  Estimated Cycles:               231016046|584319432            (-60.4641%) [-2.52935x]

criterion shows 61% latency reduction.
The measurement was taken in a single-threaded run with 896 dimensions.

.\bench_main-a1fa58ecbc5b2232.exe --bench Snrm2 --color=never
Gnuplot not found, using plotters backend
Benchmarking kmeans-computation/Snrm2 Rust Run
Benchmarking kmeans-computation/Snrm2 Rust Run: Warming up for 3.0000 s

Warning: Unable to complete 50 samples in 5.0s. You may wish to increase target time to 8.3s, or reduce sample count to 30.
Benchmarking kmeans-computation/Snrm2 Rust Run: Collecting 50 samples in estimated 8.2595 s (50 iterations)
Benchmarking kmeans-computation/Snrm2 Rust Run: Analyzing
kmeans-computation/Snrm2 Rust Run
                        time:   [45.687 ms 46.349 ms 47.244 ms]
                        change: [-61.988% -61.410% -60.611%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 13 outliers among 50 measurements (26.00%)
  2 (4.00%) low severe
  4 (8.00%) low mild
  1 (2.00%) high mild
  6 (12.00%) high severe

P.S. Neither tool shows noticeable performance improvements for 4-dimensional vectors.

@codecov-commenter
Copy link
Copy Markdown

codecov-commenter commented May 27, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 89.46%. Comparing base (3dc4a28) to head (fcad292).
⚠️ Report is 5 commits behind head on main.

Additional details and impacted files

Impacted file tree graph

@@            Coverage Diff             @@
##             main    #1107      +/-   ##
==========================================
- Coverage   89.48%   89.46%   -0.02%     
==========================================
  Files         474      482       +8     
  Lines       89753    91075    +1322     
==========================================
+ Hits        80316    81481    +1165     
- Misses       9437     9594     +157     
Flag Coverage Δ
miri 89.46% <100.00%> (-0.02%) ⬇️
unittests 89.11% <100.00%> (-0.02%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines Coverage Δ
diskann-disk/src/utils/math_util.rs 98.82% <100.00%> (-0.02%) ⬇️

... and 15 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@arrayka arrayka marked this pull request as ready for review May 27, 2026 03:30
@arrayka arrayka requested review from a team and Copilot May 27, 2026 03:30
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR updates diskann-disk’s squared L2 norm computation to use the SIMD-optimized diskann_vector::norm::FastL2NormSquared implementation, and adjusts the k-means benchmark harnesses to better isolate measured work.

Changes:

  • Replaced the scalar per-vector L2-squared loop in compute_vecs_l2sq with FastL2NormSquared.evaluate(...).
  • Updated the Criterion benchmark to use iter_batched to avoid including setup (clones/allocations) in timing.
  • Updated the iai-callgrind benchmark to move thread-pool creation into setup and add black_box to prevent DCE.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 1 comment.

File Description
diskann-disk/src/utils/math_util.rs Switches compute_vecs_l2sq to use FastL2NormSquared for faster L2-squared norms.
diskann-disk/benches/benchmarks/kmeans_bench.rs Uses Criterion iter_batched and inlines snrm2 benchmark logic to reduce setup overhead in measurements.
diskann-disk/benches/benchmarks_iai/kmeans_bench_iai.rs Moves pool creation into setup, removes unnecessary clones, and adds black_box for more reliable callgrind measurements.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread diskann-disk/benches/benchmarks/kmeans_bench.rs
@arrayka arrayka enabled auto-merge (squash) May 27, 2026 21:59
@arrayka arrayka linked an issue May 27, 2026 that may be closed by this pull request
@arrayka arrayka merged commit c32b838 into main May 27, 2026
25 of 26 checks passed
@arrayka arrayka deleted the u/alrazu/fast_l2sq branch May 27, 2026 22:37
arkrishn94 added a commit that referenced this pull request May 28, 2026
# DiskANN v0.53.0 Release Notes

## Breaking Changes

An AI generated, human reviewed list of changes is summarized below.

### Paged search overhauled — channel-based API
([#1078](#1078))

`PagedSearchState` and its `'static`-bound pause/resume model have been
replaced with an async, channel-based interface. The recommended way to
drive paged search is now via a `tokio::sync::mpsc` channel, with the
searcher embedded in an otherwise-`'static` future. See the [rendered
RFC](https://github.com/microsoft/DiskANN/blob/main/rfcs/01078-paged-search.md)
for the new shape. Callers wired against `PagedSearchState` must migrate
to the channel API.

Users of paged search via `wrapped_async::DiskANNIndex` that know their
inner futures will never suspend can use the new
`wrapped_async::DiskANNIndex::paged_search_no_await`; this will
efficiently run paged searches with minimal synchronization overhead.

### `DiskANNIndex::flat_search` removed
([#1076](#1076))

`DiskANNIndex::flat_search` and the `IdIterator` trait have been removed
from the `diskann` crate. Equivalent functionality lives on the new
inherent method `DiskIndexSearcher::flat_search` in `diskann-disk`. This
unblocks the experimental directions in #1067 and #983.

```rust
// Before
diskann_index.flat_search(query, ...)?;

// After
disk_index_searcher.flat_search(query, ...).await?;
```

### `DiskIndexSearcher::flat_search` now batched
([#1097](#1097))

The new `DiskIndexSearcher::flat_search` uses the bulk `pq_distances`
path instead of one-vector-at-a-time `Accessor::build_query_computer` +
`evaluate_similarity`. Downstream behavior is equivalent but tighter
resource bounds apply.

### `centroid` removed from PQ interfaces
([#1010](#1010))

The dataset-centroid argument has been removed from `FixedChunkPQTable`
construction, `populate`, and most other PQ APIs. The shift only ever
worked for L2 distance and was silently ignored for inner-product /
cosine, so passing it was a footgun. When an L2 shift is required, fold
it into the PQ pivots instead (the library now does this internally).

```rust
// Before
let table = FixedChunkPQTable::new(.., centroid, ..);

// After — drop the centroid argument
let table = FixedChunkPQTable::new(.., ..);
```

### Flat search interface
([#983](#983))

A new `flat` module in `diskann` adds a provider-agnostic brute-force
search surface, mirroring the shape of graph search. Backends implement
a single trait, `DistancesUnordered<C>` (in `flat/strategy.rs`), which
fuses iteration and distance computation, allowing any backend
(in-memory, quantized, disk, remote) to plug into a shared algorithm.
See the [rendered
RFC](https://github.com/microsoft/DiskANN/blob/main/rfcs/00983-flat-search.md).
This is additive but is the new canonical surface — direct ad-hoc
flat-search call sites should migrate.

### `bf_tree` extracted into `diskann-bftree` crate
([#1020](#1020))

The bf_tree provider has been moved out of `diskann-providers`
(previously at
`diskann-providers/src/model/graph/provider/async_/bf_tree/`) into a new
standalone `diskann-bftree` crate. Along with the move:

- Switched from PQ to spherical quantization.
- Dropped dependencies on `DeletionCheck`, `AsDeletionCheck`, and
`RemoveDeletedIdsAndCopy`.
- Simplified generics.

Consumers must update their `Cargo.toml` to depend on `diskann-bftree`
and update import paths.

### `direct_distance_impl` and `inner_product_raw` re-exposed
([#1081](#1081))

`direct_distance_impl` (free function) and
`FixedChunkPQTable::inner_product_raw` are `pub` again after being
privatized in #1044. Restored to unblock a downstream user. Not breaking
in the typical direction — this restores previously available API
surface.

### MinMax `recompress` takes a grid-scale parameter
([#1109](#1109))

The MinMax `recompress` API now accepts a grid-scale parameter. 

## New Features

- SIMD-optimized L2-squared norm
([#1107](#1107))
- Significantly faster bitmap computation
([#1099](#1099))
- Large speedup on the bitmap construction path used by filtered search.
- LLVM IR bloat regression check in CI
([#1083](#1083))
- CI now flags regressions in generated LLVM IR size, helping catch
unintended monomorphization blow-ups.
- Recall computation fix for under-k groundtruth
([#1069](#1069))

## Merged PRs

* Revise README for DiskANN3 by @harsha-simhadri in
#1046
* [CI] Try to fix publishing step by @hildebrandmw in
#1057
* [benchmark] Remove `DispatchRule` by @hildebrandmw in
#1064
* [benchmark] Automatic Input Registration by @hildebrandmw in
#1066
* Remove centroid from most PQ interfaces by @hildebrandmw in
#1010
* [diskann/disk] Remove `flat_search` from `DiskANNIndex` by
@hildebrandmw in #1076
* macos build and miri check to nightly by @harsha-simhadri in
#1058
* [API] Make some methods public again by @hildebrandmw in
#1081
* [benchmark] Simply `Inputs` more by @hildebrandmw in
#1077
* Turn on stack protection for the diskann-garnet NuGet build by
@jackmoffitt in #1082
* Fix options for diskann-garnet nuget pipeline by @jackmoffitt in
#1091
* [CI] add LLVM IR bloat regression check by @arazumov in
#1083
* Bump openssl from 0.10.79 to 0.10.80 by @dependabot[bot] in
#1093
* [Disk CI benchmarks] Use 1ES.Pool=diskann-github by @arazumov in
#869
* Fix recall computation for fewer than k groundtruth results by
@magdalendobson in #1069
* bf_tree migration away from diskann-providers by @JordanMaples in
#1020
* [RFC/diskann] Overhaul paged search by @hildebrandmw in
#1078
* Remove unsafe code from compute_vec_l2sq by @arazumov in
#1094
* Remove direct accessor call in `diskann-garnet` by @hildebrandmw in
#1098
* Refactor `DiskIndexSearcher::flat_search` to use batching by
@hildebrandmw in #1097
* [flat index] Flat Search Interface by @arkrishn94 in
#983
* migrating multi-hop tests from diskann-providers to diskann by
@JordanMaples in #928
* Significantly speed up bitmap computation by @magdalendobson in
#1099
* `compute_vecs_l2sq`: Replace scalar L2 Squared norm with
SIMD-optimized FastL2NormSquared by @arazumov in
#1107
* [minmax] Add grid scaling to recompress API by @arkrishn94 in
#1109

**Full Changelog**:
v0.52.0...v0.53.0
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Explore compute_vec_l2sq speed-up

5 participants