Skip to content

Significantly speed up bitmap computation#1099

Merged
magdalendobson merged 37 commits into
mainfrom
users/magdalen/add_filter_utils
May 26, 2026
Merged

Significantly speed up bitmap computation#1099
magdalendobson merged 37 commits into
mainfrom
users/magdalen/add_filter_utils

Conversation

@magdalendobson
Copy link
Copy Markdown
Contributor

@magdalendobson magdalendobson commented May 21, 2026

Introduction

Bitmap computation in diskann-label-filter is unacceptably slow. Currently, with a 1 million size slice of yfcc and a 10k query set, computing the query bitmaps takes 43.10 seconds. With just a 100K slice of the caselaw dataset and a 10k query set, computing the bitmaps takes 6.03 seconds. This was making it hard to run experiments on filtered search algorithms for the full sizes of these datasets.

Speeding up the bitmap computation is conceptually simple. Instead of iterating over every base label for every query filter, we compute an inverted index for each label type, which maps the label value to the documents with the same value. Then, at query time, we query the inverted index for the relevant label values, and compose the resulting sets as necessary to find the documents satisfying the entire filter expression. At a high level, that is what this PR does.

Lower level details

The overall workflow of the main function, compute_query_bitmaps, is as follows:

  1. Check whether the query expression contains any ASTExpr::Not clauses. If so, default to the existing slow path. This is because we don't store the document universe for each label, and thus can't compute the complement of an arbitrary bitset.
  2. Otherwise, move to the fast path.
  3. Flatten the base labels so that nested values map to a single string (e.g. the JSON string {"car": {"color":"red", "make":Mazda"}} would be transformed to {"car.color":red, "car.make":"Mazda}), and re-organize as a hash map of labels to values.
  4. For each label, compute either an inverted index (strings and bools) or an B-tree (ints and floats) depending on its type.
  5. At query time, use either the inverted index or the B-tree to produce a bitset for each CompareOp in the clause, and then compose them with AND and OR as needed to produce the final bitset.

We also add a utility to diskann-label-filter for computing the specificity of a set of query filters with respect to a base set, outputting some statistics on it, and optionally outputting the individual specificity values to a file for further processing.

Inverted Index

The inverted index maps each label value, converted to a string, to a bitset containing the doc ids corresponding to that value.

B-Tree

For simplicity, the B-tree implementation converts integers to floats before inserting so that we don't have to deal with two different types of B-tree. The performance of this piece of code isn't sensitive enough that it makes sense to differentiate, but this could be changed in the future.

The B-tree maps collections of ids to vectors instead of bitsets, because concatenating vectors is much cheaper than extending bitsets, and potentially many vectors would be concatenated during a range query.

Timings

Returning to the earlier discussion of timings, for the 1 million size slice of yfcc and a 10k query set, computing the query bitmaps now takes .6 seconds. For the 100K slice of the caselaw dataset and a 10k query set, computing the bitmaps now takes 1.728 seconds.

@magdalendobson magdalendobson marked this pull request as ready for review May 21, 2026 22:13
@magdalendobson magdalendobson requested review from a team and Copilot May 21, 2026 22:13
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR targets a major performance improvement in diskann-label-filter by introducing a fast-path for computing per-query bitmaps using precomputed per-field accelerators (inverted-index style maps for equality and a numeric BTree for range queries), while falling back to the existing evaluator when NOT is present. It also adds an example utility for computing “specificity” statistics over query filters.

Changes:

  • Add utils::compute_bitmap::compute_query_bitmaps implementing an accelerated bitmap computation path (with a NOT-guarded slow fallback).
  • Export the new bitmap API from diskann-label-filter and add an example (compute_specificities) to compute stats/output.
  • Minor doc comment updates in flattening utilities and dependency updates for the new module.

Reviewed changes

Copilot reviewed 5 out of 6 changed files in this pull request and generated 7 comments.

Show a summary per file
File Description
diskann-label-filter/src/utils/flatten_utils.rs Updates doc examples for configurable flattening (one example is currently inconsistent with behavior).
diskann-label-filter/src/utils/compute_bitmap.rs New accelerated bitmap computation implementation plus unit tests.
diskann-label-filter/src/lib.rs Exposes the new module and re-exports compute_query_bitmaps.
diskann-label-filter/examples/compute_specificities.rs New example for computing/saving specificity stats from computed bitmaps.
diskann-label-filter/Cargo.toml Adds dependencies needed by the new bitmap computation module.
Cargo.lock Locks new transitive deps (bit-set, rayon) for this crate.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread diskann-label-filter/src/utils/flatten_utils.rs Outdated
Comment thread diskann-label-filter/src/utils/compute_bitmap.rs Outdated
Comment thread diskann-label-filter/src/utils/compute_bitmap.rs Outdated
Comment thread diskann-label-filter/src/utils/compute_bitmap.rs Outdated
Comment thread diskann-label-filter/src/utils/compute_bitmap.rs Outdated
Comment thread diskann-tools/src/bin/compute_specificities.rs
Comment thread diskann-label-filter/examples/compute_specificities.rs Outdated
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
Copilot AI and others added 2 commits May 22, 2026 14:11
Agent-Logs-Url: https://github.com/microsoft/DiskANN/sessions/f16c59eb-89cf-4480-b6fe-afe4be5e7c8e

Co-authored-by: magdalendobson <58752279+magdalendobson@users.noreply.github.com>
Agent-Logs-Url: https://github.com/microsoft/DiskANN/sessions/727d3d88-3d0b-47bf-a023-9170d72fb87a

Co-authored-by: magdalendobson <58752279+magdalendobson@users.noreply.github.com>
Copilot AI and others added 2 commits May 22, 2026 14:17
Agent-Logs-Url: https://github.com/microsoft/DiskANN/sessions/47e2bd0f-cb8b-495f-8274-02a88596b0e6

Co-authored-by: magdalendobson <58752279+magdalendobson@users.noreply.github.com>
Agent-Logs-Url: https://github.com/microsoft/DiskANN/sessions/1bc31d27-7a19-4c4c-9ecc-c10260b944a3

Co-authored-by: magdalendobson <58752279+magdalendobson@users.noreply.github.com>
@magdalendobson
Copy link
Copy Markdown
Contributor Author

Performance improvement of bitmap calculation is really needed to unblock serious work on the filtered algorithms, so I have a vested interest in seeing something like this merged ASAP.

That said, there is a lot of work needed to fully productionize it. As examples:

  1. The conversion of i64 to f64 is an imprecise conversion. Similarly u64 would have to be treated separately from i64, which is currently isn't in AttributeValue.

With the move to diskann-tools, is this still something you would like to see handled?

  1. I'm not convinced the handling of "." as a separator is robust: what happens if a user string contains a "."? For example, currently {"a.b": 1} and {"a": {"b": 1}} get lowered the same, creating ambiguity.

Doesn't our existing code already explicitly treat those two expressions as the same? E.g.

/// Helper to get a nested value from a label using dot notation (e.g., "specs.cpu")

  1. Probably want to use RoaringBitSet instead of BitSet.

After the move to diskann-tools, since the existing drivers use usize for vector ids, I don't think it makes sense to move to RoaringBitSet.

  1. This adds anyhow as a low-level library error type, which is not a great fit.

Resolved by the move to diskann-tools.

  1. This also unconditionally adds rayon as a dependency of diskann-label-filter and doesn't provide a caller with a clear ability to opt-out.

Resolved by the move to diskann-tools.

  1. Copy-paste in eval_query_using_accelerators that could be factored out.

Resolved in latest edits.

  1. Many of the helper structs made public that don't necessarily need to be.
    This is probably fine for the datasets we are testing on where everything is "nice", but I have concerns about the overall correctness in the presence of corner cases.

In latest edits I made the helper functions and structs private.

  1. Minor, but the PR description says R-tree when the implementation uses a B-tree.

Resolved.

To unblock algorithmic work, though, what if we do the following:

  1. Put this behind an "experimental" feature flag and in an "experimental" module to clearly indicate that things can go wrong.
  2. Gate the new dependencies on this features to avoid unconditionally adding dependencies (in particularly, anyhow and rayon).
  3. Enable this feature in benchmarks to unblock filtered work.

@hildebrandmw
Copy link
Copy Markdown
Contributor

Performance improvement of bitmap calculation is really needed to unblock serious work on the filtered algorithms, so I have a vested interest in seeing something like this merged ASAP.
That said, there is a lot of work needed to fully productionize it. As examples:

  1. The conversion of i64 to f64 is an imprecise conversion. Similarly u64 would have to be treated separately from i64, which is currently isn't in AttributeValue.

With the move to diskann-tools, is this still something you would like to see handled?

  1. I'm not convinced the handling of "." as a separator is robust: what happens if a user string contains a "."? For example, currently {"a.b": 1} and {"a": {"b": 1}} get lowered the same, creating ambiguity.

Doesn't our existing code already explicitly treat those two expressions as the same? E.g.

/// Helper to get a nested value from a label using dot notation (e.g., "specs.cpu")

  1. Probably want to use RoaringBitSet instead of BitSet.

After the move to diskann-tools, since the existing drivers use usize for vector ids, I don't think it makes sense to move to RoaringBitSet.

  1. This adds anyhow as a low-level library error type, which is not a great fit.

Resolved by the move to diskann-tools.

  1. This also unconditionally adds rayon as a dependency of diskann-label-filter and doesn't provide a caller with a clear ability to opt-out.

Resolved by the move to diskann-tools.

  1. Copy-paste in eval_query_using_accelerators that could be factored out.

Resolved in latest edits.

  1. Many of the helper structs made public that don't necessarily need to be.
    This is probably fine for the datasets we are testing on where everything is "nice", but I have concerns about the overall correctness in the presence of corner cases.

In latest edits I made the helper functions and structs private.

  1. Minor, but the PR description says R-tree when the implementation uses a B-tree.

Resolved.

To unblock algorithmic work, though, what if we do the following:

  1. Put this behind an "experimental" feature flag and in an "experimental" module to clearly indicate that things can go wrong.
  2. Gate the new dependencies on this features to avoid unconditionally adding dependencies (in particularly, anyhow and rayon).
  3. Enable this feature in benchmarks to unblock filtered work.

Thanks Magdalen, most of this gets resolved by moving it to diskann-tools, which is a better fit.

Comment thread diskann-tools/src/bin/compute_specificities.rs Outdated
Comment thread diskann-tools/src/utils/compute_bitmap.rs
Comment thread diskann-tools/src/utils/compute_bitmap.rs Outdated
Comment thread diskann-tools/src/utils/compute_bitmap.rs Outdated
Comment thread diskann-tools/src/utils/compute_bitmap.rs
Copy link
Copy Markdown
Contributor

@harsha-simhadri harsha-simhadri left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

posted some questions inline. thanks

Comment thread diskann-label-filter/src/utils/flatten_utils.rs
Comment thread diskann-tools/src/utils/ground_truth.rs
@magdalendobson
Copy link
Copy Markdown
Contributor Author

posted some questions inline. thanks

I believe these are all now resolved, would you mind approving if you agree?

Comment thread diskann-tools/src/utils/compute_bitmap.rs Outdated
@magdalendobson magdalendobson merged commit e2dc9a0 into main May 26, 2026
24 checks passed
@magdalendobson magdalendobson deleted the users/magdalen/add_filter_utils branch May 26, 2026 18:07
arkrishn94 added a commit that referenced this pull request May 28, 2026
# DiskANN v0.53.0 Release Notes

## Breaking Changes

An AI generated, human reviewed list of changes is summarized below.

### Paged search overhauled — channel-based API
([#1078](#1078))

`PagedSearchState` and its `'static`-bound pause/resume model have been
replaced with an async, channel-based interface. The recommended way to
drive paged search is now via a `tokio::sync::mpsc` channel, with the
searcher embedded in an otherwise-`'static` future. See the [rendered
RFC](https://github.com/microsoft/DiskANN/blob/main/rfcs/01078-paged-search.md)
for the new shape. Callers wired against `PagedSearchState` must migrate
to the channel API.

Users of paged search via `wrapped_async::DiskANNIndex` that know their
inner futures will never suspend can use the new
`wrapped_async::DiskANNIndex::paged_search_no_await`; this will
efficiently run paged searches with minimal synchronization overhead.

### `DiskANNIndex::flat_search` removed
([#1076](#1076))

`DiskANNIndex::flat_search` and the `IdIterator` trait have been removed
from the `diskann` crate. Equivalent functionality lives on the new
inherent method `DiskIndexSearcher::flat_search` in `diskann-disk`. This
unblocks the experimental directions in #1067 and #983.

```rust
// Before
diskann_index.flat_search(query, ...)?;

// After
disk_index_searcher.flat_search(query, ...).await?;
```

### `DiskIndexSearcher::flat_search` now batched
([#1097](#1097))

The new `DiskIndexSearcher::flat_search` uses the bulk `pq_distances`
path instead of one-vector-at-a-time `Accessor::build_query_computer` +
`evaluate_similarity`. Downstream behavior is equivalent but tighter
resource bounds apply.

### `centroid` removed from PQ interfaces
([#1010](#1010))

The dataset-centroid argument has been removed from `FixedChunkPQTable`
construction, `populate`, and most other PQ APIs. The shift only ever
worked for L2 distance and was silently ignored for inner-product /
cosine, so passing it was a footgun. When an L2 shift is required, fold
it into the PQ pivots instead (the library now does this internally).

```rust
// Before
let table = FixedChunkPQTable::new(.., centroid, ..);

// After — drop the centroid argument
let table = FixedChunkPQTable::new(.., ..);
```

### Flat search interface
([#983](#983))

A new `flat` module in `diskann` adds a provider-agnostic brute-force
search surface, mirroring the shape of graph search. Backends implement
a single trait, `DistancesUnordered<C>` (in `flat/strategy.rs`), which
fuses iteration and distance computation, allowing any backend
(in-memory, quantized, disk, remote) to plug into a shared algorithm.
See the [rendered
RFC](https://github.com/microsoft/DiskANN/blob/main/rfcs/00983-flat-search.md).
This is additive but is the new canonical surface — direct ad-hoc
flat-search call sites should migrate.

### `bf_tree` extracted into `diskann-bftree` crate
([#1020](#1020))

The bf_tree provider has been moved out of `diskann-providers`
(previously at
`diskann-providers/src/model/graph/provider/async_/bf_tree/`) into a new
standalone `diskann-bftree` crate. Along with the move:

- Switched from PQ to spherical quantization.
- Dropped dependencies on `DeletionCheck`, `AsDeletionCheck`, and
`RemoveDeletedIdsAndCopy`.
- Simplified generics.

Consumers must update their `Cargo.toml` to depend on `diskann-bftree`
and update import paths.

### `direct_distance_impl` and `inner_product_raw` re-exposed
([#1081](#1081))

`direct_distance_impl` (free function) and
`FixedChunkPQTable::inner_product_raw` are `pub` again after being
privatized in #1044. Restored to unblock a downstream user. Not breaking
in the typical direction — this restores previously available API
surface.

### MinMax `recompress` takes a grid-scale parameter
([#1109](#1109))

The MinMax `recompress` API now accepts a grid-scale parameter. 

## New Features

- SIMD-optimized L2-squared norm
([#1107](#1107))
- Significantly faster bitmap computation
([#1099](#1099))
- Large speedup on the bitmap construction path used by filtered search.
- LLVM IR bloat regression check in CI
([#1083](#1083))
- CI now flags regressions in generated LLVM IR size, helping catch
unintended monomorphization blow-ups.
- Recall computation fix for under-k groundtruth
([#1069](#1069))

## Merged PRs

* Revise README for DiskANN3 by @harsha-simhadri in
#1046
* [CI] Try to fix publishing step by @hildebrandmw in
#1057
* [benchmark] Remove `DispatchRule` by @hildebrandmw in
#1064
* [benchmark] Automatic Input Registration by @hildebrandmw in
#1066
* Remove centroid from most PQ interfaces by @hildebrandmw in
#1010
* [diskann/disk] Remove `flat_search` from `DiskANNIndex` by
@hildebrandmw in #1076
* macos build and miri check to nightly by @harsha-simhadri in
#1058
* [API] Make some methods public again by @hildebrandmw in
#1081
* [benchmark] Simply `Inputs` more by @hildebrandmw in
#1077
* Turn on stack protection for the diskann-garnet NuGet build by
@jackmoffitt in #1082
* Fix options for diskann-garnet nuget pipeline by @jackmoffitt in
#1091
* [CI] add LLVM IR bloat regression check by @arazumov in
#1083
* Bump openssl from 0.10.79 to 0.10.80 by @dependabot[bot] in
#1093
* [Disk CI benchmarks] Use 1ES.Pool=diskann-github by @arazumov in
#869
* Fix recall computation for fewer than k groundtruth results by
@magdalendobson in #1069
* bf_tree migration away from diskann-providers by @JordanMaples in
#1020
* [RFC/diskann] Overhaul paged search by @hildebrandmw in
#1078
* Remove unsafe code from compute_vec_l2sq by @arazumov in
#1094
* Remove direct accessor call in `diskann-garnet` by @hildebrandmw in
#1098
* Refactor `DiskIndexSearcher::flat_search` to use batching by
@hildebrandmw in #1097
* [flat index] Flat Search Interface by @arkrishn94 in
#983
* migrating multi-hop tests from diskann-providers to diskann by
@JordanMaples in #928
* Significantly speed up bitmap computation by @magdalendobson in
#1099
* `compute_vecs_l2sq`: Replace scalar L2 Squared norm with
SIMD-optimized FastL2NormSquared by @arazumov in
#1107
* [minmax] Add grid scaling to recompress API by @arkrishn94 in
#1109

**Full Changelog**:
v0.52.0...v0.53.0
magdalendobson added a commit that referenced this pull request May 29, 2026
# Introduction

Bitmap computation in diskann-label-filter is unacceptably slow.
Currently, with a 1 million size slice of yfcc and a 10k query set,
computing the query bitmaps takes 43.10 seconds. With just a 100K slice
of the caselaw dataset and a 10k query set, computing the bitmaps takes
6.03 seconds. This was making it hard to run experiments on filtered
search algorithms for the full sizes of these datasets.

Speeding up the bitmap computation is conceptually simple. Instead of
iterating over every base label for every query filter, we compute an
inverted index for each label type, which maps the label value to the
documents with the same value. Then, at query time, we query the
inverted index for the relevant label values, and compose the resulting
sets as necessary to find the documents satisfying the entire filter
expression. At a high level, that is what this PR does.

# Lower level details

The overall workflow of the main function, `compute_query_bitmaps`, is
as follows:
1. Check whether the query expression contains any `ASTExpr::Not`
clauses. If so, default to the existing slow path. This is because we
don't store the document universe for each label, and thus can't compute
the complement of an arbitrary bitset.
2. Otherwise, move to the fast path.
3. Flatten the base labels so that nested values map to a single string
(e.g. the JSON string {"car": {"color":"red", "make":Mazda"}} would be
transformed to {"car.color":red, "car.make":"Mazda}), and re-organize as
a hash map of labels to values.
4. For each label, compute either an inverted index (strings and bools)
or an B-tree (ints and floats) depending on its type.
5. At query time, use either the inverted index or the B-tree to produce
a bitset for each `CompareOp` in the clause, and then compose them with
AND and OR as needed to produce the final bitset.

We also add a utility to `diskann-label-filter` for computing the
specificity of a set of query filters with respect to a base set,
outputting some statistics on it, and optionally outputting the
individual specificity values to a file for further processing.

## Inverted Index
The inverted index maps each label value, converted to a string, to a
bitset containing the doc ids corresponding to that value.

## B-Tree
For simplicity, the B-tree implementation converts integers to floats
before inserting so that we don't have to deal with two different types
of B-tree. The performance of this piece of code isn't sensitive enough
that it makes sense to differentiate, but this could be changed in the
future.

The B-tree maps collections of ids to vectors instead of bitsets,
because concatenating vectors is much cheaper than extending bitsets,
and potentially many vectors would be concatenated during a range query.

# Timings

Returning to the earlier discussion of timings, for the 1 million size
slice of yfcc and a 10k query set, computing the query bitmaps now takes
.6 seconds. For the 100K slice of the caselaw dataset and a 10k query
set, computing the bitmaps now takes 1.728 seconds.

---------

Co-authored-by: Magdalen Manohar <magdalen@magdalen.localdomain>
Co-authored-by: Magdalen Manohar <mmanohar@microsoft.com>
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants