Support for inline-beta filtered search with expressions#782
Support for inline-beta filtered search with expressions#782
Conversation
- Refactored recall utilities in diskann-benchmark - Updated tokio utilities - Added attribute and format parser improvements in label-filter - Updated ground_truth utilities in diskann-tools
There was a problem hiding this comment.
Pull request overview
This PR integrates label-filtered (“document”) insertion and inline beta filtered search into the DiskANN benchmark/tooling flow, enabling benchmarks that operate on { vector, attributes } documents and evaluate filtered queries.
Changes:
- Added
DocumentInsertStrategyand supporting public types to insert/queryDocumentobjects (vector + attributes) throughDocumentProvider. - Extended inline beta filter search to handle predicate encoding failures and added a constructor for
InlineBetaStrategy. - Added a new benchmark input/backend (
document-index-build) plus example config for running document + filter benchmarks.
Reviewed changes
Copilot reviewed 22 out of 23 changed files in this pull request and generated 9 comments.
Show a summary per file
| File | Description |
|---|---|
| test_data/disk_index_search/data.256.label.jsonl | Updates LFS pointer for label test data used in filter benchmarks. |
| diskann-tools/src/utils/ground_truth.rs | Adds array-aware label matching/expansion and extensive tracing diagnostics for filter ground-truth generation. |
| diskann-tools/Cargo.toml | Adds serde_json dependency (and adjusts manifest metadata). |
| diskann-providers/src/model/graph/provider/async_/inmem/full_precision.rs | Adds Vec<T> query support for full-precision in-mem provider (for inline beta usage). |
| diskann-label-filter/src/lib.rs | Exposes the new document_insert_strategy module under encoded_attribute_provider. |
| diskann-label-filter/src/inline_beta_search/inline_beta_filter.rs | Adds InlineBetaStrategy::new and introduces is_valid_filter fast-path logic. |
| diskann-label-filter/src/inline_beta_search/encoded_document_accessor.rs | Adjusts filter encoding to be optional and threads is_valid_filter into the query computer. |
| diskann-label-filter/src/encoded_attribute_provider/roaring_attribute_store.rs | Makes RoaringAttributeStore public for cross-crate use. |
| diskann-label-filter/src/encoded_attribute_provider/encoded_filter_expr.rs | Changes encoded filter representation to Option, allowing “invalid filter” fallback behavior. |
| diskann-label-filter/src/encoded_attribute_provider/document_provider.rs | Allows vector types used in documents to be ?Sized. |
| diskann-label-filter/src/encoded_attribute_provider/document_insert_strategy.rs | New strategy wrapper enabling insertion/search over Document values. |
| diskann-label-filter/src/encoded_attribute_provider/ast_label_id_mapper.rs | Simplifies lookup error messaging and signature for attribute→id mapping. |
| diskann-label-filter/src/document.rs | Makes Document generic over ?Sized vectors. |
| diskann-benchmark/src/utils/tokio.rs | Adds a reusable multi-thread Tokio runtime builder. |
| diskann-benchmark/src/utils/recall.rs | Re-exports knn recall helper for benchmark use. |
| diskann-benchmark/src/inputs/mod.rs | Registers a new document_index input module. |
| diskann-benchmark/src/inputs/document_index.rs | New benchmark input schema for document-index build + filtered search runs. |
| diskann-benchmark/src/backend/mod.rs | Registers new document_index backend benchmarks. |
| diskann-benchmark/src/backend/index/result.rs | Extends search result reporting with query count and wall-clock summary columns. |
| diskann-benchmark/src/backend/document_index/mod.rs | New backend module entrypoint for document index benchmarks. |
| diskann-benchmark/src/backend/document_index/benchmark.rs | New end-to-end benchmark: build via DocumentInsertStrategy + filtered search via InlineBetaStrategy. |
| diskann-benchmark/example/document-filter.json | Adds example job configuration for document filter benchmark runs. |
| Cargo.lock | Adds serde_json to the lockfile dependencies. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
…DiskANN into sync-from-cdb-diskann
…the documentItem on which insertStrategy and setElement are implemented to references. This is due to the change in index.insert requiring copy trait on the element being set. Other changes include fixes to api breakage.
| pub struct FilteredQuery<V> { | ||
| query: V, | ||
| pub struct FilteredQuery<'a, V: ?Sized> { | ||
| query: &'a V, |
There was a problem hiding this comment.
Okay, now that we've made the API change, there's one thing we can do to clean up how this works a little. Instead of
pub struct FilteredQuery<V> {
query: V,
filter_expr: ASTExpr,
}
impl<V> FilteredQuery<V> {
fn query<'a>(&'a self) -> V::Target
where
V: Reborrow<'a>,
{
self.query.reborrow()
}
}And instead of requiring &V for the inner trait bounds, we use <V as Reborrow<'a>>::Target (or just V::Target when the associated lifetime is unambiguous.
This does a couple things. First, it lets FilteredQuery have an owned query if needed and gets rid of the repeated lifetime bound.
Second, it will compose slightly better with providers that use non-slice types (e.g. multi-vectors).
| let filtered_query = FilteredQuery::new(query_vec, ast_expr.clone()); | ||
|
|
||
| // Use a concrete IdDistance scratch buffer so that both the IDs and distances | ||
| // are captured. Afterwards, the valid IDs are forwarded into the framework buffer. |
There was a problem hiding this comment.
Perhaps diskann-benchmark-core should be updated to capture distances as well. I think this can be done in a non-breaking way (not a blocker for this PR).
| let query_vec = self.queries.row(index); | ||
| let (_, ref ast_expr) = self.predicates[index]; | ||
| let strategy = InlineBetaStrategy::new(self.beta, common::FullPrecision); | ||
| let filtered_query = FilteredQuery::new(query_vec, ast_expr.clone()); |
There was a problem hiding this comment.
One theme I've been observing throughout diskann-label-filter is the design kind of inherently forces patterns like cloning the ast_expr for the query.
I'm not reviewing the benchmark code in too much detail, but I strongly encourage looking for patterns like forced clones in loops as opportunities for making the underlying implementation better.
There was a problem hiding this comment.
I looked through the benchmark code. This clone right here seems like the only one that would cause performance issues. This would mean holding a reference to the ast_expr instead of owning it.
…n in the FilteredQuerySearch instead of owning it
This PR has the following changes:
Add support for inline-beta search with filter expressions that support AND, OR expressions and equality comparisons.
Benchmark to evaluate perf and recall on small dataset and which also serves as an example on how to set things up to use filtered search with expressions.