RFC - Beta Filter For Disk Search#1101
Conversation
Adds an RFC proposing a `SearchPlan` enum to replace the disk search API's
`(vector_filter, is_flat_search)` parameter pair, and introducing beta-biased
graph search as a new capability. The hierarchical `SearchPlan { FlatScan,
Graph(GraphMode) }` shape makes invalid combinations unrepresentable and
provides a single extension point for future graph algorithms.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- Reframe Motivation around supporting BetaFilter on the disk path and designing an extension point for future filter algorithms. - Move the full design doc content under the Proposal section. - Set Benchmark Results, Future Work, References to N/A initially, then add one Future Work item: align disk and in-memory BetaFilter semantics (disk applies bias + post-filter; in-memory only biases). - Drop the Multihop worked example and its back-references. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Pull request overview
Note
Copilot was unable to run its full agentic suite in this review.
This PR adds an RFC proposing a new SearchPlan/GraphMode API for disk search to support beta-biased filtering, replace (vector_filter, is_flat_search) with a single typed parameter, and create an extensible hook for future graph search variants.
Changes:
- Introduces a hierarchical
SearchPlan { FlatScan, Graph(GraphMode) }model for disk search configuration. - Defines
Predicate,GraphModevariants (includingBetaFilter), and migration examples from the current API. - Documents intended internal plumbing changes (strategy projection,
pq_distancesbeta application, and post-filtering behavior).
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #1101 +/- ##
==========================================
+ Coverage 89.46% 90.53% +1.06%
==========================================
Files 482 482
Lines 91082 91082
==========================================
+ Hits 81491 82461 +970
+ Misses 9591 8621 -970
Flags with carried forward coverage won't be shown. Click here to find out more. 🚀 New features to boost your workflow:
|
suri-kumkaran
left a comment
There was a problem hiding this comment.
Thanks for working through this — I’m aligned with the design. I do feel we can trim some of the redundant or overly explanatory parts of the RFC and leave a few of the implementation details for the implementation PRs.
- Add lifetime parameter to Predicate/SearchPlan/GraphMode so callers can pass closures borrowing non-'static data, matching the existing VectorFilter<'a, Data> flexibility. - Make GraphMode::beta_filter fallible (returns Result<Self, BetaError>) so invalid beta from JSON/CLI surfaces as a validation error, not a process crash. - Migrate the benchmark JSON schemas in the same PR (no external consumers; no grace period needed). - Restructure for outside-in narrative: search() signature → projection function → DiskSearchStrategy struct change. - Trim implementation-detail sections (pq_distances code block, data-flow diagram, plumbing table, testing list) that belong in the implementation PR. - Rename "The five cases" to "Supported configurations". - Add a Future Work item: align disk vs. in-memory BetaFilter semantics (disk applies bias + post-filter; in-memory only biases). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
@dyhyfu please read the following Contributor License Agreement(CLA). If you agree with the CLA, please reply with the following information.
Contributor License AgreementContribution License AgreementThis Contribution License Agreement (“Agreement”) is agreed to by the party signing below (“You”),
|
suri-kumkaran
left a comment
There was a problem hiding this comment.
This is getting close!
- Introduce FilterMode<'a> sum type at the strategy layer: replaces the (Option<&Predicate>, Option<f32>) tuple with one enum (None | Filter | BetaFilter), making (None, Some(β)) unrepresentable and extending better for future variants (Multihop, LabelBeta, ...). - Fix the internal/external ID mismatch in RerankAndFilter::post_process: predicate is keyed on ExternalId per DataProvider contract; flat_search is correct; RerankAndFilter and the new pq_distances both add to_external_id conversion. Today's identity mapping hides the bug. - Add Non-Goal: newtyping DiskProvider::ExternalId is deferred. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
suri-kumkaran
left a comment
There was a problem hiding this comment.
Aligned on the design. ID convention is a clean fix and the FilterMode<'a> sum type makes the invariants compiler-enforced.
On Predicate<'a> vs Predicate ('static-only): fine either way — lifetime preserves borrow flexibility, 'static-only is simpler. Author's call.
Approving once the inline comments are resolved.
| Current state: | ||
|
|
||
| - **Input predicate is keyed on `ExternalId`** by `DataProvider` contract. | ||
| - **`Index::flat_search`** is correct — calls `to_external_id` before invoking the predicate ([diskann/src/graph/index.rs](../diskann/src/graph/index.rs)). |
There was a problem hiding this comment.
There's no flat_search in diskann/src/graph/index.rs (grep confirms — only disk_provider.rs has one). I think this might be a conflation with diskann::flat::FlatIndex::knn_search in diskann/src/flat/index.rs (the module from RFC 00983-flat-search) — that's a separate, parallel abstraction that the disk path doesn't delegate to.
The disk's own flat_search (disk_provider.rs:912) is a standalone implementation, and it invokes the predicate with positional/internal IDs directly:
let mut iter = (0..provider.num_points as u32).filter(vector_filter);No to_external_id call anywhere in the disk flat_search path (confirmed: grep for to_external_id in disk_provider.rs returns only the trait definition and one unit test). The 0..num_points idiom looks generic but is the positional/internal space by construction — there's no abstraction between integer counting and positional storage. Identity translation masks this today; under any future non-identity provider, flat-scan and graph paths would diverge silently.
So all three sites (flat_search, pq_distances, RerankAndFilter::post_process) are currently incorrect — this RFC needs to fix all three.
Suggested edits:
- Remove the "Sites already correct" bullet (line 269).
- Add
flat_searchto "Sites updated in the implementation PR" (line 262):flat_search(the disk-local one indisk_provider.rs) — callto_external_idon each id in(0..num_points)before invoking the predicate.
There was a problem hiding this comment.
Both BetaFilter and MultihopSearch take QueryLabelProvider<DP::InternalId> — their predicates run in the internal-id space. The only to_internal_id / to_external_id translations in index.rs happen at the insert/delete API boundary, not on the search path.
We can keep Predicate keyed on internal-id here too.
There was a problem hiding this comment.
Checking latest code, flat_search has been moved to disk_provider.rs and is using internal_id as input param. The "ID convension" part will be removed and Predicate will be keyed on internal_id.
There was a problem hiding this comment.
The motivation for external-keyed predicates was broader than the flat_search audit — that was a symptom, not the core argument. External-keyed is still the right contract because:
- Matches the
DataProviderabstraction; doesn't leak the positional internal representation. - Matches in-memory
BetaFilter(keyed onDP::ExternalIdviaQueryLabelProvider) — closes one corner of the divergence the Future Work item flags. - Future-proof: non-identity translation (sparse IDs, deletion compaction,
u64→u32) stays disk-internal; callers don't break silently.
Practically: how do external clients construct a predicate over internal IDs? They only see external IDs — from search outputs, ingest, the DataProvider contract. Asking them to maintain a bitmap over a space they can't observe pushes implementation details out as caller responsibility, and silently couples every consumer to today's identity-translation invariant.
The fix is small and inlines to a no-op under identity translation. Worth keeping in the RFC.
There was a problem hiding this comment.
DiskProvider doesn't really expose an Internal/External contract today. Both types are pinned to u32, VectorIdType = u32 is a hard where bound on the impl (disk_provider.rs:95-103), I'd suggest dropping the ID-convention change from this RFC.
impl<Data> DataProvider for DiskProvider<Data>
where Data: GraphDataType<VectorIdType = u32>,
{
type InternalId = u32;
type ExternalId = u32;
fn to_internal_id(&self, _: &DefaultContext, gid: &u32) -> Result<u32, _> { Ok(*gid) }
fn to_external_id(&self, _: &DefaultContext, id: u32) -> Result<u32, _> { Ok(id) }
}For Practical, external clients implements to_internal_id / to_external_id on their provider, so they define the translation and own both sides. The question isn't whether they can observe internal id — it's which side of the boundary is the natural place to do the translation.
For the 1:N domain model (which is what our team has: external = doc_id, internal = vector_id, one doc-id → many vector-ids), "caller side" is a feasible option.
- DataProvider's ExternalId ↔ InternalId is 1:1 by trait signature; a 1:N domain doesn't fit through it.
- So we keep the doc_id ↔ {vector_id} map outside the library, for example in a filtered scenario:
- Filtered doc-ids → vector-id bitmap (outside DiskANN).
- DiskANN filter-search returns vector-ids.
- Vector-ids → doc-ids (outside DiskANN).
There was a problem hiding this comment.
Two separate things getting mixed up — the outer mapping (domain ↔ vector IDs) and the inner consistency (which ID space DiskANN uses internally).
Outer mapping (1:N): agreed, this lives outside the library. DataProvider::ExternalId ↔ InternalId is 1:1 by signature, so doc-id ↔ {vector-id} has to be maintained caller-side. The flow you describe (filtered doc-ids → vector-id bitmap → DiskANN search → vector-ids → doc-ids) is the only way it can work. No disagreement there.
Inner consistency (the RFC's change): by the time DiskANN receives the predicate, it's already in vector-id space (= external ID). The question is whether DiskANN invokes that predicate with ExternalId or InternalId. Today three sites — RerankAndFilter::post_process, pq_distances, flat_search — pass internal directly. This is a latent bug: it produces wrong results the moment DiskProvider ever uses non-identity translation (e.g., for deletion compaction or sparse IDs added in place to the existing provider — no fork required).
Re: "DiskProvider doesn't expose the distinction today" — true, and that's exactly why the bug is invisible. The trait exists precisely so future in-place changes don't break consumers. Fixing now is essentially free (translation inlines to a no-op under identity). Leaving it means the next person extending DiskProvider has to remember to also audit every predicate site, with no compile-time help.
Want to keep the ID-convention section in this RFC — it pre-existing-bug we surface and fix here. Fix lands in the impl PR.
- Switch the documented ID convention back to internal IDs, matching the implementation: all three invocation sites (flat_search, pq_distances, RerankAndFilter) pass internal IDs uniformly; no to_external_id conversion at the predicate boundary. Identity mapping (InternalId == ExternalId == u32) makes this transparent to callers either way. - Remove the §ID convention section, the Non-Goal about newtyping ExternalId, and the related note in the Future Work item -- all obsolete after the contract flip. - Trim the "Project at the boundary" bullet in §Key design decisions: it dove into FilterMode internals before the reader had context, and the rationale already lives in §search_strategy projection. Replace with a shorter bullet phrased as "variant matching is delegated to search_strategy()" -- avoids the "accessor" terminology clash with the repo's existing Accessor trait. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Summary
Adds beta-biased filtering on the disk search path by introducing a
SearchPlanenum that replaces the existing(vector_filter, is_flat_search)parameter pair onsearcher.search(). The new enum is hierarchical (SearchPlan { FlatScan, Graph(GraphMode) }), makes invalid combinations unrepresentable, and gives future filter algorithms a single extension point without further growing the publicsearch()signature.Motivation
BetaFilterstrategy; the disk side has no equivalent.beta. Bolting it on as a thirdOption<f32>parameter creates meaningless combinations (is_flat_search=true+beta=Some(_)) the type system can't reject.MultihopSearchexists indiskannbut the disk API has no way to select it).What's in the RFC
SearchPlan { FlatScan { filter: Option<Predicate> }, Graph(GraphMode) }+GraphMode { Unfiltered, PostFilter(p), BetaFilter { predicate, beta } }.Predicate = Box<dyn Fn(u32) -> bool + Send + Sync>— closure-based.GraphModematch site insearch_strategy()that projects(predicate, beta); downstreamDiskAccessorandRerankAndFilterread those fields directly — adding a newGraphModevariant touches one place.Caller migration
(vector_filter, is_flat_search)plan(None, false)SearchPlan::graph()(None, true)SearchPlan::flat()(Some(p), false)SearchPlan::graph_with(GraphMode::post_filter(p))(Some(p), true)SearchPlan::flat_filtered(p)SearchPlan::graph_with(GraphMode::beta_filter(p, β))Known follow-ups (deferred)
GraphMode::BetaFilterapplies bias + hard post-filter; in-memoryBetaFilteronly biases. Future work to align.See
rfcs/01101-disk-beta-filter.mdfor the full design.