-
Notifications
You must be signed in to change notification settings - Fork 396
Maintenance of diskann-providers #899
Description
The diskann-providers crate has become a dumping ground for architectural cruft. Containing it in one place is useful, but it blurs what actually belongs here. Here's my opinionated take on how to tackle this.
I think four things need to happen before we can cleanly extract the inmem providers into their own crate (and similarly split out bf-tree), ordered from easy/mechanical to architecturally involved:
- Cruft Removal: Consolidate and clean up.
- Migrate unit/integration tests in
diskann_async.rstodiskann. - Move reasonable portions of the PQ code to
diskann-quantization. - Finalize our comprehensive saving and loading story.
- Make a clean
diskann-inmemcrate and others.
Importantly, I don't think much outside the following justifies creating an auxiliary crate — at least until several rounds of cleansing and extraction have been performed. (I reserve the right to be wrong.)
-
MinMaxRepr: Maybe we should group this with the implementations indiskannand featureMinMaxmore prominently? - Saving/Loading overhaul. This is a great candidate for a new crate.
Cruft Removal
There's a bunch of utilities that have been either superseded or are straight up dead code (though not caught due to public exports). These should be audited and either deleted, or upgraded to their new counterparts. A few quick examples:
-
utils/normalizing_util.rs: Is this still used? -
utils/kmeans.rs: KMeans has been replaced for PQ related purposes by the faster implementation indiskann-quantization. The last user of this flow is the disk-index partitioning. It's possible that the BLAS based implementation of thediskann-providersis faster than the approach taken indiskann-quantizationfor full-dimensional data. To that end, we should either:- Audit and replace the use of KMeans in disk partitioning to use
diskann-quantizationand get rid of thediskann-providersimplementation. - Upgrade the implementation in
diskann-quantizationto be on-par for full-dimensional data. - Move the implementation to
diskann-diskwhere it's actually used.
- Audit and replace the use of KMeans in disk partitioning to use
-
utils/math_util.rs: See the abovekmeansuse. Also, vector generation with norm can be replaced by the implementations indiskann-utils. - Rethink file-centric flow? Many of the implementations in
diskann-providers/utilsderive their staying power because they work in-place on raw files. Is this still needed and if so, is there a cleaner way of decoupling the implementation from files? -
rayon_util.rs: The purpose ofrayon_utilis to make the rayon thread pool explicit and to allow inspection of the number of threads available.
One side-effect is that traits are designed to either take an explicit thread pool or an integer for the number of threads.
Because this is done with generics, it encourages monomorphization high in the stack.
I vote we either always take thread pools to make this explicit to the callers, or use an enum for this to cut down on generics. - Synthetic/structured data: Audit the users and move synthetic data generation to more appropriate locations (
diskann-toolsalready has its own synthetic data creation?).
Structured data like the grid points already have a replacement indiskann.
We can remove the implementations indiskann-providers. -
model/scratch/pq_scratch.rs: I'm pretty sure the only ultimate users here arediskann-diskand that this data structure and its transitive dependencies (quantizer_preprocess.rs) can all be moved todiskann-disk. -
model/graph/graph_data_model/adjacency_list.rs: This has almost been completely replaced with the struct indiskann/graph, butdiskann-diskapparently uses the interesting conversion from&[u8]. This can either be moved todiskann-disk, ordiskann-diskcan be updated to use the implementation indiskann/graph. -
model/graph/traits/graph_data_types.rs: This type should no longer be needed by non-diskann-diskcode (and if it is, changing it to a rawT: VectorReprshould be straight-forward.
We should remove this type fromdiskann-providersand update remaining uses withindiskann-providersto just use a vector type, ID type, and associated data type independently as needed.
The problem with using this large list of associated types to instantiate generic functions that only use a portion is that it can lead to redundant monomorphizations, and makes uses less clear on whether all associated data types are used or not. - Get rid of
AlignedBoxWithSlice. This has literally been replaced withdiskann-quantization::alloc::Polyas its implementation. Its one non-trivial methodsplit_into_nonoverlapping_mut_slicescan simply be replaced with rayon'sParChunksMut.
Test Migration
- Migrate tests. Currently,
diskann_async.rsis pulling double-duty as tests for the inmem index and tests fordiskann/graph/index.rs.
The test infrastructure indiskannhas gotten much better for running such unit tests.
All unit tests should be migrated todiskannand use appropriate baselines.
This might not necessarily be straight forward to decouple the graph invariant tests from the inmem specific code, but using the test provider in diskann will provide much better checks on the indexing code.
If we want inmem tests still, we should look carefully at using a trait-object to wrap the code under test to avoid the insane number of test monomorphizations that are currently present in diskann_async.rs
I do question the usefulness of these tests when it comes to performing long-term guarantees. For really protecting the integrity of the inmem providers in an end-to-end context, I think we'll want the upcoming native A/B test infrastructure in diskann-benchmark and use checked-in baselines as a basis for regression testing.
PQ
Much of the PQ code in diskann-providers predates the diskann-quantization crate. Some items are candidates for moving to diskann-quantization or elsewhere.
What to do with FixedChunkPQTable
I really do not like this data structure. The way it stores the PQ pivot data is almost the worst way one can store pivots data for any kind of efficient processing.
Ideally, we'd use something more like the TransposedTable in diskann-quantization for distance table population (though this comes at the potential cost of slower quant-quant distances).
However, there are a few issues that keep us from being able to just do this:
- For historical reasons, a
centroidsvector has been used to center PQ data.
Vanilla PQ based on L2 distances between centers, an affine shift is not really useful.
However, we need to support this shift for backwards compatibility with some implementations.
I'd be comfortable moving application of this shift elsewhere in the pre-processing stack, but I do not thinkdiskann-quantizationshould be obligated to support this. - The state of OPQ is pretty nebulous at this point. As far as I know, no one is using it and I'd be surprised if it actually worked correctly. We should make a decision on whether we actually want to support this or if we should simply ditch it. The latter would greatly simplify a lot of code.
- The
FixedChunkPQTablesupports quant-quant distances. The implementation is efficient-ish, but could be much better. We may consider building a distance table specifically to accelerate quant-quant distances, but this would be large (at least 64kB per chunk).
Candidates for moving to diskann-quantization
- Distance computers (table lookup + bulk lookup): Even this is not necessarily straight-forward because of memory pooling.
Candidates for moving out
- Multi-PQ: This can likely be moved internally to where it is actually needed.
Open Questions
The code in pq_construction.rs is too opinionated on its file operations to justify moving to a lower level crate.
Save/Load Overhaul
Most of the code in storage/* is related to the various ad-hoc storage mechanisms used by various index implementations.
Suhas has been working on a proposal for a structured and unified way of saving and loading.
Hopefully, we can use this to dump the problematic SaveWith/LoadWith traits and get rid of the various ad-hoc methods strewn about.
I do not think it's worth trying to clean this up until the replacement is ready.
Extracting and Fortifying Inmem
There are currently several issues with the in-memory providers, not least of which is being buried 8 layers deep in a directory.
- Deep-rooted unsafety in
AlignedMemoryVectorStore(this is a hard problem to tackle efficiently). - Tricky generic bounds on some quantizers (this is largely
diskann-quantization's fault, I'm sorry). - Sketchy APIs around start point creation/constructors.
These do not need to necessarily be fixed to create a diskann-inmem, but there are some considerations that must be made first.
Because of the reliance of the inmem providers on the PQ code, we either need to fix the PQ story before isolating, or make a new diskann-inmem that relies on diskann-providers until the other issues can be sorted out.
Similarly, the save/load story keeps these implementations largely coupled with the storage related code in diskann-providers.
Thus, without these underlying issues being addressed, a crate split wouldn't really make a semantic boundary.
Metadata
Metadata
Assignees
Labels
Type
Projects
Status