Remove centroid from most PQ interfaces.#1010
Conversation
There was a problem hiding this comment.
Pull request overview
This PR removes explicit dataset-centroid handling from most PQ interfaces, shifts compatibility to the storage/load path by folding legacy centroids into pivots, and simplifies downstream consumers so they can treat PQ tables as centroid-free. In the broader codebase, this moves PQ closer to the diskann-quantization abstractions and removes L2-only centering behavior from general-purpose interfaces.
Changes:
- Removes centroid fields/parameters from
FixedChunkPQTable,GeneratePivotArguments, and the_membufPQ construction/compression APIs. - Updates PQ storage read/write paths to preserve the on-disk layout and recover legacy behavior by folding non-zero stored centroids into pivots on load.
- Simplifies disk PQ consumers to always use
TransposedTable, and updates construction/save call sites across providers, disk builder code, tools, and benchmarks.
Reviewed changes
Copilot reviewed 21 out of 21 changed files in this pull request and generated 11 comments.
Show a summary per file
| File | Description |
|---|---|
diskann-tools/src/utils/build_pq.rs |
Updates tool-side PQ pivot generation call to the new generate_pq_pivots signature. |
diskann-providers/src/storage/pq_storage.rs |
Adds optional-centroid write path, legacy-centroid folding on load, and storage round-trip tests. |
diskann-providers/src/model/pq/pq_construction.rs |
Removes centroid handling from in-memory PQ APIs and adds legacy-centering toggle to file-based pivot generation. |
diskann-providers/src/model/pq/generate_pivot_arguments.rs |
Removes centroid-centering state from PQ generation arguments. |
diskann-providers/src/model/pq/fixed_chunk_pq_table.rs |
Drops centroid storage/processing from the PQ table type and updates related tests/helpers. |
diskann-providers/src/model/pq/distance/test_utils.rs |
Updates PQ test helpers to construct centroid-free tables. |
diskann-providers/src/model/pq/distance/l2.rs |
Stops query preprocessing from subtracting centroids before lookup-table generation. |
diskann-providers/src/model/graph/provider/async_/memory_quant_vector_provider.rs |
Saves PQ pivots without a centroid payload and updates test construction. |
diskann-providers/src/model/graph/provider/async_/inmem/provider.rs |
Refreshes public doc examples for the new FixedChunkPQTable::new shape. |
diskann-providers/src/model/graph/provider/async_/fast_memory_quant_vector_provider.rs |
Switches async fast in-memory PQ provider save/test paths to centroid-free tables. |
diskann-providers/src/model/graph/provider/async_/experimental/multi_pq_async.rs |
Updates experimental multi-PQ compression call sites to the new membuf API. |
diskann-providers/src/model/graph/provider/async_/bf_tree/quant_vector_provider.rs |
Adjusts BF-tree quant provider tests to the centroid-free table constructor. |
diskann-providers/src/model/graph/provider/async_/bf_tree/provider.rs |
Saves BF-tree PQ pivots without centroids and updates related tests. |
diskann-providers/src/index/diskann_async.rs |
Removes centroid buffers from async index PQ training and test setup. |
diskann-disk/src/storage/quant/pq/pq_generation.rs |
Threads the new legacy-centering flag through disk PQ generation. |
diskann-disk/src/storage/quant/pq/pq_dataset.rs |
Replaces the PQTable enum with always-transposed storage in PQData. |
diskann-disk/src/storage/quant/pq/mod.rs |
Removes the now-obsolete PQTable re-export. |
diskann-disk/src/storage/quant/mod.rs |
Stops re-exporting PQTable from the public quant module. |
diskann-disk/src/search/pq/quantizer_preprocess.rs |
Simplifies PQ preprocessing to a single TransposedTable path. |
diskann-disk/src/build/builder/quantizer.rs |
Saves builder-generated PQ pivots without a centroid payload. |
diskann-benchmark/src/backend/exhaustive/product.rs |
Updates benchmark-side PQ table construction to the centroid-free constructor. |
Comments suppressed due to low confidence (4)
diskann-providers/src/model/pq/pq_construction.rs:166
- This public
_membufentry point no longer exposes the centroid output buffer, which breaks existing callers that still use the 0.50.x signature. The PR description calls out caller-side migration, but the crate versioning policy says these packages obey SemVer, so this signature change needs a shim/deprecation path or a major release.
pub fn generate_pq_pivots_from_membuf<T: Copy + Into<f32>>(
parameters: &GeneratePivotArguments,
train_data_slice: &[T],
offsets: &mut [usize],
full_pivot_data: &mut [f32],
rng: &mut (impl Rng + ?Sized),
cancellation_token: &mut bool,
pool: RayonThreadPoolRef<'_>,
diskann-providers/src/model/pq/pq_construction.rs:166
- The doc comment for this public function still describes the removed
make_zero_mean/centroidparameters, so it no longer matches the actual API. Anyone reading the generated docs will be told to pass buffers that this signature no longer accepts.
pub fn generate_pq_pivots_from_membuf<T: Copy + Into<f32>>(
parameters: &GeneratePivotArguments,
train_data_slice: &[T],
offsets: &mut [usize],
full_pivot_data: &mut [f32],
rng: &mut (impl Rng + ?Sized),
cancellation_token: &mut bool,
pool: RayonThreadPoolRef<'_>,
diskann-providers/src/model/pq/pq_construction.rs:80
- Adding the new
legacy_center_dataparameter changes the signature of a public, re-exported function. That is another SemVer break fordiskann-providersconsumers on 0.50.x, so this needs a compatibility wrapper or a major-version plan instead of replacing the existing API in place.
pub fn generate_pq_pivots<Storage, Random>(
parameters: GeneratePivotArguments,
legacy_center_data: bool,
train_data: &mut [f32],
pq_storage: &PQStorage,
storage_provider: &Storage,
random_provider: RandomProvider<Random>,
pool: RayonThreadPoolRef<'_>,
diskann-providers/src/storage/pq_storage.rs:98
- Changing
write_pivot_datafrom taking&[f32]toOption<&[f32]>is a source-breaking public API change fordiskann-providersconsumers. Because the workspace is still on a SemVer-governed 0.50.x line, this needs a compatibility layer or a coordinated major-version bump rather than replacing the signature in place.
pub fn write_pivot_data<Storage>(
&self,
full_pivot_data: &[f32],
centroid: Option<&[f32]>,
chunk_offsets: &[usize],
num_centers: usize,
dim: usize,
storage_provider: &Storage,
) -> ANNResult<()>
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| pub fn generate_pq_data_from_pivots_from_membuf<T: Copy + Into<f32>>( | ||
| vector_data: &[T], | ||
| pivot_data: &[f32], | ||
| num_pivots: usize, | ||
| centroid: Option<&[f32]>, | ||
| offsets: &[usize], | ||
| pq_out: &mut [u8], |
| /// * `centroid` - Optional per-dimension centroid. Pass `None` for the standard | ||
| /// (non-legacy) code path; a zero vector of length `dim` is written to preserve | ||
| /// the on-disk file format. Pass `Some(centroid)` only when legacy centroid | ||
| /// centering is enabled (see [`GeneratePivotArguments::with_legacy_centering`]). |
| /// | ||
| /// Panics under the following condition: | ||
| /// | ||
| /// * `base_vec.length() != self.get_dim()`. |
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #1010 +/- ##
==========================================
- Coverage 89.49% 89.49% -0.01%
==========================================
Files 448 448
Lines 84118 84021 -97
==========================================
- Hits 75282 75193 -89
+ Misses 8836 8828 -8
Flags with carried forward coverage won't be shown. Click here to find out more.
🚀 New features to boost your workflow:
|
harsha-simhadri
left a comment
There was a problem hiding this comment.
ok with this. could you please run on a few medium size datasets and check recall before merging. Thanks.
arkrishn94
left a comment
There was a problem hiding this comment.
Thanks for doing this Mark, lgtm. The only comment I had was on supporting the backward compatibility of generate_pq_pivots for diskann-disk tests. If you can create a small issue/sub-issue on this, we can close it in a quick follow up.
| /// | ||
| /// Argument `legacy_center_data` will center the provided data by the dataset mean. | ||
| /// This is to supply backwards compatibility with some `diskann-disk` tests that used this | ||
| /// feature and require exact reproducibility in some tests. | ||
| /// | ||
| /// This argument should **only** be used if the distance metric being used is L2. Otherwise | ||
| /// any computed distance on the resulting PQ compressed data will be incorrect. |
There was a problem hiding this comment.
I think we can reset the baseline for these tests by generating the pivots without centering so that we can remove this path here. Mind creating a small issue so we can pick this up as a quick follow up?
| /// For files written without legacy centering (`centroid = None` in | ||
| /// [`Self::write_pivot_data`]), the returned centroid will be all zeros and | ||
| /// can safely be accumulated as a no-op. | ||
| pub fn load_existing_pivot_data<Storage>( |
There was a problem hiding this comment.
It seems that we have two functions doing the same thing - loading pq pivots:
- load_existing_pivot_data
- load_pq_pivots_bin
We should converge them (in a separate PR). Let me create a work item for that.
| .unwrap(); | ||
|
|
||
| let (loaded_pivots, loaded_centroid, loaded_offsets) = pq_storage | ||
| .load_existing_pivot_data(&num_pq_chunks, &num_centers, &dim, &storage_provider) |
There was a problem hiding this comment.
Let's also ensure that load_pq_pivots_bin() loads pivots correctly and they match pivots.
Remove the "centroid" from most PQ interfaces. This centroid is/was the dataset centroid and served to shift a vector before compressing or populating the distance lookup table. Here are the problems:
This kind of shifting only works when computing L2 distances. This kind of shift does not preserve inner products or cosine similarity. So it is a little bit of a foot-gun and was already not being applied when these distances were being computed, leading to a somewhat confusing scenario where a centroid can be provided to the
FixedChunkPQTableconstructor and then ignored.When L2 distances are being used - application of the centroid can be done implicitly by adding it to the PQ pivots rather than subtracting it from each vector that gets processed. In this way, it always gets applied.
It's possible this has a small effect on training by conditioning the data a little better, but when we've applied this empirically in the past, it has not made a difference.
Carrying around this centroid is making it awkward to cleanly move PQ to
diskann-quantization, so this PR takes the first steps towards removing it.Logical Flow/Review Order
diskann_providers/src/model/pq/generate_pivot_arguments.rs:translate_to_centeris removed fromGeneratePivotArguments.diskann_providers/src/model/pq/pq_construction.rs: The functiongenerate_pq_pivotsgains alegacy_center_databoolean argument to preserve the behavior ofdiskann-disk. This is necessary because tests indiskann-diskcheck against hard-coded pivots. Not centering the data results in different PQ codes due to slightly different floating point calculations, breaking these tests.Data translation is completely removed from the
_membufAPIs.diskann_providers/src/model/pq/fixed_chunk_pq_table.rs: Remove centroids altogether fromFixedChunkPQTable.diskann-providers/src/storage/pq_storage.rs: This is where backwards compatibility comes in. We cannot change the layout of the storage file and we also need to support previously generated files. To that end,load_pq_pivots_binis updated to check if the loaded centroid is non-zero. If so, it is accumulated into the pivots to restore the original numerical behavior. Several tests are added to ensure this behavior.For saving, when no centroid is available, it can be passed to
write_pivot_dataas anOption::Noneand will then be inserted as zeros into the saved file.diskann-disk/src/search/pq/pq_dataset.rsanddiskann-disk/src/search/pq/quantizer_preprocess.rs: Preprocessing can now be greatly simplified. Because theFixedChunkPQTableno longer has centroids, theTransposedTable(offering much faster pre-processing times) can always be used.The rest of the changes involve updating the construction sites of
FixedChunkPQTableand invoking the new saving interface.Backwards Compatibility
There exist PQ schemas in the wild that were generated with centroids and we still need to correctly process those.
diskann-disk: PQ schemas previously generated I believe have been saved withPQStorage. Thus, reloading into aFixedChunkPQTablewithPQStorage::load_pq_pivots_binwill merge the centroid with the PQ pivots._membufAPI: Backwards compatibility is handled by the caller (the only real user I know of uses this via FFI, and I will take responsibility for updating that). Again, this can be done by simply folding the centroid into the pivot data, or manually doing the centering. Both of which are quite easy to do.