Skip to content

[PQ Cleanup] Part 2: Consolidate calculate_chunk_offsets* and accum_row_inplace #976

Open
arkrishn94 wants to merge 15 commits intomainfrom
u/adkrishnan/pq-cleanup-3
Open

[PQ Cleanup] Part 2: Consolidate calculate_chunk_offsets* and accum_row_inplace #976
arkrishn94 wants to merge 15 commits intomainfrom
u/adkrishnan/pq-cleanup-3

Conversation

@arkrishn94
Copy link
Copy Markdown
Contributor

@arkrishn94 arkrishn94 commented Apr 24, 2026

.Couple of small independent cleanups for PQ.

Move calculate_chunk_offsets[_auto] to ChunkOffsets as a constructor.

These two functions are pure prefix-sum math over dimensions and num_pq_chunks. ChunkOffsetsBase / ChunkOffsetsView in diskann-quantization::views is the natural home for it. I've moved the logic into two constructors - from_dim and from_dim_into for ChunkOffsets and ChunkOffsetsView respectively to support both allocation patterns.

All in-repo call sites have been updated. There might be some overlapping edits with #1010.

Minor changes

  • QueryComputer and DistanceComputer had six trampoline impls forwarding &Vec<u8> and &&[u8] arguments to the canonical &[u8] impls. ElementRef in the accessor now allows us to get rid of these! This might be conflicting with Remove unnecessary distance function implementations. #1008
  • Inline accum_row_inplace from diskann-providers\src\model\pq at the two call-sites.
  • Move get_chunk_from_training_data in pq_construction.rs from public API into tests where it is used.

Relocates the object pool module so that it is available to crates that depend on diskann-utils but not diskann (notably diskann-quantization, which will gain pool-aware distance-table allocation in a follow-up). diskann::utils::object_pool stays as a re-export for backwards compatibility.

Direct importers in diskann-providers, diskann-disk, and diskann-garnet are switched to use diskann_utils::object_pool directly. Internal diskann users continue to use the re-export.
@arkrishn94 arkrishn94 requested review from a team and Copilot April 24, 2026 18:50
@arkrishn94 arkrishn94 changed the title {PQ Cleanup] Part 2: Relocate calculate_chunk_offsets* and remove redundant distance impls and [PQ Cleanup] Part 2: Relocate calculate_chunk_offsets* and remove redundant distance impls and Apr 24, 2026
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Note

Copilot was unable to run its full agentic suite in this review.

Relocates PQ-related chunk offset helpers into diskann-quantization, centralizes the cosine→L2 fallback rationale for disk PQ preprocessing, and removes redundant distance trampoline impls by leaning on accessor element refs.

Changes:

  • Moved calculate_chunk_offsets[_auto] from diskann-providers into diskann-quantization::views and updated call sites.
  • Migrated object_pool usage to diskann-utils::object_pool across crates and removed diskann::utils::object_pool.
  • Deduplicated cosine/L2 fallback commentary and removed redundant distance impls.

Reviewed changes

Copilot reviewed 20 out of 21 changed files in this pull request and generated 5 comments.

Show a summary per file
File Description
diskann/src/utils/mod.rs Removes utils::object_pool module exposure.
diskann/src/graph/search/scratch.rs Switches AsPooled import to diskann_utils.
diskann/src/graph/index.rs Switches ObjectPool/PooledRef import to diskann_utils.
diskann-utils/src/lib.rs Exposes object_pool module publicly from diskann-utils.
diskann-quantization/src/views.rs Adds calculate_chunk_offsets[_auto] helpers.
diskann-providers/src/model/pq/pq_construction.rs Removes local chunk-offset helpers and imports from quantization crate.
diskann-providers/src/model/pq/mod.rs Stops re-exporting chunk-offset helpers from providers PQ module.
diskann-providers/src/model/pq/distance/test_utils.rs Updates import path for calculate_chunk_offsets_auto.
diskann-providers/src/model/pq/distance/l2.rs Switches object pool import to diskann_utils.
diskann-providers/src/model/pq/distance/innerproduct.rs Switches object pool import to diskann_utils.
diskann-providers/src/model/pq/distance/dynamic.rs Switches object pool import and removes redundant trait impls.
diskann-providers/src/model/pq/distance/common.rs Switches object pool import to diskann_utils.
diskann-providers/src/model/mod.rs Removes re-export of calculate_chunk_offsets_auto.
diskann-providers/src/model/graph/provider/async_/memory_quant_vector_provider.rs Switches object pool import to diskann_utils.
diskann-providers/src/model/graph/provider/async_/fast_memory_quant_vector_provider.rs Switches object pool import and updates tests for new argument types.
diskann-providers/src/model/graph/provider/async_/bf_tree/quant_vector_provider.rs Switches object pool import to diskann_utils.
diskann-garnet/src/provider.rs Switches object pool imports to diskann_utils.
diskann-disk/src/search/provider/disk_provider.rs Switches object pool imports to diskann_utils.
diskann-disk/src/search/pq/quantizer_preprocess.rs Centralizes cosine→L2 fallback rationale into module docs.
diskann-benchmark/src/backend/exhaustive/product.rs Updates chunk-offset helper call site to diskann-quantization.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread diskann-quantization/src/views.rs Outdated
Comment on lines +219 to +231
pub fn calculate_chunk_offsets(dimensions: usize, num_pq_chunks: usize, offsets: &mut [usize]) {
// Calculate each chunk's offset
// If we have 8 dimension and 3 chunks then offsets would be [0,3,6,8]
let mut chunk_offset: usize = 0;
offsets[0] = chunk_offset;
for chunk_index in 0..num_pq_chunks {
chunk_offset += dimensions / num_pq_chunks;
if chunk_index < (dimensions % num_pq_chunks) {
chunk_offset += 1;
}
offsets[chunk_index + 1] = chunk_offset;
}
}
Copy link

Copilot AI Apr 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As a public helper, this can panic with an unhelpful message when num_pq_chunks == 0 (division/mod by zero) or when offsets.len() != num_pq_chunks + 1 (out-of-bounds on offsets[0] / offsets[chunk_index + 1]). Consider adding an explicit check (e.g., assert!(num_pq_chunks > 0, ...) and assert_eq!(offsets.len(), num_pq_chunks + 1, ...)) or changing the API to return a Result with a clear error.

Copilot uses AI. Check for mistakes.
Comment thread diskann/src/utils/mod.rs
Comment thread diskann-providers/src/model/pq/mod.rs
Comment thread diskann-disk/src/search/pq/quantizer_preprocess.rs Outdated
Comment thread diskann-quantization/src/views.rs Outdated
@codecov-commenter
Copy link
Copy Markdown

codecov-commenter commented Apr 24, 2026

Codecov Report

❌ Patch coverage is 90.83969% with 12 lines in your changes missing coverage. Please review.
✅ Project coverage is 89.50%. Comparing base (65857ad) to head (58a398d).
⚠️ Report is 2 commits behind head on main.

Files with missing lines Patch % Lines
diskann-disk/src/storage/quant/pq/pq_generation.rs 54.54% 5 Missing ⚠️
diskann-providers/src/model/pq/pq_construction.rs 87.50% 5 Missing ⚠️
diskann-quantization/src/views.rs 97.29% 2 Missing ⚠️
Additional details and impacted files

Impacted file tree graph

@@            Coverage Diff             @@
##             main     #976      +/-   ##
==========================================
- Coverage   89.55%   89.50%   -0.05%     
==========================================
  Files         459      460       +1     
  Lines       85006    85489     +483     
==========================================
+ Hits        76131    76521     +390     
- Misses       8875     8968      +93     
Flag Coverage Δ
miri 89.50% <90.83%> (-0.05%) ⬇️
unittests 89.35% <90.83%> (-0.05%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines Coverage Δ
...iskann-benchmark/src/backend/exhaustive/product.rs 100.00% <ø> (ø)
...ovider/async_/fast_memory_quant_vector_provider.rs 98.46% <100.00%> (-0.01%) ⬇️
diskann-providers/src/model/pq/distance/dynamic.rs 94.11% <ø> (+2.81%) ⬆️
...kann-providers/src/model/pq/distance/test_utils.rs 100.00% <100.00%> (ø)
diskann-quantization/src/views.rs 98.27% <97.29%> (-0.34%) ⬇️
diskann-disk/src/storage/quant/pq/pq_generation.rs 89.84% <54.54%> (-3.38%) ⬇️
diskann-providers/src/model/pq/pq_construction.rs 92.78% <87.50%> (-1.13%) ⬇️

... and 21 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@arkrishn94 arkrishn94 changed the title [PQ Cleanup] Part 2: Relocate calculate_chunk_offsets* and remove redundant distance impls and [PQ Cleanup] Part 2: Relocate calculate_chunk_offsets* and remove redundant distance impls Apr 25, 2026
Comment thread diskann-utils/src/views.rs Outdated
@arkrishn94 arkrishn94 changed the title [PQ Cleanup] Part 2: Relocate calculate_chunk_offsets* and remove redundant distance impls [PQ Cleanup] Part 2: Consolidate calculate_chunk_offsets* and accum_row_inplace May 4, 2026
Copy link
Copy Markdown
Contributor

@hildebrandmw hildebrandmw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Turns out you updated the code midway through the review 😄

That said, I think my comments still hold.

Comment thread diskann-quantization/src/views.rs Outdated
pub fn from_dimensions(
dimensions: usize,
num_pq_chunks: usize,
) -> Result<Self, ChunkOffsetError> {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the spirit of bike-shedding, what about something like

pub fn partition(
    dim: NonZeroUsize,
    len: NonZerousize,
) -> Result<Sel, ChunkOffsetError> {
    
}

Which would be internally consistent with the dim()/len() methods on chunk_offsets.

Also - bonus points for erroring before allocating if at all possible. The final result can still go through Self::new to make sure all the invariants are upheld, but if we can tell right away that things are not going to work out, might as well skip some unnecessary work.

I'm not completely set on NonZeroUsize, but do consider using bespoke errors here. Right now, if you pass num_pq_chunks == 0, the error is "offsets must have a length of at least 2, found 1" which is a little confusing. Perhaps even more confusing is when we pass num_pq_chunks > dim, in which case the error is "offsets must be strictly increasing ...".

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Mark. Made sure to catch the errors upfront and hopefully raise them correctly. Lmk if that looks good.

I'm less convinced about the naming, from_dim and from_dim_into seem reasonable? I'm not sure I'm following the naming of partition and partition_in :p

pub fn from_dimensions_into(
dimensions: usize,
num_pq_chunks: usize,
scratch: &'a mut [usize],
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can derive num_pq_chunks from scratch.len(), in which case this can be

pub fn partition_in(
    dim: usize,
    offsets: &mut [usize],
)

Which has a nice symmetry with the comment on suggesting partition for the constructing case. It would also be nice if we could avoid mutating offsets if we know there will be an invalid configuration.

Comment thread diskann-utils/src/views.rs Outdated
self.data.as_mut_slice().chunks_exact_mut(ncols)
}

/// Apply `op(row[j], &broadcast[j])` to every entry of every row, broadcasting
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here's the rabbit hole I went down:

First, I felt that this was far too specific targeting a very niche use case and would instead be better served with

pub fn try_map_row_mut<F, R>(&mut self, f: F) -> R
where
    F: FnMut(&mut [T::Elem]) -> Result<(), R>

Then I realized that this could be trivially implemented with .row_iter_mut().try_for_each(f)

At which point, this made me question whether or not this is even needed.

My suggestion is to remove it if possible since it's a very specific function that can already be done almost trivially through existing means.

Copy link
Copy Markdown
Contributor Author

@arkrishn94 arkrishn94 May 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Inlined this operation at the two call-sites. Removed this from MatrixBase and the associated testing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants