feat: support IVF partitions multi-split#6423
Conversation
|
ACTION NEEDED The PR title and description are used as the merge commit message. Please update your PR title and description to match the specification. For details on the error please inspect the "PR Title Check" action. |
Codecov Report❌ Patch coverage is
📢 Thoughts on this report? Let us know! |
| T::Native: Dot + L2 + Normalize, | ||
| PrimitiveArray<T>: From<Vec<T::Native>>, | ||
| { | ||
| let Some((row_ids, vectors)) = self.load_partition_raw_vectors(part_idx).await? else { |
There was a problem hiding this comment.
load_partition_raw_vectors will load multivector into flat ones so one row id could match different vectors. Will this make the multivector data been missing in our index?
| new_centroids.extend(centroids.iter().map(|vec| vec.unwrap())); | ||
| let split_plans = stream::iter(split_partitions.iter().copied().enumerate()) | ||
| .map(|(split_order, part_idx)| async move { | ||
| let centroid2_part_idx = ivf.num_partitions() + split_order; |
There was a problem hiding this comment.
The centroid2_part_idx is calculated using ivf.num_partitions(). Could it be that some partitions have been filtered, causing the centroid2_part_idx in the plan to exceed the actual number? If so, the subsequent assign_ops[*target_idx] operation might panic.
Xuanwo
left a comment
There was a problem hiding this comment.
Thank you for working on this!
Feature
What is the new feature?
This PR allows a single
optimize_indicescall on the v3 IVF incremental optimize path to split multiple oversized IVF partitions in one pass.Why do we need this feature?
Previously, optimize could split at most one oversized partition per run. After large appends, several partitions can exceed the split threshold at the same time, which forced repeated optimize cycles to bring the index back into a healthy partition layout.
How does it work?
check_partition_adjustmentnow collects all split candidates from the current snapshot and keeps the existing single-partition join fallback.Performance Improvement
What is the performance issue or bottleneck?
The initial multi-split implementation removed the functional limitation, but overlapping reassign partitions still had avoidable overhead:
How does this PR improve performance?
Testing
cargo test -p lance compute_reassign_candidate_moves_vectors_to_new_centroidscargo test -p lance test_partition_split_on_append_multiveccargo test -p lance test_split_multiple_partitions_in_one_optimizecargo test -p lance test_join_partition_on_delete_multivec