Skip to content

feat: support IVF partitions multi-split#6423

Merged
BubbleCal merged 4 commits into
mainfrom
yang/split-multiple-partitions
Apr 8, 2026
Merged

feat: support IVF partitions multi-split#6423
BubbleCal merged 4 commits into
mainfrom
yang/split-multiple-partitions

Conversation

@BubbleCal
Copy link
Copy Markdown
Contributor

@BubbleCal BubbleCal commented Apr 7, 2026

Feature

What is the new feature?

This PR allows a single optimize_indices call on the v3 IVF incremental optimize path to split multiple oversized IVF partitions in one pass.

Why do we need this feature?

Previously, optimize could split at most one oversized partition per run. After large appends, several partitions can exceed the split threshold at the same time, which forced repeated optimize cycles to bring the index back into a healthy partition layout.

How does it work?

  • check_partition_adjustment now collects all split candidates from the current snapshot and keeps the existing single-partition join fallback.
  • The multi-split path preserves existing partition ids and appends one new partition per split partition.
  • When any split happens, optimize continues to merge all existing delta indices in the same round, preserving the existing merge semantics.
  • Candidate rows from overlapping reassign partitions are resolved globally so the same row is moved at most once, choosing the best destination by distance.

Performance Improvement

What is the performance issue or bottleneck?

The initial multi-split implementation removed the functional limitation, but overlapping reassign partitions still had avoidable overhead:

  • split planning was done sequentially
  • split plans retained full raw vector payloads longer than necessary
  • a reused candidate partition recomputed its baseline distance to the original centroid for every overlapping split request

How does this PR improve performance?

  • split plans are now built with bounded parallelism across compute CPUs
  • split plans no longer retain raw partition vectors after producing the original-partition assign ops
  • each reused candidate partition is loaded once and computes its baseline distance once, then reuses that result across overlapping split requests
  • best candidate moves are updated in-place per row id instead of materializing intermediate move vectors per request

Testing

  • cargo test -p lance compute_reassign_candidate_moves_vectors_to_new_centroids
  • cargo test -p lance test_partition_split_on_append_multivec
  • cargo test -p lance test_split_multiple_partitions_in_one_optimize
  • cargo test -p lance test_join_partition_on_delete_multivec

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Apr 7, 2026

ACTION NEEDED
Lance follows the Conventional Commits specification for release automation.

The PR title and description are used as the merge commit message. Please update your PR title and description to match the specification.

For details on the error please inspect the "PR Title Check" action.

@BubbleCal BubbleCal changed the title feat reduce overlap work in ivf multi-split feat: support IVF partitions multi-split Apr 7, 2026
@github-actions github-actions Bot added the enhancement New feature or request label Apr 7, 2026
@codecov
Copy link
Copy Markdown

codecov Bot commented Apr 7, 2026

Codecov Report

❌ Patch coverage is 90.73634% with 39 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
rust/lance/src/index/vector/builder.rs 87.62% 22 Missing and 15 partials ⚠️
rust/lance/src/index/vector/ivf/v2.rs 98.36% 2 Missing ⚠️

📢 Thoughts on this report? Let us know!

T::Native: Dot + L2 + Normalize,
PrimitiveArray<T>: From<Vec<T::Native>>,
{
let Some((row_ids, vectors)) = self.load_partition_raw_vectors(part_idx).await? else {
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

load_partition_raw_vectors will load multivector into flat ones so one row id could match different vectors. Will this make the multivector data been missing in our index?

new_centroids.extend(centroids.iter().map(|vec| vec.unwrap()));
let split_plans = stream::iter(split_partitions.iter().copied().enumerate())
.map(|(split_order, part_idx)| async move {
let centroid2_part_idx = ivf.num_partitions() + split_order;
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The centroid2_part_idx is calculated using ivf.num_partitions(). Could it be that some partitions have been filtered, causing the centroid2_part_idx in the plan to exceed the actual number? If so, the subsequent assign_ops[*target_idx] operation might panic.

Copy link
Copy Markdown
Collaborator

@Xuanwo Xuanwo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for working on this!

@BubbleCal BubbleCal merged commit 5310f36 into main Apr 8, 2026
29 checks passed
@BubbleCal BubbleCal deleted the yang/split-multiple-partitions branch April 8, 2026 11:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants