fix: do no work when optimize_indices is called but no new data has arrived#6711
Conversation
Repeated optimize_indices() calls on a steady-state dataset were rewriting identical files on every invocation. For each scalar index, the merge path unconditionally called update() and produced a new index file even with an empty unindexed input; for vector indices, the default-options auto-rebalance introduced in lance-format#6402 fired whenever multiple segments existed, which is the common state immediately after a previous optimize. Skip an index group at the top of the per-index loop when: - no fragments are unindexed for that group, AND - the caller did not request retrain, AND - num_indices_to_merge is None or 0 (no explicit delta merge). Pre-0.8 indices without a fragment_bitmap are conservatively treated as still needing optimize. The auto-rebalance path is unreachable through default options after this change; its regression test is marked #[ignore] until the feature is reintroduced behind an explicit opt-in. Adds a Python regression test that builds one index of each type, appends a fragment, runs optimize_indices() twice, and asserts the second call adds no files to the dataset directory. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
… check The previous commit's top-level skip in optimize_indices made the single-segment auto-rebalance from lance-format#6402 unreachable, since rebalance fires precisely when there are no unindexed fragments. That path is the real reason the regression test in lance-format#6402 exists, so the test was being ignored rather than maintained. Restructure so the top-level skip applies only to scalar indices (which have no rebalance concept) and add a rebalance-aware no-op inside the vector branch of merge_indices_with_unindexed_frags: when there is no new data and no retrain/merge requested, call select_segment_for_single_rebalance and bail when it returns no candidate. This preserves both behaviors: - A truly steady-state dataset is a no-op (the user's bug). - A skewed multi-segment dataset still self-heals one segment per call on the default path. Removes the #[ignore] from test_segmented_optimize_rebalances_only_one_segment and adjusts the python regression test to use num_partitions=1 so its small dataset isn't perpetually below MIN_PARTITION_SIZE_PERCENT and caught by the join-rebalance heuristic. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Codecov Report❌ Patch coverage is
📢 Thoughts on this report? Let us know! |
|
@claude review once |
| if unindexed.is_empty() | ||
| && !options.retrain | ||
| && options.num_indices_to_merge.is_none() | ||
| && select_segment_for_single_rebalance(&ivf_view)?.is_none() | ||
| { | ||
| return Ok(None); | ||
| } |
There was a problem hiding this comment.
🔴 The new vector no-op check at append.rs:197 uses options.num_indices_to_merge.is_none(), but the matching scalar no-op check in index.rs uses is_none_or(|n| n == 0). OptimizeOptions::append() is defined as num_indices_to_merge: Some(0) and is used by real callers (e.g. cleanup.rs and tests via optimize_indices(&OptimizeOptions::append())), so on a steady-state vector index a second append() call still falls through to optimize_vector_indices, writes a new UUID directory + manifest, and reproduces exactly the regression this PR is trying to fix. Suggest using is_none_or(|n| n == 0) on append.rs:197 (and the matching use_single_segment_rebalance gate two lines below) to align the two checks with the PR description ("None or Some(0)").
Extended reasoning...
What the bug is
The PR description states the no-op should fire when num_indices_to_merge is None or Some(0). The two checks the PR introduces don't agree on what "no merge requested" means:
- Scalar path (
rust/lance/src/index.rs:1260):options.num_indices_to_merge.is_none_or(|n| n == 0)— correctly covers bothNoneandSome(0). - Vector path (
rust/lance/src/index/append.rs:197):options.num_indices_to_merge.is_none()— only coversNone, notSome(0).
The pre-existing use_single_segment_rebalance gate two lines below (append.rs:204) is also is_none(), so for Some(0) neither gate fires.
Why Some(0) is a real, reachable input
OptimizeOptions::append() in rust/lance-index/src/optimize.rs:78-84 is defined as:
pub fn append() -> Self {
Self {
num_indices_to_merge: Some(0),
...
}
}This is a public entry point used by real Rust callers — notably rust/lance/src/dataset/cleanup.rs (auto-cleanup) calls optimize_indices(&OptimizeOptions::append()), and the test suite (rust/lance/src/index/append.rs:569) does the same. The Python regression test added in this PR uses the default kwargs, which build OptimizeOptions::default() (None), so the test happens to take the path that does work and never exercises Some(0).
Step-by-step proof on a steady-state vector index
- Create a dataset with an
IVF_PQindex covering all fragments, no unindexed fragments. - Call
dataset.optimize_indices(&OptimizeOptions::append())—options.num_indices_to_merge == Some(0). - In
Dataset::optimize_indices(index.rs), the new scalar short-circuit at line 1260 hasindex_group_is_scalar == falsefor the vector group, so thecontinueis not taken. Control reachesmerge_indices(..). merge_indicescallsmerge_indices_with_unindexed_fragswithunindexed == [].- At
append.rs:191-200, the new vector no-op check evaluatesunindexed.is_empty() && !options.retrain && options.num_indices_to_merge.is_none() && select_segment_for_single_rebalance(&ivf_view)?.is_none(). The third conjunct (is_none()) is false forSome(0), so the no-op short-circuit does not returnNone. - At
append.rs:202-206,use_single_segment_rebalanceis gated onoptions.num_indices_to_merge.is_none()— also false forSome(0). So we fall into the else branch (line 271+). new_data_streamisNone(unindexed.is_empty()).optimize_vector_indicesis called.- In
rust/lance/src/index/vector/ivf.rsaround line 415,let writer = object_store.create(&index_file).await?unconditionally creates a new UUID-named index file. - Back in
Dataset::optimize_indices, this populatesnew_indiceswith a non-emptyIndexMergeResults, so theif new_indices.is_empty() { return Ok(()) }short-circuit does not fire, and the function commits a new manifest + transaction. - Result: every call to
optimize_indices(&OptimizeOptions::append())on a steady-state vector index writes a new UUID directory and bumps the dataset version — the exact symptom the PR aims to fix.
Impact
OptimizeOptions::append() is documented as an entry point and is wired into auto-cleanup, so this isn't a hypothetical input. Any Rust caller (or anyone building OptimizeOptions { num_indices_to_merge: Some(0), .. } directly) on a vector index keeps re-emitting manifests and index files indefinitely. The PR's own description ("None or Some(0)") states the intended invariant — only half of the implementation enforces it.
Suggested fix
Change line 197 and line 204 of rust/lance/src/index/append.rs from:
&& options.num_indices_to_merge.is_none()to:
&& options.num_indices_to_merge.is_none_or(|n| n == 0)to match the scalar path and the documented intent. Strengthening the Python regression test to also call optimize_indices(num_indices_to_merge=0) would lock this in.
wjones127
left a comment
There was a problem hiding this comment.
I think the claude comment is valid, but not an important edge case.
Thanks for implementing this!
The new vector no-op check used num_indices_to_merge.is_none() while the scalar gate at the top of optimize_indices uses is_none_or(|n| n == 0). OptimizeOptions::append() is Some(0) and is used by real callers (cleanup.rs, internal tests), so on a steady-state vector index a second append() call still fell through to optimize_vector_indices and wrote a new UUID directory + manifest — reproducing the exact regression this PR is meant to fix. Switch both the no-op check and the use_single_segment_rebalance gate two lines below to is_none_or(|n| n == 0) so default() and append() behave the same on a steady state: bail when no segment needs rebalancing; otherwise enter the single-segment rebalance branch. Add a Rust regression test in index::append::tests that exercises OptimizeOptions::append() exactly: build a 1-partition IVF_PQ index, append a fragment, call optimize_indices(append()) once to fold it in, then call it again and assert the manifest version is unchanged and no new index directory was created. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
Thanks for this fix! One thing I noted: this "fast path" is gated on Am I reading this wrong? Is this intended behavior? |
I think you're right, we could do better here. I guess it shouldn't matter at all what the value of |
Summary
optimize_indices()a no-op when no fragment is unindexed and the caller hasn't asked forretrainornum_indices_to_merge >= 1. Without this, a steady-state dataset (everything already indexed) had every call rewrite identical scalar-index files and trigger a vector single-segment auto-rebalance, producing a new manifest, transaction file, and per-index files on every call.optimize_indices()twice, and asserts the second call writes zero new files under the dataset directory.Why
Two independent paths were both writing files when there was nothing to do:
merge_indicesinrust/lance/src/index/append.rsalways calledindex.update(...)even whenunindexed.is_empty(), producing a new UUID directory with identical contents.num_segments > 1 && num_indices_to_merge.is_none() && !retrain && unindexed.is_empty(). After a prioroptimize_indices()had folded a delta into its own segment, every subsequent call re-rebalanced one segment.The user-visible symptom was that calling
optimize_indices()in a loop kept bumping the dataset version and creating index files indefinitely, even when the dataset hadn't changed.Approach
The fix lives in
Dataset::optimize_indicesinrust/lance/src/index.rs. Before callingmerge_indicesfor each grouped delta set, we now skip the group when:retrain = true, andnum_indices_to_mergeisNoneorSome(0)(no explicit delta consolidation requested), andfragment_bitmaps.The bitmap union is computed from the metadata we already loaded (no extra IO). Pre-0.8 indices without a
fragment_bitmapare treated conservatively — they still go through the merge path so they pick up the new format on first optimize.If any index group has work to do, the rest of the function is unchanged; the new check is purely a
continue. When every group is skipped, the existingif new_indices.is_empty() { return Ok(()); }short-circuit prevents any commit, so no manifest or transaction file is written.Test plan
test_optimize_indices_second_call_is_nooppassespython/tests/test_optimize.py,python/tests/test_scalar_index.py,python/tests/test_vector_index.py— 225 passed, 10 skipped (pre-existing), 1 deselected (unrelated pre-existing failure intest_create_index_progress_callback_error_before_completion_propagates)cargo test -p lance --lib index::— 389 passed, 1 ignored (the auto-rebalance test noted above)cargo clippy --lib --tests -- -D warnings— cleanNotes for reviewers
__lance_frag_reuse,__lance_mem_wal); those are filtered out before this loop by the existingis_system_indexcheck.