Skip to content

feat: opt-in env var to drop centroids from vector index stats#6654

Merged
westonpace merged 3 commits intolance-format:mainfrom
westonpace:fix-drop-centroids-index-stats
Apr 30, 2026
Merged

feat: opt-in env var to drop centroids from vector index stats#6654
westonpace merged 3 commits intolance-format:mainfrom
westonpace:fix-drop-centroids-index-stats

Conversation

@westonpace
Copy link
Copy Markdown
Member

@westonpace westonpace commented Apr 30, 2026

Summary

Vector index statistics() serializes every centroid as JSON. For very large IVF indexes this can use a significant amount of memory and risks OOMing callers that just want metadata. We can't drop centroids outright without breaking existing consumers (e.g. the Python LanceIndex.centroids() helper), so this PR introduces an opt-in env var and a deprecation path:

  • LANCE_INCLUDE_VECTOR_CENTROIDS=false (or 0, case-insensitive): the stats method skips reading centroids entirely; the centroids field is omitted from the JSON (not serialized as null).
  • LANCE_INCLUDE_VECTOR_CENTROIDS=true (or 1, or any other non-falsy value): current behavior, no warning.
  • Unset: current behavior, plus a one-time warn! per process explaining that the default will change in a future release and how to lock in either behavior.

The check is centralized in a new maybe_centroids_for_stats helper so both the legacy IVFIndex::statistics (rust/lance/src/index/vector/ivf.rs) and the V2/V3 IvfIndex::statistics (rust/lance/src/index/vector/ivf/v2.rs) share the same gate. The struct field is now Option<Vec<Vec<f32>>> with #[serde(skip_serializing_if = "Option::is_none")] so the JSON shape is unchanged whenever centroids are included.

Behavior matrix

LANCE_INCLUDE_VECTOR_CENTROIDS Reads centroids? centroids in JSON? Warning?
unset yes yes one-time warn!
true / 1 / other non-falsy yes yes no
false / FALSE / 0 no omitted no

Files changed

  • rust/lance/src/index/vector/ivf.rs — struct field becomes Option<...> with skip_serializing_if; new LANCE_INCLUDE_VECTOR_CENTROIDS_ENV constant and maybe_centroids_for_stats helper (uses std::sync::Once so the deprecation warning fires at most once per process); legacy statistics() routed through the helper; two new unit tests.
  • rust/lance/src/index/vector/ivf/v2.rs — V2/V3 statistics() routed through the same helper.

Migration guidance for downstream consumers

  • Code that currently reads stats["centroids"] will keep working unchanged (still emitted by default; warning is informational).
  • Callers who don't need centroids can set LANCE_INCLUDE_VECTOR_CENTROIDS=false today to avoid the memory cost.
  • Callers who do need centroids should set LANCE_INCLUDE_VECTOR_CENTROIDS=true to silence the warning and lock in the existing behavior ahead of the future default flip.

Test plan

  • cargo test -p lance --lib maybe_centroids_for_stats — new test covers unset, true, 1, false, FALSE, 0; serialized via #[serial_test::serial(LANCE_INCLUDE_VECTOR_CENTROIDS)] so it doesn't race other env-var tests.
  • cargo test -p lance --lib stats_centroids_omitted — asserts the centroids field is fully absent from the serialized JSON when disabled (not serialized as null).
  • cargo test -p lance --lib test_index_stats — pre-existing IVF stats tests (IVF_FLAT/Hamming, IVF_PQ/L2, IVF_HNSW_SQ/Cosine, plus empty-partition) still pass against real datasets.
  • cargo clippy -p lance --tests -- -D warnings
  • cargo fmt --check -p lance

Notes / non-goals

  • No Python or Java binding changes. One already broken python method was removed.
  • No change to the legacy IndexFileVersion::Legacy JSON shape beyond the optional centroids field — the rest of IvfIndexStatistics is untouched.
  • The deprecation warning intentionally uses log::warn! (already imported in ivf.rs) rather than tracing to match the surrounding module's logging style.

🤖 Generated with Claude Code

…ids in stats

Vector index statistics include centroids serialized as JSON, which can
balloon memory for large indexes. Introduce LANCE_INCLUDE_VECTOR_CENTROIDS
to allow opting out without breaking existing callers: when set to false,
the stats method skips reading centroids; when unset, the current behavior
is preserved with a one-time deprecation warning; when set to true, the
behavior is preserved silently.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@github-actions github-actions Bot added the enhancement New feature or request label Apr 30, 2026
Use the shared str_is_truthy helper from lance-core for parsing
LANCE_INCLUDE_VECTOR_CENTROIDS so the accepted truthy values match the
rest of the codebase (1/true/on/yes/y, case-insensitive).

Remove VectorIndexReader.centroids in the Python bindings: it called
self.dataset._ds.get_index_centroids, which has no Rust binding, so the
method already raised AttributeError on every call.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

return self.stats["indices"][0]["num_partitions"]

def centroids(self) -> np.ndarray:
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This method was already broken (there is no get_index_centroids method) so I'm just pulling it out. This may be the shape we want for retrieving centroids in the future or it may not. Either way, we can address that in a follow-up.

Replace the direct str_is_truthy + std::env::var dance with the
parse_env_as_bool helper from lance-core. The unset-vs-set distinction
is still needed for the one-time deprecation warning, so retain a
single std::env::var(...).is_err() check just for the warning gate.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@westonpace westonpace marked this pull request as ready for review April 30, 2026 18:21
Copy link
Copy Markdown

@claude claude Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Claude Code Review

This pull request is from a fork — automated review is disabled. A repository maintainer can comment @claude review to run a one-time review.

@codecov
Copy link
Copy Markdown

codecov Bot commented Apr 30, 2026

Codecov Report

❌ Patch coverage is 91.80328% with 5 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
rust/lance/src/index/vector/ivf.rs 93.33% 2 Missing and 2 partials ⚠️
rust/lance/src/index/vector/ivf/v2.rs 0.00% 0 Missing and 1 partial ⚠️

📢 Thoughts on this report? Let us know!

@westonpace westonpace merged commit 0d30709 into lance-format:main Apr 30, 2026
30 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request python

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants