feat: opt-in env var to drop centroids from vector index stats#6654
Merged
westonpace merged 3 commits intolance-format:mainfrom Apr 30, 2026
Merged
Conversation
…ids in stats Vector index statistics include centroids serialized as JSON, which can balloon memory for large indexes. Introduce LANCE_INCLUDE_VECTOR_CENTROIDS to allow opting out without breaking existing callers: when set to false, the stats method skips reading centroids; when unset, the current behavior is preserved with a one-time deprecation warning; when set to true, the behavior is preserved silently. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Use the shared str_is_truthy helper from lance-core for parsing LANCE_INCLUDE_VECTOR_CENTROIDS so the accepted truthy values match the rest of the codebase (1/true/on/yes/y, case-insensitive). Remove VectorIndexReader.centroids in the Python bindings: it called self.dataset._ds.get_index_centroids, which has no Rust binding, so the method already raised AttributeError on every call. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
westonpace
commented
Apr 30, 2026
|
|
||
| return self.stats["indices"][0]["num_partitions"] | ||
|
|
||
| def centroids(self) -> np.ndarray: |
Member
Author
There was a problem hiding this comment.
This method was already broken (there is no get_index_centroids method) so I'm just pulling it out. This may be the shape we want for retrieving centroids in the future or it may not. Either way, we can address that in a follow-up.
Replace the direct str_is_truthy + std::env::var dance with the parse_env_as_bool helper from lance-core. The unset-vs-set distinction is still needed for the one-time deprecation warning, so retain a single std::env::var(...).is_err() check just for the warning gate. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
jackye1995
approved these changes
Apr 30, 2026
Codecov Report❌ Patch coverage is
📢 Thoughts on this report? Let us know! |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Vector index
statistics()serializes every centroid as JSON. For very large IVF indexes this can use a significant amount of memory and risks OOMing callers that just want metadata. We can't drop centroids outright without breaking existing consumers (e.g. the PythonLanceIndex.centroids()helper), so this PR introduces an opt-in env var and a deprecation path:LANCE_INCLUDE_VECTOR_CENTROIDS=false(or0, case-insensitive): the stats method skips reading centroids entirely; thecentroidsfield is omitted from the JSON (not serialized asnull).LANCE_INCLUDE_VECTOR_CENTROIDS=true(or1, or any other non-falsy value): current behavior, no warning.warn!per process explaining that the default will change in a future release and how to lock in either behavior.The check is centralized in a new
maybe_centroids_for_statshelper so both the legacyIVFIndex::statistics(rust/lance/src/index/vector/ivf.rs) and the V2/V3IvfIndex::statistics(rust/lance/src/index/vector/ivf/v2.rs) share the same gate. The struct field is nowOption<Vec<Vec<f32>>>with#[serde(skip_serializing_if = "Option::is_none")]so the JSON shape is unchanged whenever centroids are included.Behavior matrix
LANCE_INCLUDE_VECTOR_CENTROIDScentroidsin JSON?warn!true/1/ other non-falsyfalse/FALSE/0Files changed
rust/lance/src/index/vector/ivf.rs— struct field becomesOption<...>withskip_serializing_if; newLANCE_INCLUDE_VECTOR_CENTROIDS_ENVconstant andmaybe_centroids_for_statshelper (usesstd::sync::Onceso the deprecation warning fires at most once per process); legacystatistics()routed through the helper; two new unit tests.rust/lance/src/index/vector/ivf/v2.rs— V2/V3statistics()routed through the same helper.Migration guidance for downstream consumers
stats["centroids"]will keep working unchanged (still emitted by default; warning is informational).LANCE_INCLUDE_VECTOR_CENTROIDS=falsetoday to avoid the memory cost.LANCE_INCLUDE_VECTOR_CENTROIDS=trueto silence the warning and lock in the existing behavior ahead of the future default flip.Test plan
cargo test -p lance --lib maybe_centroids_for_stats— new test covers unset,true,1,false,FALSE,0; serialized via#[serial_test::serial(LANCE_INCLUDE_VECTOR_CENTROIDS)]so it doesn't race other env-var tests.cargo test -p lance --lib stats_centroids_omitted— asserts thecentroidsfield is fully absent from the serialized JSON when disabled (not serialized asnull).cargo test -p lance --lib test_index_stats— pre-existing IVF stats tests (IVF_FLAT/Hamming, IVF_PQ/L2, IVF_HNSW_SQ/Cosine, plus empty-partition) still pass against real datasets.cargo clippy -p lance --tests -- -D warningscargo fmt --check -p lanceNotes / non-goals
IndexFileVersion::LegacyJSON shape beyond the optionalcentroidsfield — the rest ofIvfIndexStatisticsis untouched.log::warn!(already imported inivf.rs) rather thantracingto match the surrounding module's logging style.🤖 Generated with Claude Code