feat(index): serializable cache for the BTree scalar index#6793
Merged
Conversation
The whole BTree index was cached at the top level as `Arc<dyn ScalarIndex>` under an `UnsizedCacheKey`, which can never carry a codec, so a persistent cache backend could not serialize it. Make top-level scalar index caching a plugin implementation detail via the existing `ScalarIndexPlugin::get_from_cache` / `put_in_cache` hooks. The default impls keep today's in-memory unsized caching; `BTreeIndexPlugin` overrides them to store a `BTreeIndexState` (the `page_lookup.lance` batch plus `batch_size` and the range-partition map) under a sized, codec-backed key, and reconstructs the index from that state. `get_from_cache` gains `index_store` / `frag_reuse_index` args so an override can rebuild without re-reading metadata. Also wire `BTreePageKey::codec()` so cached `FlatIndex` pages are serializable. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…cache Adds the BTree analogue of test_fts_prewarm_with_serializing_backend_serves_query_with_no_io: after prewarming a multi-page BTree index through a cache backend that serializes every entry via its codec, an indexed-filter query reconstructs the index and every page it touches from the cache with zero read IOPS. Exercises the BTreeIndexState and FlatIndex CacheCodec round-trips end to end. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Codecov Report❌ Patch coverage is 📢 Thoughts on this report? Let us know! |
Xuanwo
approved these changes
May 15, 2026
Collaborator
Xuanwo
left a comment
There was a problem hiding this comment.
Thank you for working on this!
- assert SearchResult equality directly instead of via Debug-string formatting. - add test_btree_index_state_rejects_unknown_version covering the forward-compat version guard in BTreeIndexState::deserialize. - add test_btree_index_state_reconstruct_applies_frag_reuse_index to verify the frag_reuse_index argument is threaded through reconstruct into the rebuilt BTreeIndex. - add test_btree_index_state_range_partitioned_plugin_cache_roundtrip to cover the ranges_to_files = Some path through the plugin cache hooks. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Drop the JSON header (and the version field) in favor of a tight little-endian binary encoding. ranges_to_files entries are now written as fixed-width u32 fields plus a len-prefixed path, avoiding JSON encoding of arrays of numbers. The format carries no version tag; on-disk stability is not yet a promise — any mismatch will be detected (or trigger a deserialize error) and the cache will rebuild from source. A version field can be added later once the format stabilizes. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The cache passed to get_from_cache/put_in_cache is already per-index namespaced by open_scalar_index (via LanceCache::for_index), so the extra key parameter just appended a redundant uuid component to every hash. Collapse ScalarIndexCacheKey to a unit struct with a constant key and drop the parameter from both methods. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
BTreeIndex now retains the full lookup batch alongside the BTreeLookup built from it, duplicating min/max values for every page. Note a follow-up path that would eliminate the duplication. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ic doc rustdoc -D warnings rejects intra-doc links from public items to private ones. Drop the link; the name is enough for a reader. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This was referenced May 15, 2026
westonpace
approved these changes
May 19, 2026
Member
westonpace
left a comment
There was a problem hiding this comment.
Does this cache the pages themselves? It looks like it might just be caching the lookups?
Member
|
Nevermind, I missed the batch IPC stream at the end. +1. |
wjones127
added a commit
that referenced
this pull request
May 20, 2026
…es (#6874) Adds `CacheCodec` impls so Bitmap and LabelList index cache entries survive through a persistent cache backend, mirroring the BTree work in #6793. - `CacheCodecImpl for RowAddrTreeMap` (delegates to existing `serialize_into`/`deserialize_from`), so per-value bitmap entries cached under `BitmapKey` are codec-backed. - `BitmapIndexState` captures the value→offset map (Arrow IPC), the null bitmap, and the value type. `BitmapIndexPlugin` overrides `get_from_cache`/`put_in_cache` to store this sized state. - `LabelListIndexState` wraps an inner `BitmapIndexState` plus `list_nulls` and gets the same plugin-level codec treatment. - `open_scalar_index` skips the LabelList compatibility check on cache hits, so a fully-cached LabelList query no longer pays an extra `bitmap_page_lookup.lance` open per call. ## Tests - Unit codec round-trip for `BitmapIndexState` (empty + populated). - Integration tests `test_{bitmap,label_list}_prewarm_with_serializing_backend_serves_query_with_no_io` asserting zero IOPS after prewarm through a serializing cache backend. Closes #6744
wombatu-kun
pushed a commit
to wombatu-kun/lance
that referenced
this pull request
May 21, 2026
Mainline added serializable scalar-index caching (lance-format#6793, lance-format#6874) and moved the TRACE_IO_EVENTS / record_index_load instrumentation from the outer call site into `scalar::open_scalar_index`. The relocated trace references a `uuid_str` local that no longer exists after the branch dropped the `&str` form, and the inner `index` binding is shadowed by the loaded plugin index. Capture `index.uuid` (a `Uuid`) before the shadowing and format it via Display. Also re-add the `UnsizedCacheKey` import in `rust/lance/src/index.rs`; the new `ScalarIndexCacheKey` introduced by this branch implements it, but the import was lost when the auto-merge pruned the outer scalar-cache code that mainline migrated into the plugin layer. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
wombatu-kun
pushed a commit
to wombatu-kun/lance
that referenced
this pull request
May 24, 2026
Mainline added serializable scalar-index caching (lance-format#6793, lance-format#6874) and moved the TRACE_IO_EVENTS / record_index_load instrumentation from the outer call site into `scalar::open_scalar_index`. The relocated trace references a `uuid_str` local that no longer exists after the branch dropped the `&str` form, and the inner `index` binding is shadowed by the loaded plugin index. Capture `index.uuid` (a `Uuid`) before the shadowing and format it via Display. Also re-add the `UnsizedCacheKey` import in `rust/lance/src/index.rs`; the new `ScalarIndexCacheKey` introduced by this branch implements it, but the import was lost when the auto-merge pruned the outer scalar-cache code that mainline migrated into the plugin layer. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
wombatu-kun
pushed a commit
to wombatu-kun/lance
that referenced
this pull request
May 25, 2026
Mainline added serializable scalar-index caching (lance-format#6793, lance-format#6874) and moved the TRACE_IO_EVENTS / record_index_load instrumentation from the outer call site into `scalar::open_scalar_index`. The relocated trace references a `uuid_str` local that no longer exists after the branch dropped the `&str` form, and the inner `index` binding is shadowed by the loaded plugin index. Capture `index.uuid` (a `Uuid`) before the shadowing and format it via Display. Also re-add the `UnsizedCacheKey` import in `rust/lance/src/index.rs`; the new `ScalarIndexCacheKey` introduced by this branch implements it, but the import was lost when the auto-merge pruned the outer scalar-cache code that mainline migrated into the plugin layer. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
wombatu-kun
pushed a commit
to wombatu-kun/lance
that referenced
this pull request
Jun 4, 2026
Mainline added serializable scalar-index caching (lance-format#6793, lance-format#6874) and moved the TRACE_IO_EVENTS / record_index_load instrumentation from the outer call site into `scalar::open_scalar_index`. The relocated trace references a `uuid_str` local that no longer exists after the branch dropped the `&str` form, and the inner `index` binding is shadowed by the loaded plugin index. Capture `index.uuid` (a `Uuid`) before the shadowing and format it via Display. Also re-add the `UnsizedCacheKey` import in `rust/lance/src/index.rs`; the new `ScalarIndexCacheKey` introduced by this branch implements it, but the import was lost when the auto-merge pruned the outer scalar-cache code that mainline migrated into the plugin layer. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
wombatu-kun
pushed a commit
to wombatu-kun/lance
that referenced
this pull request
Jun 4, 2026
Mainline added serializable scalar-index caching (lance-format#6793, lance-format#6874) and moved the TRACE_IO_EVENTS / record_index_load instrumentation from the outer call site into `scalar::open_scalar_index`. The relocated trace references a `uuid_str` local that no longer exists after the branch dropped the `&str` form, and the inner `index` binding is shadowed by the loaded plugin index. Capture `index.uuid` (a `Uuid`) before the shadowing and format it via Display. Also re-add the `UnsizedCacheKey` import in `rust/lance/src/index.rs`; the new `ScalarIndexCacheKey` introduced by this branch implements it, but the import was lost when the auto-merge pruned the outer scalar-cache code that mainline migrated into the plugin layer. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
wombatu-kun
pushed a commit
to wombatu-kun/lance
that referenced
this pull request
Jun 4, 2026
Mainline added serializable scalar-index caching (lance-format#6793, lance-format#6874) and moved the TRACE_IO_EVENTS / record_index_load instrumentation from the outer call site into `scalar::open_scalar_index`. The relocated trace references a `uuid_str` local that no longer exists after the branch dropped the `&str` form, and the inner `index` binding is shadowed by the loaded plugin index. Capture `index.uuid` (a `Uuid`) before the shadowing and format it via Display. Also re-add the `UnsizedCacheKey` import in `rust/lance/src/index.rs`; the new `ScalarIndexCacheKey` introduced by this branch implements it, but the import was lost when the auto-merge pruned the outer scalar-cache code that mainline migrated into the plugin layer. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
wombatu-kun
pushed a commit
to wombatu-kun/lance
that referenced
this pull request
Jun 5, 2026
Mainline added serializable scalar-index caching (lance-format#6793, lance-format#6874) and moved the TRACE_IO_EVENTS / record_index_load instrumentation from the outer call site into `scalar::open_scalar_index`. The relocated trace references a `uuid_str` local that no longer exists after the branch dropped the `&str` form, and the inner `index` binding is shadowed by the loaded plugin index. Capture `index.uuid` (a `Uuid`) before the shadowing and format it via Display. Also re-add the `UnsizedCacheKey` import in `rust/lance/src/index.rs`; the new `ScalarIndexCacheKey` introduced by this branch implements it, but the import was lost when the auto-merge pruned the outer scalar-cache code that mainline migrated into the plugin layer. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
wombatu-kun
pushed a commit
to wombatu-kun/lance
that referenced
this pull request
Jun 5, 2026
Mainline added serializable scalar-index caching (lance-format#6793, lance-format#6874) and moved the TRACE_IO_EVENTS / record_index_load instrumentation from the outer call site into `scalar::open_scalar_index`. The relocated trace references a `uuid_str` local that no longer exists after the branch dropped the `&str` form, and the inner `index` binding is shadowed by the loaded plugin index. Capture `index.uuid` (a `Uuid`) before the shadowing and format it via Display. Also re-add the `UnsizedCacheKey` import in `rust/lance/src/index.rs`; the new `ScalarIndexCacheKey` introduced by this branch implements it, but the import was lost when the auto-merge pruned the outer scalar-cache code that mainline migrated into the plugin layer. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Makes BTree scalar index cache entries serializable, so a persistent cache backend can store and reload them without re-reading from storage.
Previously the whole BTree index was cached as
Arc<dyn ScalarIndex>under anUnsizedCacheKey, which can never carry a codec, and eachFlatIndexpage was cached in-memory only.Changes:
CacheCodecImpl for FlatIndex(one BTree page) andBTreePageKey::codec().ScalarIndexPlugin::get_from_cache/put_in_cachehooks. The default impl preserves today's in-memory unsized caching (backwards compatible); the BTree plugin overrides it with a sized, codec-backedBTreeIndexState(the lookupRecordBatch+batch_size+ranges_to_files, from whichtry_from_serializedrebuilds the index with no IO).scalar::open_scalar_index(get → miss → load → put); the dataset-levelScalarIndexCacheKeylogic is removed fromDataset::open_scalar_index.This keeps index-type-specific knowledge in
lance-indexrather than leaking a state trait + dispatch intolance/src/index.rs.Adds an integration test asserting that after prewarming with a serializing cache backend, an indexed-filter query does 0 read IOPS.
Bitmap index will follow the same pattern in a separate PR.
🤖 Generated with Claude Code