Skip to content

feat(index): serializable cache for the BTree scalar index#6793

Merged
wjones127 merged 9 commits into
lance-format:mainfrom
wjones127:cache-codec-btree
May 20, 2026
Merged

feat(index): serializable cache for the BTree scalar index#6793
wjones127 merged 9 commits into
lance-format:mainfrom
wjones127:cache-codec-btree

Conversation

@wjones127
Copy link
Copy Markdown
Contributor

Makes BTree scalar index cache entries serializable, so a persistent cache backend can store and reload them without re-reading from storage.

Previously the whole BTree index was cached as Arc<dyn ScalarIndex> under an UnsizedCacheKey, which can never carry a codec, and each FlatIndex page was cached in-memory only.

Changes:

  • CacheCodecImpl for FlatIndex (one BTree page) and BTreePageKey::codec().
  • Top-level scalar index caching becomes a plugin implementation detail via the existing ScalarIndexPlugin::get_from_cache/put_in_cache hooks. The default impl preserves today's in-memory unsized caching (backwards compatible); the BTree plugin overrides it with a sized, codec-backed BTreeIndexState (the lookup RecordBatch + batch_size + ranges_to_files, from which try_from_serialized rebuilds the index with no IO).
  • Caching moves into scalar::open_scalar_index (get → miss → load → put); the dataset-level ScalarIndexCacheKey logic is removed from Dataset::open_scalar_index.

This keeps index-type-specific knowledge in lance-index rather than leaking a state trait + dispatch into lance/src/index.rs.

Adds an integration test asserting that after prewarming with a serializing cache backend, an indexed-filter query does 0 read IOPS.

Bitmap index will follow the same pattern in a separate PR.

🤖 Generated with Claude Code

wjones127 and others added 3 commits May 14, 2026 14:23
The whole BTree index was cached at the top level as `Arc<dyn ScalarIndex>`
under an `UnsizedCacheKey`, which can never carry a codec, so a persistent
cache backend could not serialize it.

Make top-level scalar index caching a plugin implementation detail via the
existing `ScalarIndexPlugin::get_from_cache` / `put_in_cache` hooks. The
default impls keep today's in-memory unsized caching; `BTreeIndexPlugin`
overrides them to store a `BTreeIndexState` (the `page_lookup.lance` batch
plus `batch_size` and the range-partition map) under a sized, codec-backed
key, and reconstructs the index from that state. `get_from_cache` gains
`index_store` / `frag_reuse_index` args so an override can rebuild without
re-reading metadata.

Also wire `BTreePageKey::codec()` so cached `FlatIndex` pages are
serializable.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…cache

Adds the BTree analogue of
test_fts_prewarm_with_serializing_backend_serves_query_with_no_io: after
prewarming a multi-page BTree index through a cache backend that serializes
every entry via its codec, an indexed-filter query reconstructs the index
and every page it touches from the cache with zero read IOPS. Exercises the
BTreeIndexState and FlatIndex CacheCodec round-trips end to end.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@github-actions github-actions Bot added the enhancement New feature or request label May 14, 2026
@codecov
Copy link
Copy Markdown

codecov Bot commented May 14, 2026

Codecov Report

❌ Patch coverage is 86.25000% with 55 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
rust/lance-index/src/scalar/btree.rs 87.83% 16 Missing and 25 partials ⚠️
rust/lance-index/src/scalar/btree/flat.rs 73.80% 0 Missing and 11 partials ⚠️
rust/lance/src/index/scalar.rs 72.72% 0 Missing and 3 partials ⚠️

📢 Thoughts on this report? Let us know!

Copy link
Copy Markdown
Collaborator

@Xuanwo Xuanwo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for working on this!

wjones127 and others added 6 commits May 15, 2026 11:11
- assert SearchResult equality directly instead of via Debug-string formatting.
- add test_btree_index_state_rejects_unknown_version covering the
  forward-compat version guard in BTreeIndexState::deserialize.
- add test_btree_index_state_reconstruct_applies_frag_reuse_index to verify
  the frag_reuse_index argument is threaded through reconstruct into the
  rebuilt BTreeIndex.
- add test_btree_index_state_range_partitioned_plugin_cache_roundtrip to
  cover the ranges_to_files = Some path through the plugin cache hooks.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Drop the JSON header (and the version field) in favor of a tight
little-endian binary encoding. ranges_to_files entries are now written
as fixed-width u32 fields plus a len-prefixed path, avoiding JSON
encoding of arrays of numbers.

The format carries no version tag; on-disk stability is not yet a
promise — any mismatch will be detected (or trigger a deserialize
error) and the cache will rebuild from source. A version field can be
added later once the format stabilizes.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The cache passed to get_from_cache/put_in_cache is already per-index
namespaced by open_scalar_index (via LanceCache::for_index), so the
extra key parameter just appended a redundant uuid component to every
hash. Collapse ScalarIndexCacheKey to a unit struct with a constant key
and drop the parameter from both methods.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
BTreeIndex now retains the full lookup batch alongside the BTreeLookup
built from it, duplicating min/max values for every page. Note a
follow-up path that would eliminate the duplication.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ic doc

rustdoc -D warnings rejects intra-doc links from public items to private
ones. Drop the link; the name is enough for a reader.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@wjones127 wjones127 marked this pull request as ready for review May 15, 2026 21:02
Copy link
Copy Markdown

@claude claude Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Claude Code Review

This pull request is from a fork — automated review is disabled. A repository maintainer can comment @claude review to run a one-time review.

Copy link
Copy Markdown
Member

@westonpace westonpace left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this cache the pages themselves? It looks like it might just be caching the lookups?

@westonpace
Copy link
Copy Markdown
Member

Nevermind, I missed the batch IPC stream at the end. +1.

@wjones127 wjones127 merged commit 1607010 into lance-format:main May 20, 2026
28 checks passed
wjones127 added a commit that referenced this pull request May 20, 2026
…es (#6874)

Adds `CacheCodec` impls so Bitmap and LabelList index cache entries
survive through a persistent cache backend, mirroring the BTree work in
#6793.

- `CacheCodecImpl for RowAddrTreeMap` (delegates to existing
`serialize_into`/`deserialize_from`), so per-value bitmap entries cached
under `BitmapKey` are codec-backed.
- `BitmapIndexState` captures the value→offset map (Arrow IPC), the null
bitmap, and the value type. `BitmapIndexPlugin` overrides
`get_from_cache`/`put_in_cache` to store this sized state.
- `LabelListIndexState` wraps an inner `BitmapIndexState` plus
`list_nulls` and gets the same plugin-level codec treatment.
- `open_scalar_index` skips the LabelList compatibility check on cache
hits, so a fully-cached LabelList query no longer pays an extra
`bitmap_page_lookup.lance` open per call.

## Tests

- Unit codec round-trip for `BitmapIndexState` (empty + populated).
- Integration tests
`test_{bitmap,label_list}_prewarm_with_serializing_backend_serves_query_with_no_io`
asserting zero IOPS after prewarm through a serializing cache backend.

Closes #6744
wombatu-kun pushed a commit to wombatu-kun/lance that referenced this pull request May 21, 2026
Mainline added serializable scalar-index caching (lance-format#6793, lance-format#6874) and moved the TRACE_IO_EVENTS / record_index_load instrumentation from the outer call site into `scalar::open_scalar_index`. The relocated trace references a `uuid_str` local that no longer exists after the branch dropped the `&str` form, and the inner `index` binding is shadowed by the loaded plugin index. Capture `index.uuid` (a `Uuid`) before the shadowing and format it via Display.

Also re-add the `UnsizedCacheKey` import in `rust/lance/src/index.rs`; the new `ScalarIndexCacheKey` introduced by this branch implements it, but the import was lost when the auto-merge pruned the outer scalar-cache code that mainline migrated into the plugin layer.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
wombatu-kun pushed a commit to wombatu-kun/lance that referenced this pull request May 24, 2026
Mainline added serializable scalar-index caching (lance-format#6793, lance-format#6874) and moved the TRACE_IO_EVENTS / record_index_load instrumentation from the outer call site into `scalar::open_scalar_index`. The relocated trace references a `uuid_str` local that no longer exists after the branch dropped the `&str` form, and the inner `index` binding is shadowed by the loaded plugin index. Capture `index.uuid` (a `Uuid`) before the shadowing and format it via Display.

Also re-add the `UnsizedCacheKey` import in `rust/lance/src/index.rs`; the new `ScalarIndexCacheKey` introduced by this branch implements it, but the import was lost when the auto-merge pruned the outer scalar-cache code that mainline migrated into the plugin layer.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
wombatu-kun pushed a commit to wombatu-kun/lance that referenced this pull request May 25, 2026
Mainline added serializable scalar-index caching (lance-format#6793, lance-format#6874) and moved the TRACE_IO_EVENTS / record_index_load instrumentation from the outer call site into `scalar::open_scalar_index`. The relocated trace references a `uuid_str` local that no longer exists after the branch dropped the `&str` form, and the inner `index` binding is shadowed by the loaded plugin index. Capture `index.uuid` (a `Uuid`) before the shadowing and format it via Display.

Also re-add the `UnsizedCacheKey` import in `rust/lance/src/index.rs`; the new `ScalarIndexCacheKey` introduced by this branch implements it, but the import was lost when the auto-merge pruned the outer scalar-cache code that mainline migrated into the plugin layer.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
wombatu-kun pushed a commit to wombatu-kun/lance that referenced this pull request Jun 4, 2026
Mainline added serializable scalar-index caching (lance-format#6793, lance-format#6874) and moved the TRACE_IO_EVENTS / record_index_load instrumentation from the outer call site into `scalar::open_scalar_index`. The relocated trace references a `uuid_str` local that no longer exists after the branch dropped the `&str` form, and the inner `index` binding is shadowed by the loaded plugin index. Capture `index.uuid` (a `Uuid`) before the shadowing and format it via Display.

Also re-add the `UnsizedCacheKey` import in `rust/lance/src/index.rs`; the new `ScalarIndexCacheKey` introduced by this branch implements it, but the import was lost when the auto-merge pruned the outer scalar-cache code that mainline migrated into the plugin layer.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
wombatu-kun pushed a commit to wombatu-kun/lance that referenced this pull request Jun 4, 2026
Mainline added serializable scalar-index caching (lance-format#6793, lance-format#6874) and moved the TRACE_IO_EVENTS / record_index_load instrumentation from the outer call site into `scalar::open_scalar_index`. The relocated trace references a `uuid_str` local that no longer exists after the branch dropped the `&str` form, and the inner `index` binding is shadowed by the loaded plugin index. Capture `index.uuid` (a `Uuid`) before the shadowing and format it via Display.

Also re-add the `UnsizedCacheKey` import in `rust/lance/src/index.rs`; the new `ScalarIndexCacheKey` introduced by this branch implements it, but the import was lost when the auto-merge pruned the outer scalar-cache code that mainline migrated into the plugin layer.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
wombatu-kun pushed a commit to wombatu-kun/lance that referenced this pull request Jun 4, 2026
Mainline added serializable scalar-index caching (lance-format#6793, lance-format#6874) and moved the TRACE_IO_EVENTS / record_index_load instrumentation from the outer call site into `scalar::open_scalar_index`. The relocated trace references a `uuid_str` local that no longer exists after the branch dropped the `&str` form, and the inner `index` binding is shadowed by the loaded plugin index. Capture `index.uuid` (a `Uuid`) before the shadowing and format it via Display.

Also re-add the `UnsizedCacheKey` import in `rust/lance/src/index.rs`; the new `ScalarIndexCacheKey` introduced by this branch implements it, but the import was lost when the auto-merge pruned the outer scalar-cache code that mainline migrated into the plugin layer.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
wombatu-kun pushed a commit to wombatu-kun/lance that referenced this pull request Jun 5, 2026
Mainline added serializable scalar-index caching (lance-format#6793, lance-format#6874) and moved the TRACE_IO_EVENTS / record_index_load instrumentation from the outer call site into `scalar::open_scalar_index`. The relocated trace references a `uuid_str` local that no longer exists after the branch dropped the `&str` form, and the inner `index` binding is shadowed by the loaded plugin index. Capture `index.uuid` (a `Uuid`) before the shadowing and format it via Display.

Also re-add the `UnsizedCacheKey` import in `rust/lance/src/index.rs`; the new `ScalarIndexCacheKey` introduced by this branch implements it, but the import was lost when the auto-merge pruned the outer scalar-cache code that mainline migrated into the plugin layer.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
wombatu-kun pushed a commit to wombatu-kun/lance that referenced this pull request Jun 5, 2026
Mainline added serializable scalar-index caching (lance-format#6793, lance-format#6874) and moved the TRACE_IO_EVENTS / record_index_load instrumentation from the outer call site into `scalar::open_scalar_index`. The relocated trace references a `uuid_str` local that no longer exists after the branch dropped the `&str` form, and the inner `index` binding is shadowed by the loaded plugin index. Capture `index.uuid` (a `Uuid`) before the shadowing and format it via Display.

Also re-add the `UnsizedCacheKey` import in `rust/lance/src/index.rs`; the new `ScalarIndexCacheKey` introduced by this branch implements it, but the import was lost when the auto-merge pruned the outer scalar-cache code that mainline migrated into the plugin layer.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants