Skip to content

feat(fts): implement CacheCodec for posting lists and positions#6691

Merged
wjones127 merged 2 commits into
lance-format:mainfrom
wkalt:ticket/oss-741/cachecodec-fts
May 6, 2026
Merged

feat(fts): implement CacheCodec for posting lists and positions#6691
wjones127 merged 2 commits into
lance-format:mainfrom
wkalt:ticket/oss-741/cachecodec-fts

Conversation

@wkalt
Copy link
Copy Markdown
Contributor

@wkalt wkalt commented May 5, 2026

OSS-741. Follows lance#6223. PostingListKey and PositionKey gain a codec mirroring partition_serde.rs (variant tag + JSON header + Arrow IPC streams). SharedPositionStream is backed by bytes::Bytes so the cache read path is end-to-end zero-copy.

Copy link
Copy Markdown

@claude claude Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Claude Code Review

This pull request is from a fork — automated review is disabled. A repository maintainer can comment @claude review to run a one-time review.

@github-actions github-actions Bot added the enhancement New feature or request label May 5, 2026
@wkalt wkalt force-pushed the ticket/oss-741/cachecodec-fts branch from 7e2cd07 to 1235bce Compare May 5, 2026 22:24
@codecov
Copy link
Copy Markdown

codecov Bot commented May 5, 2026

Codecov Report

❌ Patch coverage is 82.24299% with 76 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
...ust/lance-index/src/scalar/inverted/cache_codec.rs 82.68% 25 Missing and 46 partials ⚠️
rust/lance-index/src/scalar/inverted/index.rs 72.22% 5 Missing ⚠️

📢 Thoughts on this report? Let us know!

@wkalt
Copy link
Copy Markdown
Contributor Author

wkalt commented May 5, 2026

@claude review once

Copy link
Copy Markdown

@claude claude Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Found one nit (inline) about preferring as_primitive_opt for type-safe IPC column reads — otherwise the codec mirrors partition_serde.rs cleanly and the round-trip + zero-copy tests look solid. Deferring to a human for the format-design call (variant tags, JSON header layout) since this is the persistence contract for FTS cache entries going forward.

Extended reasoning...

Overview

This PR introduces a new ~665-line cache_codec.rs implementing CacheCodecImpl for PostingList and Positions, plus changes to inverted/index.rs to (a) switch SharedPositionStream::bytes from Vec<u8> to bytes::Bytes for zero-copy reads, (b) wire PostingListKey / PositionKey to the new codec via CacheKey::codec(), and (c) make Positions field visibility pub(super). Tests exercise round-trip across all variants (Plain/Compressed × None/Legacy/SharedStream positions), verify that the deserialized shared-stream byte slice points back into the input Bytes (true zero-copy), and add an integration test (test_fts_prewarm_with_serializing_backend_serves_query_with_no_io) that drives a real FTS prewarm + phrase query through a serializing cache backend and asserts zero IO post-prewarm.

Security risks

None of significance. The codec consumes Bytes from a cache backend, which is trusted infrastructure rather than untrusted input. The IPC stream + JSON header path could panic on malformed input via as_primitive::<T>() (the nit flagged inline), but the realistic threat is corruption / version-skew rather than an attacker-controlled payload, and the fix is the standard _opt + ok_or_else pattern already used elsewhere in the same file.

Level of scrutiny

Moderate-to-high. This is a persistence format — once cache entries are written by a real persistent backend, the variant tags (POSTING_VARIANT_PLAIN = 0, etc.), JSON header field names (max_score, length, posting_tail_codec, codec), and IPC schema column ordering become a forward-compat contract. Worth a human eye specifically on: (1) whether the JSON headers should carry a format version field for future evolution, (2) whether the panic-on-type-mismatch behavior at cache_codec.rs:199-203 and :295-304 is acceptable for persistent backends, and (3) whether Positions::deserialize rejecting POSITIONS_TAG_NONE is the right policy.

Other factors

  • Codecov reports 82.83% patch coverage with 73 lines uncovered, mostly in error paths (truncated input, unknown tags, JSON header parse failures). The happy-path round-trip coverage is good.
  • The only notable behavioral change in index.rs is SharedPositionStream::size() now returns bytes.len() instead of bytes.capacity() — a small accounting tweak that is correct given Bytes does not expose capacity.
  • The integration test in dataset_index.rs introduces a SerializingBackend test helper inline and the comment notes "if a third user appears, lift this into a shared test utility" — reasonable deferral.
  • Author explicitly requested review (@claude review once), so a deferral with the nit context is more useful than silence here.

Comment thread rust/lance-index/src/scalar/inverted/cache_codec.rs
@watermelon12138
Copy link
Copy Markdown

Hi wkalt, I saw your name here: https://lance.org/community/maintainers/#activities. I'm taking the liberty to ask in your PR comment — do you have any idea roughly when V6.0.0 and V7.0.0 will be released? I'm considering whether to wait for these two versions, otherwise I'll have to use v4.0.0.

wkalt and others added 2 commits May 6, 2026 11:20
OSS-741. Follows lance#6223. PostingListKey and PositionKey gain a
codec mirroring partition_serde.rs (variant tag + JSON header +
Arrow IPC streams). SharedPositionStream is backed by bytes::Bytes
so the cache read path is end-to-end zero-copy.
The deserialization paths for PlainPostingList and the shared-position
storage read primitive columns out of an Arrow IPC stream that was
decoded from cache bytes. The bytes can come from a persistent cache
backend across versions or from a corrupted entry, so the IPC schema
is part of the (untrusted) input rather than something we have already
validated.

The previous code used as_primitive::<T>(), which panics on type
mismatch. Switch to as_primitive_opt + ok_or_else(Error::io) so a
malformed cache entry surfaces as an Error::io the caller can log and
fall back from instead of aborting the process. This matches the
pattern already used in this file for ListArray and LargeBinaryArray
columns and aligns with the rust/AGENTS.md guidance to prefer _opt
variants when the data type has not been pre-verified.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@wkalt wkalt force-pushed the ticket/oss-741/cachecodec-fts branch from b0c6d00 to 8cfd2b0 Compare May 6, 2026 18:21
@wkalt
Copy link
Copy Markdown
Contributor Author

wkalt commented May 6, 2026

@watermelon12138 Hi. 6.x should be coming as soon as tomorrow. We are just waiting for the release voting period to conclude.

There is not a date yet planned for 7.x, it'll likely be in the next few weeks.

Copy link
Copy Markdown
Contributor

@wjones127 wjones127 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great! Excited to have this supported.

@wjones127 wjones127 merged commit 0ef33c5 into lance-format:main May 6, 2026
34 of 36 checks passed
@watermelon12138
Copy link
Copy Markdown

thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants