feat(fts): implement CacheCodec for posting lists and positions#6691
Conversation
7e2cd07 to
1235bce
Compare
Codecov Report❌ Patch coverage is
📢 Thoughts on this report? Let us know! |
|
@claude review once |
There was a problem hiding this comment.
Found one nit (inline) about preferring as_primitive_opt for type-safe IPC column reads — otherwise the codec mirrors partition_serde.rs cleanly and the round-trip + zero-copy tests look solid. Deferring to a human for the format-design call (variant tags, JSON header layout) since this is the persistence contract for FTS cache entries going forward.
Extended reasoning...
Overview
This PR introduces a new ~665-line cache_codec.rs implementing CacheCodecImpl for PostingList and Positions, plus changes to inverted/index.rs to (a) switch SharedPositionStream::bytes from Vec<u8> to bytes::Bytes for zero-copy reads, (b) wire PostingListKey / PositionKey to the new codec via CacheKey::codec(), and (c) make Positions field visibility pub(super). Tests exercise round-trip across all variants (Plain/Compressed × None/Legacy/SharedStream positions), verify that the deserialized shared-stream byte slice points back into the input Bytes (true zero-copy), and add an integration test (test_fts_prewarm_with_serializing_backend_serves_query_with_no_io) that drives a real FTS prewarm + phrase query through a serializing cache backend and asserts zero IO post-prewarm.
Security risks
None of significance. The codec consumes Bytes from a cache backend, which is trusted infrastructure rather than untrusted input. The IPC stream + JSON header path could panic on malformed input via as_primitive::<T>() (the nit flagged inline), but the realistic threat is corruption / version-skew rather than an attacker-controlled payload, and the fix is the standard _opt + ok_or_else pattern already used elsewhere in the same file.
Level of scrutiny
Moderate-to-high. This is a persistence format — once cache entries are written by a real persistent backend, the variant tags (POSTING_VARIANT_PLAIN = 0, etc.), JSON header field names (max_score, length, posting_tail_codec, codec), and IPC schema column ordering become a forward-compat contract. Worth a human eye specifically on: (1) whether the JSON headers should carry a format version field for future evolution, (2) whether the panic-on-type-mismatch behavior at cache_codec.rs:199-203 and :295-304 is acceptable for persistent backends, and (3) whether Positions::deserialize rejecting POSITIONS_TAG_NONE is the right policy.
Other factors
- Codecov reports 82.83% patch coverage with 73 lines uncovered, mostly in error paths (truncated input, unknown tags, JSON header parse failures). The happy-path round-trip coverage is good.
- The only notable behavioral change in
index.rsisSharedPositionStream::size()now returnsbytes.len()instead ofbytes.capacity()— a small accounting tweak that is correct givenBytesdoes not expose capacity. - The integration test in
dataset_index.rsintroduces aSerializingBackendtest helper inline and the comment notes "if a third user appears, lift this into a shared test utility" — reasonable deferral. - Author explicitly requested review (
@claude review once), so a deferral with the nit context is more useful than silence here.
|
Hi wkalt, I saw your name here: https://lance.org/community/maintainers/#activities. I'm taking the liberty to ask in your PR comment — do you have any idea roughly when V6.0.0 and V7.0.0 will be released? I'm considering whether to wait for these two versions, otherwise I'll have to use v4.0.0. |
OSS-741. Follows lance#6223. PostingListKey and PositionKey gain a codec mirroring partition_serde.rs (variant tag + JSON header + Arrow IPC streams). SharedPositionStream is backed by bytes::Bytes so the cache read path is end-to-end zero-copy.
The deserialization paths for PlainPostingList and the shared-position storage read primitive columns out of an Arrow IPC stream that was decoded from cache bytes. The bytes can come from a persistent cache backend across versions or from a corrupted entry, so the IPC schema is part of the (untrusted) input rather than something we have already validated. The previous code used as_primitive::<T>(), which panics on type mismatch. Switch to as_primitive_opt + ok_or_else(Error::io) so a malformed cache entry surfaces as an Error::io the caller can log and fall back from instead of aborting the process. This matches the pattern already used in this file for ListArray and LargeBinaryArray columns and aligns with the rust/AGENTS.md guidance to prefer _opt variants when the data type has not been pre-verified. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
b0c6d00 to
8cfd2b0
Compare
|
@watermelon12138 Hi. 6.x should be coming as soon as tomorrow. We are just waiting for the release voting period to conclude. There is not a date yet planned for 7.x, it'll likely be in the next few weeks. |
wjones127
left a comment
There was a problem hiding this comment.
Looks great! Excited to have this supported.
|
thanks. |
OSS-741. Follows lance#6223. PostingListKey and PositionKey gain a codec mirroring partition_serde.rs (variant tag + JSON header + Arrow IPC streams). SharedPositionStream is backed by bytes::Bytes so the cache read path is end-to-end zero-copy.