feat(encoding): cache repetition index for FullZip encoding#4104
Conversation
Signed-off-by: Xuanwo <github@xuanwo.io>
Signed-off-by: Xuanwo <github@xuanwo.io>
Signed-off-by: Xuanwo <github@xuanwo.io>
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #4104 +/- ##
==========================================
- Coverage 78.05% 77.98% -0.08%
==========================================
Files 301 302 +1
Lines 102665 103312 +647
Branches 102665 103312 +647
==========================================
+ Hits 80134 80563 +429
- Misses 19576 19783 +207
- Partials 2955 2966 +11
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
westonpace
left a comment
There was a problem hiding this comment.
Looks like the right direction. A few suggestions, but nothing urgent.
| Ok(async move { | ||
| let data = data.await?; | ||
| let data = data | ||
| .into_iter() | ||
| .map(|d| LanceBuffer::from_bytes(d, 1)) | ||
| .collect(); | ||
| let num_rows = row_ranges.into_iter().map(|r| r.end - r.start).sum(); | ||
|
|
||
| match &details.value_decompressor { | ||
| PerValueDecompressor::Fixed(decompressor) => { | ||
| let bits_per_value = decompressor.bits_per_value(); | ||
| if bits_per_value == 0 { | ||
| return Err(lance_core::Error::Internal { | ||
| message: "Invalid encoding: bits_per_value must be greater than 0".into(), | ||
| location: location!(), | ||
| }); | ||
| } | ||
| if bits_per_value % 8 != 0 { | ||
| return Err(lance_core::Error::NotSupported { | ||
| source: "Bit-packed full-zip encoding (non-byte-aligned values) is not yet implemented".into(), | ||
| location: location!(), | ||
| }); | ||
| } | ||
| let bytes_per_value = bits_per_value / 8; | ||
| let total_bytes_per_value = | ||
| bytes_per_value as usize + details.ctrl_word_parser.bytes_per_word(); | ||
| Ok(Box::new(FixedFullZipDecoder { | ||
| details, | ||
| data, | ||
| num_rows, | ||
| offset_in_current: 0, | ||
| bytes_per_value: bytes_per_value as usize, | ||
| total_bytes_per_value, | ||
| }) as Box<dyn StructuralPageDecoder>) | ||
| } | ||
| PerValueDecompressor::Variable(_decompressor) => { | ||
| Ok(Box::new(VariableFullZipDecoder::new( | ||
| details, | ||
| data, | ||
| num_rows, | ||
| bits_per_offset, | ||
| bits_per_offset, | ||
| )) as Box<dyn StructuralPageDecoder>) | ||
| } | ||
| } | ||
| } | ||
| .boxed()) |
There was a problem hiding this comment.
Is there anyway we can consolidate this logic across the two implementations? That way we don't have duplication?
There was a problem hiding this comment.
I will try in the following PRs.
|
Let me polish a bit before merging. |
…on-index' into 3579-cache-repetition-index
|
Test |
Follow-up of #4104, make the code more clear. Signed-off-by: Xuanwo <github@xuanwo.io>
Summary
This PR introduces caching for repetition index in FullZip encoding to reduce I/O operations when reading variable-length data from remote storage.
Why this change?
When reading variable-length data (like strings) with FullZip encoding, the system needs to load a repetition
index to determine byte offsets. Currently, this index is loaded from storage for every read operation, causing
2 additional I/O requests per query. For remote storage scenarios with sufficient RAM, caching this index can
significantly improve performance.
What's changed?
FullZipCacheableStatestruct to store decoded repetition index in memoryFullZipScheduler::initialize()that loads and decodes the repetitionindex once
schedule_ranges_with_cached_rep()method that uses cached data directly without I/Oschedule_ranges_rep()to check for cached data first before falling back to disk readsCachedPageDatatrait system for cache managementThis is the foundation work for issue #3579. The caching is currently automatic when a repetition index exists, with configuration options coming in follow-up PRs.
I changed this PR into disable repetition index cache by default so users won't be affected. I will re-add this part after configration from
field metadataandReadParamshas been added.