Skip to content

feat(encoding): cache repetition index for FullZip encoding#4104

Merged
Xuanwo merged 6 commits into
mainfrom
3579-cache-repetition-index
Jul 3, 2025
Merged

feat(encoding): cache repetition index for FullZip encoding#4104
Xuanwo merged 6 commits into
mainfrom
3579-cache-repetition-index

Conversation

@Xuanwo
Copy link
Copy Markdown
Collaborator

@Xuanwo Xuanwo commented Jun 30, 2025

Summary

This PR introduces caching for repetition index in FullZip encoding to reduce I/O operations when reading variable-length data from remote storage.

Why this change?

When reading variable-length data (like strings) with FullZip encoding, the system needs to load a repetition
index to determine byte offsets. Currently, this index is loaded from storage for every read operation, causing
2 additional I/O requests per query. For remote storage scenarios with sufficient RAM, caching this index can
significantly improve performance.

What's changed?

  • Added FullZipCacheableState struct to store decoded repetition index in memory
  • Implemented cache initialization in FullZipScheduler::initialize() that loads and decodes the repetition
    index once
  • Added schedule_ranges_with_cached_rep() method that uses cached data directly without I/O
  • Modified schedule_ranges_rep() to check for cached data first before falling back to disk reads
  • Integrated with existing CachedPageData trait system for cache management

This is the foundation work for issue #3579. The caching is currently automatic when a repetition index exists, with configuration options coming in follow-up PRs.


I changed this PR into disable repetition index cache by default so users won't be affected. I will re-add this part after configration from field metadata and ReadParams has been added.

Signed-off-by: Xuanwo <github@xuanwo.io>
@github-actions github-actions Bot added the enhancement New feature or request label Jun 30, 2025
Xuanwo added 3 commits June 30, 2025 18:30
Signed-off-by: Xuanwo <github@xuanwo.io>
Signed-off-by: Xuanwo <github@xuanwo.io>
@Xuanwo Xuanwo marked this pull request as ready for review June 30, 2025 10:37
@Xuanwo Xuanwo requested a review from westonpace June 30, 2025 10:37
@codecov-commenter
Copy link
Copy Markdown

codecov-commenter commented Jun 30, 2025

Codecov Report

Attention: Patch coverage is 68.57143% with 66 lines in your changes missing coverage. Please review.

Project coverage is 77.98%. Comparing base (d517f9b) to head (5d9702e).
Report is 10 commits behind head on main.

Files with missing lines Patch % Lines
.../lance-encoding/src/encodings/logical/primitive.rs 68.57% 63 Missing and 3 partials ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #4104      +/-   ##
==========================================
- Coverage   78.05%   77.98%   -0.08%     
==========================================
  Files         301      302       +1     
  Lines      102665   103312     +647     
  Branches   102665   103312     +647     
==========================================
+ Hits        80134    80563     +429     
- Misses      19576    19783     +207     
- Partials     2955     2966      +11     
Flag Coverage Δ
unittests 77.98% <68.57%> (-0.08%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Copy link
Copy Markdown
Member

@westonpace westonpace left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like the right direction. A few suggestions, but nothing urgent.

Comment on lines +1877 to +1923
Ok(async move {
let data = data.await?;
let data = data
.into_iter()
.map(|d| LanceBuffer::from_bytes(d, 1))
.collect();
let num_rows = row_ranges.into_iter().map(|r| r.end - r.start).sum();

match &details.value_decompressor {
PerValueDecompressor::Fixed(decompressor) => {
let bits_per_value = decompressor.bits_per_value();
if bits_per_value == 0 {
return Err(lance_core::Error::Internal {
message: "Invalid encoding: bits_per_value must be greater than 0".into(),
location: location!(),
});
}
if bits_per_value % 8 != 0 {
return Err(lance_core::Error::NotSupported {
source: "Bit-packed full-zip encoding (non-byte-aligned values) is not yet implemented".into(),
location: location!(),
});
}
let bytes_per_value = bits_per_value / 8;
let total_bytes_per_value =
bytes_per_value as usize + details.ctrl_word_parser.bytes_per_word();
Ok(Box::new(FixedFullZipDecoder {
details,
data,
num_rows,
offset_in_current: 0,
bytes_per_value: bytes_per_value as usize,
total_bytes_per_value,
}) as Box<dyn StructuralPageDecoder>)
}
PerValueDecompressor::Variable(_decompressor) => {
Ok(Box::new(VariableFullZipDecoder::new(
details,
data,
num_rows,
bits_per_offset,
bits_per_offset,
)) as Box<dyn StructuralPageDecoder>)
}
}
}
.boxed())
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there anyway we can consolidate this logic across the two implementations? That way we don't have duplication?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will try in the following PRs.

Comment thread rust/lance-encoding/src/encodings/logical/primitive.rs Outdated
Comment thread rust/lance-encoding/src/encodings/logical/primitive.rs Outdated
@Xuanwo Xuanwo marked this pull request as draft July 3, 2025 05:25
@Xuanwo
Copy link
Copy Markdown
Collaborator Author

Xuanwo commented Jul 3, 2025

Let me polish a bit before merging.

Xuanwo added 2 commits July 3, 2025 17:43
Signed-off-by: Xuanwo <github@xuanwo.io>
@Xuanwo
Copy link
Copy Markdown
Collaborator Author

Xuanwo commented Jul 3, 2025

Test index::append::tests::test_query_delta_indices::index_params_2_VectorIndexParams__with_ivf_hnsw_sq_params_MetricType__L2_IvfBui seems flaky. Can't reproduce locally.

@Xuanwo Xuanwo marked this pull request as ready for review July 3, 2025 10:06
@Xuanwo Xuanwo merged commit 254f413 into main Jul 3, 2025
41 of 42 checks passed
@Xuanwo Xuanwo deleted the 3579-cache-repetition-index branch July 3, 2025 10:31
Xuanwo added a commit that referenced this pull request Jul 7, 2025
Follow-up of #4104, make the code
more clear.

Signed-off-by: Xuanwo <github@xuanwo.io>
Xuanwo added a commit that referenced this pull request Jul 7, 2025
Follow-up of #4104, make the code
more clear.

This PR is based on #4141, should
be merged after that.

---------

Signed-off-by: Xuanwo <github@xuanwo.io>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants