Skip to content

Conversation

@wjones127
Copy link
Contributor

  • Changes FileMetadataCache to always use bytes mode. For backwards compatibility, takes old capacity argument and uses 10MB assumed entry size to compute actual size.
  • Rewrites file metadata cache to be more generic. We will later reuse LanceCache to implement a consolidated index cache. In particular, uses strings instead of Path for keys.
  • Refactored how keys are generated: Dataset now keeps a metadata_cache that inserts with a prefix of the dataset URI. This way all use of this cache is namespaced, making the cache more resilient to collisions between datasets.

@wjones127 wjones127 changed the title feat!: move file metadata cache to size capacity feat!: move file metadata cache to bytes capacity Jun 5, 2025
@github-actions github-actions bot added the java label Jun 13, 2025
@wjones127 wjones127 force-pushed the feat/bytes-based-metadata-cache branch from fc07566 to dd8bc97 Compare June 13, 2025 21:21
@codecov-commenter
Copy link

codecov-commenter commented Jun 13, 2025

Codecov Report

Attention: Patch coverage is 81.00775% with 49 lines in your changes missing coverage. Please review.

Project coverage is 78.70%. Comparing base (a499cfa) to head (47983b6).

Files with missing lines Patch % Lines
rust/lance-core/src/cache.rs 83.17% 16 Missing and 2 partials ⚠️
java/core/lance-jni/src/blocking_dataset.rs 0.00% 6 Missing ⚠️
rust/lance/src/dataset.rs 64.70% 6 Missing ⚠️
rust/lance/src/session.rs 42.85% 4 Missing ⚠️
rust/lance-encoding-datafusion/src/zone.rs 0.00% 3 Missing ⚠️
rust/lance-index/src/vector/ivf/shuffler.rs 25.00% 3 Missing ⚠️
rust/lance-index/src/scalar/bitmap.rs 33.33% 2 Missing ⚠️
rust/lance/src/dataset/builder.rs 75.00% 2 Missing ⚠️
rust/lance/src/dataset/rowids.rs 77.77% 2 Missing ⚠️
java/core/lance-jni/src/file_reader.rs 0.00% 1 Missing ⚠️
... and 2 more
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #3949      +/-   ##
==========================================
+ Coverage   78.64%   78.70%   +0.05%     
==========================================
  Files         285      285              
  Lines      113290   113324      +34     
  Branches   113290   113324      +34     
==========================================
+ Hits        89100    89190      +90     
+ Misses      20752    20695      -57     
- Partials     3438     3439       +1     
Flag Coverage Δ
unittests 78.70% <81.00%> (+0.05%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@wjones127 wjones127 marked this pull request as ready for review June 16, 2025 19:43
Copy link
Member

@westonpace westonpace left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice cleanup and 🎉 for moving from paths to strings.

@Deprecated
public Builder setMetadataCacheSize(int metadataCacheSize) {
this.metadataCacheSize = metadataCacheSize;
int assumedEntrySize = 10 * 1024 * 1024; // 10MB per entry
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

10MB feels high to me for metadata.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I changed this just now to 4MB. My reason for this is that makes the defaults equivalent. The previous default is 256 items. 256 * 4MiB is 1GiB, the new default.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perfect, seems reasonable to me

fn new<T: DeepSizeOf + Send + Sync + 'static>(record: Arc<T>) -> Self {
let size_accessor =
|record: &ArcAny| -> usize { record.downcast_ref::<T>().unwrap().deep_size_of() };
|record: &ArcAny| -> usize { record.clone().downcast::<T>().unwrap().deep_size_of() };
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is a clone needed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah I see I can avoid that with downcast_ref. Before I was downcasting the Arc itself, which consumes the Arc. But downcast_ref takes &self.

pub const BLOB_DIR: &str = "_blobs";
pub(crate) const DEFAULT_INDEX_CACHE_SIZE: usize = 256;
pub(crate) const DEFAULT_METADATA_CACHE_SIZE: usize = 256;
// Default to 1 GiB for the metadata cache. Column metadata can be like 40MB,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

40MB? At what scale?

Copy link
Contributor Author

@wjones127 wjones127 Jun 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah I think I asked you how big this could be at one point, and thought you said 40MB. Though maybe that's very very large lance files. I don't quite remember at what scale.

Comment on lines 27 to 32
/// Global cache for file metadata.
///
/// Sub-caches are created from this cache for each dataset by adding the
/// URI as a key prefix. See the [`LanceDataset::metadata_cache`] field.
/// This prevents collisions between different datasets.
pub(crate) file_metadata_cache: LanceCache,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we just call it metadata_cache?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure. It was convenient to have a different name for the refactor so I could spot uses of file_metadata_cache and replace with metadata_cache. But now it's not as helpful.

"file_metadata_cache",
&format!(
"FileMetadataCache(items={})",
"LanceCache(items={})",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe change to printing the byte size?

}

let item = Arc::new(MyType(42));
let item_dyn: Arc<dyn MyTrait> = item;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't Arc<dyn Foo> sized? Why is insert_unsized needed here? Also, why do we need insert_unsized at all?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll try again to remove this and see if I can.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Later on we'll want to store Arc<dyn ScalarIndex> in here.

This is where I got stuck:

https://play.rust-lang.org/?version=stable&mode=debug&edition=2024&gist=f641e98ab95dfb6186c7c4518775ae98

@wjones127 wjones127 merged commit ec1f7d1 into lance-format:main Jun 18, 2025
26 of 28 checks passed
@wjones127 wjones127 deleted the feat/bytes-based-metadata-cache branch June 18, 2025 22:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants