-
Notifications
You must be signed in to change notification settings - Fork 526
feat!: move file metadata cache to bytes capacity #3949
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat!: move file metadata cache to bytes capacity #3949
Conversation
…oss Java and Rust implementations
fc07566 to
dd8bc97
Compare
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #3949 +/- ##
==========================================
+ Coverage 78.64% 78.70% +0.05%
==========================================
Files 285 285
Lines 113290 113324 +34
Branches 113290 113324 +34
==========================================
+ Hits 89100 89190 +90
+ Misses 20752 20695 -57
- Partials 3438 3439 +1
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
westonpace
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice cleanup and 🎉 for moving from paths to strings.
| @Deprecated | ||
| public Builder setMetadataCacheSize(int metadataCacheSize) { | ||
| this.metadataCacheSize = metadataCacheSize; | ||
| int assumedEntrySize = 10 * 1024 * 1024; // 10MB per entry |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
10MB feels high to me for metadata.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I changed this just now to 4MB. My reason for this is that makes the defaults equivalent. The previous default is 256 items. 256 * 4MiB is 1GiB, the new default.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perfect, seems reasonable to me
rust/lance-core/src/cache.rs
Outdated
| fn new<T: DeepSizeOf + Send + Sync + 'static>(record: Arc<T>) -> Self { | ||
| let size_accessor = | ||
| |record: &ArcAny| -> usize { record.downcast_ref::<T>().unwrap().deep_size_of() }; | ||
| |record: &ArcAny| -> usize { record.clone().downcast::<T>().unwrap().deep_size_of() }; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is a clone needed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah I see I can avoid that with downcast_ref. Before I was downcasting the Arc itself, which consumes the Arc. But downcast_ref takes &self.
| pub const BLOB_DIR: &str = "_blobs"; | ||
| pub(crate) const DEFAULT_INDEX_CACHE_SIZE: usize = 256; | ||
| pub(crate) const DEFAULT_METADATA_CACHE_SIZE: usize = 256; | ||
| // Default to 1 GiB for the metadata cache. Column metadata can be like 40MB, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
40MB? At what scale?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah I think I asked you how big this could be at one point, and thought you said 40MB. Though maybe that's very very large lance files. I don't quite remember at what scale.
rust/lance/src/session.rs
Outdated
| /// Global cache for file metadata. | ||
| /// | ||
| /// Sub-caches are created from this cache for each dataset by adding the | ||
| /// URI as a key prefix. See the [`LanceDataset::metadata_cache`] field. | ||
| /// This prevents collisions between different datasets. | ||
| pub(crate) file_metadata_cache: LanceCache, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we just call it metadata_cache?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure. It was convenient to have a different name for the refactor so I could spot uses of file_metadata_cache and replace with metadata_cache. But now it's not as helpful.
rust/lance/src/session.rs
Outdated
| "file_metadata_cache", | ||
| &format!( | ||
| "FileMetadataCache(items={})", | ||
| "LanceCache(items={})", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe change to printing the byte size?
| } | ||
|
|
||
| let item = Arc::new(MyType(42)); | ||
| let item_dyn: Arc<dyn MyTrait> = item; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't Arc<dyn Foo> sized? Why is insert_unsized needed here? Also, why do we need insert_unsized at all?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll try again to remove this and see if I can.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Later on we'll want to store Arc<dyn ScalarIndex> in here.
This is where I got stuck:
FileMetadataCacheto always use bytes mode. For backwards compatibility, takes old capacity argument and uses 10MB assumed entry size to compute actual size.LanceCacheto implement a consolidated index cache. In particular, uses strings instead of Path for keys.metadata_cachethat inserts with a prefix of the dataset URI. This way all use of this cache is namespaced, making the cache more resilient to collisions between datasets.