Skip to content

feat!: move object store registry to the session, re-use stores#3689

Merged
wjones127 merged 13 commits intolance-format:mainfrom
wjones127:feat/cache-stores
Apr 18, 2025
Merged

feat!: move object store registry to the session, re-use stores#3689
wjones127 merged 13 commits intolance-format:mainfrom
wjones127:feat/cache-stores

Conversation

@wjones127
Copy link
Copy Markdown
Contributor

@wjones127 wjones127 commented Apr 15, 2025

BREAKING CHANGE: removes object_store_registry from WriteParams and ReadParams. The registry is now taken from the Session, which is already on those parameters. Also, most ObjectStore constructors now return Arc<ObjectStore>.

Closes #3684

  • Add a cache of in-use datasets within the registry.
  • Move ObjectStoreRegistry onto the Session object. Combined with the cache, this lets datasets using the same session share object stores, as long as they use the same parameters.

@github-actions github-actions Bot added enhancement New feature or request python java labels Apr 15, 2025
@codecov-commenter
Copy link
Copy Markdown

codecov-commenter commented Apr 15, 2025

Codecov Report

Attention: Patch coverage is 84.95763% with 71 lines in your changes missing coverage. Please review.

Project coverage is 78.41%. Comparing base (9dde5ea) to head (33d14dd).

Files with missing lines Patch % Lines
rust/lance-io/src/object_store.rs 66.66% 16 Missing and 16 partials ⚠️
rust/lance-io/src/object_store/providers.rs 84.16% 15 Missing and 4 partials ⚠️
rust/lance/src/session.rs 57.14% 9 Missing ⚠️
rust/lance-io/src/object_store/providers/local.rs 90.47% 0 Missing and 2 partials ⚠️
rust/lance/src/dataset.rs 97.72% 2 Missing ⚠️
rust/lance/src/dataset/builder.rs 83.33% 2 Missing ⚠️
java/core/lance-jni/src/blocking_dataset.rs 0.00% 1 Missing ⚠️
rust/lance-io/src/object_store/providers/aws.rs 94.44% 0 Missing and 1 partial ⚠️
rust/lance-io/src/object_store/providers/azure.rs 90.00% 0 Missing and 1 partial ⚠️
rust/lance-io/src/object_store/providers/gcp.rs 90.00% 0 Missing and 1 partial ⚠️
... and 1 more
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #3689      +/-   ##
==========================================
+ Coverage   78.38%   78.41%   +0.03%     
==========================================
  Files         267      267              
  Lines      100049   100257     +208     
  Branches   100049   100257     +208     
==========================================
+ Hits        78421    78619     +198     
- Misses      18513    18516       +3     
- Partials     3115     3122       +7     
Flag Coverage Δ
unittests 78.41% <84.95%> (+0.03%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@wjones127 wjones127 changed the title feat: re-use object store instances feat!: move object store registry to the session, re-use stores Apr 16, 2025
Comment thread rust/lance/src/dataset.rs
Comment on lines +6101 to +6102
#[tokio::test]
async fn test_session_store_registry() {
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the main test of these changes.

@wjones127 wjones127 force-pushed the feat/cache-stores branch 2 times, most recently from bb0047c to fb7b579 Compare April 17, 2025 17:28
@wjones127 wjones127 marked this pull request as ready for review April 17, 2025 21:17
Copy link
Copy Markdown
Member

@westonpace westonpace left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few questions but no major concerns.

Comment thread rust/lance-io/src/object_store.rs Outdated
})
}

fn extract_path(url: &Url) -> Path {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe some comment here about what this function does?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah this isn't clear. I think it actually makes much more sense to move it to the store provider.

Ok((store, path))
}

#[deprecated(note = "Use `from_uri` instead")]
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So now we can just pass a path into from_uri?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah that should work just the same. Trying to reduce the number of code paths we have to handle.

Comment thread rust/lance-io/src/object_store/providers.rs Outdated
let valid_schemes = self
.providers
.read()
.expect("ObjectStoreRegistry lock poisoned")
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This expect message isn't unique. If this triggers we won't know which statement triggers it. Can we do something like LanceOptionExt::expect_ok for results? That adds the location and it avoids needed to specify a string.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Comment thread rust/lance-io/src/object_store/providers.rs Outdated
// Cache of object stores currently in use. We use a weak reference so the
// cache itself doesn't keep them alive if no object store is actually using
// it.
active_stores: RwLock<HashMap<(String, ObjectStoreParams), Weak<ObjectStore>>>,
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So when will this session be active? Is it only when we have multiple readers / writers?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah this is mainly to address issues when we have a lot of tables open on the same bucket. We might one day cache with TTL, but not sure if it's worth the complexity given that it's not too expensive to reopen the connection.

active_stores: RwLock<HashMap<(String, ObjectStoreParams), Weak<ObjectStore>>>,
}

impl DeepSizeOf for ObjectStoreRegistry {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need to impl DeepSizeOf for ObjectStoreRegistry? The registry itself is not cached is it? Or is this just for debugging purposes?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess it's not very helpful. I had it because we added the registry on Session, and session does derive(DeepSizeOf). But I can just manually implement for session and skip this. It's memory size doesn't matter that much.

@wjones127 wjones127 merged commit 64d3ecb into lance-format:main Apr 18, 2025
25 of 27 checks passed
@wjones127 wjones127 deleted the feat/cache-stores branch April 18, 2025 15:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Cache object stores in session

3 participants