Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DM-42934: Run existing unit tests against RemoteButler #961

Merged
merged 8 commits into from Feb 19, 2024

Conversation

dhirving
Copy link
Contributor

@dhirving dhirving commented Feb 16, 2024

Added a new Butler subclass HybridButler that delegates methods to RemoteButler where implemented, and uses DirectButler for the rest. Extracted the test server setup from test_server.py into a reusable function. Modified test_butler.py to run tests against Butler server via HybridButler.

This revealed a few minor incompatibilities between RemoteButler and DirectButler, so updated RemoteButler to conform with the DirectButler behavior.

Checklist

  • ran Jenkins
  • added a release note for user-visible changes to doc/changes

@dhirving dhirving force-pushed the tickets/DM-42934 branch 2 times, most recently from 5ac533b to cbb532b Compare February 16, 2024 16:49
Pull Butler server test setup out to a separate function, to allow it to be re-used in tests outside of test_server.py.
DirectButler throws a FileNotFoundError for missing datasets in get(), not the LookupError that RemoteButler was throwing.
DirectButler throws if get() is given both a DatasetRef and DataId -- RemoteButler now does so as well for consistency.
Copy link

codecov bot commented Feb 16, 2024

Codecov Report

Attention: 20 lines in your changes are missing coverage. Please review.

Comparison is base (3e35ddf) 88.39% compared to head (2e9104e) 88.41%.

Files Patch % Lines
python/lsst/daf/butler/tests/hybrid_butler.py 85.56% 14 Missing ⚠️
tests/test_butler.py 95.77% 2 Missing and 1 partial ⚠️
python/lsst/daf/butler/tests/server.py 96.49% 2 Missing ⚠️
tests/test_server.py 95.23% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #961      +/-   ##
==========================================
+ Coverage   88.39%   88.41%   +0.02%     
==========================================
  Files         307      309       +2     
  Lines       39508    39690     +182     
  Branches     8316     8331      +15     
==========================================
+ Hits        34922    35091     +169     
- Misses       3373     3385      +12     
- Partials     1213     1214       +1     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@dhirving dhirving marked this pull request as ready for review February 16, 2024 20:54
Copy link
Member

@timj timj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great. I really like the hybrid butler approach -- makes it very clear what works with remote server.

_remote_butler: RemoteButler
_direct_butler: Butler

def __new__(cls, remote_butler: RemoteButler, direct_butler: Butler) -> "HybridButler":
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
def __new__(cls, remote_butler: RemoteButler, direct_butler: Butler) -> "HybridButler":
def __new__(cls, remote_butler: RemoteButler, direct_butler: Butler) -> Self:

Does typing.Self work here? (maybe also need from __future__ import annotations to remove the quotes?)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Self doesn't for some reason... it causes mypy to throw "expected Self but got HybridButler".

from __future__ import annotations works though, thanks -- I could not for the life of me figure out why this exact same pattern worked in DirectButler and RemoteButler but not in this file.

storageClass: str | StorageClass | None = None,
**kwargs: Any,
) -> DeferredDatasetHandle:
return self._direct_butler.getDeferred(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

getDeferred in theory is simple to implement in remote butler since it is a deferred get and get works already. The minor complication is that the direct butler implementation does a check to make sure the get is likely to work (calling datastore at the moment but it should really call self.exists() and self.stored()) before it returns the deferred handle (this is so that you don't spend hours doing some processing only to find that the get fails). This might be something to add fairly soon (not on this ticket) along with exists and stored.

run: str | None = None,
inferDefaults: bool = True,
**kwargs: Any,
) -> Butler:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is Self better here?

) -> tuple[Butler, DatasetType]:
"""Create a Butler for the test repository and insert some test data
into it.
"""
butler = Butler.from_config(self.tmpConfigFile, run=run)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this call self.create_empty_butler() here? Or is there some restriction on this having to be a direct butler? If it can't call it, maybe add a comment to that effect?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah that makes sense and will let me get rid of the other method override below. I didn't discover the need for create_empty_butler until later in the process so I didn't think about it.

butler = Butler.from_config(self.tmpConfigFile, run=run)
assert isinstance(butler, DirectButler), "Expect DirectButler in configuration"

collections = set(butler.registry.queryCollections())
self.assertEqual(collections, {run})

return self._setup_butler_data(butler, storageClass, datasetTypeName)

def create_empty_butler(self, run: str | None = None, writeable: bool | None = None):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
def create_empty_butler(self, run: str | None = None, writeable: bool | None = None):
def create_empty_butler(self, run: str | None = None, writeable: bool | None = None) -> Butler:

self.assertIsInstance(uri, ResourcePath)
self.assertIn("424", str(uri), f"Checking visit is in URI {uri}")
self.assertEqual(uri.fragment, "predicted", f"Checking for fragment in {uri}")
if self.predictionSupported:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this a temporary restriction on remote butler or a permanent one?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not currently implemented. I don't really understand the use cases for that feature so I'm not sure whether it makes sense to implement in the future or not.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it's a tricky one. People wanted to know what the path would be if they were to store the dataset. In client/server this becomes a bit odder since you have to decide what it comes back as since returning a signed URL to something that doesn't exist seems odd.

We have discussed in the past whether getURI should have the option of returning the server-side URI (s3://bucket/path_in_store/file.fits) rather than the client-side signed URL -- presumably this is assumed to be a bad idea because the bucket name shouldn't leak to an untrusted user? Maybe for predicted URIs we could return something like butler://datastore_name/path_in_store/file.fits#predicted so that people could see what the final filename template would resolve to. I'm not sure how many users care about this at this point.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe this could become a DirectButler-only interface? It certainly won't be the last one; we might even want to go ahead and move the entire write interface from Butler to DirectButler at some point.

def testStringification(self) -> None:
# RemoteButler does not have datastore/registry strings like
# DirectButler does.
pass
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we have a bespoke test here that checks that RemoteButler stringifies into something reasonable? Maybe with the URL of the server?

Add a "HybridButler" class that uses RemoteButler when possible, and DirectButler for methods not yet implemented by RemoteButler.  Use HybridButler to  run parts of the main Butler test suite against RemoteButler.
When running unit tests in Docker, the tests and the main library code are not in the same folder.  Explicitly specify the test directory when creating the server so it is able to locate its configuration file.
Add a way to optionally skip "predict mode" tests, so we can run the rest of the composite tests against Butler server.
@dhirving dhirving merged commit 81146d7 into main Feb 19, 2024
18 checks passed
@dhirving dhirving deleted the tickets/DM-42934 branch February 19, 2024 19:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants