New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DM-42947: Implement RemoteButler.retrieveArtifacts #964
Conversation
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #964 +/- ##
==========================================
+ Coverage 88.47% 88.49% +0.02%
==========================================
Files 309 310 +1
Lines 39740 39783 +43
Branches 8340 8348 +8
==========================================
+ Hits 35159 35206 +47
+ Misses 3368 3366 -2
+ Partials 1213 1211 -2 ☔ View full report in Codecov by Sentry. |
adf49d2
to
107bed3
Compare
107bed3
to
a3d9631
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks okay. I think there will need to be a refactor of it at some point in the future to be more efficient (we have had people complaining that it's slow in direct butler when the number of refs is reasonably large)
|
||
Parameters | ||
---------- | ||
destination_directory : `ResourcePath` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
destination_directory : `ResourcePath` | |
destination_directory : `~lsst.resources.ResourcePath` |
I don't think Sphinx resolves them otherwise because ResourcePath
is not part of daf_butler
. It only really matters if this docstring turns up in the sphinx and in the longer term we keep dreaming that sphinx will pick up the type annotations and we can remove the type from here...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FYI I asked Jonathan about this at JTM. The latest version of the documentation tooling does do that, just science pipelines has not upgraded to use it yet.
|
||
target_path: ResourcePathExpression | ||
if preserve_path: | ||
target_path = source_path |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know this was copied from above but looking at it again it might be clearer from a typing perspective as:
target_path: str
if preserve_path:
if source_path.isabs():
target_path = source_path.relativeToPathRoot
else:
target_path = source_path.path
else:
target_path = source_path.basename()
then target_path is a simple string and doesn't have to be a ResourcePath in one branch. Thoughts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
seems reasonable
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually should it be unquoted_path
instead? I'm not sure on the vagaries of ResourcePath quoting and unquoting. I might just leave it as it was since otherwise it's going to get immediately converted back into a ResourcePath in join() anyway.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, join
uses unquoted_path
(so it does convert the resource path to a string inside there).
https://github.com/lsst/resources/blob/main/python/lsst/resources/_resourcePath.py#L742
Leaving it is fine, it is just a bit more convoluted with the typing as it currently stands.
target_uri = determine_destination_for_retrieved_artifact( | ||
destination, relative_path, preserve_path | ||
) | ||
target_uri.transfer_from(source_uri, transfer="copy", overwrite=overwrite) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This loop is organized slightly different to FileDatastore
in that it does the transfer in the main loop and can't report how many files it's about to transfer. The debug message was there before the transfer for cases where people have mistakenly asked for 10,000 datasets and are wondering why it's taking so long. We have had discussions in the past about refusing to do the transfers if there are too many refs and also implementing parallelism in the transfers (which is why it was written initially to do the transfers after all the database queries so it could be modified to use futures.
I realize looking at the original now that it was never upgraded to use the bulk query interface _get_stored_records_associated_with_refs
but instead uses the much slower per-ref API that we had originally. This is also another slowdown -- we have had people wondering what is going on even with a few hundred datasets so a reorganization of this implementation to have a bulk endpoint ("give me the payloads of N refs") with futures to download with 10 threads will likely be important at some point.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is intentional here to some extent... because of URL expiration, we don't want to download a huge number of URLs upfront and then download all of them at the end. If the files are large the URLs will expire before we start downloading the later ones.
I think a bulk/parallel version of this would probably request like 10 URLs at a time from the server, instead of all of them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add a comment here to say that the reason is because of expiration?
Maybe also add a verbose log message after the loop to report how many files were transferred.
Add an implementation for RemoteButler.retrieveArtifacts matching the DirectButler behavior. The code used to determine output file paths in FileDatastore.retrieveArtifacts() was factored out to a shared function.
a3d9631
to
5804ff3
Compare
Checklist
doc/changes