Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DM-42947: Implement RemoteButler.retrieveArtifacts #964

Merged
merged 1 commit into from Feb 22, 2024
Merged

Conversation

dhirving
Copy link
Contributor

@dhirving dhirving commented Feb 20, 2024

Checklist

  • ran Jenkins
  • added a release note for user-visible changes to doc/changes

Copy link

codecov bot commented Feb 20, 2024

Codecov Report

Attention: Patch coverage is 96.49123% with 2 lines in your changes are missing coverage. Please review.

Project coverage is 88.49%. Comparing base (b06f0c5) to head (5804ff3).

Files Patch % Lines
python/lsst/daf/butler/datastores/fileDatastore.py 81.81% 1 Missing and 1 partial ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #964      +/-   ##
==========================================
+ Coverage   88.47%   88.49%   +0.02%     
==========================================
  Files         309      310       +1     
  Lines       39740    39783      +43     
  Branches     8340     8348       +8     
==========================================
+ Hits        35159    35206      +47     
+ Misses       3368     3366       -2     
+ Partials     1213     1211       -2     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@dhirving dhirving force-pushed the tickets/DM-42947 branch 6 times, most recently from adf49d2 to 107bed3 Compare February 21, 2024 22:45
@dhirving dhirving marked this pull request as ready for review February 21, 2024 23:03
Copy link
Member

@timj timj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks okay. I think there will need to be a refactor of it at some point in the future to be more efficient (we have had people complaining that it's slow in direct butler when the number of refs is reasonably large)


Parameters
----------
destination_directory : `ResourcePath`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
destination_directory : `ResourcePath`
destination_directory : `~lsst.resources.ResourcePath`

I don't think Sphinx resolves them otherwise because ResourcePath is not part of daf_butler. It only really matters if this docstring turns up in the sphinx and in the longer term we keep dreaming that sphinx will pick up the type annotations and we can remove the type from here...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI I asked Jonathan about this at JTM. The latest version of the documentation tooling does do that, just science pipelines has not upgraded to use it yet.


target_path: ResourcePathExpression
if preserve_path:
target_path = source_path
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know this was copied from above but looking at it again it might be clearer from a typing perspective as:

target_path: str
if preserve_path:
    if source_path.isabs():
        target_path = source_path.relativeToPathRoot
    else:
        target_path = source_path.path
else:
    target_path = source_path.basename()

then target_path is a simple string and doesn't have to be a ResourcePath in one branch. Thoughts?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

seems reasonable

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually should it be unquoted_path instead? I'm not sure on the vagaries of ResourcePath quoting and unquoting. I might just leave it as it was since otherwise it's going to get immediately converted back into a ResourcePath in join() anyway.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, join uses unquoted_path (so it does convert the resource path to a string inside there).

https://github.com/lsst/resources/blob/main/python/lsst/resources/_resourcePath.py#L742

Leaving it is fine, it is just a bit more convoluted with the typing as it currently stands.

target_uri = determine_destination_for_retrieved_artifact(
destination, relative_path, preserve_path
)
target_uri.transfer_from(source_uri, transfer="copy", overwrite=overwrite)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This loop is organized slightly different to FileDatastore in that it does the transfer in the main loop and can't report how many files it's about to transfer. The debug message was there before the transfer for cases where people have mistakenly asked for 10,000 datasets and are wondering why it's taking so long. We have had discussions in the past about refusing to do the transfers if there are too many refs and also implementing parallelism in the transfers (which is why it was written initially to do the transfers after all the database queries so it could be modified to use futures.

I realize looking at the original now that it was never upgraded to use the bulk query interface _get_stored_records_associated_with_refs but instead uses the much slower per-ref API that we had originally. This is also another slowdown -- we have had people wondering what is going on even with a few hundred datasets so a reorganization of this implementation to have a bulk endpoint ("give me the payloads of N refs") with futures to download with 10 threads will likely be important at some point.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is intentional here to some extent... because of URL expiration, we don't want to download a huge number of URLs upfront and then download all of them at the end. If the files are large the URLs will expire before we start downloading the later ones.

I think a bulk/parallel version of this would probably request like 10 URLs at a time from the server, instead of all of them.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add a comment here to say that the reason is because of expiration?

Maybe also add a verbose log message after the loop to report how many files were transferred.

Add an implementation for RemoteButler.retrieveArtifacts matching the DirectButler behavior.

The code used to determine output file paths in FileDatastore.retrieveArtifacts() was factored out to a shared function.
@dhirving dhirving merged commit 75d9858 into main Feb 22, 2024
18 checks passed
@dhirving dhirving deleted the tickets/DM-42947 branch February 22, 2024 22:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants