DM-42947: Implement RemoteButler.retrieveArtifacts #964

dhirving · 2024-02-20T22:43:24Z

Checklist

ran Jenkins
added a release note for user-visible changes to doc/changes

codecov · 2024-02-20T22:54:58Z

Codecov Report

Attention: Patch coverage is 96.49123% with 2 lines in your changes are missing coverage. Please review.

Project coverage is 88.49%. Comparing base (b06f0c5) to head (5804ff3).

Files	Patch %	Lines
python/lsst/daf/butler/datastores/fileDatastore.py	81.81%	1 Missing and 1 partial ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #964      +/-   ##
==========================================
+ Coverage   88.47%   88.49%   +0.02%     
==========================================
  Files         309      310       +1     
  Lines       39740    39783      +43     
  Branches     8340     8348       +8     
==========================================
+ Hits        35159    35206      +47     
+ Misses       3368     3366       -2     
+ Partials     1213     1211       -2

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

timj

Looks okay. I think there will need to be a refactor of it at some point in the future to be more efficient (we have had people complaining that it's slow in direct butler when the number of refs is reasonably large)

timj · 2024-02-22T17:31:46Z

python/lsst/daf/butler/datastores/file_datastore/retrieve_artifacts.py

+
+    Parameters
+    ----------
+    destination_directory : `ResourcePath`


Suggested change

destination_directory : `ResourcePath`

destination_directory : `~lsst.resources.ResourcePath`

I don't think Sphinx resolves them otherwise because ResourcePath is not part of daf_butler. It only really matters if this docstring turns up in the sphinx and in the longer term we keep dreaming that sphinx will pick up the type annotations and we can remove the type from here...

FYI I asked Jonathan about this at JTM. The latest version of the documentation tooling does do that, just science pipelines has not upgraded to use it yet.

timj · 2024-02-22T17:50:11Z

python/lsst/daf/butler/datastores/file_datastore/retrieve_artifacts.py

+
+    target_path: ResourcePathExpression
+    if preserve_path:
+        target_path = source_path


I know this was copied from above but looking at it again it might be clearer from a typing perspective as:

target_path: str if preserve_path: if source_path.isabs(): target_path = source_path.relativeToPathRoot else: target_path = source_path.path else: target_path = source_path.basename()

then target_path is a simple string and doesn't have to be a ResourcePath in one branch. Thoughts?

seems reasonable

Actually should it be unquoted_path instead? I'm not sure on the vagaries of ResourcePath quoting and unquoting. I might just leave it as it was since otherwise it's going to get immediately converted back into a ResourcePath in join() anyway.

Yes, join uses unquoted_path (so it does convert the resource path to a string inside there).

https://github.com/lsst/resources/blob/main/python/lsst/resources/_resourcePath.py#L742

Leaving it is fine, it is just a bit more convoluted with the typing as it currently stands.

timj · 2024-02-22T18:03:48Z

python/lsst/daf/butler/remote_butler/_remote_butler.py

+                target_uri = determine_destination_for_retrieved_artifact(
+                    destination, relative_path, preserve_path
+                )
+                target_uri.transfer_from(source_uri, transfer="copy", overwrite=overwrite)


This loop is organized slightly different to FileDatastore in that it does the transfer in the main loop and can't report how many files it's about to transfer. The debug message was there before the transfer for cases where people have mistakenly asked for 10,000 datasets and are wondering why it's taking so long. We have had discussions in the past about refusing to do the transfers if there are too many refs and also implementing parallelism in the transfers (which is why it was written initially to do the transfers after all the database queries so it could be modified to use futures.

I realize looking at the original now that it was never upgraded to use the bulk query interface _get_stored_records_associated_with_refs but instead uses the much slower per-ref API that we had originally. This is also another slowdown -- we have had people wondering what is going on even with a few hundred datasets so a reorganization of this implementation to have a bulk endpoint ("give me the payloads of N refs") with futures to download with 10 threads will likely be important at some point.

This is intentional here to some extent... because of URL expiration, we don't want to download a huge number of URLs upfront and then download all of them at the end. If the files are large the URLs will expire before we start downloading the later ones.

I think a bulk/parallel version of this would probably request like 10 URLs at a time from the server, instead of all of them.

Can you add a comment here to say that the reason is because of expiration?

Maybe also add a verbose log message after the loop to report how many files were transferred.

Add an implementation for RemoteButler.retrieveArtifacts matching the DirectButler behavior. The code used to determine output file paths in FileDatastore.retrieveArtifacts() was factored out to a shared function.

dhirving force-pushed the tickets/DM-42947 branch 6 times, most recently from adf49d2 to 107bed3 Compare February 21, 2024 22:45

dhirving marked this pull request as ready for review February 21, 2024 23:03

dhirving force-pushed the tickets/DM-42947 branch from 107bed3 to a3d9631 Compare February 22, 2024 17:01

timj approved these changes Feb 22, 2024

View reviewed changes

Implement RemoteButler.retrieveArtifacts

5804ff3

Add an implementation for RemoteButler.retrieveArtifacts matching the DirectButler behavior. The code used to determine output file paths in FileDatastore.retrieveArtifacts() was factored out to a shared function.

dhirving force-pushed the tickets/DM-42947 branch from a3d9631 to 5804ff3 Compare February 22, 2024 21:58

dhirving merged commit 75d9858 into main Feb 22, 2024
18 checks passed

dhirving deleted the tickets/DM-42947 branch February 22, 2024 22:13

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DM-42947: Implement RemoteButler.retrieveArtifacts #964

DM-42947: Implement RemoteButler.retrieveArtifacts #964

dhirving commented Feb 20, 2024 •

edited

codecov bot commented Feb 20, 2024 •

edited

timj left a comment

timj Feb 22, 2024

dhirving Feb 22, 2024

timj Feb 22, 2024

dhirving Feb 22, 2024

dhirving Feb 22, 2024

timj Feb 22, 2024

timj Feb 22, 2024

dhirving Feb 22, 2024

timj Feb 22, 2024

	destination_directory : `ResourcePath`
	destination_directory : `~lsst.resources.ResourcePath`

DM-42947: Implement RemoteButler.retrieveArtifacts #964

DM-42947: Implement RemoteButler.retrieveArtifacts #964

Conversation

dhirving commented Feb 20, 2024 • edited

Checklist

codecov bot commented Feb 20, 2024 • edited

Codecov Report

timj left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dhirving commented Feb 20, 2024 •

edited

codecov bot commented Feb 20, 2024 •

edited