-
Notifications
You must be signed in to change notification settings - Fork 12
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DM-42947: Implement RemoteButler.retrieveArtifacts #964
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,72 @@ | ||
# This file is part of daf_butler. | ||
# | ||
# Developed for the LSST Data Management System. | ||
# This product includes software developed by the LSST Project | ||
# (http://www.lsst.org). | ||
# See the COPYRIGHT file at the top-level directory of this distribution | ||
# for details of code ownership. | ||
# | ||
# This software is dual licensed under the GNU General Public License and also | ||
# under a 3-clause BSD license. Recipients may choose which of these licenses | ||
# to use; please see the files gpl-3.0.txt and/or bsd_license.txt, | ||
# respectively. If you choose the GPL option then the following text applies | ||
# (but note that there is still no warranty even if you opt for BSD instead): | ||
# | ||
# This program is free software: you can redistribute it and/or modify | ||
# it under the terms of the GNU General Public License as published by | ||
# the Free Software Foundation, either version 3 of the License, or | ||
# (at your option) any later version. | ||
# | ||
# This program is distributed in the hope that it will be useful, | ||
# but WITHOUT ANY WARRANTY; without even the implied warranty of | ||
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the | ||
# GNU General Public License for more details. | ||
# | ||
# You should have received a copy of the GNU General Public License | ||
# along with this program. If not, see <http://www.gnu.org/licenses/>. | ||
|
||
from __future__ import annotations | ||
|
||
__all__ = ("determine_destination_for_retrieved_artifact",) | ||
|
||
from lsst.resources import ResourcePath, ResourcePathExpression | ||
|
||
|
||
def determine_destination_for_retrieved_artifact( | ||
destination_directory: ResourcePath, source_path: ResourcePath, preserve_path: bool | ||
) -> ResourcePath: | ||
"""Determine destination path for an artifact retrieved from a datastore. | ||
|
||
Parameters | ||
---------- | ||
destination_directory : `ResourcePath` | ||
Path to the output directory where file will be stored. | ||
source_path : `ResourcePath` | ||
Path to the source file to be transferred. This may be relative to the | ||
datastore root, or an absolute path. | ||
preserve_path : `bool`, optional | ||
If `True` the full path of the artifact within the datastore | ||
is preserved. If `False` the final file component of the path | ||
is used. | ||
|
||
Returns | ||
------- | ||
destination_uri : `~lsst.resources.ResourcePath` | ||
Absolute path to the output destination. | ||
""" | ||
destination_directory = destination_directory.abspath() | ||
|
||
target_path: ResourcePathExpression | ||
if preserve_path: | ||
target_path = source_path | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I know this was copied from above but looking at it again it might be clearer from a typing perspective as: target_path: str
if preserve_path:
if source_path.isabs():
target_path = source_path.relativeToPathRoot
else:
target_path = source_path.path
else:
target_path = source_path.basename() then target_path is a simple string and doesn't have to be a ResourcePath in one branch. Thoughts? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. seems reasonable There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Actually should it be There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yes, https://github.com/lsst/resources/blob/main/python/lsst/resources/_resourcePath.py#L742 Leaving it is fine, it is just a bit more convoluted with the typing as it currently stands. |
||
if target_path.isabs(): | ||
# This is an absolute path to an external file. | ||
# Use the full path. | ||
target_path = target_path.relativeToPathRoot | ||
else: | ||
target_path = source_path.basename() | ||
|
||
target_uri = destination_directory.join(target_path).abspath() | ||
if target_uri.relative_to(destination_directory) is None: | ||
raise ValueError(f"File path attempts to escape destination directory: '{source_path}'") | ||
return target_uri |
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -40,6 +40,9 @@ | |
|
||
import httpx | ||
from lsst.daf.butler import __version__ | ||
from lsst.daf.butler.datastores.file_datastore.retrieve_artifacts import ( | ||
determine_destination_for_retrieved_artifact, | ||
) | ||
from lsst.daf.butler.datastores.fileDatastoreClient import ( | ||
FileDatastoreGetPayload, | ||
get_dataset_as_python_object, | ||
|
@@ -298,10 +301,7 @@ def _get_file_info( | |
if isinstance(datasetRefOrType, DatasetRef): | ||
if dataId is not None: | ||
raise ValueError("DatasetRef given, cannot use dataId as well") | ||
dataset_id = datasetRefOrType.id | ||
response = self._get(f"get_file/{dataset_id}", expected_errors=(404,)) | ||
if response.status_code == 404: | ||
raise FileNotFoundError(f"Dataset not found: {datasetRefOrType}") | ||
return self._get_file_info_for_ref(datasetRefOrType) | ||
else: | ||
request = GetFileByDataIdRequestModel( | ||
dataset_type_name=self._normalize_dataset_type_name(datasetRefOrType), | ||
|
@@ -314,7 +314,12 @@ def _get_file_info( | |
f"Dataset not found with DataId: {dataId} DatasetType: {datasetRefOrType}" | ||
f" collections: {collections}" | ||
) | ||
return self._parse_model(response, GetFileResponseModel) | ||
|
||
def _get_file_info_for_ref(self, ref: DatasetRef) -> GetFileResponseModel: | ||
response = self._get(f"get_file/{ref.id}", expected_errors=(404,)) | ||
if response.status_code == 404: | ||
raise FileNotFoundError(f"Dataset not found: {ref.id}") | ||
return self._parse_model(response, GetFileResponseModel) | ||
|
||
def getURIs( | ||
|
@@ -428,8 +433,28 @@ def retrieveArtifacts( | |
preserve_path: bool = True, | ||
overwrite: bool = False, | ||
) -> list[ResourcePath]: | ||
# Docstring inherited. | ||
raise NotImplementedError() | ||
destination = ResourcePath(destination).abspath() | ||
if not destination.isdir(): | ||
raise ValueError(f"Destination location must refer to a directory. Given {destination}.") | ||
|
||
if transfer not in ("auto", "copy"): | ||
raise ValueError("Only 'copy' and 'auto' transfer modes are supported.") | ||
|
||
output_uris: list[ResourcePath] = [] | ||
for ref in refs: | ||
file_info = _to_file_payload(self._get_file_info_for_ref(ref)).file_info | ||
for file in file_info: | ||
source_uri = ResourcePath(str(file.url)) | ||
relative_path = ResourcePath(file.datastoreRecords.path, forceAbsolute=False) | ||
target_uri = determine_destination_for_retrieved_artifact( | ||
destination, relative_path, preserve_path | ||
) | ||
# Because signed URLs expire, we want to do the transfer soon | ||
# after retrieving the URL. | ||
target_uri.transfer_from(source_uri, transfer="copy", overwrite=overwrite) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This loop is organized slightly different to I realize looking at the original now that it was never upgraded to use the bulk query interface There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This is intentional here to some extent... because of URL expiration, we don't want to download a huge number of URLs upfront and then download all of them at the end. If the files are large the URLs will expire before we start downloading the later ones. I think a bulk/parallel version of this would probably request like 10 URLs at a time from the server, instead of all of them. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Can you add a comment here to say that the reason is because of expiration? Maybe also add a verbose log message after the loop to report how many files were transferred. |
||
output_uris.append(target_uri) | ||
|
||
return output_uris | ||
|
||
def exists( | ||
self, | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think Sphinx resolves them otherwise because
ResourcePath
is not part ofdaf_butler
. It only really matters if this docstring turns up in the sphinx and in the longer term we keep dreaming that sphinx will pick up the type annotations and we can remove the type from here...There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FYI I asked Jonathan about this at JTM. The latest version of the documentation tooling does do that, just science pipelines has not upgraded to use it yet.