Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DM-24475: Rewrite of Butler URI and major clean up of S3 usage #336

Merged
merged 45 commits into from Jul 30, 2020

Conversation

timj
Copy link
Member

@timj timj commented Jul 24, 2020

  • ButlerURI constructor is now a factory that returns a specific subclass for that scheme.
  • ButlerURI is now concrete and can check for existence of the resource and download it.
  • Add resource:// URI scheme for local python package resources.
  • Add http(s):// URI scheme for easy web access.
  • ButlerURI can now transfer from one URI to another, supports move and transactions.
  • Rewrite datastore.export to use ButlerURI thus enabling export to S3 and use in S3Datastore for export to S3 or local file.
  • Change all file access in Config class to use ButlerURI. Full support for !include and includeConfigs with remote resources. You can now create a config from a HTTP resource.
  • Rewrite datastore.ingest to use ButlerURI, enabling S3 and POSIX to ingest from all supported sources.
  • Butler.makeRepo now no longer special cases S3 but can write the output config to any supported location.

timj added 18 commits July 21, 2020 13:58
We really need to generalize the code that transfers one
file to another location so that ingest and export use
the same code and preferably can be shared amongst different
datastores. You can imagine ingest from S3 to local file or
export from S3 to local file or export from local file
to S3.  It is not scalable to put this code in multiple
places in multiple datastores.
Now you get different subclasses for schemeless, file
and generic URIs.  This simplifies some of the
logic for handling path components.
This now leaves us with a tiny bit of duplication
between schemeless and file fixups.
Defer converting to posix form until the end, thereby letting
us use os.sep everywhere.
ButlerURI.transfer_from(URI) now works for file, http, and S3.
It is a surprise to see a schemeless relative path become
a schemeless absolute path when asking for the directory
component. Changing this required a couple of tests to
be updated that were assuming absolute paths.
mypy gets confused that new does not return a ButlerURI
which leads to mypy thinking that we are calling
abstract methods.
Simple routine that special cases schemeless URIs.
Now unified resource access for files, S3, HTTP, and pkg_resources.
Much simplified I/O.

Still need to sort out dumpToFile
Removes a lot of code.
timj added 13 commits July 24, 2020 15:32
Absolute paths with a relative path should now work.
At some point we should go through and require that
datastore.root be a URI and not allow it to sort of
be a file.
Calling realpath on the destination for the transfer
is never the right thing to do. It was harmless when the
destination did not exist but in cases where the destination
existed and was a softlink that ended up somewhere else
it completely moved the output location.
on macOS the /var folder is a link to /private/var which
means that when you create a temp directory in a test
and put a file in it, once you readpath you end up
in a completely different location that is nowhere
near the test file that you are trying to use
with relsymlink.  Using a local tempdir fixes
this anomaly.
This enables posixDatastore to use it.
files containing ? were breaking because urllib.parse would
treat this as a URI with query parameters. We therefore need
to quote to protect this prior to parsing in some cases and
also unquote when referring to local file resources.
@timj
Copy link
Member Author

timj commented Jul 25, 2020

Some ingest tests included use of ??? in file names. This broke the URI parser. Therefore I just added URI quoting and unquoting into the mix. This found a few more bugs.

@timj timj requested a review from rra July 25, 2020 20:44
Copy link
Member

@rra rra left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had some nits, but overall this looks good to me. I will say that it all still feels rather complex, and I would be tempted to store some additional data to make some of the operations clearer. For example, I think several operations would be simpler if a ButlerFileURI created a Path object on instantiation and stashed it inside the object and used it for several of the APIs. But that may be gilding the lily, and you probably want to get on with other work.

The one thing that makes me nervous is that I'm worried that relative paths are still ill-defined (specifically, "relative to what?" since in at least one place it doesn't seem to be the current working directory). I don't know enough of the surrounding context to know if you need to retain the relativeness because the working directory will change over the lifetime of the ButlerURI. If not, I would be sorely tempted to eliminate relative paths by canonicalizing everything on instantiation, since you're carrying a ton of complexity in order to make relative paths absolute at the last possible moment.

It might be worth trying to do some explicit testing with Windows paths, since I'm not sure how drive letters and whatnot would interact with a lot of this file logic.

python/lsst/daf/butler/core/_butlerUri.py Outdated Show resolved Hide resolved
python/lsst/daf/butler/core/_butlerUri.py Show resolved Hide resolved
python/lsst/daf/butler/core/_butlerUri.py Show resolved Hide resolved
python/lsst/daf/butler/core/_butlerUri.py Show resolved Hide resolved
Comment on lines +755 to +761
# if we have a relative path convert it to absolute
# relative to the supplied parent. This is solely to handle
# the case where the relative path includes ".." but somehow
# then goes back inside the directory of the parent
if not self.isabs():
childUri = other.join(self.path)
return childUri.relative_to(other)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This logic feels weird to me. A relative path is, by convention, relative to the current working directory, but here you're assuming that it's relative to the supplied path, which could produce far different results. Maybe what I'm asking for here is better documentation of what "relative" means in the context of a ButlerURI?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For a relative path relative to an absolute path I'm trying to say that the answer is by definition the relative path unless it has ../ in it, in which case it goes up one level. I think that's a fairly well defined definition of relative. The current working directory is not relevant.

python/lsst/daf/butler/core/_butlerUri.py Show resolved Hide resolved
python/lsst/daf/butler/core/_butlerUri.py Outdated Show resolved Hide resolved
python/lsst/daf/butler/core/_butlerUri.py Show resolved Hide resolved
python/lsst/daf/butler/core/config.py Outdated Show resolved Hide resolved
tests/test_uri.py Outdated Show resolved Hide resolved
@timj
Copy link
Member Author

timj commented Jul 29, 2020

Regarding the use of ntpath -- it's a good idea and I'm sure something will break in my os.path usage if I try it. I think though that I'd save that for a rainy day unless you have a trivial way of mocking sys.builtin_module_names so that os.path will magically turn into ntpath.

@timj timj merged commit e214c1f into master Jul 30, 2020
@timj timj deleted the tickets/DM-24475 branch July 30, 2020 03:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants