New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DM-24475: Rewrite of Butler URI and major clean up of S3 usage #336
Conversation
timj
commented
Jul 24, 2020
•
edited
edited
- ButlerURI constructor is now a factory that returns a specific subclass for that scheme.
- ButlerURI is now concrete and can check for existence of the resource and download it.
- Add resource:// URI scheme for local python package resources.
- Add http(s):// URI scheme for easy web access.
- ButlerURI can now transfer from one URI to another, supports move and transactions.
- Rewrite datastore.export to use ButlerURI thus enabling export to S3 and use in S3Datastore for export to S3 or local file.
- Change all file access in Config class to use ButlerURI. Full support for !include and includeConfigs with remote resources. You can now create a config from a HTTP resource.
- Rewrite datastore.ingest to use ButlerURI, enabling S3 and POSIX to ingest from all supported sources.
- Butler.makeRepo now no longer special cases S3 but can write the output config to any supported location.
We really need to generalize the code that transfers one file to another location so that ingest and export use the same code and preferably can be shared amongst different datastores. You can imagine ingest from S3 to local file or export from S3 to local file or export from local file to S3. It is not scalable to put this code in multiple places in multiple datastores.
Now you get different subclasses for schemeless, file and generic URIs. This simplifies some of the logic for handling path components.
This now leaves us with a tiny bit of duplication between schemeless and file fixups.
Defer converting to posix form until the end, thereby letting us use os.sep everywhere.
ButlerURI.transfer_from(URI) now works for file, http, and S3.
It is a surprise to see a schemeless relative path become a schemeless absolute path when asking for the directory component. Changing this required a couple of tests to be updated that were assuming absolute paths.
mypy gets confused that new does not return a ButlerURI which leads to mypy thinking that we are calling abstract methods.
Simple routine that special cases schemeless URIs.
Now unified resource access for files, S3, HTTP, and pkg_resources. Much simplified I/O. Still need to sort out dumpToFile
This helps in cases like tempfile.mkdtemp where the path returned does not include a trailing slash but is used as a string and not a ButlerURI.
This removes all the special casing for dumpToFile and dumpToS3 in Config.
Removes a lot of code.
Absolute paths with a relative path should now work.
At some point we should go through and require that datastore.root be a URI and not allow it to sort of be a file.
Calling realpath on the destination for the transfer is never the right thing to do. It was harmless when the destination did not exist but in cases where the destination existed and was a softlink that ended up somewhere else it completely moved the output location.
on macOS the /var folder is a link to /private/var which means that when you create a temp directory in a test and put a file in it, once you readpath you end up in a completely different location that is nowhere near the test file that you are trying to use with relsymlink. Using a local tempdir fixes this anomaly.
This enables posixDatastore to use it.
files containing ? were breaking because urllib.parse would treat this as a URI with query parameters. We therefore need to quote to protect this prior to parsing in some cases and also unquote when referring to local file resources.
Some ingest tests included use of ??? in file names. This broke the URI parser. Therefore I just added URI quoting and unquoting into the mix. This found a few more bugs. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had some nits, but overall this looks good to me. I will say that it all still feels rather complex, and I would be tempted to store some additional data to make some of the operations clearer. For example, I think several operations would be simpler if a ButlerFileURI
created a Path
object on instantiation and stashed it inside the object and used it for several of the APIs. But that may be gilding the lily, and you probably want to get on with other work.
The one thing that makes me nervous is that I'm worried that relative paths are still ill-defined (specifically, "relative to what?" since in at least one place it doesn't seem to be the current working directory). I don't know enough of the surrounding context to know if you need to retain the relativeness because the working directory will change over the lifetime of the ButlerURI
. If not, I would be sorely tempted to eliminate relative paths by canonicalizing everything on instantiation, since you're carrying a ton of complexity in order to make relative paths absolute at the last possible moment.
It might be worth trying to do some explicit testing with Windows paths, since I'm not sure how drive letters and whatnot would interact with a lot of this file logic.
# if we have a relative path convert it to absolute | ||
# relative to the supplied parent. This is solely to handle | ||
# the case where the relative path includes ".." but somehow | ||
# then goes back inside the directory of the parent | ||
if not self.isabs(): | ||
childUri = other.join(self.path) | ||
return childUri.relative_to(other) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This logic feels weird to me. A relative path is, by convention, relative to the current working directory, but here you're assuming that it's relative to the supplied path, which could produce far different results. Maybe what I'm asking for here is better documentation of what "relative" means in the context of a ButlerURI?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For a relative path relative to an absolute path I'm trying to say that the answer is by definition the relative path unless it has ../
in it, in which case it goes up one level. I think that's a fairly well defined definition of relative. The current working directory is not relevant.
Regarding the use of |
Rather than looking for a substring which is easy to miss at a glance.
This required that we also add mem:// support for in-memory datastores.