Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add hl.hadoop_scheme_supported function #10555

Merged
merged 3 commits into from Jun 8, 2021

Conversation

nawatts
Copy link
Contributor

@nawatts nawatts commented Jun 4, 2021

Currently, the only way to check if Hail can read URLs with a given scheme (gs, s3, etc) is to attempt to read a URL with that scheme. However, the same exception type is thrown whether the scheme is not supported or the file doesn't exist or something else went wrong and the error message is the only way to determine what went wrong.

This adds a hl.hadoop_scheme_supported function, which returns a boolean indicating whether or not a URL scheme is supported.

Discussed on Zulip: https://hail.zulipchat.com/#narrow/stream/123010-Hail-0.2E2.20support/topic/Get.20supported.20URL.20schemes

@johnc1231

Copy link
Contributor

@johnc1231 johnc1231 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All file systems should support file:// scheme. I'm not sure if it makes sense for this function to return true for empty string (which I read as meaning no specified scheme, and therefore read file from local disk). I think those should maybe just be 'file'? I could be convinced it should support empty string and 'file' as inputs if you want it to work on empty string.

@@ -139,3 +139,6 @@ def rmtree(self, path: str):
if self._is_local(path):
rmtree(path)
self.client.rm(path, recursive=True)

def supports_scheme(self, scheme: str) -> bool:
return scheme in ("gs", "")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should also support file as a scheme.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For URLs that don't start with gs://, GoogleCloudStorageFS uses os methods and open, which don't support file:// URLs.

@@ -68,3 +68,6 @@ def remove(self, path: str):

def rmtree(self, path: str):
rmtree(path)

def supports_scheme(self, scheme: str) -> bool:
return scheme == ""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should support file

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LocalFS uses os methods and open, which don't support file:// URLs.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LocalFS supports it, tested:

import hail as hl
hl.init_local()
hl.import_vcf("file:///Users/johnc/Code/hail/hail/1kg/1kg.vcf.bgz")

@nawatts
Copy link
Contributor Author

nawatts commented Jun 7, 2021

Sorry, I should have explained the rationale for returning true for empty strings.

I was thinking that you should be able to do something like this:

from urllib.parse import urlparse

url = "whatever:///path/to/file"
if hl.hadoop_supports_scheme(urlparse(url).scheme):
    with hl.hadoop_open(url) as f:
        ...

For paths without a scheme, the ParseResult's scheme is "".

from urllib.parse import urlparse

urlparse("/path/to/file.txt").scheme
''

All file systems should support file:// scheme.

HadoopFS is the only one that will actually work correctly if passed a file:// URL. For example, both LocalFS and GoogleCloudStorageFS use os.path.exists to implement exists. On macOS at least, os.path.exists("file:///test.txt") looks for test.txt in a directory named file: in the current working directory, not the root directory.

I'm not sure if it makes sense for this function to return true for empty string (which I read as meaning no specified scheme, and therefore read file from local disk). I think those should maybe just be 'file'? I could be convinced it should support empty string and 'file' as inputs if you want it to work on empty string.

For LocalFS and GoogleCloudStorageFS, an empty scheme will correspond to a path to a local file. For HadoopFS, an empty scheme causes Hadoop to use its configured default file system. When running Hail locally, this would be the local file system. On a Dataproc cluster, this would be HDFS. Returning true for the empty string in HadoopFS.supports_scheme is based on an assumption that there would always be a working default file system.

@johnc1231
Copy link
Contributor

Thanks for explaining and for adding this functionality, this all makes sense.

@nawatts
Copy link
Contributor Author

nawatts commented Jun 7, 2021

Linking to the Zulip discussion for possible future reference.

https://hail.zulipchat.com/#narrow/stream/123011-Hail-Dev/topic/file.20schemes

@danking danking merged commit db0a231 into hail-is:main Jun 8, 2021
@nawatts nawatts deleted the hadoop-scheme-supported branch June 8, 2021 19:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants