Add hl.hadoop_scheme_supported function#10555
Conversation
johnc1231
left a comment
There was a problem hiding this comment.
All file systems should support file:// scheme. I'm not sure if it makes sense for this function to return true for empty string (which I read as meaning no specified scheme, and therefore read file from local disk). I think those should maybe just be 'file'? I could be convinced it should support empty string and 'file' as inputs if you want it to work on empty string.
| self.client.rm(path, recursive=True) | ||
|
|
||
| def supports_scheme(self, scheme: str) -> bool: | ||
| return scheme in ("gs", "") |
There was a problem hiding this comment.
Should also support file as a scheme.
There was a problem hiding this comment.
For URLs that don't start with gs://, GoogleCloudStorageFS uses os methods and open, which don't support file:// URLs.
| rmtree(path) | ||
|
|
||
| def supports_scheme(self, scheme: str) -> bool: | ||
| return scheme == "" |
There was a problem hiding this comment.
LocalFS uses os methods and open, which don't support file:// URLs.
There was a problem hiding this comment.
LocalFS supports it, tested:
import hail as hl
hl.init_local()
hl.import_vcf("file:///Users/johnc/Code/hail/hail/1kg/1kg.vcf.bgz")
|
Sorry, I should have explained the rationale for returning true for empty strings. I was thinking that you should be able to do something like this: from urllib.parse import urlparse
url = "whatever:///path/to/file"
if hl.hadoop_supports_scheme(urlparse(url).scheme):
with hl.hadoop_open(url) as f:
...For paths without a scheme, the ParseResult's scheme is "". from urllib.parse import urlparse
urlparse("/path/to/file.txt").scheme
''
HadoopFS is the only one that will actually work correctly if passed a
For LocalFS and GoogleCloudStorageFS, an empty scheme will correspond to a path to a local file. For HadoopFS, an empty scheme causes Hadoop to use its configured default file system. When running Hail locally, this would be the local file system. On a Dataproc cluster, this would be HDFS. Returning true for the empty string in |
|
Thanks for explaining and for adding this functionality, this all makes sense. |
|
Linking to the Zulip discussion for possible future reference. https://hail.zulipchat.com/#narrow/stream/123011-Hail-Dev/topic/file.20schemes |
Currently, the only way to check if Hail can read URLs with a given scheme (
gs,s3, etc) is to attempt to read a URL with that scheme. However, the same exception type is thrown whether the scheme is not supported or the file doesn't exist or something else went wrong and the error message is the only way to determine what went wrong.This adds a
hl.hadoop_scheme_supportedfunction, which returns a boolean indicating whether or not a URL scheme is supported.Discussed on Zulip: https://hail.zulipchat.com/#narrow/stream/123010-Hail-0.2E2.20support/topic/Get.20supported.20URL.20schemes
@johnc1231