Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[fs] support hfs.ls on a bucket #14176

Merged
merged 14 commits into from Feb 6, 2024
Merged

Conversation

danking
Copy link
Collaborator

@danking danking commented Jan 18, 2024

Teaches hfs.ls('gs://bucket/') to list the files and directories at the top-level of the bucket.

In main that command raises because this line of _ls_no_glob raises:

maybe_sb_and_t, maybe_contents = await asyncio.gather(
    self._size_bytes_and_time_modified_or_none(path), ls_as_dir()
)

In particular, statfile raises a cloud-specific, esoteric error about a malformed URL or empty object names:

async def _size_bytes_and_time_modified_or_none(self, path: str) -> Optional[Tuple[int, float]]:
    try:
        # Hadoop semantics: creation time is used if the object has no notion of last modification time.
        file_status = await self.afs.statfile(path)
        return (await file_status.size(), file_status.time_modified().timestamp())
    except FileNotFoundError:
        return None

I decided to add a sub-class of FileNotFoundError which is self-describing: IsABucketError.

I changed most methods to raise that error when given a bucket URL. The two interesting cases:

  1. isdir. This raises an error but I could also see this returning True. A bucket is like a directory whose path/name is empty.

  2. isfile. This returns False but I could also see this raising an error. This just seems convenient, we know the bucket is not a file so we should say so.


Apparently hfs.ls had no current tests because the globbing system doesn't work with Azure https:// URLs. I fixed it to use AsyncFSURL.with_new_path_component which is resilient to Azure https weirdness. However, I had to change with_new_path_component to treat an empty path in a special way. I wanted this to hold:

actual = str(afs.parse_url('gs://bucket').with_new_path_component('bar'))
expected = 'gs://bucket/bar'
assert actual == expected

But with_new_path_component interacts badly with GoogleAsyncFSURL.__str__ to return this:

'gs://bucket//bar'

Teaches `hfs.ls('gs://bucket/')` to list the files and directories at the top-level of the bucket.

In `main` that command raises because this line of `_ls_no_glob` raises:

```python3
maybe_sb_and_t, maybe_contents = await asyncio.gather(
    self._size_bytes_and_time_modified_or_none(path), ls_as_dir()
)
```

In particular, `statfile` raises a cloud-specific, esoteric error about a malformed URL or empty
object names:

```python3
async def _size_bytes_and_time_modified_or_none(self, path: str) -> Optional[Tuple[int, float]]:
    try:
        # Hadoop semantics: creation time is used if the object has no notion of last modification time.
        file_status = await self.afs.statfile(path)
        return (await file_status.size(), file_status.time_modified().timestamp())
    except FileNotFoundError:
        return None
```

I decided to add a sub-class of `FileNotFoundError` which is self-describing: `IsABucketError`.

I changed most methods to raise that error when given a bucket URL. The two interesting cases:

1. `isdir`. This raises an error but I could also see this returning `True`. A bucket is like a
   directory whose path/name is empty.

2. `isfile`. This returns False but I could also see this raising an error. This just seems
   convenient, we know the bucket is not a file so we should say so.

---

Apparently `hfs.ls` had no current tests because the globbing system doesn't work with Azure
https:// URLs. I fixed it to use `AsyncFSURL.with_new_path_component` which is resilient to Azure
https weirdness. However, I had to change `with_new_path_component` to treat an empty path in a
special way. I wanted this to hold:

```
actual = str(afs.parse_url('gs://bucket').with_new_path_component('bar'))
expected = 'gs://bucket/bar'
assert actual == expected
```

But `with_new_path_component` interacts badly with `GoogleAsyncFSURL.__str__` to return this:

```
'gs://bucket//bar'
```
@@ -490,6 +499,8 @@ async def create(self, url: str, *, retry_writes: bool = True) -> S3CreateManage
# complete before the write can begin (unlike the current
# code, that copies 128MB parts in 256KB chunks).
bucket, name = self.get_bucket_and_name(url)
if name == '':
raise IsABucketError(url)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I think I'd rather have this handled with a decorator (or at least a separate method). Is it bad DX to have these methods accept a S3AsyncFSURL and have a decorator on top tha converts str to the the URL object, optionally erroring if there's no path?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, my main gripe with decorators is that they tend to mangle the types. I tried writing a decorator with precise types in this commit. I'm not sure why this fails:

pyright fs.py
/Users/dking/projects/hail/hail/python/hailtop/aiocloud/aioaws/fs.py
  /Users/dking/projects/hail/hail/python/hailtop/aiocloud/aioaws/fs.py:460:15 - error: "open" overrides method of same name in class "AsyncFS" with incompatible type "StrOrURLMethod[(), Coroutine[Any, Any, ReadableStream]]" (reportIncompatibleMethodOverride)
1 error, 0 warnings, 0 informations 

It works fine though. I can read stuff from S3.

In [1]: from hailtop.aiocloud.aioaws import S3AsyncFS
   ...: fs = S3AsyncFS()
   ...: (await (await fs.open('s3://gnomad-public-us-east-1/release/4.0/constraint/README.txt')).read())[:100]
Out[1]: b'\xef\xbb\xbfConstraint field descriptions\r\n\r\nDescriptions of columns in the gnomAD v4 constraint metrics tsv.'

Maybe I'm overcomplicating what you're asking, but there seems to be two issues:

  1. If the URL is in the open method's signature either we make FS generic (ugh) or we assert that the URL is the right URL everywhere.
  2. Preserving precise type information requires a fair bit of code (even if we can resolve (1)).

I was able to write a decorator that had no types and pyright didn't complain, but I also lost all type hinting on fs.open. IMO, between types and not repeating myself, I'd go with types.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me try something using inheritance instead.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Something along these lines may work and won't be too bad, but we'll need to play the same trick where the RouterAsyncFSURL has a reference to its file system. It will be a rather large change and I (1) would rather not tie that to this PR and (2) have committed to limiting my programming substantially in this last month. What do you think of leaving this as a follow up cleanup?

d39557b

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hrmph, ya while I don't hate the last option in principle I'd rather not get that involved here. Can you just wrap the error throwing in a method so that most of these call-sites become one-liners and we can punt on the refactor?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@daniel-goldstein OK, this was quite a few lines of code change, so worth a careful re-review.

return

bucket_url = str(base.with_path(''))
fs = RouterFS()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this resource needs to be closed

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I made RouterFS a context manager and used that here. I also added an exit stack to LocalBackend. I also changed ServiceBackend to use its exitstack the same way as LocalBackend.

@danking danking dismissed daniel-goldstein’s stale review February 2, 2024 17:41

played some type golf trying to figure out what would work. see comments, I think there's a workable solution but it's a large change to the FS stuff

Copy link
Contributor

@daniel-goldstein daniel-goldstein left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

addressed comment in thread. Let's just use a subroutine for now

@danking danking merged commit 671deef into hail-is:main Feb 6, 2024
8 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants