New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[hailtop.fs] use parallelism to list directories #13253
[hailtop.fs] use parallelism to list directories #13253
Conversation
Qin He reported that listing a folder containing around 50k files took 1h15. This new code takes ~14 seconds which is about how long it takes `gcloud storage ls`. There are two improvements: 1. Use `bounded_gather2`. The use of a semaphore in `bounded_gather2`, which is missing from `bounded_gather`, allows it to be used recursively. In particular, suppose we had a semaphore of 50. The outer `bounded_gather2` might need 20 slots to run its 20 paths in parallel. That leaves 30 slots of parallelism left over for its children. By passing the semaphore down, we let our children optimistically use some of that excess parallelism. 2. If we happen to have the `StatResult` for a particular object, we should never again look it up. In particular, getting the `StatResult` for every file in a directory can be done in O(1) requests. Getting the `StatResult` for each of those files individually (using their full paths) is necessarily O(N). If there was at least one glob and also there are no `suffix_components`, then we can use the `StatResult`s that we learned when checking the glog pattern. The latter point is perhaps a bit more clear with examples: 1. `gs://foo/bar/baz`. Since there are no globs, we can make exactly one API request to list `gs://foo/bar/baz`. 2. `gs://foo/b*r/baz`. In this case, we must make one API request to list `gs://foo/`. This gives us a list of paths under that prefix. We check each path for conformance to the glob pattern `gs://foo/b*r`. For any path that matches, we must then list `<the matching path>/baz` which may itself be a directory containing files. Overall we make O(1) API requests to do the glob and then O(K) API requests to get the final `StatResult`s, where K is the number of paths matching the glob pattern. 3. `gs://foo/bar/b*z`. In this case, we must make one API request to list `gs://foo/bar/`. In `main`, we then throw away the `StatResult`s we got from that API request! Now we have to make O(K) requests to recover those `StatResult`s for all K paths that match the glob pattern. This PR just caches the `StatResult`s of the most recent globbing. If there is no suffix to later append, then we can just re-use the `StatResult`s we already have!
Can you confirm there are already existing tests for each of the code paths used in this code? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
see comment about tests.
|
@@ -280,11 +280,13 @@ async def _async_ls(self, | |||
*, | |||
error_when_file_and_directory: bool = True, | |||
_max_simultaneous_files: int = 50) -> List[StatResult]: | |||
sema = asyncio.Semaphore(_max_simultaneous_files) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to use async with sema
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, I think this is fine and bounded gather2 does that.
Qin He reported that listing a folder containing around 50k files took 1h15. This new code takes ~16 seconds which is about how long it takes
gcloud storage ls
.There are two improvements:
Use
bounded_gather2
. The use of a semaphore inbounded_gather2
, which is missing frombounded_gather
, allows it to be used recursively. In particular, suppose we had a semaphore of50. The outer
bounded_gather2
might need 20 slots to run its 20 paths in parallel. That leaves 30 slots of parallelism left over for its children. By passing the semaphore down, we let our children optimistically use some of that excess parallelism.If we happen to have the
StatResult
for a particular object, we should never again look it up. In particular, getting theStatResult
for every file in a directory can be done in O(1) requests. Getting theStatResult
for each of those files individually (using their full paths) is necessarily O(N). If there was at least one glob and also there are nosuffix_components
, then we can use theStatResult
s that we learned when checking the glog pattern.The latter point is perhaps a bit more clear with examples:
gs://foo/bar/baz
. Since there are no globs, we can make exactly one API request to listgs://foo/bar/baz
.gs://foo/b*r/baz
. In this case, we must make one API request to listgs://foo/
. This gives us a list of paths under that prefix. We check each path for conformance to the glob patterngs://foo/b*r
. For any path that matches, we must then list<the matching path>/baz
which may itself be a directory containing files. Overall we make O(1) API requests to do the glob and then O(K) API requests to get the finalStatResult
s, where K is the number of paths matching the glob pattern.gs://foo/bar/b*z
. In this case, we must make one API request to listgs://foo/bar/
. Inmain
, we then throw away theStatResult
s we got from that API request! Now we have to make O(K) requests to recover thoseStatResult
s for all K paths that match the glob pattern. This PR just caches theStatResult
s of the most recent globbing. If there is no suffix to later append, then we can just re-use theStatResult
s we already have!cc: @daniel-goldstein since you've reviewed this before. Might be of interest.