[hailtop.fs] use parallelism to list directories #13253

danking · 2023-07-17T22:25:54Z

Qin He reported that listing a folder containing around 50k files took 1h15. This new code takes ~16 seconds which is about how long it takes gcloud storage ls.

There are two improvements:

Use bounded_gather2. The use of a semaphore in bounded_gather2, which is missing from bounded_gather, allows it to be used recursively. In particular, suppose we had a semaphore of
50. The outer bounded_gather2 might need 20 slots to run its 20 paths in parallel. That leaves 30 slots of parallelism left over for its children. By passing the semaphore down, we let our children optimistically use some of that excess parallelism.
If we happen to have the StatResult for a particular object, we should never again look it up. In particular, getting the StatResult for every file in a directory can be done in O(1) requests. Getting the StatResult for each of those files individually (using their full paths) is necessarily O(N). If there was at least one glob and also there are no suffix_components, then we can use the StatResults that we learned when checking the glog pattern.

The latter point is perhaps a bit more clear with examples:

gs://foo/bar/baz. Since there are no globs, we can make exactly one API request to list gs://foo/bar/baz.
gs://foo/b*r/baz. In this case, we must make one API request to list gs://foo/. This gives us a list of paths under that prefix. We check each path for conformance to the glob pattern gs://foo/b*r. For any path that matches, we must then list <the matching path>/baz which may itself be a directory containing files. Overall we make O(1) API requests to do the glob and then O(K) API requests to get the final StatResults, where K is the number of paths matching the glob pattern.
gs://foo/bar/b*z. In this case, we must make one API request to list gs://foo/bar/. In main, we then throw away the StatResults we got from that API request! Now we have to make O(K) requests to recover those StatResults for all K paths that match the glob pattern. This PR just caches the StatResults of the most recent globbing. If there is no suffix to later append, then we can just re-use the StatResults we already have!

cc: @daniel-goldstein since you've reviewed this before. Might be of interest.

Qin He reported that listing a folder containing around 50k files took 1h15. This new code takes ~14 seconds which is about how long it takes `gcloud storage ls`. There are two improvements: 1. Use `bounded_gather2`. The use of a semaphore in `bounded_gather2`, which is missing from `bounded_gather`, allows it to be used recursively. In particular, suppose we had a semaphore of 50. The outer `bounded_gather2` might need 20 slots to run its 20 paths in parallel. That leaves 30 slots of parallelism left over for its children. By passing the semaphore down, we let our children optimistically use some of that excess parallelism. 2. If we happen to have the `StatResult` for a particular object, we should never again look it up. In particular, getting the `StatResult` for every file in a directory can be done in O(1) requests. Getting the `StatResult` for each of those files individually (using their full paths) is necessarily O(N). If there was at least one glob and also there are no `suffix_components`, then we can use the `StatResult`s that we learned when checking the glog pattern. The latter point is perhaps a bit more clear with examples: 1. `gs://foo/bar/baz`. Since there are no globs, we can make exactly one API request to list `gs://foo/bar/baz`. 2. `gs://foo/b*r/baz`. In this case, we must make one API request to list `gs://foo/`. This gives us a list of paths under that prefix. We check each path for conformance to the glob pattern `gs://foo/b*r`. For any path that matches, we must then list `<the matching path>/baz` which may itself be a directory containing files. Overall we make O(1) API requests to do the glob and then O(K) API requests to get the final `StatResult`s, where K is the number of paths matching the glob pattern. 3. `gs://foo/bar/b*z`. In this case, we must make one API request to list `gs://foo/bar/`. In `main`, we then throw away the `StatResult`s we got from that API request! Now we have to make O(K) requests to recover those `StatResult`s for all K paths that match the glob pattern. This PR just caches the `StatResult`s of the most recent globbing. If there is no suffix to later append, then we can just re-use the `StatResult`s we already have!

jigold · 2023-07-18T14:47:12Z

Can you confirm there are already existing tests for each of the code paths used in this code?

jigold

see comment about tests.

danking · 2023-07-18T15:48:11Z

hadoop_ls is implemented in terms of FS.ls, and there are ample tests of hadoop_ls in test_utils.py : see here and here. Some of the tests include:

No globbing at all (test_hadoop_ls)
Globbing in the second to last component (test_hadoop_glob_heterogenous_structure, test_hadoop_ls_folder_glob)
Globbing in the last component (test_hadoop_ls_glob_1, test_hadoop_ls_glob_2, test_hadoop_ls_negated_group)
Multiple globs before the last component (test_hadoop_ls_two_folder_globs)
Multiple globs including the last component (test_hadoop_ls_prefix_folder_glob_qmarks, test_hadoop_ls_two_folder_globs_and_two_qmarks, test_hadoop_ls_one_folder_glob_and_qmarks_in_multiple_components, test_hadoop_ls_component_with_only_groups)

x

jigold · 2023-07-18T15:51:14Z

hail/python/hailtop/fs/router_fs.py

@@ -280,11 +280,13 @@ async def _async_ls(self,
                        *,
                        error_when_file_and_directory: bool = True,
                        _max_simultaneous_files: int = 50) -> List[StatResult]:
+        sema = asyncio.Semaphore(_max_simultaneous_files)


Do we need to use async with sema?

Actually, I think this is fine and bounded gather2 does that.

danking assigned jigold Jul 17, 2023

jigold previously requested changes Jul 18, 2023

View reviewed changes

jigold approved these changes Jul 18, 2023

View reviewed changes

jigold requested changes Jul 18, 2023

View reviewed changes

jigold approved these changes Jul 18, 2023

View reviewed changes

danking merged commit 044d6bb into hail-is:main Jul 18, 2023
8 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[hailtop.fs] use parallelism to list directories #13253

[hailtop.fs] use parallelism to list directories #13253

danking commented Jul 17, 2023 •

edited

jigold commented Jul 18, 2023

jigold left a comment

danking commented Jul 18, 2023

jigold Jul 18, 2023

jigold Jul 18, 2023

[hailtop.fs] use parallelism to list directories #13253

[hailtop.fs] use parallelism to list directories #13253

Conversation

danking commented Jul 17, 2023 • edited

jigold commented Jul 18, 2023

jigold left a comment

Choose a reason for hiding this comment

danking commented Jul 18, 2023

jigold Jul 18, 2023

Choose a reason for hiding this comment

jigold Jul 18, 2023

Choose a reason for hiding this comment

danking commented Jul 17, 2023 •

edited