Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[hailtop.fs] use parallelism to list directories #13253

Merged
merged 1 commit into from Jul 18, 2023

Conversation

danking
Copy link
Collaborator

@danking danking commented Jul 17, 2023

Qin He reported that listing a folder containing around 50k files took 1h15. This new code takes ~16 seconds which is about how long it takes gcloud storage ls.

There are two improvements:

  1. Use bounded_gather2. The use of a semaphore in bounded_gather2, which is missing from bounded_gather, allows it to be used recursively. In particular, suppose we had a semaphore of
    50. The outer bounded_gather2 might need 20 slots to run its 20 paths in parallel. That leaves 30 slots of parallelism left over for its children. By passing the semaphore down, we let our children optimistically use some of that excess parallelism.

  2. If we happen to have the StatResult for a particular object, we should never again look it up. In particular, getting the StatResult for every file in a directory can be done in O(1) requests. Getting the StatResult for each of those files individually (using their full paths) is necessarily O(N). If there was at least one glob and also there are no suffix_components, then we can use the StatResults that we learned when checking the glog pattern.

The latter point is perhaps a bit more clear with examples:

  1. gs://foo/bar/baz. Since there are no globs, we can make exactly one API request to list gs://foo/bar/baz.

  2. gs://foo/b*r/baz. In this case, we must make one API request to list gs://foo/. This gives us a list of paths under that prefix. We check each path for conformance to the glob pattern gs://foo/b*r. For any path that matches, we must then list <the matching path>/baz which may itself be a directory containing files. Overall we make O(1) API requests to do the glob and then O(K) API requests to get the final StatResults, where K is the number of paths matching the glob pattern.

  3. gs://foo/bar/b*z. In this case, we must make one API request to list gs://foo/bar/. In main, we then throw away the StatResults we got from that API request! Now we have to make O(K) requests to recover those StatResults for all K paths that match the glob pattern. This PR just caches the StatResults of the most recent globbing. If there is no suffix to later append, then we can just re-use the StatResults we already have!

cc: @daniel-goldstein since you've reviewed this before. Might be of interest.

Qin He reported that listing a folder containing around 50k files took 1h15. This new code
takes ~14 seconds which is about how long it takes `gcloud storage ls`.

There are two improvements:

1. Use `bounded_gather2`. The use of a semaphore in `bounded_gather2`, which is missing from
   `bounded_gather`, allows it to be used recursively. In particular, suppose we had a semaphore of
   50. The outer `bounded_gather2` might need 20 slots to run its 20 paths in parallel. That leaves 30
   slots of parallelism left over for its children. By passing the semaphore down, we let our
   children optimistically use some of that excess parallelism.

2. If we happen to have the `StatResult` for a particular object, we should never again look it
   up. In particular, getting the `StatResult` for every file in a directory can be done in O(1)
   requests. Getting the `StatResult` for each of those files individually (using their full paths)
   is necessarily O(N). If there was at least one glob and also there are no `suffix_components`,
   then we can use the `StatResult`s that we learned when checking the glog pattern.

The latter point is perhaps a bit more clear with examples:

1. `gs://foo/bar/baz`. Since there are no globs, we can make exactly one API request to list
   `gs://foo/bar/baz`.

2. `gs://foo/b*r/baz`. In this case, we must make one API request to list `gs://foo/`. This gives us
   a list of paths under that prefix. We check each path for conformance to the glob pattern
   `gs://foo/b*r`. For any path that matches, we must then list `<the matching path>/baz` which may
   itself be a directory containing files. Overall we make O(1) API requests to do the glob and then
   O(K) API requests to get the final `StatResult`s, where K is the number of paths matching the
   glob pattern.

3. `gs://foo/bar/b*z`. In this case, we must make one API request to list `gs://foo/bar/`. In
   `main`, we then throw away the `StatResult`s we got from that API request! Now we have to make
   O(K) requests to recover those `StatResult`s for all K paths that match the glob pattern. This PR
   just caches the `StatResult`s of the most recent globbing. If there is no suffix to later append,
   then we can just re-use the `StatResult`s we already have!
@jigold
Copy link
Collaborator

jigold commented Jul 18, 2023

Can you confirm there are already existing tests for each of the code paths used in this code?

jigold
jigold previously requested changes Jul 18, 2023
Copy link
Collaborator

@jigold jigold left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see comment about tests.

@danking
Copy link
Collaborator Author

danking commented Jul 18, 2023

hadoop_ls is implemented in terms of FS.ls, and there are ample tests of hadoop_ls in test_utils.py : see here and here. Some of the tests include:

  1. No globbing at all (test_hadoop_ls)
  2. Globbing in the second to last component (test_hadoop_glob_heterogenous_structure, test_hadoop_ls_folder_glob)
  3. Globbing in the last component (test_hadoop_ls_glob_1, test_hadoop_ls_glob_2, test_hadoop_ls_negated_group)
  4. Multiple globs before the last component (test_hadoop_ls_two_folder_globs)
  5. Multiple globs including the last component (test_hadoop_ls_prefix_folder_glob_qmarks, test_hadoop_ls_two_folder_globs_and_two_qmarks, test_hadoop_ls_one_folder_glob_and_qmarks_in_multiple_components, test_hadoop_ls_component_with_only_groups)

@@ -280,11 +280,13 @@ async def _async_ls(self,
*,
error_when_file_and_directory: bool = True,
_max_simultaneous_files: int = 50) -> List[StatResult]:
sema = asyncio.Semaphore(_max_simultaneous_files)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to use async with sema?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, I think this is fine and bounded gather2 does that.

@danking danking merged commit 044d6bb into hail-is:main Jul 18, 2023
8 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants