Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add option to list all objects under an S3 location prefix. #3221

Open
wants to merge 4 commits into
base: master
from

Conversation

@willmostly
Copy link
Contributor

willmostly commented Mar 24, 2020

Using hive.recursive-directories=true results in one S3 listObjects call per subdirectory under a partition location. These calls are made serially, which can be significant for short duration queries. Setting hive.s3.use-pseudo-directories = false allows one to instead list all objects under an S3 prefix with a single call.

@cla-bot cla-bot bot added the cla-signed label Mar 24, 2020
.withRequesterPays(requesterPaysEnabled);
if (usePseudoDirectories) {
request.setDelimiter(PATH_SEPARATOR);

This comment has been minimized.

Copy link
@findepi

findepi Mar 24, 2020

Member

i understand the conditional removal of .setDelimiter(PATH_SEPARATOR) as to list all sub directories in once shot, instead of hoping over dir levels. This is appropriate to directory listing but only for tables/partitions when hive.recursive-directories is on.

  • shouldn’t this new option be enabled only if that option is enabled?
  • shouldn’t we know that FS layer is called from the place where actually hive.recursive-directories matters — ie shouldn’t we modify the calling side to pass information that all subdirs are needed? Then we wouldn’t need the new toggle at all

cc @electrum

This comment has been minimized.

Copy link
@rohangarg

rohangarg Mar 24, 2020

Contributor

To add to what @findepi said, I think with list caching done in CachingDirectoryLister, we can have an inconsistent (and sometimes incorrect) view of the list results in this approach.
Further, doesn't lazy split generation done currently help in these scenarios?

This comment has been minimized.

Copy link
@willmostly

willmostly Mar 25, 2020

Author Contributor

@findepi I reworked this to use hive.recursive-directories to control the behavior. Much cleaner this way.

@rohangarg - can you elaborate on when this could lead to an inconsistent list, and how thats different than walking the directories? I'm afraid I may be missing something there.

You're right that lazy split generation is good enough for most cases. The performance impact looks like it's significant when A) the number of subdirectories per partition is O(100), and B) the target query time is O(1 second). But in general, reducing the number of calls to S3 should be a good thing.

This comment has been minimized.

Copy link
@rohangarg

rohangarg Mar 25, 2020

Contributor

The example that I was thinking where error can occur is :

  1. set recursive listing to true and query a path - with the feature, the full nested objects are cached in the list located status call (in CachingDirectoryLister's cache obj).
  2. set recursive listing to false and query a path - now while listing the path, the nested objects should not occur in the result whereas the Lister's cache has complete nested listing stored.
Copy link
Member

dain left a comment

IIRC, Hive ignores directories that start with ., _, and maybe some others. How does this code deal with that?

willmostly added 2 commits Mar 25, 2020
… objects under an S3 prefix in a single call.
@electrum

This comment has been minimized.

Copy link
Member

electrum commented Mar 25, 2020

Rather than leaking details of S3 into the split loader, we should switch to this FS call that does recursive natively and implement it in S3 FS: https://hadoop.apache.org/docs/stable/api/org/apache/hadoop/fs/FileSystem.html#listFiles-org.apache.hadoop.fs.Path-boolean-

With this call, we should be able to simplify the recursive code, since it’s all done by the file system.

@willmostly

This comment has been minimized.

Copy link
Contributor Author

willmostly commented Mar 27, 2020

IIRC, Hive ignores directories that start with ., _, and maybe some others. How does this code deal with that?

We can filter out these directories and objects from the listing, instead of in HiveFileIterator.

@willmostly

This comment has been minimized.

Copy link
Contributor Author

willmostly commented Mar 27, 2020

Rather than leaking details of S3 into the split loader, we should switch to this FS call that does recursive natively and implement it in S3 FS.

Sounds good to me.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Linked issues

Successfully merging this pull request may close these issues.

None yet

5 participants
You can’t perform that action at this time.