Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add option to list all objects under an S3 location prefix. #3221

Closed

Conversation

willmostly
Copy link
Contributor

Using hive.recursive-directories=true results in one S3 listObjects call per subdirectory under a partition location. These calls are made serially, which can be significant for short duration queries. Setting hive.s3.use-pseudo-directories = false allows one to instead list all objects under an S3 prefix with a single call.

@cla-bot cla-bot bot added the cla-signed label Mar 24, 2020
.withRequesterPays(requesterPaysEnabled);
if (usePseudoDirectories) {
request.setDelimiter(PATH_SEPARATOR);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i understand the conditional removal of .setDelimiter(PATH_SEPARATOR) as to list all sub directories in once shot, instead of hoping over dir levels. This is appropriate to directory listing but only for tables/partitions when hive.recursive-directories is on.

  • shouldn’t this new option be enabled only if that option is enabled?
  • shouldn’t we know that FS layer is called from the place where actually hive.recursive-directories matters — ie shouldn’t we modify the calling side to pass information that all subdirs are needed? Then we wouldn’t need the new toggle at all

cc @electrum

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To add to what @findepi said, I think with list caching done in CachingDirectoryLister, we can have an inconsistent (and sometimes incorrect) view of the list results in this approach.
Further, doesn't lazy split generation done currently help in these scenarios?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@findepi I reworked this to use hive.recursive-directories to control the behavior. Much cleaner this way.

@rohangarg - can you elaborate on when this could lead to an inconsistent list, and how thats different than walking the directories? I'm afraid I may be missing something there.

You're right that lazy split generation is good enough for most cases. The performance impact looks like it's significant when A) the number of subdirectories per partition is O(100), and B) the target query time is O(1 second). But in general, reducing the number of calls to S3 should be a good thing.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The example that I was thinking where error can occur is :

  1. set recursive listing to true and query a path - with the feature, the full nested objects are cached in the list located status call (in CachingDirectoryLister's cache obj).
  2. set recursive listing to false and query a path - now while listing the path, the nested objects should not occur in the result whereas the Lister's cache has complete nested listing stored.

Copy link
Member

@dain dain left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIRC, Hive ignores directories that start with ., _, and maybe some others. How does this code deal with that?

@electrum
Copy link
Member

Rather than leaking details of S3 into the split loader, we should switch to this FS call that does recursive natively and implement it in S3 FS: https://hadoop.apache.org/docs/stable/api/org/apache/hadoop/fs/FileSystem.html#listFiles-org.apache.hadoop.fs.Path-boolean-

With this call, we should be able to simplify the recursive code, since it’s all done by the file system.

@willmostly
Copy link
Contributor Author

IIRC, Hive ignores directories that start with ., _, and maybe some others. How does this code deal with that?

We can filter out these directories and objects from the listing, instead of in HiveFileIterator.

@willmostly
Copy link
Contributor Author

Rather than leaking details of S3 into the split loader, we should switch to this FS call that does recursive natively and implement it in S3 FS.

Sounds good to me.

@findepi
Copy link
Member

findepi commented Aug 21, 2020

See also #4825

@colebow
Copy link
Member

colebow commented Oct 19, 2022

👋 @willmostly - this PR is inactive and doesn't seem to be under development, and it might already be implemented. If you'd like to continue work on this at any point in the future, feel free to re-open.

@colebow colebow closed this Oct 19, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

Successfully merging this pull request may close these issues.

None yet

6 participants