Add option to list all objects under an S3 location prefix. #3221

willmostly · 2020-03-24T03:35:06Z

Using hive.recursive-directories=true results in one S3 listObjects call per subdirectory under a partition location. These calls are made serially, which can be significant for short duration queries. Setting hive.s3.use-pseudo-directories = false allows one to instead list all objects under an S3 prefix with a single call.

findepi · 2020-03-24T06:52:51Z

presto-hive/src/main/java/io/prestosql/plugin/hive/s3/PrestoS3FileSystem.java

                .withRequesterPays(requesterPaysEnabled);
+        if (usePseudoDirectories) {
+            request.setDelimiter(PATH_SEPARATOR);


i understand the conditional removal of .setDelimiter(PATH_SEPARATOR) as to list all sub directories in once shot, instead of hoping over dir levels. This is appropriate to directory listing but only for tables/partitions when hive.recursive-directories is on.

shouldn’t this new option be enabled only if that option is enabled?

shouldn’t we know that FS layer is called from the place where actually hive.recursive-directories matters — ie shouldn’t we modify the calling side to pass information that all subdirs are needed? Then we wouldn’t need the new toggle at all

cc @electrum

To add to what @findepi said, I think with list caching done in CachingDirectoryLister, we can have an inconsistent (and sometimes incorrect) view of the list results in this approach.
Further, doesn't lazy split generation done currently help in these scenarios?

@findepi I reworked this to use hive.recursive-directories to control the behavior. Much cleaner this way.

@rohangarg - can you elaborate on when this could lead to an inconsistent list, and how thats different than walking the directories? I'm afraid I may be missing something there.

You're right that lazy split generation is good enough for most cases. The performance impact looks like it's significant when A) the number of subdirectories per partition is O(100), and B) the target query time is O(1 second). But in general, reducing the number of calls to S3 should be a good thing.

The example that I was thinking where error can occur is :

set recursive listing to true and query a path - with the feature, the full nested objects are cached in the list located status call (in CachingDirectoryLister's cache obj).

set recursive listing to false and query a path - now while listing the path, the nested objects should not occur in the result whereas the Lister's cache has complete nested listing stored.

dain

IIRC, Hive ignores directories that start with ., _, and maybe some others. How does this code deal with that?

presto-hive/src/main/java/io/prestosql/plugin/hive/s3/PrestoS3FileSystem.java

… objects under an S3 prefix in a single call.

electrum · 2020-03-25T03:47:46Z

Rather than leaking details of S3 into the split loader, we should switch to this FS call that does recursive natively and implement it in S3 FS: https://hadoop.apache.org/docs/stable/api/org/apache/hadoop/fs/FileSystem.html#listFiles-org.apache.hadoop.fs.Path-boolean-

With this call, we should be able to simplify the recursive code, since it’s all done by the file system.

willmostly · 2020-03-27T02:55:05Z

IIRC, Hive ignores directories that start with ., _, and maybe some others. How does this code deal with that?

We can filter out these directories and objects from the listing, instead of in HiveFileIterator.

willmostly · 2020-03-27T03:00:21Z

Rather than leaking details of S3 into the split loader, we should switch to this FS call that does recursive natively and implement it in S3 FS.

Sounds good to me.

findepi · 2020-08-21T16:00:11Z

See also #4825

colebow · 2022-10-19T20:16:54Z

👋 @willmostly - this PR is inactive and doesn't seem to be under development, and it might already be implemented. If you'd like to continue work on this at any point in the future, feel free to re-open.

Add option to list all objects under an S3 location prefix.

b0b7173

cla-bot bot added the cla-signed label Mar 24, 2020

findepi reviewed Mar 24, 2020

View reviewed changes

dain reviewed Mar 24, 2020

View reviewed changes

presto-hive/src/main/java/io/prestosql/plugin/hive/s3/PrestoS3FileSystem.java Outdated Show resolved Hide resolved

willmostly added 2 commits March 24, 2020 22:22

Use hive.recursive-directories configuration to manage retrieving all…

d6ee4bd

… objects under an S3 prefix in a single call.

Removed stray debug statement.

89b86fa

Filter out Hive-style hidden directories and objects from listing.

bfdc92a

raunaqmorarka mentioned this pull request Aug 14, 2020

Implement PrestoS3FileSystem#listFiles for direct recursive listings #4825

Merged

findepi force-pushed the master branch from 8538e49 to 1f896ea Compare July 30, 2021 22:14

colebow closed this Oct 19, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add option to list all objects under an S3 location prefix. #3221

Add option to list all objects under an S3 location prefix. #3221

willmostly commented Mar 24, 2020

findepi Mar 24, 2020

rohangarg Mar 24, 2020

willmostly Mar 25, 2020

rohangarg Mar 25, 2020

dain left a comment

electrum commented Mar 25, 2020

willmostly commented Mar 27, 2020

willmostly commented Mar 27, 2020

findepi commented Aug 21, 2020

colebow commented Oct 19, 2022

Add option to list all objects under an S3 location prefix. #3221

Add option to list all objects under an S3 location prefix. #3221

Conversation

willmostly commented Mar 24, 2020

findepi Mar 24, 2020

Choose a reason for hiding this comment

rohangarg Mar 24, 2020

Choose a reason for hiding this comment

willmostly Mar 25, 2020

Choose a reason for hiding this comment

rohangarg Mar 25, 2020

Choose a reason for hiding this comment

dain left a comment

Choose a reason for hiding this comment

electrum commented Mar 25, 2020

willmostly commented Mar 27, 2020

willmostly commented Mar 27, 2020

findepi commented Aug 21, 2020

colebow commented Oct 19, 2022