New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support for query pushdown to S3 using S3Select #11033
Conversation
Thank you for your pull request and welcome to our community. We require contributors to sign our Contributor License Agreement, and we don't seem to have you on file. In order for us to review and merge your code, please sign up at https://code.facebook.com/cla. If you are contributing on behalf of someone else (eg your employer), the individual CLA may not be sufficient and your employer may need the corporate CLA signed. If you have received this in error or have any questions, please contact us at cla@fb.com. Thanks! |
Working on getting the CLA action item closed. |
@sameer2086 This only works for text files, am I correct? Do you plan to use other file formats in the future? |
presto-hive/src/main/java/com/facebook/presto/hive/HiveS3Optimizer.java
Outdated
Show resolved
Hide resolved
@sopel39 |
presto-hive/src/main/java/com/facebook/presto/hive/IonSqlQueryBuilder.java
Outdated
Show resolved
Hide resolved
presto-hive/src/main/java/com/facebook/presto/hive/S3SelectCsvRecordReader.java
Show resolved
Hide resolved
549f31e
to
428443c
Compare
Addressed cr feedback. New revision is out. |
2f5e7c6
to
0fbb993
Compare
Just noting that I'm told (by FB OSPO) that "If you are contributing on behalf of someone else (eg your employer), the individual CLA may not be sufficient and your employer may need the corporate CLA signed." is inaccurate. Facebook's position (from CLA page) is: "If you are making contributions to our repositories on behalf of your company, then we will need a Corporate Contributor License Agreement (CLA) signed." Just as a note that your CLABot may be out of date. |
presto-hive/src/main/java/com/facebook/presto/hive/S3SelectLineRecordReader.java
Outdated
Show resolved
Hide resolved
0fbb993
to
b457e31
Compare
Referring to our offline discussion -- some general thoughts. I am looking for someone to substitute me in the review here, as I have limited availability currently.
|
@findepi Agreed. Either TPC-H or TPC-DS against Text (with S3 select on/off) plus ORC as a reference. |
Thanks for the inputs findepi@ and kbajda@, we are working on the technical documents and benchmark numbers and will update the PR with the same. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I made a pass over non-test classes only. Here are some notes:
- This feature definitely needs proper documentation.
- There is code duplication in PrestoS3Client and PrestoS3FileSystem
- There is lots of log info statements (e.g., for each branch taken, etc.)
- We shouldn't introduce a new dependency just for retry support (see my comment about
RetryDriver
) - Class hierarchies can be given more thought (e.g., abstract classes and interfaces with only one subclass/implementation)
- I left many comments to give you an idea about how to make this PR more consistent with our coding conventions.
presto-hive/src/main/java/com/facebook/presto/hive/HiveClientConfig.java
Outdated
Show resolved
Hide resolved
presto-hive/src/main/java/com/facebook/presto/hive/HiveS3Optimizer.java
Outdated
Show resolved
Hide resolved
presto-hive/src/main/java/com/facebook/presto/hive/HiveS3Optimizer.java
Outdated
Show resolved
Hide resolved
presto-hive/src/main/java/com/facebook/presto/hive/HiveS3Optimizer.java
Outdated
Show resolved
Hide resolved
presto-hive/src/main/java/com/facebook/presto/hive/S3SelectRecordCursorProvider.java
Outdated
Show resolved
Hide resolved
presto-hive/src/main/java/com/facebook/presto/hive/S3SelectRecordCursorProvider.java
Outdated
Show resolved
Hide resolved
presto-hive/src/main/java/com/facebook/presto/hive/S3SelectRecordCursorProvider.java
Outdated
Show resolved
Hide resolved
presto-hive/src/main/java/com/facebook/presto/hive/S3SelectRecordCursorProvider.java
Outdated
Show resolved
Hide resolved
presto-hive/src/main/java/com/facebook/presto/hive/S3SelectRecordCursorProvider.java
Outdated
Show resolved
Hide resolved
Thanks for the detailed code review @nezihyigitbasi, working on the feedback. Will respond to your questions soon. |
presto-hive/src/main/java/com/facebook/presto/hive/HiveS3Optimizer.java
Outdated
Show resolved
Hide resolved
presto-hive/src/main/java/com/facebook/presto/hive/S3SelectLineRecordReader.java
Outdated
Show resolved
Hide resolved
Where's Nikki?
…On Thu, Jul 12, 2018, 6:43 PM Sameer Choudhary ***@***.***> wrote:
This change will allow Presto users to improve the performance of their
queries using S3Select.
The optimizer pushes down projections and predicate evaluation to S3 using
S3Optimizer. As a result
Presto doesn't need to download full S3 objects and only data required to
answer the user's query
is returned to Presto, thereby improving performance.
------------------------------
You can view, comment on, or merge this pull request online at:
#11033
Commit Summary
- Add support for S3Optimizer using S3Select
File Changes
- *M* pom.xml
<https://github.com/prestodb/presto/pull/11033/files#diff-0> (9)
- *M* presto-hive-hadoop2/pom.xml
<https://github.com/prestodb/presto/pull/11033/files#diff-1> (32)
- *M* presto-hive/pom.xml
<https://github.com/prestodb/presto/pull/11033/files#diff-2> (21)
- *M*
presto-hive/src/main/java/com/facebook/presto/hive/BackgroundHiveSplitLoader.java
<https://github.com/prestodb/presto/pull/11033/files#diff-3> (5)
- *M*
presto-hive/src/main/java/com/facebook/presto/hive/HiveClientConfig.java
<https://github.com/prestodb/presto/pull/11033/files#diff-4> (15)
- *M*
presto-hive/src/main/java/com/facebook/presto/hive/HiveClientModule.java
<https://github.com/prestodb/presto/pull/11033/files#diff-5> (1)
- *M*
presto-hive/src/main/java/com/facebook/presto/hive/HivePageSourceProvider.java
<https://github.com/prestodb/presto/pull/11033/files#diff-6> (6)
- *M*
presto-hive/src/main/java/com/facebook/presto/hive/HivePartitionMetadata.java
<https://github.com/prestodb/presto/pull/11033/files#diff-7> (9)
- *A*
presto-hive/src/main/java/com/facebook/presto/hive/HiveS3Optimizer.java
<https://github.com/prestodb/presto/pull/11033/files#diff-8> (204)
- *M*
presto-hive/src/main/java/com/facebook/presto/hive/HiveSessionProperties.java
<https://github.com/prestodb/presto/pull/11033/files#diff-9> (11)
- *M* presto-hive/src/main/java/com/facebook/presto/hive/HiveSplit.java
<https://github.com/prestodb/presto/pull/11033/files#diff-10> (13)
- *M*
presto-hive/src/main/java/com/facebook/presto/hive/HiveSplitManager.java
<https://github.com/prestodb/presto/pull/11033/files#diff-11> (24)
- *M*
presto-hive/src/main/java/com/facebook/presto/hive/HiveSplitSource.java
<https://github.com/prestodb/presto/pull/11033/files#diff-12> (4)
- *M* presto-hive/src/main/java/com/facebook/presto/hive/HiveUtil.java
<https://github.com/prestodb/presto/pull/11033/files#diff-13> (20)
- *M*
presto-hive/src/main/java/com/facebook/presto/hive/InternalHiveSplit.java
<https://github.com/prestodb/presto/pull/11033/files#diff-14> (10)
- *A*
presto-hive/src/main/java/com/facebook/presto/hive/IonSqlQueryBuilder.java
<https://github.com/prestodb/presto/pull/11033/files#diff-15> (175)
- *A*
presto-hive/src/main/java/com/facebook/presto/hive/PrestoS3Client.java
<https://github.com/prestodb/presto/pull/11033/files#diff-16> (135)
- *A*
presto-hive/src/main/java/com/facebook/presto/hive/PrestoS3SelectClient.java
<https://github.com/prestodb/presto/pull/11033/files#diff-17> (97)
- *A*
presto-hive/src/main/java/com/facebook/presto/hive/QueryBuilder.java
<https://github.com/prestodb/presto/pull/11033/files#diff-18> (181)
- *A*
presto-hive/src/main/java/com/facebook/presto/hive/S3SelectClient.java
<https://github.com/prestodb/presto/pull/11033/files#diff-19> (33)
- *A*
presto-hive/src/main/java/com/facebook/presto/hive/S3SelectCsvRecordReader.java
<https://github.com/prestodb/presto/pull/11033/files#diff-20> (94)
- *A*
presto-hive/src/main/java/com/facebook/presto/hive/S3SelectLineRecordReader.java
<https://github.com/prestodb/presto/pull/11033/files#diff-21> (200)
- *A*
presto-hive/src/main/java/com/facebook/presto/hive/S3SelectRecordCursor.java
<https://github.com/prestodb/presto/pull/11033/files#diff-22> (201)
- *A*
presto-hive/src/main/java/com/facebook/presto/hive/S3SelectRecordCursorProvider.java
<https://github.com/prestodb/presto/pull/11033/files#diff-23> (85)
- *M*
presto-hive/src/main/java/com/facebook/presto/hive/s3/HiveS3Config.java
<https://github.com/prestodb/presto/pull/11033/files#diff-24> (1)
- *M*
presto-hive/src/main/java/com/facebook/presto/hive/util/InternalHiveSplitFactory.java
<https://github.com/prestodb/presto/pull/11033/files#diff-25> (8)
- *M*
presto-hive/src/test/java/com/facebook/presto/hive/AbstractTestHiveClient.java
<https://github.com/prestodb/presto/pull/11033/files#diff-26> (3)
- *M*
presto-hive/src/test/java/com/facebook/presto/hive/AbstractTestHiveFileSystem.java
<https://github.com/prestodb/presto/pull/11033/files#diff-27> (3)
- *M*
presto-hive/src/test/java/com/facebook/presto/hive/HiveTestUtils.java
<https://github.com/prestodb/presto/pull/11033/files#diff-28> (13)
- *A*
presto-hive/src/test/java/com/facebook/presto/hive/MockAmazonS3.java
<https://github.com/prestodb/presto/pull/11033/files#diff-29> (1014)
- *M*
presto-hive/src/test/java/com/facebook/presto/hive/TestBackgroundHiveSplitLoader.java
<https://github.com/prestodb/presto/pull/11033/files#diff-30> (5)
- *M*
presto-hive/src/test/java/com/facebook/presto/hive/TestHiveClientConfig.java
<https://github.com/prestodb/presto/pull/11033/files#diff-31> (7)
- *M*
presto-hive/src/test/java/com/facebook/presto/hive/TestHivePageSink.java
<https://github.com/prestodb/presto/pull/11033/files#diff-32> (3)
- *M*
presto-hive/src/test/java/com/facebook/presto/hive/TestHiveSplit.java
<https://github.com/prestodb/presto/pull/11033/files#diff-33> (4)
- *M*
presto-hive/src/test/java/com/facebook/presto/hive/TestHiveSplitSource.java
<https://github.com/prestodb/presto/pull/11033/files#diff-34> (20)
- *A*
presto-hive/src/test/java/com/facebook/presto/hive/TestIonSqlQueryBuilder.java
<https://github.com/prestodb/presto/pull/11033/files#diff-35> (158)
- *M* presto-main/etc/catalog/hive.properties
<https://github.com/prestodb/presto/pull/11033/files#diff-36> (4)
Patch Links:
- https://github.com/prestodb/presto/pull/11033.patch
- https://github.com/prestodb/presto/pull/11033.diff
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#11033>, or mute the thread
<https://github.com/notifications/unsubscribe-auth/AWhhj2qSlq5oET5a4xNKHJen-sa-sp7Qks5uF_tNgaJpZM4VOLCF>
.
|
Thank you for signing our Contributor License Agreement. We can now accept your code for this (and any) Facebook open source project. Thanks! |
presto-hive/src/main/java/com/facebook/presto/hive/S3SelectCsvRecordReader.java
Outdated
Show resolved
Hide resolved
presto-hive/src/main/java/com/facebook/presto/hive/PrestoS3Client.java
Outdated
Show resolved
Hide resolved
presto-hive/src/main/java/com/facebook/presto/hive/PrestoS3Client.java
Outdated
Show resolved
Hide resolved
We ran TPCDS-100 tests and a small fraction of the queries are currently failing due to I had a quick discussion with nezih@ and the plan is to fix this bug before merging the code with master. I have updated the code as per the feedback and will upload the new revision soon. There will be follow-up revision to fix the identified bug.
|
b457e31
to
3b1c7c2
Compare
We were able to root cause the issue. It was caused due to default max number of file descriptors (4096) configured for the hosts. For some queries, Presto will send many S3Select requests on one node. For eg, Presto sends 8000+ requests for TPCDS query 4. Each time we send a HTTPS request, we will read the truststore file, and then 8000 requests means we open 8000 fd for the files. However, Linux has a limit on the max # of fd one user could open, which is set to 4096 for our test hosts, and this is the reason why we cannot read the truststore file, and see the above exception. Raising the limit on the hosts resolves the issue. We are now able to execute all TPCDS-100 and TPCDS-1000 queries.
We will add this configuration tuning in the documentation. |
This doesn't sound good. The truststore file rarely changes (and I doubt we're obliged to react to its changes anyway). |
@sameer2086 The technical doc looks good to me. I believe to capture all concern well. Feel free to translate that into the actual end-user documentation likely as a section under https://prestodb.io/docs/current/connector/hive.html#amazon-s3-configuration |
Thanks for reviewing the document @kbajda. Could you please point me to instructions on how to update Presto documentation. I don't have prior experience with that. |
@sameer2086 this is |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for working on this. I made another pass and left some more comments.
In general I have two comments:
- We need proper user documentation.
- The test coverage is low, please add more tests to increase it.
presto-hive/src/main/java/com/facebook/presto/hive/BackgroundHiveSplitLoader.java
Outdated
Show resolved
Hide resolved
presto-hive/src/main/java/com/facebook/presto/hive/BackgroundHiveSplitLoader.java
Outdated
Show resolved
Hide resolved
presto-hive/src/main/java/com/facebook/presto/hive/HiveClientConfig.java
Show resolved
Hide resolved
presto-hive/src/main/java/com/facebook/presto/hive/HiveUtil.java
Outdated
Show resolved
Hide resolved
presto-hive/src/main/java/com/facebook/presto/hive/S3SelectLineRecordReader.java
Outdated
Show resolved
Hide resolved
presto-hive/src/main/java/com/facebook/presto/hive/S3SelectRecordCursor.java
Show resolved
Hide resolved
@nezihyigitbasi @sameer2086 i am afraid we cannot run integration tests on Travis (can we, somehow?) |
I attached user documentation to the PR earlier. @kbajda reviewed that. We are currently working on incorporating that doc in https://github.com/prestodb/presto/blob/master/presto-docs/src/main/sphinx/connector/hive.rst. Requesting you to please review the same and let us know incase any updates are required.
|
Thanks for the feedback @findepi. It makes a lot of sense to me to have automated tests in place to ensure we don't regress on AWS related Presto features. I will pass the feedback to the right teams and will request them to follow-up. Having said that I believe this shouldn't be a blocker for this PR, right? |
Right, it's not a blocker (unfortunately 😄 ). |
@sameer2086 were you able to make any progress on tests infra (#11033 (comment))? |
@findepi I am not actively working on the tests infra project. Although, I did forward the feedback to the concerned team and they will review it in their sprint planning. I don't have a date for this though. |
c6b6c36
to
d18f384
Compare
Published a new revision of the code after incorporating cr feedback. @nezihyigitbasi, requesting another review. We are adding more tests in this Sprint and will update the diff soon with the requested unit tests. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did another pass and it's getting close. My comments are mostly minor stuff. Thanks a lot for working on this PR and addressing all the comments.
- You have an internal amazon link in the commit message detail.
- I ran
TestIonSqlQueryBuilder
locally andtestDecimalColumns
test fails.
presto-hive/src/test/java/com/facebook/presto/hive/TestIonSqlQueryBuilder.java
Outdated
Show resolved
Hide resolved
presto-hive/src/test/java/com/facebook/presto/hive/TestIonSqlQueryBuilder.java
Outdated
Show resolved
Hide resolved
presto-hive/src/test/java/com/facebook/presto/hive/TestIonSqlQueryBuilder.java
Show resolved
Hide resolved
@@ -341,6 +344,52 @@ the ``org.apache.hadoop.conf.Configurable`` interface from the Hadoop Java API, | |||
will be passed in after the object instance is created and before it is asked to provision or retrieve any | |||
encryption keys. | |||
|
|||
S3SelectPushdown |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/cc @electrum any comments for the doc (and the PR)?
d18f384
to
bd95074
Compare
@nezihyigitbasi, published a new revision for review. Requested unit tests need to be added. We are working on it, and will update the PR with them sometime next week. |
53d6576
to
8242a4e
Compare
@nezihyigitbasi Published a new revision with the requested unit tests. |
8242a4e
to
5f7935e
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@sameer2086 First, thanks for addressing all these comments. I made another pass and left some more comments. Currently this PR has lots of comments and github has a hard time loading it. I think we should create a new PR (and link to this one) and continue working on that.
- Commit message details are longer than 72 characters and look weird on github.
- Commit message includes a link to an amazon cr (cr https://code.amazon.com/reviews/CR-3642036
), please remove it. - I have left comments for the docs, but looks like we may need to clean it up a bit more.
############################################################### | ||
By default, Presto uses PrestoS3FileSystem or EMRFS as its file system, and the setting ``fs.s3.maxConnections`` configuration specifies | ||
the maximum allowable client connections to Amazon S3 through EMRFS. By default, this is 5000. | ||
S3 Select Pushdown bypasses the filesystems when accessing Amazon S3 for predicate operations. In this case, the value of |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is a predicate operation
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Filter condition in the where clause of a query.
presto-hive/src/test/java/com/facebook/presto/hive/TestS3SelectRecordCursor.java
Outdated
Show resolved
Hide resolved
presto-hive/src/test/java/com/facebook/presto/hive/TestS3SelectRecordCursor.java
Outdated
Show resolved
Hide resolved
presto-hive/src/test/java/com/facebook/presto/hive/TestS3SelectRecordCursor.java
Outdated
Show resolved
Hide resolved
5f7935e
to
6a71a5f
Compare
This PR is slow to work with due to large volume of comments. Thus, closing this and creating a new PR as per @nezihyigitbasi request. New PR Link: #11970 |
Change-Id: I1498d5d7de5dc9f29f13e1239dbf57b8fc8d8bd8
This change will allow Presto users to improve the performance of their queries using S3SelectPushdown.
It pushes down projections and predicate evaluations to S3. As a result
Presto doesn't need to download full S3 objects and only data required to answer the user's query
is returned to Presto, thereby improving performance.
S3SelectPushdown Technical Document: S3SelectPushdown.pdf
PR UPDATE
Closed this PR as it was slow to work with due to large volume of comments. Created a new PR to continue the work #11970