Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add default value of Glue GetPartitions MaxResults #3024

Merged
merged 1 commit into from
Mar 11, 2020
Merged

Add default value of Glue GetPartitions MaxResults #3024

merged 1 commit into from
Mar 11, 2020

Conversation

l1x
Copy link
Member

@l1x l1x commented Mar 6, 2020

Fixes #2996

As per our discussion with @findepi and @ebyhr here is the fix for using batched requests with AWS Glue.

@l1x
Copy link
Member Author

l1x commented Mar 6, 2020

Discussed in: #2996

@cla-bot cla-bot bot added the cla-signed label Mar 6, 2020
@l1x l1x requested a review from findepi March 6, 2020 21:06
@trinodb trinodb deleted a comment from cla-bot bot Mar 7, 2020
@l1x l1x requested a review from ebyhr March 7, 2020 20:41
@l1x l1x requested a review from sopel39 March 9, 2020 18:47
@l1x l1x removed request for findepi and sopel39 March 11, 2020 09:39
@sopel39 sopel39 merged commit 4f22b0e into trinodb:master Mar 11, 2020
@sopel39 sopel39 mentioned this pull request Mar 11, 2020
6 tasks
@l1x l1x deleted the adding-glue-metastore-get-partitions-max-results branch March 11, 2020 11:30
@hashhar
Copy link
Member

hashhar commented May 2, 2020

Thanks a lot @l1x. We are personally hit hard by this issue since we want to move to Presto from Athena but the planning performance was abysmal for tables with significant number of partitions (>10k). Just wanted to let you know I appreciate this.

PS: I came here when I myself was digging into this issue and came to the same conclusion that you did.

@hashhar
Copy link
Member

hashhar commented May 2, 2020

AWS doesn't recommend any value but in practice from a small script I made to test I can see that there are no downsides to setting this to the max of 1000. The total time with a smaller value like 128 vs the max value is appreciable (3x slower for fetching almost 8k partitions). You can simulate this in a very crude way using the aws-cli command.

aws glue get-partitions --database-name db --table-name table --expression "dt>='2020-02-01'" --page-size 128
aws glue get-partitions --database-name db --table-name table --expression "dt>='2020-02-01'" --page-size 1000

The aws cli internally implements the nextToken loop so that all results are returned.

@l1x
Copy link
Member Author

l1x commented Jun 9, 2020

@hashhar I am glad I could help! Let me know how it goes. We could bump the number even higher or make it configurable.

@hashhar
Copy link
Member

hashhar commented Jun 9, 2020

@l1x We are now running PrestoSQL on production and it haven't seen any practical issues with the smaller page size for queries scanning upto 50k partitions. Will keep an eye out.

aweisberg pushed a commit to v-jizhang/presto that referenced this pull request May 19, 2021
Cherry pick of trinodb/trino#3024 and
trinodb/trino#4938

Co-authored-by: Istvan <istvan@lambdainsight.com>
Co-authored-by: Ashhar Hasan <hashhar_dev@outlook.com>
pettyjamesm pushed a commit to prestodb/presto that referenced this pull request May 19, 2021
Cherry pick of trinodb/trino#3024 and
trinodb/trino#4938

Co-authored-by: Istvan <istvan@lambdainsight.com>
Co-authored-by: Ashhar Hasan <hashhar_dev@outlook.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

Successfully merging this pull request may close these issues.

Add support for MaxResults on Glue Hive Metastore
4 participants