Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize distinct aggregation on multi column #12024

Closed
kaka11chen opened this issue Dec 6, 2018 · 8 comments
Closed

Optimize distinct aggregation on multi column #12024

kaka11chen opened this issue Dec 6, 2018 · 8 comments
Labels

Comments

@kaka11chen
Copy link
Contributor

When query use distinct aggregation on multi columns.

select count(distinct ss_item_sk), count(distinct ss_store_sk) from tpcds_bin_partitioned_orc_1000.store_sales;
Result: It is very slow, cost 60 seconds in our perf-test env, regardless of use-mark-distinct.

If I change it to

select count(case when grouping_id=1 and ss_item_sk is not null then 1 else null end) as c0, count(case when grouping_id=2 and ss_store_sk is not null then 1 else null end) as c1 from (select grouping(ss_item_sk,ss_store_sk) AS grouping_id, ss_item_sk, ss_store_sk from tpcds_bin_partitioned_orc_1000.store_sales group by grouping sets (ss_item_sk, ss_store_sk))
Result: It only cost 20 seconds in our perf-test env.

I have read source code of presto, and found a rule optimization class SingleDistinctAggregationToGroupBy to handle distinct aggregation on single column case, but I didn't find the rule handle the case about multi columns.

There are similar things on Hive and Spark, such similar optimization has been implemented on these platforms.
https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/rules/HiveExpandDistinctAggregatesRule.java
https://issues.apache.org/jira/browse/HIVE-10901
https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/RewriteDistinctAggregates.scala

I am new comers of presto, Is it worth doing this optimization?

@findepi
Copy link
Contributor

findepi commented Dec 6, 2018

Did you try set session optimize_mixed_distinct_aggregations = true;?

@kaka11chen
Copy link
Contributor Author

@findepi It's still slow after set session optimize_mixed_distinct_aggregations = true;

@qqibrow
Copy link
Contributor

qqibrow commented Dec 6, 2018

It seems optimize_mixed_distinct_aggregations only works when count, count(distinct xxx) are in group by.

for example:

select count(distinct ss_item_sk), count(ss_item_sk) from tpcds_bin_partitioned_orc_1000.store_sales group by ss_store_sk

You can use approxmate_distinct in your first query if you can tolerant certain error rate. That will make the query plan architecture similar to your second query and therefore achieve similar performance.

@kaka11chen
Copy link
Contributor Author

@qqibrow thanks for your suggestion, but sometimes the business needs extract statistics.

@kaka11chen
Copy link
Contributor Author

@findepi It seems optimize_mixed_distinct_aggregations only works one distinct column by OptimizeMixedDistinctAggregations.java. What limitation about this only support one distinct column?

@martint
Copy link
Contributor

martint commented Dec 7, 2018

optimize_mixed_distinct_aggregations is not well supported. It's currently turned off because it produces incorrect results in certain scenarios (see #8894). We just haven't gotten around to fixing it and expanding support for more complex queries.

@kaka11chen
Copy link
Contributor Author

@martint Got it. Thanks, martint.

@stale
Copy link

stale bot commented Dec 6, 2020

This issue has been automatically marked as stale because it has not had any activity in the last 2 years. If you feel that this issue is important, just comment and the stale tag will be removed; otherwise it will be closed in 7 days. This is an attempt to ensure that our open issues remain valuable and relevant so that we can keep track of what needs to be done and prioritize the right things.

@stale stale bot added the stale label Dec 6, 2020
@stale stale bot closed this as completed Dec 19, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants