Optimize distinct aggregation on multi column #12024

kaka11chen · 2018-12-06T11:58:49Z

When query use distinct aggregation on multi columns.

select count(distinct ss_item_sk), count(distinct ss_store_sk) from tpcds_bin_partitioned_orc_1000.store_sales;
Result: It is very slow, cost 60 seconds in our perf-test env, regardless of use-mark-distinct.

If I change it to

select count(case when grouping_id=1 and ss_item_sk is not null then 1 else null end) as c0, count(case when grouping_id=2 and ss_store_sk is not null then 1 else null end) as c1 from (select grouping(ss_item_sk,ss_store_sk) AS grouping_id, ss_item_sk, ss_store_sk from tpcds_bin_partitioned_orc_1000.store_sales group by grouping sets (ss_item_sk, ss_store_sk))
Result: It only cost 20 seconds in our perf-test env.

I have read source code of presto, and found a rule optimization class SingleDistinctAggregationToGroupBy to handle distinct aggregation on single column case, but I didn't find the rule handle the case about multi columns.

There are similar things on Hive and Spark, such similar optimization has been implemented on these platforms.
https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/rules/HiveExpandDistinctAggregatesRule.java
https://issues.apache.org/jira/browse/HIVE-10901
https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/RewriteDistinctAggregates.scala

I am new comers of presto, Is it worth doing this optimization？

The text was updated successfully, but these errors were encountered:

findepi · 2018-12-06T13:55:25Z

Did you try set session optimize_mixed_distinct_aggregations = true;?

kaka11chen · 2018-12-06T15:26:16Z

@findepi It's still slow after set session optimize_mixed_distinct_aggregations = true;

qqibrow · 2018-12-06T18:56:05Z

It seems optimize_mixed_distinct_aggregations only works when count, count(distinct xxx) are in group by.

for example:

select count(distinct ss_item_sk), count(ss_item_sk) from tpcds_bin_partitioned_orc_1000.store_sales group by ss_store_sk

You can use approxmate_distinct in your first query if you can tolerant certain error rate. That will make the query plan architecture similar to your second query and therefore achieve similar performance.

kaka11chen · 2018-12-07T01:48:16Z

@qqibrow thanks for your suggestion, but sometimes the business needs extract statistics.

kaka11chen · 2018-12-07T06:30:50Z

@findepi It seems optimize_mixed_distinct_aggregations only works one distinct column by OptimizeMixedDistinctAggregations.java. What limitation about this only support one distinct column?

martint · 2018-12-07T07:37:37Z

optimize_mixed_distinct_aggregations is not well supported. It's currently turned off because it produces incorrect results in certain scenarios (see #8894). We just haven't gotten around to fixing it and expanding support for more complex queries.

kaka11chen · 2018-12-07T09:56:42Z

@martint Got it. Thanks, martint.

stale · 2020-12-06T23:56:12Z

This issue has been automatically marked as stale because it has not had any activity in the last 2 years. If you feel that this issue is important, just comment and the stale tag will be removed; otherwise it will be closed in 7 days. This is an attempt to ensure that our open issues remain valuable and relevant so that we can keep track of what needs to be done and prioritize the right things.

kaka11chen mentioned this issue Jan 7, 2019

Optimize distinct aggregation on multi columns. #12183

Closed

kaka11chen mentioned this issue Apr 10, 2019

Optimize distinct aggregation on multi column trinodb/trino#613

Open

stale bot added the stale label Dec 6, 2020

stale bot closed this as completed Dec 19, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize distinct aggregation on multi column #12024

Optimize distinct aggregation on multi column #12024

kaka11chen commented Dec 6, 2018

findepi commented Dec 6, 2018

kaka11chen commented Dec 6, 2018

qqibrow commented Dec 6, 2018

kaka11chen commented Dec 7, 2018

kaka11chen commented Dec 7, 2018

martint commented Dec 7, 2018

kaka11chen commented Dec 7, 2018

stale bot commented Dec 6, 2020

Optimize distinct aggregation on multi column #12024

Optimize distinct aggregation on multi column #12024

Comments

kaka11chen commented Dec 6, 2018

findepi commented Dec 6, 2018

kaka11chen commented Dec 6, 2018

qqibrow commented Dec 6, 2018

kaka11chen commented Dec 7, 2018

kaka11chen commented Dec 7, 2018

martint commented Dec 7, 2018

kaka11chen commented Dec 7, 2018

stale bot commented Dec 6, 2020