Optimize distinct aggregation on multi column #613

kaka11chen · 2019-04-10T09:55:38Z

Reference prestodb/presto#12024

When query use distinct aggregation on multi columns.

select count(distinct ss_item_sk), count(distinct ss_store_sk) from tpcds_bin_partitioned_orc_1000.store_sales;
Result: It is very slow, cost 60 seconds in our perf-test env, regardless of use-mark-distinct.

If I change it to

select count(case when grouping_id=1 and ss_item_sk is not null then 1 else null end) as c0, count(case when grouping_id=2 and ss_store_sk is not null then 1 else null end) as c1 from (select grouping(ss_item_sk,ss_store_sk) AS grouping_id, ss_item_sk, ss_store_sk from tpcds_bin_partitioned_orc_1000.store_sales group by grouping sets (ss_item_sk, ss_store_sk))
Result: It only cost 20 seconds in our perf-test env.

I have read source code of presto, and found a rule optimization class SingleDistinctAggregationToGroupBy to handle distinct aggregation on single column case, but I didn't find the rule handle the case about multi columns.

There are similar things on Hive and Spark, such similar optimization has been implemented on these platforms.
https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/rules/HiveExpandDistinctAggregatesRule.java
https://issues.apache.org/jira/browse/HIVE-10901
https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/RewriteDistinctAggregates.scala

findepi · 2019-04-10T12:53:33Z

@kaka11chen, Thanks for creating the issue!
For the record, i understand you're going to migrate your previous PR in this area (prestodb/presto#12183) to prestosql.

amoghmargoor · 2023-05-22T11:15:56Z

Hi @kaka11chen, do you plan to work on this PR in near future: #624 ?

kaka11chen mentioned this issue Apr 11, 2019

Optimize distinct aggregation on multiple columns #624

Closed

amoghmargoor self-assigned this Jun 6, 2023

lukasz-stec mentioned this issue Sep 19, 2023

Using more than one Count function in a group by query on Iceberg connector degrades performance #19072

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize distinct aggregation on multi column #613

Optimize distinct aggregation on multi column #613

kaka11chen commented Apr 10, 2019

findepi commented Apr 10, 2019

amoghmargoor commented May 22, 2023

Optimize distinct aggregation on multi column #613

Optimize distinct aggregation on multi column #613

Comments

kaka11chen commented Apr 10, 2019

findepi commented Apr 10, 2019

amoghmargoor commented May 22, 2023