Support preferring exact partitioning #12495

arhimondr · 2022-05-20T15:02:23Z

By default Trino partitions data by all available columns for partitioning. For example for JOIN Trino partitions data based on all the columns appearing in the equality condition. For GROUP BY Trino partitions data on all available grouping keys.

However due to the relatively high cost of a partitioning (shuffle) operation Trino may try to avoid unnecessary re-partitions when possible.

For example if data is already partitioned on column_a to perform a certain operation (such as a JOIN or a GROUP BY) Trino may decide not to partition data again to perform a subsequent operation that has column_a as one of the partitioning keys among the other keys. For example consider a query:

SELECT t1.column_a, t2.column_b, count(*)
FROM t1, t2
WHERE t1.column_a = t2.column_a
GROUP BY t1.column_a, t2.column_b

To join t1 with t2 trino has to partition both tables on column_a. To run a subsequent aggregation Trino may decide to either partition again on both, column_a and column_b or to preserve an existing partitioning on column_a and run the subsequent aggregation in place.

By default Trino decided to avoid unnecessary partitioning and run a subsequent operation in place if possible. However it introduces a risk of a memory skew, as for certain values of column_a there could be significantly more values of column_b than for other values of column_a.

In the future Trino should be able to detect such conditions automatically and apply runtime based optimizations to fix this problem automatically.

However today, in the absence of such an optimization it would be great to allow users to instruct Trino to prefer adding an extra exchange to avoid potential skews.

An alternative would be to suggest users adding an identity projection to trick the optimizer:

SELECT t1.column_a, t2.column_b, count(*)
FROM t1, t2
WHERE t1.column_a = t2.column_a
GROUP BY t1.column_a + 0, t2.column_b

However this approach is fragile as it may silently break when optimizer becomes smarter at detecting identity projections.

The text was updated successfully, but these errors were encountered:

arhimondr · 2022-05-20T15:02:33Z

Here's an example PR from Presto: prestodb/presto#13354

arhimondr assigned linzebing May 24, 2022

linzebing mentioned this issue Jun 24, 2022

Support use-exact-partitioning #12967

Merged

losipiuk closed this as completed in #12967 Jun 27, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support preferring exact partitioning #12495

Support preferring exact partitioning #12495

arhimondr commented May 20, 2022 •

edited

Loading

arhimondr commented May 20, 2022

Support preferring exact partitioning #12495

Support preferring exact partitioning #12495

Comments

arhimondr commented May 20, 2022 • edited Loading

arhimondr commented May 20, 2022

arhimondr commented May 20, 2022 •

edited

Loading