[Proposal] Adding pre-aggregation support in presto #11777

hellium01 · 2018-10-24T18:35:11Z

For many connectors, it is possible for us to provide pre-aggregated data either on the fly or from materialized table (these two don't differ significantly in engine's point of view). Be able to utilize that can help query performance a lot, especially if queries very likely will query same set of data repeatedly. To support that, we will need to allow engine to understand how the connectors pre-calculate columns, and how to rewrite the query to utilize intermediate results.

1. Representing pre-calculated columns:
To allow engine understand how connector pre-compute data, we should have a form to represent an operation tree which is a subset of presto expressions, i.e. we need to be able to represnt:

preCalculatedColumn1 =  approx_distinct(rawColumn1)

So, we can return a map from ColumnHandle to operation tree.

ColumnHandle("preCalculatedColumn1") -> FunctionCallOperation(SignatureFor("approx_distinct"), list.of(ColumnHandle("rawColumn1")))

Which subset is still open to discussion but as an example, we can represent function calls as FunctionCallOperation, columns as ColumnHandles ... etc.

2. Decomposing aggregation functions:
To rewrite aggregation functions to merge intermediate results, engine should now how to decompose functions. In presto, most of our aggregation functions are commutative semi-group aggregations (order insensitive), we just need to expand our annotations to mark the dependency in sub-functions (like the way @FunctionDependency works):

@AggregationFunction("approx_distinct")
class ApproximateCountDistinctAggregation
{
  @InputFunction(stateBuilder=ApproximateSetAggregation.class)
  @CombineFunction(stateMerger=MergeHyperLogLogAggregation.class)
  @OutputFunction(stateOuputer=MapCardinalityFunction.class)
}

thus, if we have preCalculatedColumn1 -> approx_set(c2), we can transform approx_distinct(c2) into cardinality(merge(preCalcualatedColumn1))

3. Rewrite algorithm (a simple approach):
Under current presto architecture, the fastest way to support rewrite is using layouts. We can represent grouping keys and pre-calcuated columns as localProperty(which we already do today for grouping keys). When we call getTableLayout, connector can return a list of pre-calculated layouts with localProperties:

Layout 1:
grouping keys: [c1]
precalculated columns: [ c3 -> functionCall(...), c4 -> ...., ...]
Layout 2:
grouping keys: [c1, c2]
precalculated columns: [ c3 -> functionCall(...), c4 -> ...., ...]
...
Layout N:
raw data table

When we have a sub query that is aggregation, we can scan through the layouts to determine if:

grouping keys of aggregate + un-pushed down predicate covers the grouping key provided by the layout
the desired columns are decomposable to simple expression of pre-calculated columns.

If we want to push down aggregation to connector, connector might need to return full combination of all possible grouping key sets/functions. In such case, we can provide information about upstream subtree to give hint to connector what to provide. On top of this, we can iterate on how we want to allow connectors participate more in the planning process.

4. More generic handling of materialized view
To support materialized view more generically, we should consider different approaches. One approach might be:

For each subtree, we will check if connector can provide candidate(s) to cover this subtree.
If there is candidate, we will attempt to rewrite the tree to utilize the candidate.
The simplest scenario is that we will rewrite the leaf node to tableScan.

5. References

Rewriting Queries with Arbitrary Aggregation Functions with Views. Sarah Cohen 2006.(http://leibniz.cs.huji.ac.il/tr/999.CS.pdf)
On the content of materialized aggregate views. Grumbach 2003. (https://www.sciencedirect.com/science/article/pii/S0022000002000338)
Rewriting Aggregate Queries with Description Logic. David Dehaan 2003. (http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.3.6448&rep=rep1&type=pdf)

The text was updated successfully, but these errors were encountered:

hellium01 · 2018-10-24T18:36:26Z

cc @electrum , @martint , @tdcmeehan , @jessesleeping, @charlesmccarthyirl, @joeg

mbasmanova · 2018-11-13T23:01:04Z

@hellium01 Yi, would this require committing to an intermediate representation of a query? E.g. have a way to represent the query plan tree in a backwards compatible way?

hellium01 · 2018-11-19T21:32:47Z

We probably need a way to represent a subset of expression tree. To represent a query plan tree will be much more complicated (though will be helpful if we want address the problem in a more generic way: session 4).

tdcmeehan · 2018-11-21T20:00:01Z

(Memorializing a quick point I brought up offline; the scope of the comment is technical in nature and not a comment on the overall proposal.) The output function above is incorrect--it should be HyperLogLogFunctions#cardinality. But this brings up the question of how to represent the scalar function for the output function, most of them are represented as simple static functions and not classes that are easily referenced from annotations. Perhaps we could simply refer to the string function name, and in case it's overloaded do something clever like infer the expected type from the aggregation function to determine which overload to use. I suppose for the scope of this issue, we don't worry about the order of arguments to the scalar output function, as they are all simple single-input functions.

findepi · 2018-11-21T20:42:20Z

@tdcmeehan can you elaborate? To me, this looks like a middle part of some discussion. Can you add more context?
Does this refer to @mbasmanova 's question about query IR?

tdcmeehan · 2018-11-21T20:51:18Z

@findepi The observation is more related to syntax: if you look at the stateOuputer in the @OutputFunction annotation above, it refers to MapCardinalityFunction, which gives you the cardinality for a map. It actually should refer to HyperLogLogFunctions#cardinality, which gives you the cardinality for a HyperLogLog. But that begs the question of how one represents that, since you can't refer to the method name in an annotation in Java, at least in a type-safe way, just the class name.

findepi · 2018-11-23T14:44:52Z

For reference -- this has been discussed in the past at #4839

hellium01 · 2019-02-22T18:43:44Z

Close this in favor or #12368

pgagnon mentioned this issue Feb 7, 2019

Allow connectors to participate in query optimization trinodb/trino#18

Open

26 tasks

hellium01 closed this as completed Feb 22, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Proposal] Adding pre-aggregation support in presto #11777

[Proposal] Adding pre-aggregation support in presto #11777

hellium01 commented Oct 24, 2018 •

edited

Loading

hellium01 commented Oct 24, 2018 •

edited

Loading

mbasmanova commented Nov 13, 2018

hellium01 commented Nov 19, 2018

tdcmeehan commented Nov 21, 2018

findepi commented Nov 21, 2018

tdcmeehan commented Nov 21, 2018

findepi commented Nov 23, 2018

hellium01 commented Feb 22, 2019 •

edited

Loading

[Proposal] Adding pre-aggregation support in presto #11777

[Proposal] Adding pre-aggregation support in presto #11777

Comments

hellium01 commented Oct 24, 2018 • edited Loading

hellium01 commented Oct 24, 2018 • edited Loading

mbasmanova commented Nov 13, 2018

hellium01 commented Nov 19, 2018

tdcmeehan commented Nov 21, 2018

findepi commented Nov 21, 2018

tdcmeehan commented Nov 21, 2018

findepi commented Nov 23, 2018

hellium01 commented Feb 22, 2019 • edited Loading

hellium01 commented Oct 24, 2018 •

edited

Loading

hellium01 commented Oct 24, 2018 •

edited

Loading

hellium01 commented Feb 22, 2019 •

edited

Loading