Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Proposal] Adding pre-aggregation support in presto #11777

Closed
hellium01 opened this issue Oct 24, 2018 · 8 comments
Closed

[Proposal] Adding pre-aggregation support in presto #11777

hellium01 opened this issue Oct 24, 2018 · 8 comments

Comments

@hellium01
Copy link
Contributor

hellium01 commented Oct 24, 2018

For many connectors, it is possible for us to provide pre-aggregated data either on the fly or from materialized table (these two don't differ significantly in engine's point of view). Be able to utilize that can help query performance a lot, especially if queries very likely will query same set of data repeatedly. To support that, we will need to allow engine to understand how the connectors pre-calculate columns, and how to rewrite the query to utilize intermediate results.

1. Representing pre-calculated columns:
To allow engine understand how connector pre-compute data, we should have a form to represent an operation tree which is a subset of presto expressions, i.e. we need to be able to represnt:

preCalculatedColumn1 =  approx_distinct(rawColumn1)

So, we can return a map from ColumnHandle to operation tree.

ColumnHandle("preCalculatedColumn1") -> FunctionCallOperation(SignatureFor("approx_distinct"), list.of(ColumnHandle("rawColumn1")))

Which subset is still open to discussion but as an example, we can represent function calls as FunctionCallOperation, columns as ColumnHandles ... etc.

2. Decomposing aggregation functions:
To rewrite aggregation functions to merge intermediate results, engine should now how to decompose functions. In presto, most of our aggregation functions are commutative semi-group aggregations (order insensitive), we just need to expand our annotations to mark the dependency in sub-functions (like the way @FunctionDependency works):

@AggregationFunction("approx_distinct")
class ApproximateCountDistinctAggregation
{
  @InputFunction(stateBuilder=ApproximateSetAggregation.class)
  @CombineFunction(stateMerger=MergeHyperLogLogAggregation.class)
  @OutputFunction(stateOuputer=MapCardinalityFunction.class)
}

thus, if we have preCalculatedColumn1 -> approx_set(c2), we can transform approx_distinct(c2) into cardinality(merge(preCalcualatedColumn1))

3. Rewrite algorithm (a simple approach):
Under current presto architecture, the fastest way to support rewrite is using layouts. We can represent grouping keys and pre-calcuated columns as localProperty(which we already do today for grouping keys). When we call getTableLayout, connector can return a list of pre-calculated layouts with localProperties:

Layout 1:
grouping keys: [c1]
precalculated columns: [ c3 -> functionCall(...), c4 -> ...., ...]
Layout 2:
grouping keys: [c1, c2]
precalculated columns: [ c3 -> functionCall(...), c4 -> ...., ...]
...
Layout N:
raw data table 

When we have a sub query that is aggregation, we can scan through the layouts to determine if:

  1. grouping keys of aggregate + un-pushed down predicate covers the grouping key provided by the layout
  2. the desired columns are decomposable to simple expression of pre-calculated columns.

If we want to push down aggregation to connector, connector might need to return full combination of all possible grouping key sets/functions. In such case, we can provide information about upstream subtree to give hint to connector what to provide. On top of this, we can iterate on how we want to allow connectors participate more in the planning process.

4. More generic handling of materialized view
To support materialized view more generically, we should consider different approaches. One approach might be:

  1. For each subtree, we will check if connector can provide candidate(s) to cover this subtree.
  2. If there is candidate, we will attempt to rewrite the tree to utilize the candidate.
  3. The simplest scenario is that we will rewrite the leaf node to tableScan.

5. References

  1. Rewriting Queries with Arbitrary Aggregation Functions with Views. Sarah Cohen 2006.(http://leibniz.cs.huji.ac.il/tr/999.CS.pdf)
  2. On the content of materialized aggregate views. Grumbach 2003. (https://www.sciencedirect.com/science/article/pii/S0022000002000338)
  3. Rewriting Aggregate Queries with Description Logic. David Dehaan 2003. (http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.3.6448&rep=rep1&type=pdf)
@hellium01
Copy link
Contributor Author

hellium01 commented Oct 24, 2018

@mbasmanova
Copy link
Contributor

@hellium01 Yi, would this require committing to an intermediate representation of a query? E.g. have a way to represent the query plan tree in a backwards compatible way?

@hellium01
Copy link
Contributor Author

We probably need a way to represent a subset of expression tree. To represent a query plan tree will be much more complicated (though will be helpful if we want address the problem in a more generic way: session 4).

@tdcmeehan
Copy link
Contributor

(Memorializing a quick point I brought up offline; the scope of the comment is technical in nature and not a comment on the overall proposal.) The output function above is incorrect--it should be HyperLogLogFunctions#cardinality. But this brings up the question of how to represent the scalar function for the output function, most of them are represented as simple static functions and not classes that are easily referenced from annotations. Perhaps we could simply refer to the string function name, and in case it's overloaded do something clever like infer the expected type from the aggregation function to determine which overload to use. I suppose for the scope of this issue, we don't worry about the order of arguments to the scalar output function, as they are all simple single-input functions.

@findepi
Copy link
Contributor

findepi commented Nov 21, 2018

@tdcmeehan can you elaborate? To me, this looks like a middle part of some discussion. Can you add more context?
Does this refer to @mbasmanova 's question about query IR?

@tdcmeehan
Copy link
Contributor

@findepi The observation is more related to syntax: if you look at the stateOuputer in the @OutputFunction annotation above, it refers to MapCardinalityFunction, which gives you the cardinality for a map. It actually should refer to HyperLogLogFunctions#cardinality, which gives you the cardinality for a HyperLogLog. But that begs the question of how one represents that, since you can't refer to the method name in an annotation in Java, at least in a type-safe way, just the class name.

@findepi
Copy link
Contributor

findepi commented Nov 23, 2018

For reference -- this has been discussed in the past at #4839

@hellium01
Copy link
Contributor Author

hellium01 commented Feb 22, 2019

Close this in favor or #12368

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants