-
Notifications
You must be signed in to change notification settings - Fork 5.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Proposal] Adding pre-aggregation support in presto #11777
Comments
@hellium01 Yi, would this require committing to an intermediate representation of a query? E.g. have a way to represent the query plan tree in a backwards compatible way? |
We probably need a way to represent a subset of expression tree. To represent a query plan tree will be much more complicated (though will be helpful if we want address the problem in a more generic way: session 4). |
(Memorializing a quick point I brought up offline; the scope of the comment is technical in nature and not a comment on the overall proposal.) The output function above is incorrect--it should be |
@tdcmeehan can you elaborate? To me, this looks like a middle part of some discussion. Can you add more context? |
@findepi The observation is more related to syntax: if you look at the |
For reference -- this has been discussed in the past at #4839 |
Close this in favor or #12368 |
For many connectors, it is possible for us to provide pre-aggregated data either on the fly or from materialized table (these two don't differ significantly in engine's point of view). Be able to utilize that can help query performance a lot, especially if queries very likely will query same set of data repeatedly. To support that, we will need to allow engine to understand how the connectors pre-calculate columns, and how to rewrite the query to utilize intermediate results.
1. Representing pre-calculated columns:
To allow engine understand how connector pre-compute data, we should have a form to represent an operation tree which is a subset of presto expressions, i.e. we need to be able to represnt:
So, we can return a map from ColumnHandle to operation tree.
Which subset is still open to discussion but as an example, we can represent function calls as FunctionCallOperation, columns as ColumnHandles ... etc.
2. Decomposing aggregation functions:
To rewrite aggregation functions to merge intermediate results, engine should now how to decompose functions. In presto, most of our aggregation functions are commutative semi-group aggregations (order insensitive), we just need to expand our annotations to mark the dependency in sub-functions (like the way @FunctionDependency works):
thus, if we have
preCalculatedColumn1 -> approx_set(c2)
, we can transformapprox_distinct(c2)
intocardinality(merge(preCalcualatedColumn1))
3. Rewrite algorithm (a simple approach):
Under current presto architecture, the fastest way to support rewrite is using layouts. We can represent grouping keys and pre-calcuated columns as localProperty(which we already do today for grouping keys). When we call
getTableLayout
, connector can return a list of pre-calculated layouts with localProperties:When we have a sub query that is aggregation, we can scan through the layouts to determine if:
If we want to push down aggregation to connector, connector might need to return full combination of all possible grouping key sets/functions. In such case, we can provide information about upstream subtree to give hint to connector what to provide. On top of this, we can iterate on how we want to allow connectors participate more in the planning process.
4. More generic handling of materialized view
To support materialized view more generically, we should consider different approaches. One approach might be:
5. References
The text was updated successfully, but these errors were encountered: