Welcome to the tispark wiki!
For now, our skeleton code bypassed part of the original datasource API for aggressive pushdown calculations into tikv's coprocessor and we don't uses stable api of Filters and prunes traits since they seems too narrow for coprocessor.
For calculation pushdown, what we do is direct matching subplan for pattern like Aggregates->Filter/Projection->TiRelation (for now, limit and sort not considered at least for May for simplicity)
We extract Filter/Projections and translate it to
case AttributeReference -> ColumnReference for Coprocessor results
case Filter and Projection Expression -> test if supported by tikv and translate to proto expressions, for the remaining part generate a supplementary plan on the top of TiRelation
case Aggregate -> split into partial and spark aggregate while partial part is translated into proto request, and spark one is prepend on the top of Filter/Projection (if any, otherwise TiRelation directly); for any aggregates that cannot pushed, we need to fall all the way back to no aggregate-pushdown logic
Spark aggregates itself should have two phase as in original spark since our data is not partitioned via any hash key and we still need to have a shuffle stage (if not streamingAggregate case which we don't consider for now during May). But one thing need to consider and might get optimizer involve is type promotion: for example, we need to promote sum(int) into sum(long) since coprocessor result is likely promoted avoiding overflow. This likely to be done in optimizer phase or we bypass original planning and do it our own.
- partition split
- coprocessor for plan: translate spark plan to coprocessor request
- finish aggregates(almost done)
- finish filter
- implement key to range convert besides long
- How to integrate with existing stats & cbo of spark
- Choose index based on stats
- sort and limit
- Sort Merge Join
- Consume order if pk
- Write and create table?
- Test with streaming