Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

executor: introduce a new execution framework for aggregate functions #6852

Merged
merged 24 commits into from Jun 29, 2018

Conversation

@zz-jason
Copy link
Member

commented Jun 17, 2018

What have you changed? (mandatory)

Introduce a new interface named AggFunc defined in executor/aggfuncs/aggfuncs.go to refactor the execution framework of aggregate functions. The main usage of the new execution framework is:

  1. use AllocPartialResult() to allocate the struct to store the partial result for every aggregate function
  2. use UpdatePartialResult() to update the partial result for every aggregate function, no mater whether the input is the original or partial data. The input partialBytes will be converted to the specific partial result struct before update.
  3. use ResetPartialResult() to reset or reinitialize the partial result for every aggregate function. The input partialBytes will be converted to the specific partial result struct before reinitialization.
  4. use AppendFinalResult2Chunk() to finalize the partial result to the input chk. The input partialBytes will be converted to the specific partial result before finalization every group.

The main improvements are:

  1. by calling UpdatePartialResult() with []chunk.Row, we can reduce the total function calls, which saves a lot of time. And for stream aggregate, the input data for a aggregate function are stored sequentially in the input []chunk.Row, which can further improve the CPU cache performance.
  2. by calling AllocPartialResult() to allocate the specific struct to store the partial result for every aggregate function, we can reduce the redundant memory usage in the old struct AggEvaluateContext.

Use aggfuncs.Build to create a AggFunc according to the AggFuncDesc. For now:

  1. only partially supported some implementations of AVG
  2. the new execution framework is only supported in the StreamAggExec if possiable

What are the type of the changes (mandatory)?

  • Improvement (non-breaking change which is an improvement to an existing feature)

How has this PR been tested (mandatory)?

  • unit test
  • explain test

Does this PR affect documentation (docs/docs-cn) update? (optional)

No

Benchmark result if necessary (optional)

test sql:

mysql root@172.16.10.112:tpch> desc select avg(L_QUANTITY) from (select * from lineitem union all select * from lineitem) tmp;
+----------------+--------------+-------------------------------+------+-----------------------------------------------------+--------------+
| id             | parents      | children                      | task | operator info                                       | count        |
+----------------+--------------+-------------------------------+------+-----------------------------------------------------+--------------+
| TableScan_23   |              |                               | cop  | table:lineitem, range:[-inf,+inf], keep order:false | 59986052.00  |
| TableReader_24 | Union_21     |                               | root | data:TableScan_23                                   | 59986052.00  |
| TableScan_26   |              |                               | cop  | table:lineitem, range:[-inf,+inf], keep order:false | 59986052.00  |
| TableReader_27 | Union_21     |                               | root | data:TableScan_26                                   | 59986052.00  |
| Union_21       | StreamAgg_13 | TableReader_24,TableReader_27 | root |                                                     | 119972104.00 |
| StreamAgg_13   |              | Union_21                      | root | funcs:avg(tmp.l_quantity)                           | 1.00         |
+----------------+--------------+-------------------------------+------+-----------------------------------------------------+--------------+
6 rows in set
Time: 0.017s

Before this PR:

mysql root@172.16.10.112:tpch> select avg(L_QUANTITY) from (select * from lineitem union all select * from lineitem) tmp;
+-----------------+
| avg(L_QUANTITY) |
+-----------------+
| 25.501562       |
+-----------------+
1 row in set
Time: 54.508s

After this PR:

mysql root@172.16.10.112:tpch> select avg(L_QUANTITY) from (select * from lineitem union all select * from lineitem) tmp;
+-----------------+
| avg(L_QUANTITY) |
+-----------------+
| 25.501562       |
+-----------------+
1 row in set
Time: 27.767s

The performance gain is about 96%

@zz-jason

This comment has been minimized.

Copy link
Member Author

commented Jun 17, 2018

/run-all-tests

@winoros

This comment has been minimized.

Copy link
Member

commented Jun 20, 2018

Before's result and After's is placed in the wrong position?

@zz-jason

This comment has been minimized.

Copy link
Member Author

commented Jun 20, 2018

@winoros updated

@zz-jason

This comment has been minimized.

Copy link
Member Author

commented Jun 21, 2018

/run-all-tests

@zz-jason zz-jason added this to In progress in support intra-operator parallelism via automation Jun 21, 2018
@zz-jason

This comment has been minimized.

Copy link
Member Author

commented Jun 21, 2018

/run-all-tests

1 similar comment
@zz-jason

This comment has been minimized.

Copy link
Member Author

commented Jun 21, 2018

/run-all-tests

@zz-jason

This comment has been minimized.

Copy link
Member Author

commented Jun 21, 2018

/run-all-tests

@zz-jason

This comment has been minimized.

Copy link
Member Author

commented Jun 21, 2018

/run-all-tests

@zz-jason

This comment has been minimized.

Copy link
Member Author

commented Jun 21, 2018

/run-all-tests

@zz-jason

This comment has been minimized.

Copy link
Member Author

commented Jun 21, 2018

@zz-jason

This comment has been minimized.

Copy link
Member Author

commented Jun 22, 2018

@lysu PTAL

func (e *StreamAggExec) appendResult2Chunk(chk *chunk.Chunk) {
func (e *StreamAggExec) appendResult2Chunk(chk *chunk.Chunk) error {
if e.newAggFuncs != nil {
fmt.Printf("StreamAggExec.appendResult2Chunk: use new aggfunc\n")

This comment has been minimized.

Copy link
@lamxTyler

lamxTyler Jun 22, 2018

Member

Remove the debug log.

zz-jason added 6 commits Jun 23, 2018
…nto dev/refactor-agg
@@ -100,6 +100,7 @@ func (e *avgDedup4Decimal) UpdatePartialResult(sctx sessionctx.Context, rowsInGr

type avgOriginal4Decimal struct {
baseAvgDecimal
deDuper map[types.MyDecimal]bool

This comment has been minimized.

Copy link
@XuHuaiyu

XuHuaiyu Jun 25, 2018

Contributor

deDuper should be initialized

@@ -80,11 +81,20 @@ func buildAvg(aggFuncDesc *aggregation.AggFuncDesc, output []int) AggFunc {
case aggregation.CompleteMode, aggregation.Partial1Mode:
switch aggFuncDesc.Args[0].GetType().Tp {

This comment has been minimized.

Copy link
@XuHuaiyu

XuHuaiyu Jun 25, 2018

Contributor

we should consider all the input types,
use EvalType here may be better?

zz-jason added 2 commits Jun 25, 2018
e.AggFuncs = append(e.AggFuncs, aggDesc.GetAggFunc())
newAggFunc := aggfuncs.Build(aggDesc, []int{i})

This comment has been minimized.

Copy link
@XuHuaiyu

XuHuaiyu Jun 26, 2018

Contributor

why do we need to pass a slice
since there is only one element in the slice?

// input PartialResult to the specific data structure which stores the
// partial result and then calculates the final result and append that
// final result to the chunk provided.
AppendFinalResult2Chunk(sctx sessionctx.Context, pr PartialResult, chk *chunk.Chunk) error

This comment has been minimized.

Copy link
@XuHuaiyu

XuHuaiyu Jun 26, 2018

Contributor

s/ AppendFinalResult2Chunk/ GetFinalResult

This comment has been minimized.

Copy link
@zz-jason

zz-jason Jun 26, 2018

Author Member

I prefer the original name, which indicates the result is appended to the output chunk

type baseAggFunc struct {
// input stores the input arguments for an aggregate function, we should
// call input.EvalXXX to get the actual input data for this function.
input []expression.Expression

This comment has been minimized.

Copy link
@XuHuaiyu

XuHuaiyu Jun 26, 2018

Contributor
  1. s/ input/ args may be clearer.
  2. we do not need to define output as a slice,
    since we only use it to append the final result to a chunk.
@XuHuaiyu

This comment has been minimized.

Copy link
Contributor

commented Jun 26, 2018

Do we need a GetPartialResult func which may used by mocktikv.

@zz-jason

This comment has been minimized.

Copy link
Member Author

commented Jun 26, 2018

@XuHuaiyu If we only decide to use it in the final or complete mode, we don't need to add the GetPartialResult, mocktikv can just use the origin old aggregate funcs.

@XuHuaiyu

This comment has been minimized.

Copy link
Contributor

commented Jun 26, 2018

PTAL @coocood

@lysu

This comment has been minimized.

Copy link
Member

commented Jun 28, 2018

/run-all-tests tidb-test=pr/559

zz-jason added 2 commits Jun 28, 2018

// for the new execution framework of aggregate functions
newAggFuncs []aggfuncs.AggFunc
partialResults []aggfuncs.PartialResult

This comment has been minimized.

Copy link
@coocood

coocood Jun 29, 2018

Member

Why do we need to hold partialResults here instead of in each AggFunc?

This comment has been minimized.

Copy link
@zz-jason

zz-jason Jun 29, 2018

Author Member

It'e better to let aggregate function implementations to be stateless. If not so, we have to allocate an aggregate function for every group, this is worse when we use it in the hash aggregate operator.

This comment has been minimized.

Copy link
@zz-jason

This comment has been minimized.

Copy link
@coocood

coocood Jun 29, 2018

Member

got it.

@XuHuaiyu

This comment has been minimized.

Copy link
Contributor

commented Jun 29, 2018

LGTM

@coocood

This comment has been minimized.

Copy link
Member

commented Jun 29, 2018

LGTM

@coocood coocood added status/LGT2 and removed status/LGT1 labels Jun 29, 2018
@zz-jason zz-jason merged commit 3c05d77 into pingcap:master Jun 29, 2018
4 checks passed
4 checks passed
ci/circleci Your tests passed on CircleCI!
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details
jenkins-ci-tidb/build Jenkins job succeeded.
Details
license/cla Contributor License Agreement is signed.
Details
support intra-operator parallelism automation moved this from In progress to Done Jun 29, 2018
@zz-jason zz-jason deleted the zz-jason:dev/refactor-agg branch Jun 29, 2018
@mccxj mccxj referenced this pull request Sep 2, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.