Avoid duplicated first_row agg func #8891

guo-shaoge · 2024-04-02T03:36:43Z

Enhancement

Check the following example:

drop table if exists t1;
create table t1(c1 int, c2 varchar(100), c3 int);
insert into t1 values(1, 'a', 1), (2, 'b', 2), ..., (10000, 'xxx', 10000);
select sum(c1), c2, c3 from t1 group by c2, c3;

For tiflash HashAgg current implementation, the internal computation procedure includes:

For each row, insert column serialized c2+c2 into HashMap and update agg state in HashMap
After all rows are handled, start to read HashMap. Will copy data from HashMap to Column. And the real copy include:
1. Copy result of sum(c1) to ColumnDecimal.
2. Copy result of first_row(c2) to ColumnString.
3. Copy result of first_row(c3) to ColumnInt.
4. Copy result of any(c2) to ColumnString.
5. Copy result of c3 to ColumnInt.

For 2.i, 2.ii and 2.iii, they are corresponsding to the select item of sum(c1), c2 and c3.

For 2.iv and 2.v, they are corresponding to the group by column c2 and c3.

But actually the copy of 2.ii is duplicated with 2.iv, and the copy of 2.iii is duplicated with 2.v. So we can do the following optimizations:

Eliminate the first_row(c3) agg func(a.k.a. 2.iii) and make a pointer to reference c3 directly. So we can avoid the computation of first_row and the copy of first_row(c3) to result ColumnInt.
Eliminate the any(c2) agg func (a.k.a. 2.v) and make a pointer to reference first_row(c2). So we can aovoid the computation of any(c2) (a.k.a. 2.iv) and the copy of any(c2)

If c2 and c3 are high NDV, this optimization is significant.

The text was updated successfully, but these errors were encountered:

close #8891 Signed-off-by: guo-shaoge <shaoge1994@163.com> Co-authored-by: ti-chi-bot[bot] <108142056+ti-chi-bot[bot]@users.noreply.github.com>

guo-shaoge added the type/enhancement Issue or PR for enhancement label Apr 2, 2024

guo-shaoge self-assigned this Apr 2, 2024

guo-shaoge changed the title ~~Avoid duplicated column copy for HashAgg~~ Avoid duplicated first_row agg func Apr 12, 2024

guo-shaoge mentioned this issue Apr 28, 2024

Optimize unnecessary column copy for HashAgg #8985

Merged

12 tasks

ti-chi-bot bot closed this as completed in #8985 May 29, 2024

ti-chi-bot bot added a commit that referenced this issue May 29, 2024

Optimize unnecessary column copy for HashAgg (#8985)

7c7b878

close #8891 Signed-off-by: guo-shaoge <shaoge1994@163.com> Co-authored-by: ti-chi-bot[bot] <108142056+ti-chi-bot[bot]@users.noreply.github.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Avoid duplicated first_row agg func #8891

Avoid duplicated first_row agg func #8891

guo-shaoge commented Apr 2, 2024 •

edited

Avoid duplicated first_row agg func #8891

Avoid duplicated first_row agg func #8891

Comments

guo-shaoge commented Apr 2, 2024 • edited

Enhancement

guo-shaoge commented Apr 2, 2024 •

edited