Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Avoid duplicated first_row agg func #8891

Closed
guo-shaoge opened this issue Apr 2, 2024 · 0 comments · Fixed by #8985
Closed

Avoid duplicated first_row agg func #8891

guo-shaoge opened this issue Apr 2, 2024 · 0 comments · Fixed by #8985
Assignees
Labels
type/enhancement Issue or PR for enhancement

Comments

@guo-shaoge
Copy link
Contributor

guo-shaoge commented Apr 2, 2024

Enhancement

Check the following example:

drop table if exists t1;
create table t1(c1 int, c2 varchar(100), c3 int);
insert into t1 values(1, 'a', 1), (2, 'b', 2), ..., (10000, 'xxx', 10000);
select sum(c1), c2, c3 from t1 group by c2, c3;

For tiflash HashAgg current implementation, the internal computation procedure includes:

  1. For each row, insert column serialized c2+c2 into HashMap and update agg state in HashMap
  2. After all rows are handled, start to read HashMap. Will copy data from HashMap to Column. And the real copy include:
    1. Copy result of sum(c1) to ColumnDecimal.
    2. Copy result of first_row(c2) to ColumnString.
    3. Copy result of first_row(c3) to ColumnInt.
    4. Copy result of any(c2) to ColumnString.
    5. Copy result of c3 to ColumnInt.

For 2.i, 2.ii and 2.iii, they are corresponsding to the select item of sum(c1), c2 and c3.

For 2.iv and 2.v, they are corresponding to the group by column c2 and c3.

But actually the copy of 2.ii is duplicated with 2.iv, and the copy of 2.iii is duplicated with 2.v. So we can do the following optimizations:

  1. Eliminate the first_row(c3) agg func(a.k.a. 2.iii) and make a pointer to reference c3 directly. So we can avoid the computation of first_row and the copy of first_row(c3) to result ColumnInt.
  2. Eliminate the any(c2) agg func (a.k.a. 2.v) and make a pointer to reference first_row(c2). So we can aovoid the computation of any(c2) (a.k.a. 2.iv) and the copy of any(c2)

If c2 and c3 are high NDV, this optimization is significant.

@guo-shaoge guo-shaoge added the type/enhancement Issue or PR for enhancement label Apr 2, 2024
@guo-shaoge guo-shaoge self-assigned this Apr 2, 2024
@guo-shaoge guo-shaoge changed the title Avoid duplicated column copy for HashAgg Avoid duplicated first_row agg func Apr 12, 2024
ti-chi-bot bot added a commit that referenced this issue May 29, 2024
close #8891

Signed-off-by: guo-shaoge <shaoge1994@163.com>

Co-authored-by: ti-chi-bot[bot] <108142056+ti-chi-bot[bot]@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type/enhancement Issue or PR for enhancement
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant