Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize unnecessary column copy for HashAgg #8985

Merged
merged 60 commits into from
May 29, 2024

Conversation

guo-shaoge
Copy link
Contributor

@guo-shaoge guo-shaoge commented Apr 25, 2024

What problem does this PR solve?

Issue Number: close #8891

Problem Summary:
When there are group by key in select item(a.k.a. first_row), tiflash have extra agg func, which cause unnecessary copy from HashMap to result column.

What is changed and how it works?

Basic idea:
image

Optimization-1 (with collation):
What: For group by keys that with collation, there will be first_row/any agg func for it to keep original data. So no need to copy these keys from HashMap, instead just a pointer to reference its corresponding first_row/any result is enough.

How:

  1. Detect if there is a first_row agg func in the select item.
  2. If so, ignore any agg func. If not, add the 'any' agg func.
  3. Also, set key_from_agg_func to indicate that this key is equivalent to first_row/any agg func, which can avoid copying this key from the HashMap in subsequent operations. (check DAGExpressionAnalyzer::buildAggGroupBy)
  4. If all keys are included in first_row/any (which is rare, but still can happens), will skip copy keys(template argument skip_serialize_key is true)

Optimization-2(no collation)
What: When SQL query has agg func like: first_row(group_by_key_col), and that group_by_key_col has no collation. Then we can eliminate the first_row agg func, just use a pointer to reference the group by key is enough.

Results

  1. 25% improvement
  2. workload:
    1. 20M rows, very high NDV.
    2. 3 varchar columns, 3 decimal columns, 3 int columns (that means Aggregator will use HashMethodSerialized)

before:
image

after:
image

Check List

Tests

  • Unit test
  • Integration test
  • Manual test (add detailed scripts or steps below)
  • No code

Side effects

  • Performance regression: Consumes more CPU
  • Performance regression: Consumes more Memory
  • Breaking backward compatibility

Documentation

  • Affects user behaviors
  • Contains syntax changes
  • Contains variable changes
  • Contains experimental features
  • Changes MySQL compatibility

Release note

None

Signed-off-by: guo-shaoge <shaoge1994@163.com>
@ti-chi-bot ti-chi-bot bot added do-not-merge/needs-linked-issue release-note-none size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. labels Apr 25, 2024
Signed-off-by: guo-shaoge <shaoge1994@163.com>
Signed-off-by: guo-shaoge <shaoge1994@163.com>
Signed-off-by: guo-shaoge <shaoge1994@163.com>
Signed-off-by: guo-shaoge <shaoge1994@163.com>
Signed-off-by: guo-shaoge <shaoge1994@163.com>
Signed-off-by: guo-shaoge <shaoge1994@163.com>
@guo-shaoge
Copy link
Contributor Author

/run-all-tests

@guo-shaoge
Copy link
Contributor Author

/test all

Signed-off-by: guo-shaoge <shaoge1994@163.com>
Signed-off-by: guo-shaoge <shaoge1994@163.com>
Signed-off-by: guo-shaoge <shaoge1994@163.com>
Signed-off-by: guo-shaoge <shaoge1994@163.com>
Signed-off-by: guo-shaoge <shaoge1994@163.com>
Signed-off-by: guo-shaoge <shaoge1994@163.com>
@guo-shaoge guo-shaoge changed the title Optimize duplicated agg func Optimize unnecessary copy for HashAgg Apr 29, 2024
@guo-shaoge
Copy link
Contributor Author

/test all

Signed-off-by: guo-shaoge <shaoge1994@163.com>
@guo-shaoge
Copy link
Contributor Author

/test all

@guo-shaoge guo-shaoge mentioned this pull request Apr 29, 2024
12 tasks
Signed-off-by: guo-shaoge <shaoge1994@163.com>
Signed-off-by: guo-shaoge <shaoge1994@163.com>
@guo-shaoge
Copy link
Contributor Author

/test all

Copy link
Contributor

@windtalker windtalker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@SeaRise SeaRise self-requested a review May 28, 2024 10:19
Signed-off-by: guo-shaoge <shaoge1994@163.com>
Signed-off-by: guo-shaoge <shaoge1994@163.com>
@SeaRise SeaRise self-requested a review May 29, 2024 01:53
Copy link
Contributor

@SeaRise SeaRise left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

Copy link
Contributor

ti-chi-bot bot commented May 29, 2024

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: SeaRise, windtalker

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Copy link
Contributor

ti-chi-bot bot commented May 29, 2024

[LGTM Timeline notifier]

Timeline:

  • 2024-05-28 03:00:13.985921633 +0000 UTC m=+2745367.743057204: ☑️ agreed by windtalker.
  • 2024-05-29 01:58:46.211107092 +0000 UTC m=+2828079.968242665: ☑️ agreed by SeaRise.

@SeaRise
Copy link
Contributor

SeaRise commented May 29, 2024

/hold

@ti-chi-bot ti-chi-bot bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label May 29, 2024
@SeaRise
Copy link
Contributor

SeaRise commented May 29, 2024

comment /unhold to merge pr~
@guo-shaoge

Signed-off-by: guo-shaoge <shaoge1994@163.com>
@guo-shaoge guo-shaoge removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label May 29, 2024
@guo-shaoge guo-shaoge added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label May 29, 2024
Signed-off-by: guo-shaoge <shaoge1994@163.com>
@guo-shaoge guo-shaoge removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label May 29, 2024
@guo-shaoge
Copy link
Contributor Author

/merge

Copy link
Contributor

ti-chi-bot bot commented May 29, 2024

@guo-shaoge: Your PR was out of date, I have automatically updated it for you.

At the same time I will also trigger all tests for you:

/run-all-tests

trigger some heavy tests which will not run always when PR updated.

If the CI test fails, you just re-trigger the test that failed and the bot will merge the PR for you after the CI passes.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the ti-community-infra/tichi repository.

@guo-shaoge
Copy link
Contributor Author

/test pull-integration-test

@ti-chi-bot ti-chi-bot bot merged commit 7c7b878 into pingcap:master May 29, 2024
5 checks passed
@JaySon-Huang JaySon-Huang deleted the optimize_duplicated_agg_func branch May 29, 2024 13:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved lgtm release-note-none size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Avoid duplicated first_row agg func
4 participants