Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

copr: enable region load balance for MPP #38117

Merged
merged 8 commits into from Sep 26, 2022

Conversation

windtalker
Copy link
Contributor

@windtalker windtalker commented Sep 23, 2022

Signed-off-by: xufei xufeixw@mail.ustc.edu.cn

What problem does this PR solve?

Issue Number: close #38113

Problem Summary:

The whole story is like this:

  1. At the very beginning, TiDB can only access TiFlash in cop mode, in cop mode, each cop request only access data in one region. If the region has multiple TiFlash replicas, then each time the region is accessed, it will choose the next replica to serve, so the load is balanced between TiFlash nodes.
  2. After introducing MPP and BatchCop, TiDB can access TiFlash in BatchCop/MPP mode, compared to cop mode, BatchCop/MPP mode does not access data by region, instead, it access data by TiFlash node. That is to say, each BatchCop/MPP request will read a batch of regions in on TiFlash node. BatchCop/MPP can reduce the rpc calls greatly but it also meet some problems, especially when some TiFlash nodes are temporary unavailable: in cop mode, TiDB can just retry using the next replica, while in BatchCop/MPP mode, the cost of retry is unacceptable because each BatchCop/MPP request may contain hundreds or even thousands of regions. So in order to avoid sending request to unavailable TiFlash node, for TiFlash region, if one of the replica is availale, TiDB will always use this replica for BatchCop/MPP request.(by set loadBalance to false in here.)
  3. After disable per region's load balance, we found that MPP's load is unbalanced even if the TiFlash table has multiple replicas. There is two level of unbalance
  • intra query's unbalance: considering a query like select * from t, assuming the cluster has 2 TiFlash nodes, t has two TiFlash replica, and t contans 100 regions, it is possible that the query only access the regions from one TiFlash node, the other TiFlash node is completely ignored
  • inter query's unbalance: still considering a query like select * from t, the cluster has 2 TiFlash nodes, t has two TiFlash replica, this time assuming t only contains 1 region, and the query concurrency is 100, then it is possible that all the 100 queries read from one TiFlash node.
  1. In order to solve the intra query's unbalance, we introduce balanceBatchCopTask , it will balance the region access between different TiFlash's node for each query.
  2. In some tests, we found even if we disable per region's load balance, there is still a risk of access unavailable TiFlash node, in order to solve the problem totally, we come up a new solution: when construct MPP request, TiDB will check the availability of each TiFlash node, and only construct MPP request on the live node.

So, as we can see, due to disable per region's load balance, BatchCop/MPP still suffer from inter-query's unbalance preblem. But for MPP request, there is no unavailable node problems, we actually do not need disable per region's load balance. This pr enable per region's load balance for MPP request.

What is changed and how it works?

Check List

Tests

  • Unit test
  • Integration test
  • Manual test (add detailed scripts or steps below)
  • No code

Side effects

  • Performance regression: Consumes more CPU
  • Performance regression: Consumes more Memory
  • Breaking backward compatibility

Documentation

  • Affects user behaviors
  • Contains syntax changes
  • Contains variable changes
  • Contains experimental features
  • Changes MySQL compatibility

Release note

Please refer to Release Notes Language Style Guide to write a quality release note.

None

@ti-chi-bot
Copy link
Member

ti-chi-bot commented Sep 23, 2022

[REVIEW NOTIFICATION]

This pull request has been approved by:

  • guo-shaoge
  • wshwsh12

To complete the pull request process, please ask the reviewers in the list to review by filling /cc @reviewer in the comment.
After your PR has acquired the required number of LGTMs, you can assign this pull request to the committer in the list by filling /assign @committer in the comment to help you merge this pull request.

The full list of commands accepted by this bot can be found here.

Reviewer can indicate their review by submitting an approval review.
Reviewer can cancel approval by submitting a request changes review.

@ti-chi-bot ti-chi-bot added do-not-merge/invalid-title release-note-none size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. labels Sep 23, 2022
@windtalker windtalker changed the title enable region load balance for MPP copr: enable region load balance for MPP Sep 23, 2022
@hawkingrei
Copy link
Member

/run-check_dev

@ti-chi-bot ti-chi-bot added the status/LGT1 Indicates that a PR has LGTM 1. label Sep 23, 2022
@ti-chi-bot ti-chi-bot added status/LGT2 Indicates that a PR has LGTM 2. and removed status/LGT1 Indicates that a PR has LGTM 1. labels Sep 23, 2022
@windtalker
Copy link
Contributor Author

/merge

@ti-chi-bot
Copy link
Member

This pull request has been accepted and is ready to merge.

Commit hash: cd04fb40fd37ff5b3f4ec3d1e31f2964ea0e010a

@ti-chi-bot ti-chi-bot added status/can-merge Indicates a PR has been approved by a committer. size/S Denotes a PR that changes 10-29 lines, ignoring generated files. and removed status/can-merge Indicates a PR has been approved by a committer. size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. labels Sep 23, 2022
@windtalker
Copy link
Contributor Author

/merge

@ti-chi-bot
Copy link
Member

This pull request has been accepted and is ready to merge.

Commit hash: daea317a9f8de617787a24499bded38d221478f3

@ti-chi-bot ti-chi-bot added status/can-merge Indicates a PR has been approved by a committer. and removed status/can-merge Indicates a PR has been approved by a committer. labels Sep 23, 2022
@windtalker
Copy link
Contributor Author

/run-unit-test

Signed-off-by: xufei <xufeixw@mail.ustc.edu.cn>
Signed-off-by: xufei <xufeixw@mail.ustc.edu.cn>
Signed-off-by: xufei <xufeixw@mail.ustc.edu.cn>
Signed-off-by: xufei <xufeixw@mail.ustc.edu.cn>
Signed-off-by: xufei <xufeixw@mail.ustc.edu.cn>
Signed-off-by: xufei <xufeixw@mail.ustc.edu.cn>
@ti-chi-bot ti-chi-bot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. and removed size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels Sep 26, 2022
Signed-off-by: xufei <xufeixw@mail.ustc.edu.cn>
Signed-off-by: xufei <xufeixw@mail.ustc.edu.cn>
@windtalker
Copy link
Contributor Author

/merge

@ti-chi-bot
Copy link
Member

This pull request has been accepted and is ready to merge.

Commit hash: e9c1447

@ti-chi-bot ti-chi-bot added the status/can-merge Indicates a PR has been approved by a committer. label Sep 26, 2022
@ti-chi-bot ti-chi-bot merged commit 21cfb9d into pingcap:master Sep 26, 2022
@sre-bot
Copy link
Contributor

sre-bot commented Sep 26, 2022

TiDB MergeCI notify

🔴 Bad News! New failing [1] after this pr merged.
These new failed integration tests seem to be caused by the current PR, please try to fix these new failed integration tests, thanks!

CI Name Result Duration Compare with Parent commit
idc-jenkins-ci-tidb/tics-test 🟥 failed 1, success 0, total 1 6 min 15 sec New failing
idc-jenkins-ci/integration-cdc-test ✅ all 37 tests passed 26 min Fixed
idc-jenkins-ci-tidb/integration-ddl-test 🟢 all 6 tests passed 31 min Existing passed
idc-jenkins-ci-tidb/integration-common-test 🟢 all 17 tests passed 10 min Existing passed
idc-jenkins-ci-tidb/common-test 🟢 all 11 tests passed 8 min 34 sec Existing passed
idc-jenkins-ci-tidb/sqllogic-test-2 🟢 all 28 tests passed 4 min 10 sec Existing passed
idc-jenkins-ci-tidb/sqllogic-test-1 🟢 all 26 tests passed 3 min 52 sec Existing passed
idc-jenkins-ci-tidb/integration-compatibility-test 🟢 all 1 tests passed 3 min 11 sec Existing passed
idc-jenkins-ci-tidb/mybatis-test 🟢 all 1 tests passed 2 min 54 sec Existing passed
idc-jenkins-ci-tidb/plugin-test 🟢 build success, plugin test success 4min Existing passed

@windtalker windtalker deleted the mpp_inter_query_load_balance branch September 26, 2022 13:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
release-note-none size/M Denotes a PR that changes 30-99 lines, ignoring generated files. status/can-merge Indicates a PR has been approved by a committer. status/LGT2 Indicates that a PR has LGTM 2.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

MPP query may not be balanced between TiFlash nodes
6 participants