Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

schedule: fix a thread-safe bug and improve code #1719

Merged
merged 14 commits into from Sep 6, 2019

Conversation

@Luffbee
Copy link
Contributor

commented Aug 30, 2019

What problem does this PR solve?

  1. There is a thread-safe bug which may remove operators unexpectedly, see issue #1716.
  2. There is a suspicious code in pollNeedDispathOperator: when the region of an operator is disappeared, it is ignored. The code to handle this problem is in RaftCluster.checkOperator, which is far away from the problem.

What is changed and how it works?

  1. This PR use the second possible solution mentioned in the issue: check operator is the same before delete.
  2. Moved the logic in RaftCluster.checkOperator to pollNeedDispathOperator, and removed some useless functions.

Check List

Tests

  • Unit test
@codecov-io

This comment has been minimized.

Copy link

commented Aug 30, 2019

Codecov Report

Merging #1719 into master will increase coverage by 0.02%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #1719      +/-   ##
==========================================
+ Coverage   76.92%   76.95%   +0.02%     
==========================================
  Files         161      161              
  Lines       15734    15724      -10     
==========================================
- Hits        12103    12100       -3     
+ Misses       2613     2610       -3     
+ Partials     1018     1014       -4
Impacted Files Coverage Δ
server/cluster.go 83.91% <ø> (+1.06%) ⬆️
server/schedule/operator_controller.go 90.88% <100%> (+1.05%) ⬆️
server/handler.go 50.13% <100%> (+0.52%) ⬆️
pkg/metricutil/metricutil.go 90.62% <0%> (-9.38%) ⬇️
server/schedulers/shuffle_hot_region.go 58.97% <0%> (-6.42%) ⬇️
server/region_syncer/client.go 78.94% <0%> (-3.95%) ⬇️
client/client.go 69.76% <0%> (-0.65%) ⬇️
server/schedule/operator/operator.go 85.58% <0%> (+0.36%) ⬆️
server/member/leader.go 78.06% <0%> (+1.02%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update f2487bf...6a4ce79. Read the comment docs.

@nolouch
nolouch approved these changes Sep 2, 2019
Copy link
Member

left a comment

LGTM

@nolouch

This comment has been minimized.

Copy link
Member

commented Sep 4, 2019

server/handler.go Show resolved Hide resolved
@rleungx
rleungx approved these changes Sep 5, 2019

@rleungx rleungx added the CanMerge label Sep 5, 2019

@sre-bot

This comment has been minimized.

Copy link

commented Sep 5, 2019

/run-all-tests

@sre-bot

This comment has been minimized.

Copy link

commented Sep 5, 2019

@Luffbee merge failed.

@Luffbee

This comment has been minimized.

Copy link
Contributor Author

commented Sep 5, 2019

/rebuild

@Luffbee

This comment has been minimized.

Copy link
Contributor Author

commented Sep 5, 2019

/run-all-tests

@nolouch

This comment has been minimized.

Copy link
Member

commented Sep 5, 2019

/run-integration-common-test

@nolouch

This comment has been minimized.

Copy link
Member

commented Sep 5, 2019

/run-integration-common-test

@nolouch

This comment has been minimized.

Copy link
Member

commented Sep 6, 2019

/merge

@rleungx

This comment has been minimized.

Copy link
Member

commented Sep 6, 2019

@Luffbee Please solve the conflicts.

@rleungx

This comment has been minimized.

Copy link
Member

commented Sep 6, 2019

/merge

@sre-bot

This comment has been minimized.

Copy link

commented Sep 6, 2019

/run-all-tests

@sre-bot sre-bot merged commit b66ba44 into pingcap:master Sep 6, 2019

8 checks passed

ci/circleci Your tests passed on CircleCI!
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details
idc-jenkins-ci-pd/integration-common-test Jenkins job succeeded.
Details
idc-jenkins-ci-pd/integration-compatibility-test Jenkins job succeeded.
Details
idc-jenkins-ci-pd/integration-ddl-test Jenkins job succeeded.
Details
idc-jenkins-ci/build Jenkins job succeeded.
Details
idc-jenkins-ci/test Jenkins job succeeded.
Details
license/cla Contributor License Agreement is signed.
Details
@sre-bot

This comment has been minimized.

Copy link

commented Sep 6, 2019

cherry pick to release-3.0 failed

Luffbee added a commit that referenced this pull request Sep 9, 2019
[new-hotspot-scheduler] add operator influence and more metrics (#1713)
* *: unify get store function everywhere (#1671)

Signed-off-by: Ryan Leung <rleungx@gmail.com>

* remove unnecessary parentheses

*  server: use leader lease to determine tso service validity (#1676)

Signed-off-by: disksing <i@disksing.com>

* change internal stat values to float64

* add pending operator influence

* add metrics of pending influence

* fix metrics

* fix panic

* adjust pending influence of balanceHotWrite

* change weight of pending influence

* test: fix tests (#1696)

* test: fix region syncer test

Signed-off-by: disksing <i@disksing.com>

* decrease region rolling window; store pending influence in scheduler

* add config-check flag for pd-server (#1695)

Signed-off-by: cwen0 <cwenyin0@gmail.com>

* decrease possiblility transfer hot write leader

* change pending influence weight

* add unstarted op metrics

* add logs for debug

* add log for debug

* add logs for debug

* add logs for debug

* add logs for debug

* add logs for debug

* add logs for debug

* add logs for debug

* Revert "add logs for debug"

This reverts commit e74c7a9.

* add metrics for hotspot operators

* operator: rewrite move region related functions (#1667)

* add metrics for pending operators

* *: support setting endKey for ScanRange (#1700)

Signed-off-by: disksing <i@disksing.com>

* fix bug

* fix bug

* fix bug

* fix metrics thread-safe bug

* fix logic bug

* *: reduce some unnecessary parameters (#1698)

Signed-off-by: Ryan Leung <rleungx@gmail.com>

* schedule: Do not send an operator of a region wth a stale epoch (#1659)

* schedule: Do not send an operator of a region wth a stale epoch

Signed-off-by: Shafreeck Sea <shafreeck@gmail.com>

* schedule: check the version changed by the operator self

Signed-off-by: Shafreeck Sea <shafreeck@gmail.com>

* schedule: fix unit test

Signed-off-by: Shafreeck Sea <shafreeck@gmail.com>

* schedule: fix to avoid dispatching a stale opstep

Signed-off-by: Shafreeck Sea <shafreeck@gmail.com>

* dispatch: refactor "ConsumeConfVer() int" to "ExpectConfVerChange() bool"

Signed-off-by: Shafreeck Sea <shafreeck@gmail.com>

* dispatch: fix typo in comment

Signed-off-by: Shafreeck Sea <shafreeck@gmail.com>

* fix typo

Co-Authored-By: Ryan Leung <rleungx@gmail.com>

* dispatch: fix unittest

Signed-off-by: Shafreeck Sea <shafreeck@gmail.com>

* dispatch: refine format

Signed-off-by: Shafreeck Sea <shafreeck@gmail.com>

* server: fix the dead lock in scatter region (#1706)

Signed-off-by: Ryan Leung <rleungx@gmail.com>

* add drop time for operator

* use IsDropped to recognize canceled ops

* try to fix trans leader burst

* try to fix trans leader burst

* add zombie influence

* change select src dst strategy; improve op_controller

* change select src strategy

* fix bug

* tools: fix set namespace in pd-ctl (#1701)

Signed-off-by: Ryan Leung <rleungx@gmail.com>

* tools: fix parse url without http prefix (#1703)

Signed-off-by: Ryan Leung <rleungx@gmail.com>

* tests: support deadlock detection in make test (#1704)

Signed-off-by: Ryan Leung <rleungx@gmail.com>

* Makefile: fix failpoint enable (#1722)

Signed-off-by: nolouch <nolouch@gmail.com>

* checker: fix the issue that a region does not merge to the sibling with smaller size (#1723)

Signed-off-by: disksing <i@disksing.com>

* tools: balance region simulator (#1708)

* scheduler: do not remove the operator when the step does not finish (#1715)

Signed-off-by: Shafreeck Sea <shafreeck@gmail.com>

* operator: fix the AddLearner config version judgment (#1732)

Signed-off-by: nolouch <nolouch@gmail.com>

* tools: fix TLS in pd control (#1729)

Signed-off-by: Ryan Leung <rleungx@gmail.com>

* syncer: support TLS for region syncer (#1728)

Signed-off-by: Ryan Leung <rleungx@gmail.com>

* schedule: fix a thread-safe bug and improve code (#1719)

@Luffbee Luffbee deleted the Luffbee:fix-op-controller branch Sep 11, 2019

sre-bot added a commit that referenced this pull request Sep 11, 2019
Luffbee added a commit that referenced this pull request Sep 11, 2019
[new-hotspot-scheduler] merge master (#1752)
* *: unify get store function everywhere (#1671)

Signed-off-by: Ryan Leung <rleungx@gmail.com>

*  server: use leader lease to determine tso service validity (#1676)

Signed-off-by: disksing <i@disksing.com>

* test: fix tests (#1696)

* test: fix region syncer test

Signed-off-by: disksing <i@disksing.com>

* add config-check flag for pd-server (#1695)

Signed-off-by: cwen0 <cwenyin0@gmail.com>

* operator: rewrite move region related functions (#1667)

* *: support setting endKey for ScanRange (#1700)

Signed-off-by: disksing <i@disksing.com>

* *: reduce some unnecessary parameters (#1698)

Signed-off-by: Ryan Leung <rleungx@gmail.com>

* schedule: Do not send an operator of a region wth a stale epoch (#1659)

* schedule: Do not send an operator of a region wth a stale epoch

Signed-off-by: Shafreeck Sea <shafreeck@gmail.com>

* schedule: check the version changed by the operator self

Signed-off-by: Shafreeck Sea <shafreeck@gmail.com>

* schedule: fix unit test

Signed-off-by: Shafreeck Sea <shafreeck@gmail.com>

* schedule: fix to avoid dispatching a stale opstep

Signed-off-by: Shafreeck Sea <shafreeck@gmail.com>

* dispatch: refactor "ConsumeConfVer() int" to "ExpectConfVerChange() bool"

Signed-off-by: Shafreeck Sea <shafreeck@gmail.com>

* dispatch: fix typo in comment

Signed-off-by: Shafreeck Sea <shafreeck@gmail.com>

* fix typo

Co-Authored-By: Ryan Leung <rleungx@gmail.com>

* dispatch: fix unittest

Signed-off-by: Shafreeck Sea <shafreeck@gmail.com>

* dispatch: refine format

Signed-off-by: Shafreeck Sea <shafreeck@gmail.com>

* server: fix the dead lock in scatter region (#1706)

Signed-off-by: Ryan Leung <rleungx@gmail.com>

* tools: fix set namespace in pd-ctl (#1701)

Signed-off-by: Ryan Leung <rleungx@gmail.com>

* tools: fix parse url without http prefix (#1703)

Signed-off-by: Ryan Leung <rleungx@gmail.com>

* tests: support deadlock detection in make test (#1704)

Signed-off-by: Ryan Leung <rleungx@gmail.com>

* Makefile: fix failpoint enable (#1722)

Signed-off-by: nolouch <nolouch@gmail.com>

* checker: fix the issue that a region does not merge to the sibling with smaller size (#1723)

Signed-off-by: disksing <i@disksing.com>

* tools: balance region simulator (#1708)

* scheduler: do not remove the operator when the step does not finish (#1715)

Signed-off-by: Shafreeck Sea <shafreeck@gmail.com>

* operator: fix the AddLearner config version judgment (#1732)

Signed-off-by: nolouch <nolouch@gmail.com>

* tools: fix TLS in pd control (#1729)

Signed-off-by: Ryan Leung <rleungx@gmail.com>

* syncer: support TLS for region syncer (#1728)

Signed-off-by: Ryan Leung <rleungx@gmail.com>

* schedule: fix a thread-safe bug and improve code (#1719)

* statistics: fix region flow calculation (#1688)

Signed-off-by: jiyingtk <jiyingtk@mail.ustc.edu.cn>

* makefile: improve deadlock-enable/disable (#1736)

* api: fix missing keys statistic in region information (#1741)

Signed-off-by: nolouch <nolouch@gmail.com>

* *: update go version to 1.13 (#1742)

Signed-off-by: disksing <i@disksing.com>

* coordinator: add the operator cost time in log field (#1748)

Signed-off-by: nolouch <nolouch@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
6 participants
You can’t perform that action at this time.