Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

scheduler: fix region scatter may transfer leader to removed peer #1482

Merged
merged 11 commits into from Apr 2, 2019

Conversation

Projects
None yet
4 participants
@nolouch
Copy link
Member

commented Mar 27, 2019

Signed-off-by: nolouch nolouch@gmail.com

What problem does this PR solve?

Somtimes region scatter may timeout, it cause by:

  • Leader already changed but scatter-region step still use old leader
  • Leader may transfer to a removed peer

log like:

[region 2109921] operator timeout: scatter-region (kind:leader,region,admin, region:2109921(5233,677), createAt:2019-03-26 14:29:14.885583251 +0800 CST m=+1888726.900848985, currentStep:10, steps:[add learner peer 2109925 on store 5 promote learner peer 2109925 on store 5 to voter remove peer on store 8 transfer leader from store 0 to store 5 add learner peer 2109926 on store 6 promote learner peer 2109926 on store 6 to voter remove peer on store 4 transfer leader from store 0 to store 6 add learner peer 2109927 on store 1 promote learner peer 2109927 on store 1 to voter transfer leader from store 7 to store 8 remove peer on store 7 transfer leader from store 0 to store 1]) timeout

add  store 5
remove store 8
transfer to 5
add store 6
remove store 4
transfer to 6
add  store 1 
transfer from 7 to 8
remove  store 7
transfer to 1

the origin leader is on store 7.

What is changed and how it works?

Fix the operator create logic.

Check List

Tests

  • Unit test
scheduler: fix region scatter may transfer leader to removed peer
Signed-off-by: nolouch <nolouch@gmail.com>
@disksing

This comment has been minimized.

Copy link
Member

commented Mar 27, 2019

@nolouch The log you provided does not transfer leader to removed peer. The steps are:

- add 5
- transfer 2
- remove 1
- transfer 5
- add 4
- remove 2
- transfer 4

@nolouch nolouch changed the title scheduler: fix region scatter may transfer leader to removed peer scheduler: fix region scatter may transfer leader from removed peer Mar 27, 2019

@nolouch nolouch changed the title scheduler: fix region scatter may transfer leader from removed peer scheduler: fix region scatter may transfer leader to removed peer Mar 27, 2019

@nolouch

This comment has been minimized.

Copy link
Member Author

commented Mar 27, 2019

@disksing I update the log, this one is pick from test cluster.

nolouch added some commits Mar 27, 2019

fix
Signed-off-by: nolouch <nolouch@gmail.com>
@rleungx

This comment has been minimized.

Copy link
Member

commented Mar 28, 2019

@nolouch CI failed.

@nolouch nolouch added the DNM label Mar 28, 2019

*: random pick leader
Signed-off-by: nolouch <nolouch@gmail.com>

@nolouch nolouch removed the DNM label Apr 1, 2019

nolouch added some commits Apr 1, 2019

clean up
Signed-off-by: nolouch <nolouch@gmail.com>
fix
Signed-off-by: nolouch <nolouch@gmail.com>
@codecov-io

This comment has been minimized.

Copy link

commented Apr 1, 2019

Codecov Report

❗️ No coverage uploaded for pull request base (master@56f4100). Click here to learn what that means.
The diff coverage is 78.02%.

Impacted file tree graph

@@           Coverage Diff            @@
##             master   #1482   +/-   ##
========================================
  Coverage          ?   67.4%           
========================================
  Files             ?     158           
  Lines             ?   15454           
  Branches          ?       0           
========================================
  Hits              ?   10417           
  Misses            ?    4092           
  Partials          ?     945
Impacted Files Coverage Δ
server/grpc_service.go 53.75% <0%> (ø)
server/schedule/mockcluster.go 83.41% <0%> (ø)
server/handler.go 55.58% <33.33%> (ø)
server/schedule/operator.go 86.26% <68.42%> (ø)
server/schedule/region_scatterer.go 88.14% <90.47%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 56f4100...4bf7f45. Read the comment docs.

@nolouch

This comment has been minimized.

Copy link
Member Author

commented Apr 1, 2019

steps := make([]OperatorStep, 0, len(targetPeers)*2+1)
deferSteps := make([]OperatorStep, 0, 2)
var kind OperatorKind
sameLeader := targetLeaderPeer.GetStoreId() == origin.GetLeader().GetStoreId()

This comment has been minimized.

Copy link
@rleungx

rleungx Apr 2, 2019

Member

How about using a variable for origin.GetLeader().GetStoreId()


storeIDs := origin.GetStoreIds()
steps := make([]OperatorStep, 0, len(targetPeers)*2+1)
deferSteps := make([]OperatorStep, 0, 2)

This comment has been minimized.

Copy link
@rleungx

rleungx Apr 2, 2019

Member

It's better to add some comments to explain why we put these steps in the end.

address comments
Signed-off-by: nolouch <nolouch@gmail.com>
@rleungx

rleungx approved these changes Apr 2, 2019

@nolouch nolouch merged commit 2d3e5fa into pingcap:master Apr 2, 2019

3 of 4 checks passed

continuous-integration/travis-ci/pr The Travis CI build is in progress
Details
ci/circleci Your tests passed on CircleCI!
Details
idc-jenkins-ci/build Jenkins job succeeded.
Details
license/cla Contributor License Agreement is signed.
Details

@nolouch nolouch deleted the nolouch:fix-scatter branch Apr 2, 2019

nolouch added a commit to nolouch/pd that referenced this pull request Apr 2, 2019

scheduler: fix region scatter may transfer leader to removed peer (pi…
…ngcap#1482)

* scheduler: fix region scatter may transfer leader to removed peer

Signed-off-by: nolouch <nolouch@gmail.com>

nolouch added a commit that referenced this pull request Apr 3, 2019

*: cherry pick some fixes to release 2.1 (#1490)
* scheduler: fix region scatter may transfer leader to removed peer (#1482)

* scheduler: fix region scatter may transfer leader to removed peer

Signed-off-by: nolouch <nolouch@gmail.com>

* fix get hot store (#1487)

Signed-off-by: Ryan Leung <rleungx@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.