Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PD automatic failover #57

Closed
weekface opened this issue Aug 27, 2018 · 2 comments · Fixed by #74
Closed

PD automatic failover #57

weekface opened this issue Aug 27, 2018 · 2 comments · Fixed by #74
Assignees
Labels
enhancement New feature or request status/WIP Issue/PR is being worked on
Milestone

Comments

@weekface
Copy link
Contributor

When a PD peer is not health for a while (this can be obtained from PD Cluster's /pd/health API), operator should:

  1. mark this peer as failure
  2. invoke deleteMember api to delete this member from the pd cluster
  3. increase the replicas to add a new PD peer
  4. try to delete the PVC and Pod of this PD peer over and over, let StatefulSet create the new PD peer with the same ordinal, but not use the tombstone PV
  5. decrease the replicas when all members are ready again

part of #47

@weekface weekface added the enhancement New feature or request label Aug 27, 2018
@weekface weekface added this to the v0.2 milestone Aug 27, 2018
@weekface weekface self-assigned this Aug 27, 2018
@weekface
Copy link
Contributor Author

weekface commented Aug 31, 2018

I am working hard on the feature. I encountered an issue during test verification:

PD's StatefulSet podManagementPolicy is OrderedReady, so if the 0 ordinal Pod was not ready, the StatefulSet can't be scaled out, which means can't failover.

I will change the pd podManagementPolicy to Parallel before the pd failover. Which will be another pr.

@weekface
Copy link
Contributor Author

weekface commented Aug 31, 2018

Another issue: when failover with current pd start script, the ordinal 0 pd will start as a new pd cluster, because it will start with initial-cluster option and new empty data dir when the ordinal is 0.

Solution:

Change the pd podManagementPolicy to Parallel

Add an annotation key bootstrapping to PD's ordinal 0 Pod to indicate that the cluster is bootstrapping, and a replicas annotation to indicate the replicas of current pd cluster.

  • if there are no PVCs that ordinal is greater than 0, set bootstrapping to true
  • else set bootstrapping to false

Modify the pd start script

  • if the ordinal is greater than 0, verify the cluster is working now before starting when cluster bootstrapping
  • if the ordinal is 0 and bootstrapping is true, use initial-cluster option to start this member, else use join option to start this member instead.

@tennix @onlymellb @xiaojingchen PTAL

@weekface weekface added the status/WIP Issue/PR is being worked on label Sep 1, 2018
@weekface weekface mentioned this issue Sep 3, 2018
yahonda pushed a commit that referenced this issue Dec 27, 2021
zh: add supplementary notes in restore
yahonda pushed a commit that referenced this issue Dec 27, 2021
* Update backup-to-s3.md

* en: add IAM ENV for S3 backup and restore

* update from #57

* Apply suggestions from code review

Co-Authored-By: Lilian Lee <lilin@pingcap.com>

Co-authored-by: Lilian Lee <lilin@pingcap.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request status/WIP Issue/PR is being worked on
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant