PD automatic failover #57

weekface · 2018-08-27T04:14:19Z

When a PD peer is not health for a while (this can be obtained from PD Cluster's /pd/health API), operator should:

mark this peer as failure
invoke deleteMember api to delete this member from the pd cluster
increase the replicas to add a new PD peer
try to delete the PVC and Pod of this PD peer over and over, let StatefulSet create the new PD peer with the same ordinal, but not use the tombstone PV
decrease the replicas when all members are ready again

part of #47

The text was updated successfully, but these errors were encountered:

weekface · 2018-08-31T02:50:18Z

I am working hard on the feature. I encountered an issue during test verification:

PD's StatefulSet podManagementPolicy is OrderedReady, so if the 0 ordinal Pod was not ready, the StatefulSet can't be scaled out, which means can't failover.

I will change the pd podManagementPolicy to Parallel before the pd failover. ~~Which will be another pr.~~

weekface · 2018-08-31T10:09:08Z

Another issue: when failover with current pd start script, the ordinal 0 pd will start as a new pd cluster, because it will start with initial-cluster option and new empty data dir when the ordinal is 0.

Solution:

Change the pd podManagementPolicy to Parallel

Add an annotation key bootstrapping to PD's ordinal 0 Pod to indicate that the cluster is bootstrapping, and a replicas annotation to indicate the replicas of current pd cluster.

if there are no PVCs that ordinal is greater than 0, set bootstrapping to true
else set bootstrapping to false

Modify the pd start script

if the ordinal is greater than 0, verify the cluster is working now before starting when cluster bootstrapping
if the ordinal is 0 and bootstrapping is true, use initial-cluster option to start this member, else use join option to start this member instead.

@tennix @onlymellb @xiaojingchen PTAL

zh: add supplementary notes in restore

* Update backup-to-s3.md * en: add IAM ENV for S3 backup and restore * update from #57 * Apply suggestions from code review Co-Authored-By: Lilian Lee <lilin@pingcap.com> Co-authored-by: Lilian Lee <lilin@pingcap.com>

weekface added the enhancement New feature or request label Aug 27, 2018

weekface added this to the v0.2 milestone Aug 27, 2018

weekface mentioned this issue Aug 27, 2018

Automatic failover feature proposal #47

Closed

4 tasks

weekface self-assigned this Aug 27, 2018

weekface added the status/WIP Issue/PR is being worked on label Sep 1, 2018

weekface mentioned this issue Sep 3, 2018

PD failover #74

Merged

weekface closed this as completed in #74 Sep 7, 2018

yahonda pushed a commit that referenced this issue Dec 27, 2021

Merge pull request #57 from shuijing198799/yinliang/fix-tiny

703ef55

zh: add supplementary notes in restore

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PD automatic failover #57

PD automatic failover #57

weekface commented Aug 27, 2018

weekface commented Aug 31, 2018 •

edited

Loading

weekface commented Aug 31, 2018 •

edited

Loading

PD automatic failover #57

PD automatic failover #57

Comments

weekface commented Aug 27, 2018

weekface commented Aug 31, 2018 • edited Loading

weekface commented Aug 31, 2018 • edited Loading

weekface commented Aug 31, 2018 •

edited

Loading

weekface commented Aug 31, 2018 •

edited

Loading