Implement possibility of manual failover to specific node #11

CyberDem0n · 2015-08-27T09:30:32Z

No description provided.

jberkus · 2015-09-01T00:37:46Z

So, I'm specifically working on this. It seems like there's two potential pathways forwards:

write to the shared information service (SIS) with the information about the manual failover
send signals to the API on each node about the failover

SIS Appoach:
Advantages:

client only needs to access the SIS
easy to integrate with SIS-based monitoring
consistency with other signals we may want to pass via the SIS, i.e. "nogovernor" flag
allows signalling other replicas not to try to grab the master flag
requires only a very simple client script

Disadvantages:

very asynchronous; hard to know when failover is complete except by polling
zero troubleshooting info if failover doesn't happen
requires admins writing to the SIS, which is a new pattern

Node API approach
Advantages:

synchronous: can find out how failover is going immediately
doesn't require touching the main failover loop
makes targeting specific nodes fairly obvious

Disadvangates:

requires securing the node API because now you can change databases via it.
client would need to be heavy; will have to connect to all database nodes, possibly, to make the failover go as planned
possible synchronization issues between data in SIS and what individual nodes are doing

Overall, it seems to me that doing this via the SIS makes more sense. Discussion?

jberkus · 2015-09-01T16:59:35Z

The SIS approach seems like it needs only one extra piece of information: a "new-master" key for the cluster repo. The way it would work is:

manual client writes the "new-master" key.
the current master, at the beginning of its governor loop, checks for a new-master key. If the new-master key isn't itself, it shuts down the postgreSQL server and releases the master lock key.
at the beginning of each replica's loop is a check for a new-master key. If present, this key preempts all other failover logic. If the key is present, that named server starts trying to acquire the master lock and become the master.
Once the new replica has become the master, it removes the new-master key.

Issues/additions:

a. the old master needs to check the new-master's metadata in SIS to make sure that it's able to fail over before shutting down.

b. we need some way to indicate to the user why manual failover did not occur if it fails.

c. if the new-master is unable to promote after the old master is shut down, what should the system do? And how?

CyberDem0n · 2015-12-01T08:32:34Z

This feature is covered by: #56, #67 and #82

Simulate leader node failure by killing leader consul and patroni.

CyberDem0n added the enhancement label Aug 27, 2015

jberkus mentioned this issue Sep 18, 2015

implement patroni-cli #44

Closed

CyberDem0n closed this as completed Dec 1, 2015

lukedirtwalker added a commit to lukedirtwalker/patroni that referenced this issue Mar 29, 2019

Add full leader node failure test (patroni#11)

3a6163c

Simulate leader node failure by killing leader consul and patroni.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement possibility of manual failover to specific node #11

Implement possibility of manual failover to specific node #11

CyberDem0n commented Aug 27, 2015

jberkus commented Sep 1, 2015

jberkus commented Sep 1, 2015

CyberDem0n commented Dec 1, 2015

Implement possibility of manual failover to specific node #11

Implement possibility of manual failover to specific node #11

Comments

CyberDem0n commented Aug 27, 2015

jberkus commented Sep 1, 2015

jberkus commented Sep 1, 2015

CyberDem0n commented Dec 1, 2015