Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement possibility of manual failover to specific node #11

Closed
CyberDem0n opened this issue Aug 27, 2015 · 3 comments
Closed

Implement possibility of manual failover to specific node #11

CyberDem0n opened this issue Aug 27, 2015 · 3 comments

Comments

@CyberDem0n
Copy link
Member

No description provided.

@jberkus
Copy link
Contributor

jberkus commented Sep 1, 2015

So, I'm specifically working on this. It seems like there's two potential pathways forwards:

  1. write to the shared information service (SIS) with the information about the manual failover
  2. send signals to the API on each node about the failover

SIS Appoach:
Advantages:

  • client only needs to access the SIS
  • easy to integrate with SIS-based monitoring
  • consistency with other signals we may want to pass via the SIS, i.e. "nogovernor" flag
  • allows signalling other replicas not to try to grab the master flag
  • requires only a very simple client script

Disadvantages:

  • very asynchronous; hard to know when failover is complete except by polling
  • zero troubleshooting info if failover doesn't happen
  • requires admins writing to the SIS, which is a new pattern

Node API approach
Advantages:

  • synchronous: can find out how failover is going immediately
  • doesn't require touching the main failover loop
  • makes targeting specific nodes fairly obvious

Disadvangates:

  • requires securing the node API because now you can change databases via it.
  • client would need to be heavy; will have to connect to all database nodes, possibly, to make the failover go as planned
  • possible synchronization issues between data in SIS and what individual nodes are doing

Overall, it seems to me that doing this via the SIS makes more sense. Discussion?

@jberkus
Copy link
Contributor

jberkus commented Sep 1, 2015

The SIS approach seems like it needs only one extra piece of information: a "new-master" key for the cluster repo. The way it would work is:

  1. manual client writes the "new-master" key.
  2. the current master, at the beginning of its governor loop, checks for a new-master key. If the new-master key isn't itself, it shuts down the postgreSQL server and releases the master lock key.
  3. at the beginning of each replica's loop is a check for a new-master key. If present, this key preempts all other failover logic. If the key is present, that named server starts trying to acquire the master lock and become the master.
  4. Once the new replica has become the master, it removes the new-master key.

Issues/additions:

a. the old master needs to check the new-master's metadata in SIS to make sure that it's able to fail over before shutting down.

b. we need some way to indicate to the user why manual failover did not occur if it fails.

c. if the new-master is unable to promote after the old master is shut down, what should the system do? And how?

@CyberDem0n
Copy link
Member Author

This feature is covered by: #56, #67 and #82

lukedirtwalker added a commit to lukedirtwalker/patroni that referenced this issue Mar 29, 2019
Simulate leader node failure by killing leader consul and patroni.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants