If all server nodes down, can't be clustered again #526
Comments
You are probably having the servers gracefully leave, which removes them from the peer set of Raft. Once 2 of the servers are removed, you lose quorum and the cluster goes into an outage. Outage recovery is done via: http://consul.io/docs/guides/outage.html If you hard stop them, or they crash, or power fail, when they restart a new leader will be elected. Graceful leave of all servers will cause an outage. |
I'm not sure about the original poster, but this issue still seems to be very much in play. Your explanation makes sense, but clearing the
|
@dellis23 I'm not sure what your situation is, but there are two different mechanisms for outage recovery. You can either:
Touching the peers file does not do anything, nor does killing the processes once the outage is already happening. Hope that helps! |
Thanks @armon. I think the other component to this was that the serf local snapshot had marked the servers as having left. This was due to me misreading "gracefully leaving" as being something that you would want to do when shutting down a server. I assumed this was the preferred way -- that the server would come back up just as gracefully when restarted. |
@dellis23 It is a bit confusing I agree. Graceful leave means "I intend to leave this cluster, and when I do, do not mark it as a failure." This means all services / node is deregistered instead of being marked as failed. For servers, they are removed from the raft peer set to avoid a quorum loss. |
I was able to get around this issue by first stopping consul, then deleting the entire rafts folder (the folder that holds peers.json) and restarting consul. |
@thedjEJ That will cause complete data loss, and is not really recommended |
@armon With my testing, whenever the peers.json file is empty (shows a value of "null" in the file), deleting the folder, then starting up each server again seems to bring the servers back to state of quorum. I have not seen other files other than the peers.json file in the rafts folder though. Is it also not advisable to only delete the peers.json file when it is null, as it seems that the servers discover each other correctly once this is done. I am on consul 0.5.0 |
@thedjEJ The |
Thanks @armon . I think I understand this better now and nuking the directory is DEFINITELY not a good idea. Not leaving gracefully is the issue though and I will look into not letting the servers do this, especially when they might once again join the cluster. |
Removing /var/lib/consul/raft/* on all consul hosts helped. |
I am seeing this issue with 0.6.1 right now. If I have three machines, with identical configs, but, one has the -bootstrap-expect=3 and its the last one to boot (luck of the draw) the cluster never forms. I have a crummy systemd script (this on Ubuntu 15.10) to start the system, but, the shutdown has no special command, so, as per above, consul is leaving gracefully. If I systemctl stop consul all three machines, then start the 1st one (with -bootstrap-expect) first, then the other two, the "No cluster leader" situation persists. I really, really like consul but I can't babysit a startup rendezvous problem.... The only reliable way I can get this to work is to manually start all three servers without a start_join list, then tell one to join the other two with consul join (two IP's). Can anyone give me some guidance? I'd be glad to post any sanitized log/config file etc.. TIA, G. |
Its these messages that make my debugging nose twitch:
Update: And I do consul members and I get three nodes of type Server that are Alive....but endless No cluster leader msgs in the three logs. |
It should be noted that the machines don't have a private network and I am NOT using the -WAN option. I'm looking at that now. I just nuked /tmp/consul/* and restarted them, and did a manual join and they're still confused. Poor guys! |
Update: I got it working, as per pwilczynskiclearcode's answer:
That's the only config I can get up and working. The join_start tag leads to server election loop land. |
My consul version is 0.4.1, here is my consul config file.
}
Now contest01~contest03 is a cluster. If I stop them all then start again, they will never be clusted again. Now matter how you restart them, or call the join method, they just can't be clustered, unless clear all the data in the data dir .
The log is like this:
The text was updated successfully, but these errors were encountered: