-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Consul server got stuck in "leaving" state on some other servers of the same cluster after maintenance #13379
Comments
Hey @usovamaria Quick question to help me understand this more. When you say
Does this mean that after some period of time, the |
Hi @Amier3 So, answering your questions:
|
@Amier3 Hi! Any news here? Still affecting us :( |
@hsimon-hashicorp Hello, as you mentioned that Amier is no longer with HashiCorp in another ticket, would you be able to help get this one rerouted too? |
@jkirschner-hashicorp Hi. I don't know who to mention here but we experience even more problems related to these issue after upgrade to Consul v1.14.3 Even short-term network flaps affect us this way. E.g., this happens when there's a maintenance on the host-level network and our virtual machines are hosted on these hosts. |
Hi. We conducted some research on related issues and here's what we had discovered: Similar issue reported on 1.9.5: link It looks like the root cause is in the serf component, there are also a couple of reported issues: Here's a PR created for this case but nothing happened. @rboyer was tagged in this issue, maybe you could investigate this? |
Hey @david-yu, @jkirschner-hashicorp, @mkeeler, @rboyer, Sorry for the ping, but it looks like this issue's gone a bit quiet, and the problem seems to still be there. Any chance you guys can help get this on the schedule? Thanks! |
Overview of the Issue
During maintenance some servers can leave a multi-server cluster (eg shutdown or losing network connectivity using
iptables
). We're experiencing a bug when re-joined servers haveleaving
status on some servers of the cluster but other servers mark them as followers. This seems to be a bug when force-leave operation is not applied.Reproduction Steps
Steps to reproduce this issue, eg:
consul operator raft list-peers
on neighbours, some servers can see the server in 'leaving' state while others see this server as follower. Stuck server think that it is a followerInitiating push/pull sync with
for wan/lan and everything can seem to be okConsul logs for normally re-joined server and failed server on other servers
re-joined server
During the maintenance the server was in state "left" as if it was force-left by other servers and successfully re-joined the cluster.
The second server was not force-left, but during the maintenance other servers got the message
pinging server failed
andconnection timed out
. Moreover, after some period of time there's messageRebalanced servers, new active server
on the healthy servers.Operating system and Environment details
Ubuntu 20.04, Consul v1.9.5
The text was updated successfully, but these errors were encountered: