Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Consul server got stuck in "leaving" state on some other servers of the same cluster after maintenance #13379

Open
usovamaria opened this issue Jun 7, 2022 · 7 comments
Labels
theme/federation-usability Anything related to Federation type/bug Feature does not function as expected

Comments

@usovamaria
Copy link

Overview of the Issue

During maintenance some servers can leave a multi-server cluster (eg shutdown or losing network connectivity using iptables). We're experiencing a bug when re-joined servers have leaving status on some servers of the cluster but other servers mark them as followers. This seems to be a bug when force-leave operation is not applied.

Reproduction Steps

Steps to reproduce this issue, eg:

  1. Create a cluster with 9 server nodes
  2. Shutdown server/close all ports using iptables, wait
  3. Turn on the server/open ports
  4. Check consul operator raft list-peers on neighbours, some servers can see the server in 'leaving' state while others see this server as follower. Stuck server think that it is a follower
  5. Check consul logs on the neighbours, there will be Initiating push/pull sync with for wan/lan and everything can seem to be ok

Consul logs for normally re-joined server and failed server on other servers

re-joined server
Jun 01 17:31:24 normal_server consul[39700]:     2022-06-01T17:31:24.719+0300 [ERROR] agent.server: failed to reconcile member: member=“{re_joining_server_info}” error="leadership lost while committing log"
Jun 01 17:31:26 normal_server  consul[39700]:     2022-06-01T17:31:26.655+0300 [INFO]  agent.server: member joined, marking health alive: member=re_joining_server
Jun 01 17:31:36 normal_server  consul[39700]:     2022-06-01T17:31:36.653+0300 [INFO]  agent.server.autopilot: Promoting server: id=id address=ip_address:8300 name=re_joining_server
Jun 01 17:31:41 normal_server  consul[39700]:     2022-06-01T17:31:41.137+0300 [DEBUG] agent.server.memberlist.wan: memberlist: Initiating push/pull sync with: re_joining_server
Jun 01 17:39:39 normal_server consul[39700]:     2022-06-01T17:39:39.043+0300 [INFO]  agent.server: New leader elected: payload=re_joining_server
Jun 01 17:40:43 normal_server  consul[39700]:     2022-06-01T17:40:43.517+0300 [DEBUG] agent.router.manager: Rebalanced servers, new active server: number_of_servers=3 active_server="re_joining_server"

During the maintenance the server was in state "left" as if it was force-left by other servers and successfully re-joined the cluster.
The second server was not force-left, but during the maintenance other servers got the message pinging server failed and connection timed out. Moreover, after some period of time there's message Rebalanced servers, new active server on the healthy servers.

Operating system and Environment details

Ubuntu 20.04, Consul v1.9.5

@Amier3 Amier3 added the type/bug Feature does not function as expected label Jun 7, 2022
@Amier3
Copy link
Contributor

Amier3 commented Jun 8, 2022

Hey @usovamaria

Quick question to help me understand this more. When you say

Moreover, after some period of time there's message Rebalanced servers, new active server on the healthy servers.

Does this mean that after some period of time, the leaving server was successfully added back into the cluster? or were the other server logs saying rebalenced servers even though the leaving cluster wasn't successfully back in the cluster

@Amier3 Amier3 added the theme/federation-usability Anything related to Federation label Jun 8, 2022
@usovamaria
Copy link
Author

Hi @Amier3
Yeah, there was a bit confusing description. The case is:
One server (lets name it as A-server) loses its connectivity and re-joins the cluster later. Consul logs on this server and consul operator raft list-peers on it show that server is ok and successfully re-joined the cluster.
Servers B,C,D admit this re-joining and the fact that the cluster has its leader and the followers. But servers E,F mark A-server as leaving. When A-server is restarted (using server consul restart), servers E,F mark A-server as follower.

So, answering your questions:

  1. No, leaving server can't be self-healed(?) and complete this re-joining process on some of the other healthy servers.
  2. Yes, all of the servers (even those who marked A-server as leaving) were successfully rebalanced according to consul logs and could 'see' each other.

@usovamaria
Copy link
Author

@Amier3 Hi! Any news here? Still affecting us :(

@maxb
Copy link

maxb commented Dec 15, 2022

@hsimon-hashicorp Hello, as you mentioned that Amier is no longer with HashiCorp in another ticket, would you be able to help get this one rerouted too?

@usovamaria
Copy link
Author

@jkirschner-hashicorp Hi. I don't know who to mention here but we experience even more problems related to these issue after upgrade to Consul v1.14.3

Even short-term network flaps affect us this way. E.g., this happens when there's a maintenance on the host-level network and our virtual machines are hosted on these hosts.

@usovamaria
Copy link
Author

Hi. We conducted some research on related issues and here's what we had discovered:

Similar issue reported on 1.9.5: link
Inconsistent behaviour on 1.4.4: link

It looks like the root cause is in the serf component, there are also a couple of reported issues:
one of them

Here's a PR created for this case but nothing happened. @rboyer was tagged in this issue, maybe you could investigate this?

@kemko
Copy link

kemko commented Dec 8, 2023

Hey @david-yu, @jkirschner-hashicorp, @mkeeler, @rboyer,

Sorry for the ping, but it looks like this issue's gone a bit quiet, and the problem seems to still be there. Any chance you guys can help get this on the schedule? Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
theme/federation-usability Anything related to Federation type/bug Feature does not function as expected
Projects
None yet
Development

No branches or pull requests

4 participants