Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Non voting servers get elected as leader in a mixed OSS/Enterprise version cluster #6979

Closed
tionebsalocin opened this issue Dec 23, 2019 · 6 comments
Assignees
Labels
theme/internals Serf, Raft, SWIM, Lifeguard, Anti-Entropy, locking topics type/docs Documentation needs to be created/updated/clarified

Comments

@tionebsalocin
Copy link
Contributor

Overview of the Issue

In a cluster running open source version of consul, I've added an enterprise version server with the option non_voting_server. After few election, this server happened to be leader of the cluster which was not, obviously, expected.
It happend in production while we were introducing progressively the enterprise version server.
The non voting option was set to avoid any impact at the begining of the migration.
Note: bellow logs and info comes from a test platform on which we reproduced the issue

Reproduction Steps

Steps to reproduce this issue, eg:

  1. Create a cluster with 3 client nodes and 2 server nodes running 1.5.1 version OSS and 1 running 1.5.1 enterprise version
  2. Add the non_voting_server option in the enterprise version config file (and restart consul on this server)
  3. On one if the 2 servers running OSS stop consul to provoque election
  4. Loop until the Enterprise version server gets elected

Consul info for both OSS and ENT Server

OSS Server info
agent:
	check_monitors = 0
	check_ttls = 0
	checks = 0
	services = 0
build:
	prerelease = criteo8
	revision = 
	version = 1.5.1
consul:
	acl = disabled
	bootstrap = false
	known_datacenters = 1
	leader = true
	leader_addr = 172.21.185.218:8300
	server = true
raft:
	applied_index = 179
	commit_index = 179
	fsm_pending = 0
	last_contact = 0
	last_log_index = 179
	last_log_term = 24
	last_snapshot_index = 0
	last_snapshot_term = 0
	latest_configuration = [{Suffrage:Nonvoter ID:3199fa0a-f10b-c3af-37f0-2ad73553769a Address:172.21.180.196:8300} {Suffrage:Voter ID:a411cf76-a4f9-ae24-3afd-5e642763eed0 Address:172.21.185.218:8300} {Suffrage:Voter ID:4fa1d84f-dfed-f72b-e9e9-ce6ccb0cd64f Address:172.21.233.101:8300}]
	latest_configuration_index = 173
	num_peers = 1
	protocol_version = 3
	protocol_version_max = 3
	protocol_version_min = 0
	snapshot_version_max = 1
	snapshot_version_min = 0
	state = Leader
	term = 24
runtime:
	arch = amd64
	cpu_count = 2
	goroutines = 100
	max_procs = 2
	os = linux
	version = go1.11.13
serf_lan:
	coordinate_resets = 0
	encrypted = true
	event_queue = 0
	event_time = 9
	failed = 0
	health_score = 2
	intent_queue = 0
	left = 0
	member_time = 12
	members = 6
	query_queue = 0
	query_time = 1
serf_wan:
	coordinate_resets = 0
	encrypted = true
	event_queue = 0
	event_time = 1
	failed = 0
	health_score = 0
	intent_queue = 0
	left = 0
	member_time = 4
	members = 3
	query_queue = 0
	query_time = 1
Enterprise Server info
agent:
	check_monitors = 0
	check_ttls = 0
	checks = 0
	services = 0
build:
	prerelease = criteo9
	revision = 
	version = 1.5.1+ent
consul:
	acl = disabled
	bootstrap = false
	known_datacenters = 1
	leader = true
	leader_addr = 172.21.180.196:8300
	server = true
license:
	customer = permanent
	expiration_time = 2049-12-15 13:31:19.868306147 +0000 UTC
	features = Automated Backups, Automated Upgrades, Enhanced Read Scalability, Network Segments, Redundancy Zone, Advanced Network Federation
	id = permanent
	install_id = *
	issue_time = 2019-12-23 13:31:19.868306147 +0000 UTC
	package = premium
	product = consul
	start_time = 2019-12-23 13:26:19.868306147 +0000 UTC
raft:
	applied_index = 429
	commit_index = 429
	fsm_pending = 0
	last_contact = 0
	last_log_index = 429
	last_log_term = 36
	last_snapshot_index = 0
	last_snapshot_term = 0
	latest_configuration = [{Suffrage:Voter ID:a411cf76-a4f9-ae24-3afd-5e642763eed0 Address:172.21.185.218:8300} {Suffrage:Voter ID:4fa1d84f-dfed-f72b-e9e9-ce6ccb0cd64f Address:172.21.233.101:8300} {Suffrage:Voter ID:3199fa0a-f10b-c3af-37f0-2ad73553769a Address:172.21.180.196:8300}]
	latest_configuration_index = 295
	num_peers = 2
	protocol_version = 3
	protocol_version_max = 3
	protocol_version_min = 0
	snapshot_version_max = 1
	snapshot_version_min = 0
	state = Leader
	term = 36
runtime:
	arch = amd64
	cpu_count = 2
	goroutines = 117
	max_procs = 2
	os = linux
	version = go1.13.4
serf_lan:
	coordinate_resets = 0
	encrypted = true
	event_queue = 0
	event_time = 14
	failed = 0
	health_score = 0
	intent_queue = 0
	left = 0
	member_time = 21
	members = 6
	query_queue = 0
	query_time = 1
serf_wan:
	coordinate_resets = 0
	encrypted = true
	event_queue = 0
	event_time = 1
	failed = 0
	health_score = 0
	intent_queue = 0
	left = 0
	member_time = 5
	members = 3
	query_queue = 0
	query_time = 1

Operating system and Environment details

In this exemple 3 VM running ubuntu with 2 consul 1.5.1 OSS and 1 consul 1.5.1 ENT
OSS servers are:
consul-server000-dc0 build=1.5.1criteo8 role=consul
consul-server001-dc0 build=1.5.1criteo8 role=consul
ENT server is:
consul-server002-dc0 build=1.5.1+entcriteo9 nonvoter=1 role=consul

consul members + info fragement:

consul-server000-dc0  172.21.185.218:8301  alive   server  1.5.1criteo8      2         dc0  <all>
consul-server001-dc0  172.21.233.101:8301  alive   server  1.5.1criteo8      2         dc0  <all>
consul-server002-dc0  172.21.180.196:8301  alive   server  1.5.1+entcriteo9  2         dc0  <all>
	leader = false
	leader_addr = 172.21.180.196:8300

consul members detailed for these servers

Node                  Address              Status  Tags
consul-server000-dc0  172.21.185.218:8301  alive   acls=0,build=1.5.1criteo8:,criteo_grpc_enabled=true,dc=dc0,expect=3,grpc_enabled=true,id=a411cf76-a4f9-ae24-3afd-5e642763eed0,port=8300,raft_vsn=3,role=consul,segment=<all>,vsn=2,vsn_max=3,vsn_min=2,wan_join_port=8302
consul-server001-dc0  172.21.233.101:8301  alive   acls=0,build=1.5.1criteo8:,criteo_grpc_enabled=true,dc=dc0,expect=3,grpc_enabled=true,id=4fa1d84f-dfed-f72b-e9e9-ce6ccb0cd64f,port=8300,raft_vsn=3,role=consul,segment=<all>,vsn=2,vsn_max=3,vsn_min=2,wan_join_port=8302
consul-server002-dc0  172.21.180.196:8301  alive   acls=0,build=1.5.1+entcriteo9:,criteo_grpc_enabled=true,dc=dc0,expect=3,grpc_enabled=true,id=3199fa0a-f10b-c3af-37f0-2ad73553769a,nonvoter=1,port=8300,raft_vsn=3,role=consul,segment=<all>,vsn=2,vsn_max=3,vsn_min=2,wan_join_port=8302

Non voting server is leader!

Log Fragments

Election log

2019/12/23 13:31:55 [WARN] raft: Heartbeat timeout from "172.21.185.218:8300" reached, starting election
2019/12/23 13:31:55 [INFO] raft: Node at 172.21.180.196:8300 [Candidate] entering Candidate state in term 35
2019/12/23 13:31:55 [ERR] raft: Failed to make RequestVote RPC to {Voter a411cf76-a4f9-ae24-3afd-5e642763eed0 172.21.185.218:8300}: EOF
2019/12/23 13:31:55 [DEBUG] raft: Votes needed: 2
2019/12/23 13:31:55 [DEBUG] raft: Vote granted from 3199fa0a-f10b-c3af-37f0-2ad73553769a in term 35. Tally: 1
2019/12/23 13:31:56 [WARN] raft: Election timeout reached, restarting election
2019/12/23 13:31:56 [INFO] raft: Node at 172.21.180.196:8300 [Candidate] entering Candidate state in term 36
2019/12/23 13:31:56 [ERR] raft: Failed to make RequestVote RPC to {Voter a411cf76-a4f9-ae24-3afd-5e642763eed0 172.21.185.218:8300}: dial tcp <nil>->172.21.185.218:8300: connect: connection refused
2019/12/23 13:31:56 [DEBUG] raft: Votes needed: 2
2019/12/23 13:31:56 [DEBUG] raft: Vote granted from 3199fa0a-f10b-c3af-37f0-2ad73553769a in term 36. Tally: 1
2019/12/23 13:31:56 [DEBUG] raft: Vote granted from 4fa1d84f-dfed-f72b-e9e9-ce6ccb0cd64f in term 36. Tally: 2
2019/12/23 13:31:56 [INFO] raft: Election won. Tally: 2
2019/12/23 13:31:56 [INFO] raft: Node at 172.21.180.196:8300 [Leader] entering Leader state
2019/12/23 13:31:56 [INFO] raft: Added peer a411cf76-a4f9-ae24-3afd-5e642763eed0, starting replication
2019/12/23 13:31:56 [INFO] raft: Added peer 4fa1d84f-dfed-f72b-e9e9-ce6ccb0cd64f, starting replication
2019/12/23 13:31:56 [INFO] consul: cluster leadership acquired
2019/12/23 13:31:56 [INFO] consul: New leader elected: consul-server002-dc0

Non voting server gets elected

@hanshasselberg
Copy link
Member

hanshasselberg commented Jan 13, 2020

There are two places in raft where a node enters Candidate state:

Does that seem likely to you?

@hanshasselberg hanshasselberg added the waiting-reply Waiting on response from Original Poster or another individual in the thread label Jan 13, 2020
@stale stale bot removed the waiting-reply Waiting on response from Original Poster or another individual in the thread label Jan 13, 2020
@jsosulska jsosulska added the theme/internals Serf, Raft, SWIM, Lifeguard, Anti-Entropy, locking topics label Apr 20, 2020
@hanshasselberg
Copy link
Member

hashicorp/raft#398 makes pickServer return only voters which fixes the second issue I mentioned:

https://github.com/hashicorp/raft/blob/72a6ed385dac0c8ee1628d12a9f0bb99abf39f19/raft.go#L1785 is another problem I filed a ticket for: #7031 (because it was introduced with 1.5.2).

I checked the first issue again:

https://github.com/hashicorp/raft/blob/72a6ed385dac0c8ee1628d12a9f0bb99abf39f19/raft.go#L216: is the problem, because I don't see how non-voters should be prevented from getting there. There should be a guard in place that prevents non-voters to entering that state.

And I am not sure anymore if this is the problem. There is a check before that makes contains !hasVote which checks for the voter. I am not sure anymore what the problem is here.

@jsosulska jsosulska added the waiting-reply Waiting on response from Original Poster or another individual in the thread label May 18, 2020
@pierresouchay
Copy link
Contributor

I saw the same issue today: a non voter went visible as Leader for a few seconds, then another took its role instead (Consul 1.6.5)

@ghost ghost removed waiting-reply Waiting on response from Original Poster or another individual in the thread labels May 18, 2020
@pierresouchay
Copy link
Contributor

We had the issue several times recently:

Eg: 5 voter servers + 6 non-voters
When we reinstall non-voters:

  1. The non-voter (node-id=machineID22), let's call it consul22, leave the cluster with a Consul leave, reboot and its image is erased
  2. The non-voter machine is being reinstalled, with a new name, lets call it machine22, a consul-agent configuration and the node-id: machineID22
  3. Once fully re-installed, the machine is renamed "consul22" still with the same ID, we also add a configuration { "non_voter": true}
  4. At that time, in the Consul cluster, consul members shows both machine22 (marked as dead) and consul22 (marked alive)
  5. Once started and having it replication, consul22 starts voting and triggers lots of elections, performance degrades
  6. At that time, consul operator raft list-peers shows consul22 as voting while it should not.
  7. Doing a consul leave command on consul22
  8. Restart consul on consul22 => consul operator raft list-peers => shows non-voting (good!)
  9. Still, consul22 triggers elections and performance degrades
  10. The only way to fix reliably the problem is to to a consul leave on consul22 and keep consul shutdown + do a service consul restart on all voting servers 1 by 1, so those servers DO not see the old machine22 in their consul members, once none see machine22 in member list, situation is back to normal

So, I suspect that other consul servers see machine22[nodeID=22] dead without tag non_voter=1 and decide that consul22[nodeID=22] alive with tag non_voter=true are the same machines and mismatches the tags

So, until machine22[nodeID=22] is in member list, the other servers do no consider consul22 as a non-voter.

@pierresouchay
Copy link
Contributor

pierresouchay commented Jul 24, 2020

Example from real world (from misconfiguration at start):

consul28-par                                                   10.236.91.29:8301    left    acls=1,build=1.6.6+entcriteo1:,dc=par,expect=3,id=d67f58ed-f589-8f63-830a-2fe6b9f277b4,nonvoter=1,port=8300,raft_vsn=3,role=consul,segment=<all>,use_tls=1,vsn=2,vsn_max=3,vsn_min=2,wan_join_port=8302
consul28-par.central.criteo.prod                               10.236.91.29:8301    alive   acls=1,build=1.6.6+entcriteo1:,dc=par,expect=3,id=d67f58ed-f589-8f63-830a-2fe6b9f277b4,nonvoter=1,port=8300,raft_vsn=3,role=consul,segment=<all>,use_tls=1,vsn=2,vsn_max=3,vsn_min=2,wan_join_port=8302
48-df-37-42-78-40.central.criteo.prod                          10.236.91.29:8301    left    acls=1,build=1.6.6criteo1:,dc=par,id=d67f58ed-f589-8f63-830a-2fe6b9f277b4,role=node,segment=<default>,vsn=2,vsn_max=3,vsn_min=2

Of course, we can see that 48-df-37-42-78-40.central.criteo.prod has not the non_voter flag, and I suspect that's the root cause of it

=> Until I restart all consul voters (in order to remove consul28-par and 48-df-37-42-78-40.central.criteo.prod), I have those messages:

2020/07/24 12:33:20 [WARN]  raft: Rejecting vote request from 10.236.91.29:8300 since we have a leader: 10.236.42.13:8300
2020/07/24 12:33:23 [WARN]  raft: Rejecting vote request from 10.236.91.29:8300 since we have a leader: 10.236.42.13:8300
2020/07/24 12:33:29 [WARN]  raft: Rejecting vote request from 10.236.91.29:8300 since we have a leader: 10.236.42.13:8300
2020/07/24 12:33:34 [WARN]  raft: Rejecting vote request from 10.236.91.29:8300 since we have a leader: 10.236.42.13:8300
2020/07/24 12:33:38 [WARN]  raft: Rejecting vote request from 10.236.91.29:8300 since we have a leader: 10.236.42.13:8300
2020/07/24 12:33:43 [WARN]  raft: Rejecting vote request from 10.236.91.29:8300 since we have a leader: 10.236.42.13:8300
2020/07/24 12:33:48 [WARN]  raft: Rejecting vote request from 10.236.91.29:8300 since we have a leader: 10.236.42.13:8300
2020/07/24 12:33:52 [WARN]  raft: Rejecting vote request from 10.236.91.29:8300 since we have a leader: 10.236.42.13:8300

@mikemorris mikemorris added the type/docs Documentation needs to be created/updated/clarified label Aug 18, 2020
@ChipV223
Copy link
Contributor

Hi all!

I brought this issue up the other day with Consul Engineering and we determine that this is basically working as intended, because OSS servers aren't aware of Ent-specific functionality, so they shouldn't ever be mixed. Given that, I'll close this one out. But do reach out if you have any other questions!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
theme/internals Serf, Raft, SWIM, Lifeguard, Anti-Entropy, locking topics type/docs Documentation needs to be created/updated/clarified
Projects
None yet
Development

No branches or pull requests

6 participants