Non voting servers get elected as leader in a mixed OSS/Enterprise version cluster #6979

tionebsalocin · 2019-12-23T13:36:59Z

Overview of the Issue

In a cluster running open source version of consul, I've added an enterprise version server with the option non_voting_server. After few election, this server happened to be leader of the cluster which was not, obviously, expected.
It happend in production while we were introducing progressively the enterprise version server.
The non voting option was set to avoid any impact at the begining of the migration.
Note: bellow logs and info comes from a test platform on which we reproduced the issue

Reproduction Steps

Steps to reproduce this issue, eg:

Create a cluster with 3 client nodes and 2 server nodes running 1.5.1 version OSS and 1 running 1.5.1 enterprise version
Add the non_voting_server option in the enterprise version config file (and restart consul on this server)
On one if the 2 servers running OSS stop consul to provoque election
Loop until the Enterprise version server gets elected

Consul info for both OSS and ENT Server

OSS Server info

agent:
	check_monitors = 0
	check_ttls = 0
	checks = 0
	services = 0
build:
	prerelease = criteo8
	revision = 
	version = 1.5.1
consul:
	acl = disabled
	bootstrap = false
	known_datacenters = 1
	leader = true
	leader_addr = 172.21.185.218:8300
	server = true
raft:
	applied_index = 179
	commit_index = 179
	fsm_pending = 0
	last_contact = 0
	last_log_index = 179
	last_log_term = 24
	last_snapshot_index = 0
	last_snapshot_term = 0
	latest_configuration = [{Suffrage:Nonvoter ID:3199fa0a-f10b-c3af-37f0-2ad73553769a Address:172.21.180.196:8300} {Suffrage:Voter ID:a411cf76-a4f9-ae24-3afd-5e642763eed0 Address:172.21.185.218:8300} {Suffrage:Voter ID:4fa1d84f-dfed-f72b-e9e9-ce6ccb0cd64f Address:172.21.233.101:8300}]
	latest_configuration_index = 173
	num_peers = 1
	protocol_version = 3
	protocol_version_max = 3
	protocol_version_min = 0
	snapshot_version_max = 1
	snapshot_version_min = 0
	state = Leader
	term = 24
runtime:
	arch = amd64
	cpu_count = 2
	goroutines = 100
	max_procs = 2
	os = linux
	version = go1.11.13
serf_lan:
	coordinate_resets = 0
	encrypted = true
	event_queue = 0
	event_time = 9
	failed = 0
	health_score = 2
	intent_queue = 0
	left = 0
	member_time = 12
	members = 6
	query_queue = 0
	query_time = 1
serf_wan:
	coordinate_resets = 0
	encrypted = true
	event_queue = 0
	event_time = 1
	failed = 0
	health_score = 0
	intent_queue = 0
	left = 0
	member_time = 4
	members = 3
	query_queue = 0
	query_time = 1

Enterprise Server info

agent:
	check_monitors = 0
	check_ttls = 0
	checks = 0
	services = 0
build:
	prerelease = criteo9
	revision = 
	version = 1.5.1+ent
consul:
	acl = disabled
	bootstrap = false
	known_datacenters = 1
	leader = true
	leader_addr = 172.21.180.196:8300
	server = true
license:
	customer = permanent
	expiration_time = 2049-12-15 13:31:19.868306147 +0000 UTC
	features = Automated Backups, Automated Upgrades, Enhanced Read Scalability, Network Segments, Redundancy Zone, Advanced Network Federation
	id = permanent
	install_id = *
	issue_time = 2019-12-23 13:31:19.868306147 +0000 UTC
	package = premium
	product = consul
	start_time = 2019-12-23 13:26:19.868306147 +0000 UTC
raft:
	applied_index = 429
	commit_index = 429
	fsm_pending = 0
	last_contact = 0
	last_log_index = 429
	last_log_term = 36
	last_snapshot_index = 0
	last_snapshot_term = 0
	latest_configuration = [{Suffrage:Voter ID:a411cf76-a4f9-ae24-3afd-5e642763eed0 Address:172.21.185.218:8300} {Suffrage:Voter ID:4fa1d84f-dfed-f72b-e9e9-ce6ccb0cd64f Address:172.21.233.101:8300} {Suffrage:Voter ID:3199fa0a-f10b-c3af-37f0-2ad73553769a Address:172.21.180.196:8300}]
	latest_configuration_index = 295
	num_peers = 2
	protocol_version = 3
	protocol_version_max = 3
	protocol_version_min = 0
	snapshot_version_max = 1
	snapshot_version_min = 0
	state = Leader
	term = 36
runtime:
	arch = amd64
	cpu_count = 2
	goroutines = 117
	max_procs = 2
	os = linux
	version = go1.13.4
serf_lan:
	coordinate_resets = 0
	encrypted = true
	event_queue = 0
	event_time = 14
	failed = 0
	health_score = 0
	intent_queue = 0
	left = 0
	member_time = 21
	members = 6
	query_queue = 0
	query_time = 1
serf_wan:
	coordinate_resets = 0
	encrypted = true
	event_queue = 0
	event_time = 1
	failed = 0
	health_score = 0
	intent_queue = 0
	left = 0
	member_time = 5
	members = 3
	query_queue = 0
	query_time = 1

Operating system and Environment details

In this exemple 3 VM running ubuntu with 2 consul 1.5.1 OSS and 1 consul 1.5.1 ENT
OSS servers are:
consul-server000-dc0 build=1.5.1criteo8 role=consul
consul-server001-dc0 build=1.5.1criteo8 role=consul
ENT server is:
consul-server002-dc0 build=1.5.1+entcriteo9 nonvoter=1 role=consul

consul members + info fragement:

consul-server000-dc0  172.21.185.218:8301  alive   server  1.5.1criteo8      2         dc0  <all>
consul-server001-dc0  172.21.233.101:8301  alive   server  1.5.1criteo8      2         dc0  <all>
consul-server002-dc0  172.21.180.196:8301  alive   server  1.5.1+entcriteo9  2         dc0  <all>
	leader = false
	leader_addr = 172.21.180.196:8300

consul members detailed for these servers

Node                  Address              Status  Tags
consul-server000-dc0  172.21.185.218:8301  alive   acls=0,build=1.5.1criteo8:,criteo_grpc_enabled=true,dc=dc0,expect=3,grpc_enabled=true,id=a411cf76-a4f9-ae24-3afd-5e642763eed0,port=8300,raft_vsn=3,role=consul,segment=<all>,vsn=2,vsn_max=3,vsn_min=2,wan_join_port=8302
consul-server001-dc0  172.21.233.101:8301  alive   acls=0,build=1.5.1criteo8:,criteo_grpc_enabled=true,dc=dc0,expect=3,grpc_enabled=true,id=4fa1d84f-dfed-f72b-e9e9-ce6ccb0cd64f,port=8300,raft_vsn=3,role=consul,segment=<all>,vsn=2,vsn_max=3,vsn_min=2,wan_join_port=8302
consul-server002-dc0  172.21.180.196:8301  alive   acls=0,build=1.5.1+entcriteo9:,criteo_grpc_enabled=true,dc=dc0,expect=3,grpc_enabled=true,id=3199fa0a-f10b-c3af-37f0-2ad73553769a,nonvoter=1,port=8300,raft_vsn=3,role=consul,segment=<all>,vsn=2,vsn_max=3,vsn_min=2,wan_join_port=8302

Non voting server is leader!

Log Fragments

Election log

2019/12/23 13:31:55 [WARN] raft: Heartbeat timeout from "172.21.185.218:8300" reached, starting election
2019/12/23 13:31:55 [INFO] raft: Node at 172.21.180.196:8300 [Candidate] entering Candidate state in term 35
2019/12/23 13:31:55 [ERR] raft: Failed to make RequestVote RPC to {Voter a411cf76-a4f9-ae24-3afd-5e642763eed0 172.21.185.218:8300}: EOF
2019/12/23 13:31:55 [DEBUG] raft: Votes needed: 2
2019/12/23 13:31:55 [DEBUG] raft: Vote granted from 3199fa0a-f10b-c3af-37f0-2ad73553769a in term 35. Tally: 1
2019/12/23 13:31:56 [WARN] raft: Election timeout reached, restarting election
2019/12/23 13:31:56 [INFO] raft: Node at 172.21.180.196:8300 [Candidate] entering Candidate state in term 36
2019/12/23 13:31:56 [ERR] raft: Failed to make RequestVote RPC to {Voter a411cf76-a4f9-ae24-3afd-5e642763eed0 172.21.185.218:8300}: dial tcp <nil>->172.21.185.218:8300: connect: connection refused
2019/12/23 13:31:56 [DEBUG] raft: Votes needed: 2
2019/12/23 13:31:56 [DEBUG] raft: Vote granted from 3199fa0a-f10b-c3af-37f0-2ad73553769a in term 36. Tally: 1
2019/12/23 13:31:56 [DEBUG] raft: Vote granted from 4fa1d84f-dfed-f72b-e9e9-ce6ccb0cd64f in term 36. Tally: 2
2019/12/23 13:31:56 [INFO] raft: Election won. Tally: 2
2019/12/23 13:31:56 [INFO] raft: Node at 172.21.180.196:8300 [Leader] entering Leader state
2019/12/23 13:31:56 [INFO] raft: Added peer a411cf76-a4f9-ae24-3afd-5e642763eed0, starting replication
2019/12/23 13:31:56 [INFO] raft: Added peer 4fa1d84f-dfed-f72b-e9e9-ce6ccb0cd64f, starting replication
2019/12/23 13:31:56 [INFO] consul: cluster leadership acquired
2019/12/23 13:31:56 [INFO] consul: New leader elected: consul-server002-dc0

Non voting server gets elected

The text was updated successfully, but these errors were encountered:

hanshasselberg · 2020-01-13T09:06:26Z

There are two places in raft where a node enters Candidate state:

https://github.com/hashicorp/raft/blob/72a6ed385dac0c8ee1628d12a9f0bb99abf39f19/raft.go#L216: is the problem, because I don't see how non-voters should be prevented from getting there. There should be a guard in place that prevents non-voters to entering that state.
https://github.com/hashicorp/raft/blob/72a6ed385dac0c8ee1628d12a9f0bb99abf39f19/raft.go#L1785 is another problem I filed a ticket for: Leadership could be transferred to non-voter #7031 (because it was introduced with 1.5.2).

Does that seem likely to you?

hanshasselberg · 2020-05-18T20:44:55Z

hashicorp/raft#398 makes pickServer return only voters which fixes the second issue I mentioned:

https://github.com/hashicorp/raft/blob/72a6ed385dac0c8ee1628d12a9f0bb99abf39f19/raft.go#L1785 is another problem I filed a ticket for: #7031 (because it was introduced with 1.5.2).

I checked the first issue again:

https://github.com/hashicorp/raft/blob/72a6ed385dac0c8ee1628d12a9f0bb99abf39f19/raft.go#L216: is the problem, because I don't see how non-voters should be prevented from getting there. There should be a guard in place that prevents non-voters to entering that state.

And I am not sure anymore if this is the problem. There is a check before that makes contains !hasVote which checks for the voter. I am not sure anymore what the problem is here.

pierresouchay · 2020-05-18T22:07:33Z

I saw the same issue today: a non voter went visible as Leader for a few seconds, then another took its role instead (Consul 1.6.5)

pierresouchay · 2020-07-21T16:22:30Z

We had the issue several times recently:

Eg: 5 voter servers + 6 non-voters
When we reinstall non-voters:

The non-voter (node-id=machineID22), let's call it consul22, leave the cluster with a Consul leave, reboot and its image is erased
The non-voter machine is being reinstalled, with a new name, lets call it machine22, a consul-agent configuration and the node-id: machineID22
Once fully re-installed, the machine is renamed "consul22" still with the same ID, we also add a configuration { "non_voter": true}
At that time, in the Consul cluster, consul members shows both machine22 (marked as dead) and consul22 (marked alive)
Once started and having it replication, consul22 starts voting and triggers lots of elections, performance degrades
At that time, consul operator raft list-peers shows consul22 as voting while it should not.
Doing a consul leave command on consul22
Restart consul on consul22 => consul operator raft list-peers => shows non-voting (good!)
Still, consul22 triggers elections and performance degrades
The only way to fix reliably the problem is to to a consul leave on consul22 and keep consul shutdown + do a service consul restart on all voting servers 1 by 1, so those servers DO not see the old machine22 in their consul members, once none see machine22 in member list, situation is back to normal

So, I suspect that other consul servers see machine22[nodeID=22] dead without tag non_voter=1 and decide that consul22[nodeID=22] alive with tag non_voter=true are the same machines and mismatches the tags

So, until machine22[nodeID=22] is in member list, the other servers do no consider consul22 as a non-voter.

pierresouchay · 2020-07-24T12:40:10Z

Example from real world (from misconfiguration at start):

consul28-par                                                   10.236.91.29:8301    left    acls=1,build=1.6.6+entcriteo1:,dc=par,expect=3,id=d67f58ed-f589-8f63-830a-2fe6b9f277b4,nonvoter=1,port=8300,raft_vsn=3,role=consul,segment=<all>,use_tls=1,vsn=2,vsn_max=3,vsn_min=2,wan_join_port=8302
consul28-par.central.criteo.prod                               10.236.91.29:8301    alive   acls=1,build=1.6.6+entcriteo1:,dc=par,expect=3,id=d67f58ed-f589-8f63-830a-2fe6b9f277b4,nonvoter=1,port=8300,raft_vsn=3,role=consul,segment=<all>,use_tls=1,vsn=2,vsn_max=3,vsn_min=2,wan_join_port=8302
48-df-37-42-78-40.central.criteo.prod                          10.236.91.29:8301    left    acls=1,build=1.6.6criteo1:,dc=par,id=d67f58ed-f589-8f63-830a-2fe6b9f277b4,role=node,segment=<default>,vsn=2,vsn_max=3,vsn_min=2

Of course, we can see that 48-df-37-42-78-40.central.criteo.prod has not the non_voter flag, and I suspect that's the root cause of it

=> Until I restart all consul voters (in order to remove consul28-par and 48-df-37-42-78-40.central.criteo.prod), I have those messages:

2020/07/24 12:33:20 [WARN]  raft: Rejecting vote request from 10.236.91.29:8300 since we have a leader: 10.236.42.13:8300
2020/07/24 12:33:23 [WARN]  raft: Rejecting vote request from 10.236.91.29:8300 since we have a leader: 10.236.42.13:8300
2020/07/24 12:33:29 [WARN]  raft: Rejecting vote request from 10.236.91.29:8300 since we have a leader: 10.236.42.13:8300
2020/07/24 12:33:34 [WARN]  raft: Rejecting vote request from 10.236.91.29:8300 since we have a leader: 10.236.42.13:8300
2020/07/24 12:33:38 [WARN]  raft: Rejecting vote request from 10.236.91.29:8300 since we have a leader: 10.236.42.13:8300
2020/07/24 12:33:43 [WARN]  raft: Rejecting vote request from 10.236.91.29:8300 since we have a leader: 10.236.42.13:8300
2020/07/24 12:33:48 [WARN]  raft: Rejecting vote request from 10.236.91.29:8300 since we have a leader: 10.236.42.13:8300
2020/07/24 12:33:52 [WARN]  raft: Rejecting vote request from 10.236.91.29:8300 since we have a leader: 10.236.42.13:8300

ChipV223 · 2021-04-20T19:33:31Z

Hi all!

I brought this issue up the other day with Consul Engineering and we determine that this is basically working as intended, because OSS servers aren't aware of Ent-specific functionality, so they shouldn't ever be mixed. Given that, I'll close this one out. But do reach out if you have any other questions!

hanshasselberg self-assigned this Jan 7, 2020

hanshasselberg mentioned this issue Jan 13, 2020

Leadership could be transferred to non-voter #7031

Closed

hanshasselberg added the waiting-reply Waiting on response from Original Poster or another individual in the thread label Jan 13, 2020

stale bot removed the waiting-reply Waiting on response from Original Poster or another individual in the thread label Jan 13, 2020

hanshasselberg mentioned this issue Apr 8, 2020

Ignore non-voters in leadership transfer hashicorp/raft#398

Merged

jsosulska added the theme/internals Serf, Raft, SWIM, Lifeguard, Anti-Entropy, locking topics label Apr 20, 2020

jsosulska added the waiting-reply Waiting on response from Original Poster or another individual in the thread label May 18, 2020

ghost removed waiting-reply Waiting on response from Original Poster or another individual in the thread labels May 18, 2020

mikemorris added the type/docs Documentation needs to be created/updated/clarified label Aug 18, 2020

pierresouchay mentioned this issue Oct 6, 2020

Consul Service meta wrongly computes and exposes non_voter meta #8731

Merged

ChipV223 closed this as completed Apr 20, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Non voting servers get elected as leader in a mixed OSS/Enterprise version cluster #6979

Non voting servers get elected as leader in a mixed OSS/Enterprise version cluster #6979

tionebsalocin commented Dec 23, 2019

hanshasselberg commented Jan 13, 2020 •

edited

Loading

hanshasselberg commented May 18, 2020

pierresouchay commented May 18, 2020

pierresouchay commented Jul 21, 2020

pierresouchay commented Jul 24, 2020 •

edited

Loading

ChipV223 commented Apr 20, 2021

Non voting servers get elected as leader in a mixed OSS/Enterprise version cluster #6979

Non voting servers get elected as leader in a mixed OSS/Enterprise version cluster #6979

Comments

tionebsalocin commented Dec 23, 2019

Overview of the Issue

Reproduction Steps

Consul info for both OSS and ENT Server

Operating system and Environment details

Log Fragments

hanshasselberg commented Jan 13, 2020 • edited Loading

hanshasselberg commented May 18, 2020

pierresouchay commented May 18, 2020

pierresouchay commented Jul 21, 2020

pierresouchay commented Jul 24, 2020 • edited Loading

ChipV223 commented Apr 20, 2021

hanshasselberg commented Jan 13, 2020 •

edited

Loading

pierresouchay commented Jul 24, 2020 •

edited

Loading