Conflicts appears when changing node_name on agents #3974

kamaradclimber · 2018-03-20T12:57:55Z

Description of the Issue (and unexpected/desired result)

When changing node_name of an agent, we observe conflicts in consul servers logs.
It is also possible (not reproduced yet, but observed twice on our production) that consul servers ends up blocked when several agents have changed their names.

Reproduction steps

change node_name configuration in a consul agent configuration
restart consul agent

On the consul server:

2018/03/20 12:53:56 [INFO] serf: EventMemberJoin: consul-relay01-test2-pa4.central.criteo.preprod 10.224.45.123
2018/03/20 12:53:56 [INFO] consul: member 'consul-relay01-test2-pa4.central.criteo.preprod' joined, marking health alive
2018/03/20 12:53:56 [WARN] consul.fsm: EnsureRegistration failed: failed inserting node: node ID "4d1dac9f-8977-4258-4aca-fa254c9f48da" for node "consul-relay01-test2-pa4.central.criteo.preprod" aliases existing node "consul-relay01-pa4.central.criteo.preprod"
2018/03/20 12:54:00 [WARN] consul.fsm: EnsureRegistration failed: failed inserting node: node ID "4d1dac9f-8977-4258-4aca-fa254c9f48da" for node "consul-relay01-test2-pa4.central.criteo.preprod" aliases existing node "consul-relay01-pa4.central.criteo.preprod"

Output of consul members:

consul-relay01-pa4.central.criteo.preprod                10.224.45.123:8301  failed  client  1.0.6  2         pa4  <default>
consul-relay01-test2-pa4.central.criteo.preprod          10.224.45.123:8301  alive   client  1.0.6  2         pa4  <default>

`consul version` for both Client and Server

Client: 1.0.6
Server: 1.0.6 (with some patches)

`consul info` for both Client and Server

Client:

agent:
	check_monitors = 0
	check_ttls = 0
	checks = 3
	services = 4
build:
	prerelease = 
	revision = 
	version = 1.0.6
consul:
	known_servers = 3
	server = false
runtime:
	arch = amd64
	cpu_count = 32
	goroutines = 46
	max_procs = 2
	os = linux
	version = go1.9.4
serf_lan:
	coordinate_resets = 0
	encrypted = true
	event_queue = 0
	event_time = 3715
	failed = 0
	health_score = 0
	intent_queue = 0
	left = 0
	member_time = 1133006
	members = 453
	query_queue = 0
	query_time = 1929

Server:

agent:
	check_monitors = 0
	check_ttls = 0
	checks = 2
	services = 3
build:
	prerelease = criteo5
	revision = 
	version = 1.0.6
consul:
	bootstrap = false
	known_datacenters = 2
	leader = true
	leader_addr = 10.224.47.92:8300
	server = true
raft:
	applied_index = 83407159
	commit_index = 83407159
	fsm_pending = 0
	last_contact = 0
	last_log_index = 83407159
	last_log_term = 98
	last_snapshot_index = 83403466
	last_snapshot_term = 98
	latest_configuration = [{Suffrage:Voter ID:4fd4772d-e3cd-ebd6-731f-c7e6431ce284 Address:10.224.46.86:8300} {Suffrage:Voter ID:1bfb896b-ee04-520c-10e0-8b382dc0c832 Address:10.224.47.92:8300} {Suffrage:Voter ID:f5e7d35e-66b4-a4b1-b85b-5747af533b58 Address:10.224.47.83:8300}]
	latest_configuration_index = 77270117
	num_peers = 2
	protocol_version = 3
	protocol_version_max = 3
	protocol_version_min = 0
	snapshot_version_max = 1
	snapshot_version_min = 0
	state = Leader
	term = 98
runtime:
	arch = amd64
	cpu_count = 32
	goroutines = 2592
	max_procs = 31
	os = linux
	version = go1.10
serf_lan:
	coordinate_resets = 0
	encrypted = true
	event_queue = 0
	event_time = 3715
	failed = 2
	health_score = 0
	intent_queue = 0
	left = 1
	member_time = 1133006
	members = 456
	query_queue = 0
	query_time = 1929
serf_wan:
	coordinate_resets = 0
	encrypted = true
	event_queue = 0
	event_time = 1
	failed = 0
	health_score = 0
	intent_queue = 0
	left = 0
	member_time = 6511
	members = 6
	query_queue = 0
	query_time = 1

Operating system and Environment details

centos7.3

The text was updated successfully, but these errors were encountered:

kamaradclimber · 2018-03-20T13:00:23Z

There are many existing (closed) tickets about allowing consul servers to change ip address, most of them were closed thanks to using raft3 (and possibly the node-id).

pierresouchay · 2018-03-27T06:49:37Z

@hashicorp : what is the reason for restricting changes in node names since there is now a nodeId ?

Simply because implementation was complicated ? (Update of existing services/checks...) or to be defensive to avoid clashes ?

banks · 2018-03-27T14:26:48Z

What is the use-case for having client agent name change without it's ID change? There may well be one but it's worth understanding why it's needed before considering the change which at least has some subtleties to think through.

The main value of that on servers is that they have persistent state and participate in raft where identity and state both matter for correctness. My guess is we didn't extend renaming to work fine for agents just because it's not very clear why you would need to rename a client agent (e.g. change hostname) without also just letting it get a new ID (i.e. wiping it's persistent state).

I could be wrong but I don't think there is any problem if a client agent leaves and comes back with a different name AND ID but the same IP right?

guidoiaquinti · 2018-03-27T15:00:17Z

Personally I don't have a strong user-case but I can report that we had a "mini-incident" due to this bug. We also experienced the blocked consul servers behaviour that @kamaradclimber mentioned.

We usually don't rename nodes but we ended up hitting this issue due to a race condition in our provisioning pipeline (consul process was started before the hostname was properly rendered)

I think it could be nice to have consul to gracefully handle this event.

kamaradclimber · 2018-03-27T18:13:54Z

On our cluster, consul node name is the fqdn of the machine. Some of our users change their domain name, leading to an attempt to change consul node name.

kamaradclimber · 2018-03-27T18:15:09Z

As a side note, we dont touch the node id and let consul generate it using its deterministic method.

shantanugadgil · 2018-04-18T15:44:11Z

For my on premise solution, I have a cron job which names the machine based on its ip address and the Proxmox VMID.

I currently am on v 1.0.7.

If a machine is offline for a few days, it gets a new ip and the name change goes through smoothly.

Recently I changed the naming scheme a little bit.

A VM which had been off for a few months (Consul 0.9) came online with the old naming scheme.

After updating the Consul agent and updating the cron files, I had two entries in my consul members output, one with old name and one with the new name.

I just let it be and the next day, the old name was gone from the list.

pierresouchay · 2018-04-19T16:25:30Z

@shantanugadgil
Thank you for sharing your approach and experience!
Unfortunately, we had the issue several times causing various production issue and if you want this to be fixed, it requires on our side manual intervention, which is painful. (We even had cases when we could not fix it without waiting for a few hours)
Since the node now contains an ID, I think it's name could be changed without too much troubles.

kamaradclimber · 2018-06-14T12:59:46Z

@banks would have feedback on that issue?
@pierresouchay we might want to include #3983 in our next consul build to check the improvement for our use case

banks · 2018-07-12T15:24:58Z

We had to revert #3983 as it caused problems in testing and we discovered it's a breaking change which we can't include in current release cycle.

We still think this is close and will add some extra details about what we need to do to get this into 1.3.

pierresouchay · 2018-07-12T21:11:06Z

@banks Ok, I'll give you more details about our incident as well

…#4413 This change allow to rename any well behaving recent agent with an ID to be renamed safely, ie: without taking the name of another one with case insensitive comparison. Deprecated behaviour warning ---------------------------- Due to asceding compatibility, it is still possible however to "take" the name of another name by not providing any ID. Note that when not providing any ID, it is possible to have 2 nodes having similar names with case differences, ie: myNode and mynode which might lead to DB corruption on Consul server side and lead to server not properly restarting. See hashicorp#3983 and hashicorp#4399 for Context about this change. Disabling registration of nodes without IDs as specified in hashicorp#4414 should probably be the way to go eventually.

pierresouchay · 2018-08-10T15:34:25Z

\o/

shantanugadgil · 2018-08-10T16:40:56Z

Finally!!! 👍👍👍

rafaelmagu · 2018-10-18T00:36:20Z

Is this fixed in 1.3.0? Because I keep getting errors similar to this all over my stack:

Node name bastion-03f55798ed842fc0e is reserved by node a76399da-e1b0-c1ed-d426-e7ac892ef6c2 with name bastion-03f55798ed842fc0e

Sometimes it goes away with a service restart, sometimes it doesn't.

pierresouchay mentioned this issue Mar 26, 2018

Allow changing Node names since Node now have IDs #3983

Merged

mkeeler closed this as completed in #3983 Jul 11, 2018

banks reopened this Jul 12, 2018

pierresouchay mentioned this issue Jul 16, 2018

Node renaming - Support for allow_node_renaming #4399

Closed

pierresouchay mentioned this issue Jul 18, 2018

Allow to rename nodes with IDs, will fix #3974 and #4413 #4415

Merged

pearkes added type/bug Feature does not function as expected waiting-pr-merge labels Jul 26, 2018

mkeeler closed this as completed in ef3b81a Aug 10, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Conflicts appears when changing node_name on agents #3974

Conflicts appears when changing node_name on agents #3974

kamaradclimber commented Mar 20, 2018 •

edited

Loading

kamaradclimber commented Mar 20, 2018

pierresouchay commented Mar 27, 2018

banks commented Mar 27, 2018 •

edited

Loading

guidoiaquinti commented Mar 27, 2018 •

edited

Loading

kamaradclimber commented Mar 27, 2018

kamaradclimber commented Mar 27, 2018

shantanugadgil commented Apr 18, 2018

pierresouchay commented Apr 19, 2018

kamaradclimber commented Jun 14, 2018 •

edited

Loading

banks commented Jul 12, 2018

pierresouchay commented Jul 12, 2018

pierresouchay commented Aug 10, 2018

shantanugadgil commented Aug 10, 2018

rafaelmagu commented Oct 18, 2018

Conflicts appears when changing node_name on agents #3974

Conflicts appears when changing node_name on agents #3974

Comments

kamaradclimber commented Mar 20, 2018 • edited Loading

Description of the Issue (and unexpected/desired result)

Reproduction steps

consul version for both Client and Server

consul info for both Client and Server

Operating system and Environment details

kamaradclimber commented Mar 20, 2018

pierresouchay commented Mar 27, 2018

banks commented Mar 27, 2018 • edited Loading

guidoiaquinti commented Mar 27, 2018 • edited Loading

kamaradclimber commented Mar 27, 2018

kamaradclimber commented Mar 27, 2018

shantanugadgil commented Apr 18, 2018

pierresouchay commented Apr 19, 2018

kamaradclimber commented Jun 14, 2018 • edited Loading

banks commented Jul 12, 2018

pierresouchay commented Jul 12, 2018

pierresouchay commented Aug 10, 2018

shantanugadgil commented Aug 10, 2018

rafaelmagu commented Oct 18, 2018

kamaradclimber commented Mar 20, 2018 •

edited

Loading

`consul version` for both Client and Server

`consul info` for both Client and Server

banks commented Mar 27, 2018 •

edited

Loading

guidoiaquinti commented Mar 27, 2018 •

edited

Loading

kamaradclimber commented Jun 14, 2018 •

edited

Loading