Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Conflicts appears when changing node_name on agents #3974

Closed
kamaradclimber opened this issue Mar 20, 2018 · 14 comments · Fixed by #3983
Closed

Conflicts appears when changing node_name on agents #3974

kamaradclimber opened this issue Mar 20, 2018 · 14 comments · Fixed by #3983
Labels
type/bug Feature does not function as expected

Comments

@kamaradclimber
Copy link
Contributor

kamaradclimber commented Mar 20, 2018

Description of the Issue (and unexpected/desired result)

  • When changing node_name of an agent, we observe conflicts in consul servers logs.
  • It is also possible (not reproduced yet, but observed twice on our production) that consul servers ends up blocked when several agents have changed their names.

Reproduction steps

  • change node_name configuration in a consul agent configuration
  • restart consul agent

On the consul server:

2018/03/20 12:53:56 [INFO] serf: EventMemberJoin: consul-relay01-test2-pa4.central.criteo.preprod 10.224.45.123
2018/03/20 12:53:56 [INFO] consul: member 'consul-relay01-test2-pa4.central.criteo.preprod' joined, marking health alive
2018/03/20 12:53:56 [WARN] consul.fsm: EnsureRegistration failed: failed inserting node: node ID "4d1dac9f-8977-4258-4aca-fa254c9f48da" for node "consul-relay01-test2-pa4.central.criteo.preprod" aliases existing node "consul-relay01-pa4.central.criteo.preprod"
2018/03/20 12:54:00 [WARN] consul.fsm: EnsureRegistration failed: failed inserting node: node ID "4d1dac9f-8977-4258-4aca-fa254c9f48da" for node "consul-relay01-test2-pa4.central.criteo.preprod" aliases existing node "consul-relay01-pa4.central.criteo.preprod"

Output of consul members:

consul-relay01-pa4.central.criteo.preprod                10.224.45.123:8301  failed  client  1.0.6  2         pa4  <default>
consul-relay01-test2-pa4.central.criteo.preprod          10.224.45.123:8301  alive   client  1.0.6  2         pa4  <default>

consul version for both Client and Server

Client: 1.0.6
Server: 1.0.6 (with some patches)

consul info for both Client and Server

Client:

agent:
	check_monitors = 0
	check_ttls = 0
	checks = 3
	services = 4
build:
	prerelease = 
	revision = 
	version = 1.0.6
consul:
	known_servers = 3
	server = false
runtime:
	arch = amd64
	cpu_count = 32
	goroutines = 46
	max_procs = 2
	os = linux
	version = go1.9.4
serf_lan:
	coordinate_resets = 0
	encrypted = true
	event_queue = 0
	event_time = 3715
	failed = 0
	health_score = 0
	intent_queue = 0
	left = 0
	member_time = 1133006
	members = 453
	query_queue = 0
	query_time = 1929

Server:

agent:
	check_monitors = 0
	check_ttls = 0
	checks = 2
	services = 3
build:
	prerelease = criteo5
	revision = 
	version = 1.0.6
consul:
	bootstrap = false
	known_datacenters = 2
	leader = true
	leader_addr = 10.224.47.92:8300
	server = true
raft:
	applied_index = 83407159
	commit_index = 83407159
	fsm_pending = 0
	last_contact = 0
	last_log_index = 83407159
	last_log_term = 98
	last_snapshot_index = 83403466
	last_snapshot_term = 98
	latest_configuration = [{Suffrage:Voter ID:4fd4772d-e3cd-ebd6-731f-c7e6431ce284 Address:10.224.46.86:8300} {Suffrage:Voter ID:1bfb896b-ee04-520c-10e0-8b382dc0c832 Address:10.224.47.92:8300} {Suffrage:Voter ID:f5e7d35e-66b4-a4b1-b85b-5747af533b58 Address:10.224.47.83:8300}]
	latest_configuration_index = 77270117
	num_peers = 2
	protocol_version = 3
	protocol_version_max = 3
	protocol_version_min = 0
	snapshot_version_max = 1
	snapshot_version_min = 0
	state = Leader
	term = 98
runtime:
	arch = amd64
	cpu_count = 32
	goroutines = 2592
	max_procs = 31
	os = linux
	version = go1.10
serf_lan:
	coordinate_resets = 0
	encrypted = true
	event_queue = 0
	event_time = 3715
	failed = 2
	health_score = 0
	intent_queue = 0
	left = 1
	member_time = 1133006
	members = 456
	query_queue = 0
	query_time = 1929
serf_wan:
	coordinate_resets = 0
	encrypted = true
	event_queue = 0
	event_time = 1
	failed = 0
	health_score = 0
	intent_queue = 0
	left = 0
	member_time = 6511
	members = 6
	query_queue = 0
	query_time = 1

Operating system and Environment details

centos7.3

@kamaradclimber
Copy link
Contributor Author

There are many existing (closed) tickets about allowing consul servers to change ip address, most of them were closed thanks to using raft3 (and possibly the node-id).

@pierresouchay
Copy link
Contributor

@hashicorp : what is the reason for restricting changes in node names since there is now a nodeId ?

Simply because implementation was complicated ? (Update of existing services/checks...) or to be defensive to avoid clashes ?

@banks
Copy link
Member

banks commented Mar 27, 2018

What is the use-case for having client agent name change without it's ID change? There may well be one but it's worth understanding why it's needed before considering the change which at least has some subtleties to think through.

The main value of that on servers is that they have persistent state and participate in raft where identity and state both matter for correctness. My guess is we didn't extend renaming to work fine for agents just because it's not very clear why you would need to rename a client agent (e.g. change hostname) without also just letting it get a new ID (i.e. wiping it's persistent state).

I could be wrong but I don't think there is any problem if a client agent leaves and comes back with a different name AND ID but the same IP right?

@guidoiaquinti
Copy link
Contributor

guidoiaquinti commented Mar 27, 2018

Personally I don't have a strong user-case but I can report that we had a "mini-incident" due to this bug. We also experienced the blocked consul servers behaviour that @kamaradclimber mentioned.

We usually don't rename nodes but we ended up hitting this issue due to a race condition in our provisioning pipeline (consul process was started before the hostname was properly rendered)

I think it could be nice to have consul to gracefully handle this event.

@kamaradclimber
Copy link
Contributor Author

On our cluster, consul node name is the fqdn of the machine. Some of our users change their domain name, leading to an attempt to change consul node name.

@kamaradclimber
Copy link
Contributor Author

As a side note, we dont touch the node id and let consul generate it using its deterministic method.

@shantanugadgil
Copy link
Contributor

For my on premise solution, I have a cron job which names the machine based on its ip address and the Proxmox VMID.

I currently am on v 1.0.7.

If a machine is offline for a few days, it gets a new ip and the name change goes through smoothly.

Recently I changed the naming scheme a little bit.

A VM which had been off for a few months (Consul 0.9) came online with the old naming scheme.

After updating the Consul agent and updating the cron files, I had two entries in my consul members output, one with old name and one with the new name.

I just let it be and the next day, the old name was gone from the list.

@pierresouchay
Copy link
Contributor

@shantanugadgil
Thank you for sharing your approach and experience!
Unfortunately, we had the issue several times causing various production issue and if you want this to be fixed, it requires on our side manual intervention, which is painful. (We even had cases when we could not fix it without waiting for a few hours)
Since the node now contains an ID, I think it's name could be changed without too much troubles.

@kamaradclimber
Copy link
Contributor Author

kamaradclimber commented Jun 14, 2018

@banks would have feedback on that issue?
@pierresouchay we might want to include #3983 in our next consul build to check the improvement for our use case

@banks
Copy link
Member

banks commented Jul 12, 2018

We had to revert #3983 as it caused problems in testing and we discovered it's a breaking change which we can't include in current release cycle.

We still think this is close and will add some extra details about what we need to do to get this into 1.3.

@banks banks reopened this Jul 12, 2018
@pierresouchay
Copy link
Contributor

@banks Ok, I'll give you more details about our incident as well

pierresouchay added a commit to pierresouchay/consul that referenced this issue Jul 18, 2018
…#4413

This change allow to rename any well behaving recent agent with an
ID to be renamed safely, ie: without taking the name of another one
with case insensitive comparison.

Deprecated behaviour warning
----------------------------

Due to asceding compatibility, it is still possible however to
"take" the name of another name by not providing any ID.

Note that when not providing any ID, it is possible to have 2 nodes
having similar names with case differences, ie: myNode and mynode
which might lead to DB corruption on Consul server side and
lead to server not properly restarting.

See hashicorp#3983 and hashicorp#4399 for Context about this change.

Disabling registration of nodes without IDs as specified in hashicorp#4414
should probably be the way to go eventually.
@pearkes pearkes added type/bug Feature does not function as expected waiting-pr-merge labels Jul 26, 2018
@pierresouchay
Copy link
Contributor

\o/

@shantanugadgil
Copy link
Contributor

Finally!!! 👍👍👍

@rafaelmagu
Copy link

Is this fixed in 1.3.0? Because I keep getting errors similar to this all over my stack:

Node name bastion-03f55798ed842fc0e is reserved by node a76399da-e1b0-c1ed-d426-e7ac892ef6c2 with name bastion-03f55798ed842fc0e

Sometimes it goes away with a service restart, sometimes it doesn't.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type/bug Feature does not function as expected
Projects
None yet
7 participants