Failed to join: Member has conflicting node ID #3070

eladitzhakian · 2017-05-24T13:35:16Z

I'm using dockerized consul agents and servers, version 0.8.1, and I'm also using terraform to manage the infrastructure

Whenever I'm relaunching (destory+create) a machine, or simply reboot it, the consul agent fails to rejoin the cluster with this error:

* Failed to join 10.136.60.53: Member 'abee16f194f7' has conflicting node ID 'ac5ec7af-523
8-4c18-988a-0385d0a0f477' with this agent's ID

The consul servers show the corresponding error:

 2017/05/24 13:32:43 [INFO] consul.fsm: EnsureRegistration failed: failed inserting node: node ID "ac5ec7af-5238-4c18-988a-0385d0a0f477" for node "abee16f194f7" aliases existing node "8c38f1b65cea"

I tried everything the mailing list suggests:

use randomly generated node-id
use a constant unique node-id
delete the agent's data dir
force leave the failing/missing nodes
use disable host node id

Nothing. Works.

Lastly I downgraded to 0.7.5 (where uniqueness is not enforced) and the agent was able to rejoin the cluster. What are my options?

The text was updated successfully, but these errors were encountered:

slackpad · 2017-05-25T00:24:42Z

Hi @eladitzhakian this'll be generally covered by #1580 once that's complete, but there should be a way to get this working in the meantime.

Option 4 should work no matter what - are you sure that the force-leave is actually working? It's by node name, not IP. To understand this more, what's preserved when you restart or relaunch a machine. Is it the same node name before and after?

eladitzhakian · 2017-05-25T06:15:33Z

Thanks for your reply @slackpad
I'm sure force leave is working, I'm using the node name and tailing the servers' logs. What I generally do is try and force leave all failing nodes, but they keep coming back with a new ID. The servers attempt to reconnect and reregister them for some reason.

Usually the same node name is preserved but as I said it doesn't help when I force a new node ID or ask consul to disable host node ID. getting the same error with the new node ID :/

slackpad · 2017-05-31T22:20:10Z

Hmm do you have a simple repro case that'll show the force-leave not working? I'm not sure what's happening with this one.

eladitzhakian · 2017-06-01T09:41:44Z

I'm afraid this is not easily reproduced. But it does happen on both clusters I'm maintaining. Each cluster has 3 consul server nodes and about 20 nodes running a single agent, each one has a registrator container running next to it. Everything dockerized.

nrvale0 · 2017-06-16T18:34:37Z

@eladitzhakian I've been seeing similar issue when deploying with the Kubernetes Helm Chart. There are two places I've noticed possible conflicts:

Where gopsutil computes the same Consul Node ID for two Pods/containers because they are runnign on the same underlying k8s worker node/minion.
Because the Consul Pod is running in k8s as a StatefulSet with PersistentVolumeClaims for Consul storage a rebuild of the Consul stack can result in the new Pods re-using volumes and data directories from a previous StatefulSet.

You didn't mention if you are running on k8s but I suspect this might be a problem for other cluster manager/orchestration networks with similar concepts of StatefulSets and PersistentVolumes.

eladitzhakian · 2017-06-16T22:31:55Z

@nrvale0 thanks but i'm not using k8s or any orchestration manager :/

hehailong5 · 2017-06-23T04:53:34Z

Hi All,

The same is observed in 0.8.4, and I am clustering consul in docker swarm using dedicated overlay network.

@slackpad, what's the possible workaround for this issue in current release? what's the schedule for the fix delivery?

Thanks!

nrvale0 · 2017-06-23T14:44:22Z

@hehailong5 helm/charts#1289 references a config option for Consul : -disable-host-node-id :: https://www.consul.io/docs/agent/options.html#_disable_host_node_id

I'm testing the use of that option in the Kubernetes Helm Chart for Consul now and its been working fine but I've also not spent a lot of time digging into it to fully understand any downstream consequences. I should also point out that the host node ID gets cached if the container is using a persistent storage volume so transitioning to use of -disable-host-node-id likely involves destroying / cleaning previous storage volumes for the container. The OP says that still does not solve it for him 100% of the time. I've not yet seen the combo of -disable-host-node-id + fresh volume fail.

preetapan · 2017-06-23T15:03:58Z

@hehailong5, like @nrvale0 points out, you can set disable_host_node_id to true

There's an open ticket for making that the default behavior in the 0.9.0 release #3171

There shouldn't be any other downstream implications - it will generate a random nodeid if that option is set. As long as you don't have any monitoring/nagios alerts that rely on the values of nodeids in your environment, you should be fine.

hehailong5 · 2017-06-24T04:28:16Z

Oops, We already have had such an option to prevent this happening, will try that out, thank you guys!

mkielar · 2017-07-17T11:46:44Z

We have a very similar setup to the original issue by @eladitzhakian (terraform + consul running as ECS task). What's different is that we're using ECS Service to handle task deployment, so when we change the consul agent task definition, no EC2 recreation happens.

What's important, the tasks had mounted /consul/data directory from EC2, so that the node-id would be persistent.

Still, redeployment of the task, causes the master to reject clients with similar message:

Failed to join 172.17.41.250: Member '89ce6716ce9d' has conflicting node ID 'f6a14a58-0a19-e34e-815b-d3b55084803d' with member 'e3d06a586533'

What's weird in this situation is that this error message has nothing to do with the node I'm getting the log from. It looks like Node A cannot join the cluster, because nodes B and C have conflicting node-ids.

Workaround that worked for me was to use the -disable-host-node-id option together with one of the below:

either, do a rm -rf /consul/data/* inside docker container, before running consul
or not mounting the /consul/data at all
both ultimately force the node-id to be randomly picked each time the docker container is started.

relmos · 2017-07-18T18:51:47Z

Also happen in my environment, in version 0.8.5. conflicts of any two nodes lock agent join of any other nodes. Pretty much add to clean data revert back to 0.7 and restored KV from backup.

I would really emphasize this behavior in the documentation for the benefits of any one that consider to upgrade

sebamontini · 2017-08-08T15:39:21Z

same thing here... I think this should be a big warning in the "breaking changes" part of the changelog, the config doc says that in 0.8.5 disable_host_node_id is set to true...but it never warns you that before the node id wasn't being checked.

this is a MAJOR issue when working with AMIs, you need to ensure that when you build the AMI of the client (or even the consul servers) you stop consul agent and remove the /var/consul/node-id....or else, any instance being launched from that AMI (ie: an autoscaling group launching instaces of the same app when it need to scale) it will break since all of them will have the same id

one way to fix this would ensure that when the agent starts it removes/replaces the node-id file with the new ID (like it would do with a pidfile or something like that)

slackpad · 2017-08-10T03:53:55Z

Hi @sebamontini that's an interesting consequence of this. In general it's not a good idea to bake a data-dir into your AMIs since there are other things in there that could also cause issues like Raft data, Serf snapshots, etc. I'd definitely recommend that your AMI build process either doesn't start Consul to populate the data-dir, or shuts down Consul at the end and clears it out. There should be no need to keep anything other than Consul's configuration files in the AMI.

sebamontini · 2017-08-10T12:31:21Z

@slackpad it wasn't part of the idea, since the automated process that bakes the AMI, first provision the vanilla instance and the install everything, we wanted to startup consul to test if everything is running properly. we will add one more step to ensure we clear all the data.
thanks

blalor · 2017-08-11T11:01:53Z

I'm running into this problem as well, trying to set up a 6 node "cluster" on my laptop, to test the migration from 0.7.5 to 0.9.2. I've got a script that sets up all 6 consul instances with unique data dirs, ports, etc. With 0.7.5 I can get a fully-working cluster. When I attempt to upgrade one client to 0.9.2 -- even with -disable-host-node-id=true or a unique node-id -- the new version fails to join the cluster because the other clients have conflicting node IDs. There's no way I can see to force a specific node ID in 0.7.5.

This only affects my proof-of-concept, not a production workload, but it does mean I'm gonna have to spin up a fleet of VMs, or just provision some real instances. :-(

sebamontini · 2017-08-11T12:59:40Z

i think the problem @blalor is that in 0.7.5 the default was to use host-node-id, so when you upgrade to 0.9.x all 6 nodes are using the same ID, you could either delete the node-id data before upgrading each node or creating the 0.7.5 cluster with --disable-host-node-id=true

blalor · 2017-08-11T14:03:47Z

That option's not supported in 0.7.5.

flag provided but not defined: -disable-host-node-id

Evertvz · 2017-08-23T13:59:57Z

I'm having sort of the same issue. My servers and agents are 0.7.4. All the agents are docker containers so all agent id's are the same.
When trying to upgrade to 0.9.2 consul refuses to start because other nodes have conflicting id's.
The option to disable host node id's was added in 0.8.1. But this version already checks for conflicting node id's.

Long story short there is no consul version that will generate a random node id without checking if there is a conflict. This would have allowed me to upgrade in two steps.

This is happening in our dev, test and production environment.

EDIT: I've decided to get rid of consul inside the containers and solve the issue as described here: https://medium.com/zendesk-engineering/making-docker-and-consul-get-along-5fceda1d52b9
This allows me to phase out consul from the dockers and as a second step upgrade to v0.9.2 or higher. I'm putting it here in case it can help someone else.

slackpad · 2017-08-28T23:56:55Z

@blalor for your use case in dev you can start each old 0.7.5 agent with something like -node-id=$(uuidgen | awk '{print tolower($0)}) to give it a unique ID. That option was available back then. You could also inject a UUID into each agent's data directory in a file called node-id which will get picked up.

slackpad · 2017-08-29T00:01:12Z

I'm thinking the best way to make this easier for operators is to only enforce unique host IDs for Consul agents running version 0.8.5 or later (that's when we made host-based IDs opt-in). This would be a small code change, and helps interoperability for folks that are skipping several major versions. If you have large pools of older agents this gets to be a pain.

blalor · 2017-08-29T02:20:28Z

Thanks, @slackpad. I got past my issue and have completely upgraded to 0.9.2.

Fixes #3070.

slackpad added the waiting-reply Waiting on response from Original Poster or another individual in the thread label May 25, 2017

eladitzhakian changed the title ~~Failed to join: Memeber has conflicting node ID~~ Failed to join: Member has conflicting node ID May 30, 2017

nrvale0 mentioned this issue Jun 16, 2017

Consul : Consul Leader issue helm/charts#1289

Closed

slackpad added this to James in Consul 0.9.0 Jun 25, 2017

slackpad moved this from James to Done in Consul 0.9.0 Jun 25, 2017

slackpad moved this from Done to James in Consul 0.9.0 Jun 25, 2017

slackpad moved this from James to Operator Usability in Consul 0.9.0 Jun 28, 2017

slackpad removed this from Operator Usability in Consul 0.9.0 Jul 18, 2017

slackpad added this to Triaged / Sorted in Consul 0.9.1 Jul 18, 2017

slackpad moved this from Triaged / Sorted to Community in Consul 0.9.1 Jul 18, 2017

slackpad moved this from Community to Triaged / Sorted in Consul 0.9.1 Jul 18, 2017

slackpad removed this from Triaged / Sorted in Consul 0.9.1 Aug 7, 2017

slackpad removed the waiting-reply Waiting on response from Original Poster or another individual in the thread label Aug 28, 2017

slackpad mentioned this issue Aug 28, 2017

Issues while upgrading Version 0.7.4 to 0.8.5 #3242

Closed

slackpad added the type/enhancement Proposed improvement or new feature label Aug 29, 2017

slackpad added this to the 0.9.3 milestone Aug 29, 2017

slackpad added a commit that referenced this issue Sep 6, 2017

Skips unique node ID check for old versions of Consul.

1333fa5

Fixes #3070.

slackpad mentioned this issue Sep 6, 2017

Skips unique node ID check for old versions of Consul. #3447

Merged

slackpad closed this as completed in #3447 Sep 6, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Failed to join: Member has conflicting node ID #3070

Failed to join: Member has conflicting node ID #3070

eladitzhakian commented May 24, 2017

slackpad commented May 25, 2017

eladitzhakian commented May 25, 2017

slackpad commented May 31, 2017

eladitzhakian commented Jun 1, 2017

nrvale0 commented Jun 16, 2017

eladitzhakian commented Jun 16, 2017

hehailong5 commented Jun 23, 2017

nrvale0 commented Jun 23, 2017 •

edited

Loading

preetapan commented Jun 23, 2017

hehailong5 commented Jun 24, 2017

mkielar commented Jul 17, 2017

relmos commented Jul 18, 2017

sebamontini commented Aug 8, 2017

slackpad commented Aug 10, 2017

sebamontini commented Aug 10, 2017

blalor commented Aug 11, 2017

sebamontini commented Aug 11, 2017

blalor commented Aug 11, 2017

Evertvz commented Aug 23, 2017 •

edited

Loading

slackpad commented Aug 28, 2017

slackpad commented Aug 29, 2017 •

edited

Loading

blalor commented Aug 29, 2017 •

edited

Loading

Failed to join: Member has conflicting node ID #3070

Failed to join: Member has conflicting node ID #3070

Comments

eladitzhakian commented May 24, 2017

slackpad commented May 25, 2017

eladitzhakian commented May 25, 2017

slackpad commented May 31, 2017

eladitzhakian commented Jun 1, 2017

nrvale0 commented Jun 16, 2017

eladitzhakian commented Jun 16, 2017

hehailong5 commented Jun 23, 2017

nrvale0 commented Jun 23, 2017 • edited Loading

preetapan commented Jun 23, 2017

hehailong5 commented Jun 24, 2017

mkielar commented Jul 17, 2017

relmos commented Jul 18, 2017

sebamontini commented Aug 8, 2017

slackpad commented Aug 10, 2017

sebamontini commented Aug 10, 2017

blalor commented Aug 11, 2017

sebamontini commented Aug 11, 2017

blalor commented Aug 11, 2017

Evertvz commented Aug 23, 2017 • edited Loading

slackpad commented Aug 28, 2017

slackpad commented Aug 29, 2017 • edited Loading

blalor commented Aug 29, 2017 • edited Loading

nrvale0 commented Jun 23, 2017 •

edited

Loading

Evertvz commented Aug 23, 2017 •

edited

Loading

slackpad commented Aug 29, 2017 •

edited

Loading

blalor commented Aug 29, 2017 •

edited

Loading