Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Failed to join: Member has conflicting node ID #3070

Closed
eladitzhakian opened this issue May 24, 2017 · 22 comments
Closed

Failed to join: Member has conflicting node ID #3070

eladitzhakian opened this issue May 24, 2017 · 22 comments
Labels
type/enhancement Proposed improvement or new feature
Milestone

Comments

@eladitzhakian
Copy link

I'm using dockerized consul agents and servers, version 0.8.1, and I'm also using terraform to manage the infrastructure

Whenever I'm relaunching (destory+create) a machine, or simply reboot it, the consul agent fails to rejoin the cluster with this error:

* Failed to join 10.136.60.53: Member 'abee16f194f7' has conflicting node ID 'ac5ec7af-523
8-4c18-988a-0385d0a0f477' with this agent's ID

The consul servers show the corresponding error:

 2017/05/24 13:32:43 [INFO] consul.fsm: EnsureRegistration failed: failed inserting node: node ID "ac5ec7af-5238-4c18-988a-0385d0a0f477" for node "abee16f194f7" aliases existing node "8c38f1b65cea"

I tried everything the mailing list suggests:

  1. use randomly generated node-id
  2. use a constant unique node-id
  3. delete the agent's data dir
  4. force leave the failing/missing nodes
  5. use disable host node id

Nothing. Works.

Lastly I downgraded to 0.7.5 (where uniqueness is not enforced) and the agent was able to rejoin the cluster. What are my options?

@slackpad
Copy link
Contributor

Hi @eladitzhakian this'll be generally covered by #1580 once that's complete, but there should be a way to get this working in the meantime.

Option 4 should work no matter what - are you sure that the force-leave is actually working? It's by node name, not IP. To understand this more, what's preserved when you restart or relaunch a machine. Is it the same node name before and after?

@slackpad slackpad added the waiting-reply Waiting on response from Original Poster or another individual in the thread label May 25, 2017
@eladitzhakian
Copy link
Author

Thanks for your reply @slackpad
I'm sure force leave is working, I'm using the node name and tailing the servers' logs. What I generally do is try and force leave all failing nodes, but they keep coming back with a new ID. The servers attempt to reconnect and reregister them for some reason.

Usually the same node name is preserved but as I said it doesn't help when I force a new node ID or ask consul to disable host node ID. getting the same error with the new node ID :/

@eladitzhakian eladitzhakian changed the title Failed to join: Memeber has conflicting node ID Failed to join: Member has conflicting node ID May 30, 2017
@slackpad
Copy link
Contributor

Hmm do you have a simple repro case that'll show the force-leave not working? I'm not sure what's happening with this one.

@eladitzhakian
Copy link
Author

I'm afraid this is not easily reproduced. But it does happen on both clusters I'm maintaining. Each cluster has 3 consul server nodes and about 20 nodes running a single agent, each one has a registrator container running next to it. Everything dockerized.

@nrvale0
Copy link
Contributor

nrvale0 commented Jun 16, 2017

@eladitzhakian I've been seeing similar issue when deploying with the Kubernetes Helm Chart. There are two places I've noticed possible conflicts:

  1. Where gopsutil computes the same Consul Node ID for two Pods/containers because they are runnign on the same underlying k8s worker node/minion.
  2. Because the Consul Pod is running in k8s as a StatefulSet with PersistentVolumeClaims for Consul storage a rebuild of the Consul stack can result in the new Pods re-using volumes and data directories from a previous StatefulSet.

You didn't mention if you are running on k8s but I suspect this might be a problem for other cluster manager/orchestration networks with similar concepts of StatefulSets and PersistentVolumes.

@eladitzhakian
Copy link
Author

@nrvale0 thanks but i'm not using k8s or any orchestration manager :/

@hehailong5
Copy link

Hi All,

The same is observed in 0.8.4, and I am clustering consul in docker swarm using dedicated overlay network.

@slackpad, what's the possible workaround for this issue in current release? what's the schedule for the fix delivery?

Thanks!

@nrvale0
Copy link
Contributor

nrvale0 commented Jun 23, 2017

@hehailong5 helm/charts#1289 references a config option for Consul : -disable-host-node-id :: https://www.consul.io/docs/agent/options.html#_disable_host_node_id

I'm testing the use of that option in the Kubernetes Helm Chart for Consul now and its been working fine but I've also not spent a lot of time digging into it to fully understand any downstream consequences. I should also point out that the host node ID gets cached if the container is using a persistent storage volume so transitioning to use of -disable-host-node-id likely involves destroying / cleaning previous storage volumes for the container. The OP says that still does not solve it for him 100% of the time. I've not yet seen the combo of -disable-host-node-id + fresh volume fail.

@preetapan
Copy link
Member

@hehailong5, like @nrvale0 points out, you can set disable_host_node_id to true

There's an open ticket for making that the default behavior in the 0.9.0 release #3171

There shouldn't be any other downstream implications - it will generate a random nodeid if that option is set. As long as you don't have any monitoring/nagios alerts that rely on the values of nodeids in your environment, you should be fine.

@hehailong5
Copy link

Oops, We already have had such an option to prevent this happening, will try that out, thank you guys!

@slackpad slackpad added this to James in Consul 0.9.0 Jun 25, 2017
@slackpad slackpad moved this from James to Done in Consul 0.9.0 Jun 25, 2017
@slackpad slackpad moved this from Done to James in Consul 0.9.0 Jun 25, 2017
@slackpad slackpad moved this from James to Operator Usability in Consul 0.9.0 Jun 28, 2017
@mkielar
Copy link

mkielar commented Jul 17, 2017

We have a very similar setup to the original issue by @eladitzhakian (terraform + consul running as ECS task). What's different is that we're using ECS Service to handle task deployment, so when we change the consul agent task definition, no EC2 recreation happens.

What's important, the tasks had mounted /consul/data directory from EC2, so that the node-id would be persistent.

Still, redeployment of the task, causes the master to reject clients with similar message:

Failed to join 172.17.41.250: Member '89ce6716ce9d' has conflicting node ID 'f6a14a58-0a19-e34e-815b-d3b55084803d' with member 'e3d06a586533'

What's weird in this situation is that this error message has nothing to do with the node I'm getting the log from. It looks like Node A cannot join the cluster, because nodes B and C have conflicting node-ids.

Workaround that worked for me was to use the -disable-host-node-id option together with one of the below:

  1. either, do a rm -rf /consul/data/* inside docker container, before running consul
  2. or not mounting the /consul/data at all
    both ultimately force the node-id to be randomly picked each time the docker container is started.

@slackpad slackpad removed this from Operator Usability in Consul 0.9.0 Jul 18, 2017
@slackpad slackpad added this to Triaged / Sorted in Consul 0.9.1 Jul 18, 2017
@slackpad slackpad moved this from Triaged / Sorted to Community in Consul 0.9.1 Jul 18, 2017
@slackpad slackpad moved this from Community to Triaged / Sorted in Consul 0.9.1 Jul 18, 2017
@relmos
Copy link

relmos commented Jul 18, 2017

Also happen in my environment, in version 0.8.5. conflicts of any two nodes lock agent join of any other nodes. Pretty much add to clean data revert back to 0.7 and restored KV from backup.

I would really emphasize this behavior in the documentation for the benefits of any one that consider to upgrade

@slackpad slackpad removed this from Triaged / Sorted in Consul 0.9.1 Aug 7, 2017
@sebamontini
Copy link

same thing here... I think this should be a big warning in the "breaking changes" part of the changelog, the config doc says that in 0.8.5 disable_host_node_id is set to true...but it never warns you that before the node id wasn't being checked.

this is a MAJOR issue when working with AMIs, you need to ensure that when you build the AMI of the client (or even the consul servers) you stop consul agent and remove the /var/consul/node-id....or else, any instance being launched from that AMI (ie: an autoscaling group launching instaces of the same app when it need to scale) it will break since all of them will have the same id

one way to fix this would ensure that when the agent starts it removes/replaces the node-id file with the new ID (like it would do with a pidfile or something like that)

@slackpad
Copy link
Contributor

Hi @sebamontini that's an interesting consequence of this. In general it's not a good idea to bake a data-dir into your AMIs since there are other things in there that could also cause issues like Raft data, Serf snapshots, etc. I'd definitely recommend that your AMI build process either doesn't start Consul to populate the data-dir, or shuts down Consul at the end and clears it out. There should be no need to keep anything other than Consul's configuration files in the AMI.

@sebamontini
Copy link

@slackpad it wasn't part of the idea, since the automated process that bakes the AMI, first provision the vanilla instance and the install everything, we wanted to startup consul to test if everything is running properly. we will add one more step to ensure we clear all the data.
thanks

@blalor
Copy link
Contributor

blalor commented Aug 11, 2017

I'm running into this problem as well, trying to set up a 6 node "cluster" on my laptop, to test the migration from 0.7.5 to 0.9.2. I've got a script that sets up all 6 consul instances with unique data dirs, ports, etc. With 0.7.5 I can get a fully-working cluster. When I attempt to upgrade one client to 0.9.2 -- even with -disable-host-node-id=true or a unique node-id -- the new version fails to join the cluster because the other clients have conflicting node IDs. There's no way I can see to force a specific node ID in 0.7.5.

This only affects my proof-of-concept, not a production workload, but it does mean I'm gonna have to spin up a fleet of VMs, or just provision some real instances. :-(

@sebamontini
Copy link

i think the problem @blalor is that in 0.7.5 the default was to use host-node-id, so when you upgrade to 0.9.x all 6 nodes are using the same ID, you could either delete the node-id data before upgrading each node or creating the 0.7.5 cluster with --disable-host-node-id=true

@blalor
Copy link
Contributor

blalor commented Aug 11, 2017

That option's not supported in 0.7.5.

flag provided but not defined: -disable-host-node-id

@Evertvz
Copy link

Evertvz commented Aug 23, 2017

I'm having sort of the same issue. My servers and agents are 0.7.4. All the agents are docker containers so all agent id's are the same.
When trying to upgrade to 0.9.2 consul refuses to start because other nodes have conflicting id's.
The option to disable host node id's was added in 0.8.1. But this version already checks for conflicting node id's.

Long story short there is no consul version that will generate a random node id without checking if there is a conflict. This would have allowed me to upgrade in two steps.

This is happening in our dev, test and production environment.

EDIT: I've decided to get rid of consul inside the containers and solve the issue as described here: https://medium.com/zendesk-engineering/making-docker-and-consul-get-along-5fceda1d52b9
This allows me to phase out consul from the dockers and as a second step upgrade to v0.9.2 or higher. I'm putting it here in case it can help someone else.

@slackpad slackpad removed the waiting-reply Waiting on response from Original Poster or another individual in the thread label Aug 28, 2017
@slackpad
Copy link
Contributor

@blalor for your use case in dev you can start each old 0.7.5 agent with something like -node-id=$(uuidgen | awk '{print tolower($0)}) to give it a unique ID. That option was available back then. You could also inject a UUID into each agent's data directory in a file called node-id which will get picked up.

@slackpad
Copy link
Contributor

slackpad commented Aug 29, 2017

I'm thinking the best way to make this easier for operators is to only enforce unique host IDs for Consul agents running version 0.8.5 or later (that's when we made host-based IDs opt-in). This would be a small code change, and helps interoperability for folks that are skipping several major versions. If you have large pools of older agents this gets to be a pain.

@slackpad slackpad added the type/enhancement Proposed improvement or new feature label Aug 29, 2017
@slackpad slackpad added this to the 0.9.3 milestone Aug 29, 2017
@blalor
Copy link
Contributor

blalor commented Aug 29, 2017

Thanks, @slackpad. I got past my issue and have completely upgraded to 0.9.2.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type/enhancement Proposed improvement or new feature
Projects
None yet
Development

No branches or pull requests

10 participants