New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Raft cannot handle node changing IP address #457

Closed
armon opened this Issue Nov 6, 2014 · 26 comments

Comments

Projects
None yet
@armon
Copy link
Member

armon commented Nov 6, 2014

If a node changes IP address as a server, the gossip layer will properly handle the IP update, but the Raft peer set will not be updated. This will cause replication errors and potentially an outage.

This can be triggered by restarting a docker container with the Consul servers without doing a graceful leave.

@rocketraman

This comment has been minimized.

Copy link

rocketraman commented Nov 6, 2014

The reproduction recipe is here: https://groups.google.com/d/msg/consul-tool/pWj3rHdgdqY/PMXCywgXo28J.

That uses the progrium/consul docker container, which does not have leave_on_terminate=true set by default (currently). If the author accepts and fixes gliderlabs/docker-consul#34 then that will no longer be the case.

@petemounce

This comment has been minimized.

Copy link

petemounce commented Jan 16, 2015

We use short-lived windows instances - most last not more than a day. We refer to our instances by a logical name (eg, web_001 through web_33). As instances come and go, we re-use the logical names to fill gaps before adding more at the top end.

This means nodes will come and go with different IPs, but the same node names, and it sounds like we'll be affected by this issue. As a workaround, should we inject some uniqueness into the node-name that we pick for the consul agent (such as the AWS instance ID)?

However, we'd prefer not to have to. We use logical names in the first place for 2 reasons:

  • it's easier to refer to nodes this way!
  • graphite's whisper storage backend pre-allocates space on disk when new metrics are seen. We avoid needing to reap storage by re-using logical names. This allows us to have continuity in our metrics and create alerts/dashboards more simply.
@armon

This comment has been minimized.

Copy link
Member Author

armon commented Jan 20, 2015

@petemounce This will not affect your case. This is only when the server nodes themselves change IPs but not their node name. The clients can change IPs all day :)

@blalor

This comment has been minimized.

Copy link
Member

blalor commented Feb 23, 2015

What is the proper way to change the IP (or the advertise_addr config option) on a server? Is there one?

@armon

This comment has been minimized.

Copy link
Member Author

armon commented Feb 23, 2015

It cannot be done currently. You need to remove (gracefully) first then re-add the server. Consul can't handle the address change case.

@blalor

This comment has been minimized.

Copy link
Member

blalor commented Feb 23, 2015

So, “consul leave” on the host, change the IP or advertise_addr, then restart? That seems to confuse the agents in the cluster, which continue to show the old IP and a state of “left”.

@armon

This comment has been minimized.

Copy link
Member Author

armon commented Feb 23, 2015

Assuming the node name is the same, they shouldn't be confused. The IP address should update on the clients. If the node name changes, they will be confused since it looks like a different node. But effectively yes, the node is leaving and then re-joining with new configuration.

@blalor

This comment has been minimized.

Copy link
Member

blalor commented Feb 24, 2015

That doesn't seem to be the case, unfortunately. I modified the config for node consul-000.us-east-1.aws.test.example.com to use a specific advertise_addr; leaving and re-joining with the new config is resulting in lots of

2015/02/24 00:05:10 [WARN] memberlist: Refuting a suspect message (from: consul-000.us-east-1.aws.test.example.com)

on the node I just modified, and messages like this on the other servers and agents in the cluster even minutes after reconfiguring:

2015/02/24 00:06:48 [INFO] serf: EventMemberJoin: consul-000.us-east-1.aws.test.example.com 11.222.33.444
2015/02/24 00:06:48 [INFO] consul: adding server consul-000.us-east-1.aws.test.example.com (Addr: 11.222.33.444:8300) (DC: us-east-1_aws_test)
2015/02/24 00:07:04 [INFO] serf: EventMemberFailed: consul-000.us-east-1.aws.test.example.com 11.222.33.444
2015/02/24 00:07:04 [INFO] consul: removing server consul-000.us-east-1.aws.test.example.com (Addr: 11.222.33.444:8300) (DC: us-east-1_aws_test)
@armon

This comment has been minimized.

Copy link
Member Author

armon commented Feb 24, 2015

@blalor Can you provide the DEBUG level logs from the machine and maybe one other machine? This looks slightly different than the issue of this ticket. The ticket is that the Raft peers cannot handle an IP update of a server, while this looks like a different issue (Join/Fail) not converging.

@blalor

This comment has been minimized.

Copy link
Member

blalor commented Feb 24, 2015

https://gist.github.com/blalor/60539004449c35fc079a

consul_debug.000 is for server node consul-000 which had its IP address changed from 10.130.0.248 to 11.222.33.444. consul_debug.001 is for server node consul-001 whose configuration was unchanged save for enabling debug logging.

@armon

This comment has been minimized.

Copy link
Member Author

armon commented Feb 24, 2015

@blalor It looks like consul-001 is unable to ping (directly or indirectly) consul-000:

[INFO] memberlist: Suspect consul-000.us-east-1.aws.test.example.com has failed, no acks received

This could mean there is some network issue preventing UDP packets between them, which is causing the flapping. Could you investigate possible network issues?

@blalor

This comment has been minimized.

Copy link
Member

blalor commented Feb 24, 2015

Not anymore; I’ve rebuilt that cluster. :-)

@discordianfish

This comment has been minimized.

Copy link
Contributor

discordianfish commented Apr 2, 2015

Assuming all servers have leave_on_terminate set, what are the clients suppose to do when the complete cluster is gone? Should they try to reconnect via the DNS name?
Then I'd had a workaround for this at least.

@jakubzytka

This comment has been minimized.

Copy link

jakubzytka commented Dec 1, 2015

I'm determined to introduce ip change support in Consul.

I've hacked the code to allow that and it seems to work. I'd like to agree with you on the design of the final solution so that, possibly, my pull request could be integrated with mainline Consul.
@armon Please let me know your comments and concerns.

The requirements:

  • allow "old-style" behavior - identification of nodes by their IP address
  • allow identification of nodes by some unique (cluster-wide) identifier;
  • it is not required to provide online IP change support (i.e. you need to at least restart agent to use new IP)

OK, so here's the idea:

  • use node name as a "node address" (consistency with serf, web API etc.)
  • keep this node address in RaftLayer and in serverParts
  • use serf-based node address resolver in RaftLayer::Dial and ConnPool::getNewConn to resolve node address to proper IP when creating a new connection

Please note that no reverse resolution (IP->node address) is required.

Correctness:
Obviously, there is a question whether such approach is correct.

Assumptions:

  • Raft algorithm doesn't require reliable network (i.e. network delays,
    partitions, packet loss, duplication, re-ordering is allowed)
  • after code inspection I believe hashicorp's Raft implementation doesn't require reliable network either
  • consul doesn't identify message sender by remote address (which is by nature IP); instead the sender node identification is passed in messages (if needed); in general, the messages are valid or not regardless of who sends them - it is their content that matters
  • nodes are identified (in Raft, for RPC) by their address, but there is no requirement that this is a TCP address. Thus, it can be arbitrary node address without affecting Raft/consistency.

Observations:

  • If there is a property that IP address is not re-used the approach is correct, because effectively IP change is seen as a transient network problem.
  • if IP address is re-used after some reasonably long time, the scenario is reduced to transient network problem as well.

This is good enough for me, because that covers real-world scenarios I need to handle.

However, I believe than even in case of rapid IP addresses changes the approach stays correct. The new case to consider is when messages reach different destination then intended because serf data is not up-to-date. Still, because it is message content that matters, and not the sender, all invalid requests will be dropped (even now there must be a support for handling stray or delayed messages). There is a risk that some valid requests are dropped, but this affects only efficiency, but not correctness.

Obviously, this is hardly a proof of correctness. I do not intend to perform formal verification though. Is it good enough for you?

I've looked over web API and I think this change doesn't affect it. I hope I haven't broken anything.

@discordianfish

This comment has been minimized.

Copy link
Contributor

discordianfish commented Dec 2, 2015

fwiw, my workaround worked okayish - until a node hard crashes and you need to replace it.
@jakubzytka's design sounds reasonable to me, but I'm wondering what happens if you end up with two nodes using the same node name.

@jakubzytka

This comment has been minimized.

Copy link

jakubzytka commented Dec 2, 2015

Two nodes cannot have the same node name; thats a serf requirement.
Right now an error is logged from serf (and a cluster doesn't form I guess) should such thing happen:

    2015/12/02 12:08:06 [ERR] memberlist: Conflicting address for blahblah. Mine: 192.168.9.3:8301 Theirs: 192.168.9.1:8301
    2015/12/02 12:08:06 [ERR] serf: Node name conflicts with another node at 192.168.9.1:8301. Names must be unique! (Resolution enabled: false)
@deltaroe

This comment has been minimized.

Copy link

deltaroe commented Jan 28, 2016

This would be far less of an issue in my implementation if I had a mechanism to kick out dead raft peers that serf thought were running again.

In my environment changing running servers IPs isn't the issue, it's if a server node fails there's a decent chance someone will not follow procedure and re-launch it with the same name, but a different IP address without force-leaving the failed node first. Serf will think everything is ok, all nodes will show as alive, but there's an orphaned raft peer lying around.

Detecting the orphaned raft node is easy enough to do with a monitoring system by comparing the number of raft peers with the number of consul servers. When that alert triggers normally it would be a simple manner of issuing a force-leave command for the failed node, however currently the force-leave command requires the node to be evicted to exist in serf. If someone doesn't follow procedure and re-launches a failed node and uses the old name (and it gets assigned a different IP address by EC2) then the only option is to bring the entire cluster down to update the peers.json file.

if the force-leave command could be extended (or a new command added) to being able to kick out an orphaned raft node without having to shutdown everything this becomes much less of an issue for me at least.

@jakubzytka

This comment has been minimized.

Copy link

jakubzytka commented Feb 1, 2016

@deltaroe You can workaround your issue by scripting the startup, and not relying on a manual procedure. Just check and persist the IP when starting consul and then on every restart re-check that IP. If it changed - remove old data and start the node with a new name. Or, alternatively, use node names that contain the IP. You'll never have the same node name for different IPs and you will be able to remove stray peers with force-leave.
The problem (for me) is that both these approaches require quorum of nodes to be alive, and my solution works when there is no quorum.

@slackpad

This comment has been minimized.

@mpalmer

This comment has been minimized.

Copy link

mpalmer commented Apr 5, 2016

I don't suppose someone from Hashicorp could give @jakubzytka some feedback on the proposed design? This problem has just bitten us badly, and although I'm implementing workarounds, it'd be nice if this problem was solved in consul.

@slackpad

This comment has been minimized.

Copy link
Contributor

slackpad commented Apr 5, 2016

@mpalmer sorry you got bit by this.

We are currently working on some improvements for Raft's management of config changes but we want to be super careful we do this in the best way. We are currently leaning towards adding a cluster-wide GUID that comes from the memberlist layer and is used to track identity regardless of IP and node name, so we are working through the implications of that.

@jakubzytka

This comment has been minimized.

Copy link

jakubzytka commented Apr 5, 2016

@mpalmer If you badly need some solution you can try my patched consul. It handles changing IP address of a node as long as node name stays the same.
The code is available at https://github.com/jakubzytka/consul/tree/ipChangeSupport
The branch is based on consul v0.6 if I remember correctly, but I guess it should apply cleanly over the newest version.
We've been using it in "staging" for a few months without issues.

@sakshigeminisys

This comment has been minimized.

Copy link

sakshigeminisys commented Jul 6, 2016

Hi Guys

I am running a containerized version of consul single node cluster with volumes attached. When I bring this single node cluster up for the first time, the leader is elected successfully. I see the following entry is added in the peers.json
["172.17.4.162:8300"] ==> this is correct as 172.17.4.162 is IP of my container.
Now, I remove this container and make sure that it exits gracefully as I have set "leave_on_terminate": true in my configuration file. After exit, the peers.json returns null. Up to this point, everything seems good. Now, when I restart the consul single node cluster, the IP assigned to the new container is now ["172.17.4.170:8300"] and this is added in peers.json successfully and this is the only value exisitng in peers.json.
In spite of this, consul deployment fails. The new node somehow tries to connect to the previous IP "172.17.4.162" hat has already been deleted from peers.json. Here are the logs:
2016/07/06 20:10:50 [INFO] serf: EventMemberJoin: consul 172.17.4.170
2016/07/06 20:10:50 [INFO] serf: EventMemberJoin: consul.dc1 172.17.4.170
2016/07/06 20:10:50 [INFO] raft: Node at 172.17.4.170:8300 [Follower] entering Follower state
2016/07/06 20:10:50 [INFO] consul: adding LAN server consul (Addr: 172.17.4.170:8300) (DC: dc1)
2016/07/06 20:10:50 [INFO] consul: adding WAN server consul.dc1 (Addr: 172.17.4.170:8300) (DC: dc1)
2016/07/06 20:10:50 [ERR] agent: failed to sync remote state: No cluster leader
2016/07/06 20:10:52 [WARN] raft: Heartbeat timeout reached, starting election
2016/07/06 20:10:52 [INFO] raft: Node at 172.17.4.170:8300 [Candidate] entering Candidate state
2016/07/06 20:10:52 [INFO] raft: Election won. Tally: 1
2016/07/06 20:10:52 [INFO] raft: Node at 172.17.4.170:8300 [Leader] entering Leader state
2016/07/06 20:10:52 [INFO] consul: cluster leadership acquired
2016/07/06 20:10:52 [INFO] consul: New leader elected: consul
2016/07/06 20:10:52 [INFO] raft: Disabling EnableSingleNode (bootstrap)
2016/07/06 20:10:52 [INFO] raft: Added peer 172.17.4.162:8300, starting replication
2016/07/06 20:10:52 [INFO] raft: Removed peer 172.17.4.162:8300, stopping replication (Index: 18)
2016/07/06 20:10:52 [INFO] consul: member 'consul' joined, marking health alive
2016/07/06 20:10:53 [INFO] agent: Synced service 'consul'
2016/07/06 20:10:55 [ERR] raft: Failed to heartbeat to 172.17.4.162:8300: dial tcp 172.17.4.162:8300: getsockopt: no route to host
2016/07/06 20:10:55 [ERR] raft: Failed to AppendEntries to 172.17.4.162:8300: dial tcp 172.17.4.162:8300: getsockopt: no route to host
2016/07/06 20:10:58 [ERR] raft: Failed to heartbeat to 172.17.4.162:8300: dial tcp 172.17.4.162:8300: getsockopt: no route to host

Could anyone help me find the reason?

viovanov added a commit to SUSE/scf that referenced this issue Nov 3, 2016

Fix consul so there's one consul name per ip
Based on the following it looks like there should be a 1-1 relationship between a node name and its IP address:
hashicorp/consul#457

viovanov added a commit to SUSE/scf that referenced this issue Nov 4, 2016

Fix consul so there's one consul name per ip
Based on the following it looks like there should be a 1-1 relationship between a node name and its IP address:
hashicorp/consul#457

viovanov added a commit to SUSE/scf that referenced this issue Nov 4, 2016

Fix consul so there's one consul name per ip
Based on the following it looks like there should be a 1-1 relationship between a node name and its IP address:
hashicorp/consul#457

viovanov added a commit to SUSE/scf that referenced this issue Nov 4, 2016

Fix consul so there's one consul name per ip
Based on the following it looks like there should be a 1-1 relationship between a node name and its IP address:
hashicorp/consul#457

@slackpad slackpad added this to the 0.8.0 milestone Nov 22, 2016

@Hronom

This comment has been minimized.

Copy link

Hronom commented Jan 26, 2017

Any progress on this?
I use Consul in single mode (one node). When container restart, the ip address changed and the Consul could not start because it remembered his previous ip address.
Is there something to run Consul in single mode (one node).
My config (Docker compose):

version: '2'
services:
  consul:
    image: consul:0.7.2
    ports:
      - "8500:8500"
      - "8600:8600/tcp"
      - "8600:8600/udp"
    # https://github.com/hashicorp/consul/issues/166#issuecomment-233711577
    command: agent -server -bootstrap -ui -client 0.0.0.0
@yellowmegaman

This comment has been minimized.

Copy link

yellowmegaman commented Feb 3, 2017

Highly interested as well. If using docker swarm mode, i got new ip's almost every time.

@slackpad

This comment has been minimized.

Copy link
Contributor

slackpad commented May 3, 2017

Closing this in favor of #1580.

@slackpad slackpad closed this May 3, 2017

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment