-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
client gets repeatedly evicted from cluster #352
Comments
@imkira This looks like an issue with UDP messages not being properly delivered. This pattern of suspected failure / re-join flapping continuously is exactly what is expected if the client is not handling the UDP messages. Verify that the UDP messages are going through on the UDP port (8301 by default) |
@armon thanks for the prompt reply. The weird part is that, if I were to have any firewall setting problem, it shouldn't go away even if I restart the client consul, right? (By "restart" I mean the process itself, not the machine where it runs) But the problem does go away, in fact. I have also noticed that this problem happens not after "1 day of operation" like I guessed before, but rather when I put my dev machine to sleep. Anyway, this is my dev machine and it shouldn't happen (I suppose) in production. I am using docker for running consul1,2,3 (servers) and exposing 8300 8301 8302 8400 8500 8600/udp. Docker is running on top of boot2docker (since I am running it on OSX) and the client agent is running on the host machine itself. I have no firewall settings blocking docker vm (boot2docker) or docker instances to host. Also, host is able to contact docker instances at the exposed ports specified above. It appears I should have 8301/udp listed as you said, but that doesn't seem to be the root of this problem since it runs fine for hours until I put the machine to sleep. I haven't looked at the implementation part where you are "handling the UDP messages", but I am assuming client agents listen to 8301/udp and wait for gossip messages (?) If that is true, might be the case the udp server running on the client agents is getting somewhat "confused" after awaking from sleep? |
Ahhh okay. This clears it up. There is a known issue with Docker's networking that causes this. There is some kind of ARP cache that causes improper routing to take place, and breaks the UDP routing. Not sure what exactly the issue is, but we've noticed this happens anytime there is a rapid setup/teardown of Docker networking. The only known work-around is to "sleep" at least 3-5 minutes in between the teardown and re-join of the network. |
Well that explains it all! |
I've also had such a case, but it's occurred that it's not UDP problem. |
An alternative solution if you're using Docker might be to use host networking for the Consul container. This should (I believe) prevent any of use of NAT and conntrack. |
So does consul still perform correctly when having these issues?
Could you explain a fix for it if not? I'm a little confused. |
@outrunthewolf these messages come from way down in the gossip layer (memberlist), indicating a network problem between nodes. If nodes aren't able to successfully send ping/ack messages to one another over UDP, then you will likely see members flapping between dead/alive state. I would try the suggestions above from @arthurbarr and @dennybaa and see if you can get any mileage out of either solution, or as @armon suggested if you are doing rapid teardown/startup of docker networking, you might want to add a pause in between. Let us know if you are still having trouble! |
I've also solved this problem in the past by stopping consul on the problem node, forcibly deregistering the problem node using the consul dashboard, and restarting consul on the node. |
I also noticed that if the network interface for which consul is announcing the IP address to the other nodes changes address it will stop working. Only a restart fixes it. |
If you are on CoreOS, running |
@brycekahle & @dennybaa Thank you! This needs to be shared far and wide. :) |
A scriptable option on CoreOS:
|
Hi cap10morgan, is the Dockerfile for the conntrack image available somewhere? Thanks! |
@jlordiales Yep, I've made it an automated build. |
@cap10morgan Thanks man! Really appreciate it |
Hello, Same issue here, it keeps adding and removing server nodes. Running consul within a docker (1.4.1) container with 3 server agents nodes(staging3,staging4,staging5) EDIT: attempted to run conntrack(docker run --net=host --privileged --rm cap10morgan/conntrack -F) on my 3 nodes and restarted consul, same issue
This is my port configuration: progrium/consul:latest "/bin/start -server 47 minutes ago Up 46 minutes 53/tcp, 172.17.42.1:53->53/udp, 192.168.1.190:8300->8300/tcp, 192.168.1.190:8301->8301/tcp, 192.168.1.190:8301->8301/udp, 192.168.1.190:8302->8302/tcp, 192.168.1.190:8302->8302/udp, 192.168.1.190:8400->8400/tcp, 192.168.1.190:8500->8500/tcp consul Any hint ? |
@florentvaldelievre You can use my docker conntrack container. See my comment above. |
Thanks @cap10morgan . Quoted from the page Quickly restarting a node using the same IP issue When testing a cluster scenario, you may kill a container and restart it again on the same host and see that it has trouble re-joining the cluster. There is an issue when you restart a node as a new container with the same published ports that will cause heartbeats to fail and the node will flap. This is an ARP table caching problem. If you wait about 3 minutes before starting again, it should work fine. You can also manually reset the cache. Questions
Thanks |
Hacky or not, your option 3 fixes it. That's what I'm doing. Otherwise ask Jeff Lindsay to bake it into the progrium/consul Docker image. :) |
We're building this into a new version of the Consul container, coming soon. If anybody wants to submit a PR against the current one, that's fine too. |
I ran into the similair problems. I'm on Centos 6.4, Docker 1.6. The first host: $ export HOST_IP=10.241.232.14
$ docker run -d --name consul -h $(hostname --fqdn) -p 8500:8500 -p 8400:8400 -p 8300:8300 -p 8301:8301 -p 8301:8301/udp -p 8302:8302 -p 8302:8302/udp -p 8600:53 -p 8600:53/udp progrium/consul -server -advertise=$HOST_IP -bootstrap -ui-dir /ui Second one: $ export HOST_IP=10.241.232.13
$ docker run -d --name consul -h $(hostname --fqdn) -p 8500:8500 -p 8400:8400 -p 8300:8300 -p 8301:8301 -p 8301:8301/udp -p 8302:8302 -p 8302:8302/udp -p 8600:53 -p 8600:53/udp progrium/consul -server -join 10.241.232.14 -advertise=$HOST_IP -ui-dir /ui Falling into this issue after the |
I'm seeing this problem as well. Is there a ticket with docker open about this bug? |
@ekristen moby/moby#8795 is tracking the issue. |
apt-get install conntrack
connctrack -F This seemed to work for me.
|
I think I have hit the same issue, here is my observation - I am running a container with consul installed inside. Then I hit the issue. I use tcpdump to watch the UDP traffic... On host's docker0 interface I bet there were existing track tuple that occupied sport 51086, hence 1026 got allocated. This seems to mess the gossip protocol up. When I assign another port, says 41086 instead of 51086 and restart the container, everything is back to normal. the tcpdump would looks like this Is it possible that the source port being different from the serf_lan configuration will mess the gossip protocol up? (Just for the record - the DNAT is working, it maps ETH0:51086 to CONTAINER:51086 correctly) I didn't try conntrack -F yet. I don't have an repro now maybe I should try that next time. |
Oh here is how I can reproduce that. I restart my container quickly, by using maestro-ng stop and start. I am hoping Consul could make changes to its gossip protocol for accepting "ping" from a different source port than the one configured. |
* crds: Add support for L7 intentions Co-authored-by: Iryna Shustava <iryna@hashicorp.com>
Hi, I don't know exactly how to reproduce this, only that it occurs frequently when clients run for as long as 1 day of operation.
My setup is like the following:
When the bug manifests itself, what happens is that the client (client1)'s serfHealth check is changed back and forth between
critical
andpassing
every few seconds (5 or so). I don't know if this a client bug or a server bug, but if I restart the client this endless loop bug stops manifesting itself and the situation goes back to normal.I am attaching the logs of each agent if that helps understand what's going on:
consul1 (server agent):
consul2 (server agent):
consul3 (server agent):
client1 (client agent):
The above lines are a copy paste of the logs referring to a specific time frame. client1's log looks quite short and seems to only mention consul1, but it also mentions messages from consul2 and consul3 being refuted if I wait long enough.
I am running the server agents with:
For consul2 and consul3 I append
-join=172.17.0.29
(the address of consul1).And I am running client1 with:
I think this is unrelated but, I am using the HTTP API interface to contact client1 and ask it to join consul1 (172.17.0.29). I do this several times a day and I do it even when client1 is already in joined state.
Do you know what is going on?
Please let me know I you need more info on tracking this one down.
The text was updated successfully, but these errors were encountered: