ARP Cache purging docoumentation #26

adamhadani · 2014-10-09T19:20:14Z

in the README.me it mentions "You can also manually reset the cache." -> If there is a reproducible solution, can you please supply the exact command line that will perform this? Is this something that will need to only be run on the host that went through a consul restart, or any other hosts in the cluster? does it need to be run in the docker container networking namespace (e.g using nsenter --net) or simply on the host?
I've been running into this issue and trying the following things on the host doesn't seem to help (either after stopping docker-consul and/or the docker daemon, or while either one of them is still running). only thing that works is waiting a few minutes, but thats not an acceptable solution for my environment atm.

ip -s -s neigh flush all
or
nsenter --target ${CONSUL_DOCKER_PID} --net ip -s -s neigh flush all
or
arp -i docker0 -d <docker0_ip_addr_of_consul_container>

The text was updated successfully, but these errors were encountered:

sfitts · 2014-11-13T22:25:51Z

I'm experiencing the same issue. Running Ubuntu on EC2 and nothing other than waiting seems to do the trick. Some pointers on what you found that worked would be great.

pfcarrier · 2014-11-18T13:03:16Z

the documentation refer to gratuitous ARP reply sent to the local network segment ( of consul ). That would allow all neighbour hosts to update there ARP table with the new MAC, restoring connectivity.

Try running that from the container ( through docker exec, or as per OP using nsenter ).
arping -U -c2 -A -s {IP_OF_YOUR_CONTAINER} -I eth0 0.0.0.0

Also, it seem this ARP cache issue had been addressed directly upstream in docker :

sfitts · 2014-12-06T04:00:02Z

FWIW, I just ran into this in our deployed system and unfortunately the arping didn't do the trick. Of course there could be other factors, but just thought I'd let folks know.

pikeas · 2014-12-14T01:00:07Z

So...what's the right way to work around this?

johnrengelman · 2015-02-11T19:14:12Z

I can't seem to make any of the above workarounds work. The only thing that works for me is to shut the container down for 5 minutes and then start it again.

Docker: 1.5.0
Progrium/Consul: 53a7b829dd6f

pfcarrier · 2015-02-12T17:30:05Z

@johnrengelman. wondering, can your try to hard code the mac of the container with --mac-address or maybe using --net=host in your docker run for consul ?

johnrengelman · 2015-02-12T17:34:26Z

I'll give those a try hopefully today...off on something else at the moment.

efuquen · 2015-03-03T16:48:29Z

@johnrengelman were you able to try suggested mitigations?
@pfcarrier to clear up the issues you referenced, were those fixes a part of docker 1.5.0? Meaning @johnrengelman issue still persists despite them?

Seems like we've been running into this issue, initially opened a ticket on consul: hashicorp/consul#738

What I really need to know is if there is any valid workaround or if this issue has been addressed in a newer version of docker (we're running 1.4.1). This has been a persistent problem for us, making it difficult to try and use our consul cluster for some more interesting applications, where uptime is more critical. In the end we're probably going to be forced to go another route.

epipho · 2015-03-16T07:10:02Z

I am still seeing this, even with docker 1.5. Using --net=host, while not ideal, does appear to sidestep this issue.

moolen · 2015-03-17T18:09:33Z

This works for me as a temprorary workaround: just remove The (all) containers with:
docker rm $(docker ps -a -q)

then rebuild and run.

cap10morgan · 2015-03-25T19:23:44Z

I spent a large chunk of the day digging into this. I flushed every ARP cache from here to Timbuktu with no improvement.

However, this worked for me: hashicorp/consul#352 (comment)

Try conntrack -F on the docker host where you want to quickly bounce a consul container (after docker stop but before the next docker run). The new container synced up with the cluster after that.

cap10morgan · 2015-03-25T20:12:01Z

If anyone else is on CoreOS like me, you can use my Docker image to do this:

docker run --net=host --privileged --rm cap10morgan/conntrack -F

pm-vitaly-fedyunin · 2015-05-05T18:33:00Z

upvote. This thing is really annoying. conntrack -F not helping at all on ubuntu14lts hosts.

jwierzbo · 2015-05-19T08:30:04Z

👍 conntrack -F doesn't work for me also on ubuntu 12.04 LTS...

morcmarc · 2015-05-29T13:51:56Z

+1 had the same issue running centos. Ended up dropping docker-consul for now and install things the old way. Happy to help fixing the issue although *nix networks is not my cup of tea.

c4po · 2015-07-30T15:31:28Z

I'm on docker 1.7.
--net=host --privileged=true works for me on CentOS7.

mindscratch · 2015-11-17T17:47:41Z

I'm on docker 1.9.0 with CentOS 7, running consul with --net=host --privileged=true hasn't helped. Also, running "conntract -F" hasn't helped.

twhart · 2015-11-18T02:38:02Z

I'd like to point out the best working solution I could find was removing the docker completely

Kill all containers

sudo docker kill $(sudo docker ps -a -q)

Delete all containers

sudo docker rm $(sudo docker ps -a -q)

Delete all images

sudo docker rmi $(sudo docker images -q)

Then I run
sudo conntrack -F
sleep 60

then bring the docker back up.

Hope this helps

andyshinn · 2015-12-16T21:57:57Z

Can someone explain what the actionable item is for this issue (is it still an issue?) and give an updated way to reproduce the issue (docker-compose.yml)?

alexw23 · 2015-12-27T13:08:52Z

Removing the data dir on host worked for me... it also deleted all my vault secrets 😂 Which I think might be why @twhart might've got it working.

anroots · 2016-02-05T08:54:48Z

This is still relevant. The best (and about the only working) workaround so far is to manually clear the UDP cache of conntrack before re-starting (or re-running) the Consul container after it was stopped (for redeploy, for example).

I've added the following to my deployment script after tearing down the old Consul server container and before starting the new container:

sudo docker run --net=host --privileged --rm cap10morgan/conntrack -D -p udp

fcjbispo · 2016-07-01T14:33:41Z

I've faced this issue running a small cluster of ambari/hdp services. Ambari has using consul as nameserver to coordinate the members of the cluster. And even cluster members has shutted down, they can't be startup correctly due consul neither have other ip or is unable to register new members.
To make possiblie to all cluster members to find it, i run consul exposing the dns port on the ip of docker0 (-p 172.17.0.1:53:53) and set the nameserver on resolv.conf on the ambari agents to this ip.
And for restore the consul normal behaviour, i just stopped, removed and start it again before restart all cluster members.
It's not the better solution but it's working.

arie-benichou added a commit to cnam-recolnat/portal that referenced this issue Jan 23, 2015

trying clues suggested here: gliderlabs/docker-consul#26

3b4df31

armon mentioned this issue Mar 3, 2015

agent: failed to sync remote state: No cluster leader hashicorp/consul#738

Closed

lox mentioned this issue Apr 2, 2015

Consul services seem to be registered under wrong node gliderlabs/registrator#59

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ARP Cache purging docoumentation #26

ARP Cache purging docoumentation #26

adamhadani commented Oct 9, 2014

sfitts commented Nov 13, 2014

pfcarrier commented Nov 18, 2014

sfitts commented Dec 6, 2014

pikeas commented Dec 14, 2014

johnrengelman commented Feb 11, 2015

pfcarrier commented Feb 12, 2015

johnrengelman commented Feb 12, 2015

efuquen commented Mar 3, 2015

epipho commented Mar 16, 2015

moolen commented Mar 17, 2015

cap10morgan commented Mar 25, 2015

cap10morgan commented Mar 25, 2015

pm-vitaly-fedyunin commented May 5, 2015

jwierzbo commented May 19, 2015

morcmarc commented May 29, 2015

c4po commented Jul 30, 2015

mindscratch commented Nov 17, 2015

twhart commented Nov 18, 2015

andyshinn commented Dec 16, 2015

alexw23 commented Dec 27, 2015

anroots commented Feb 5, 2016

fcjbispo commented Jul 1, 2016

ARP Cache purging docoumentation #26

ARP Cache purging docoumentation #26

Comments

adamhadani commented Oct 9, 2014

sfitts commented Nov 13, 2014

pfcarrier commented Nov 18, 2014

sfitts commented Dec 6, 2014

pikeas commented Dec 14, 2014

johnrengelman commented Feb 11, 2015

pfcarrier commented Feb 12, 2015

johnrengelman commented Feb 12, 2015

efuquen commented Mar 3, 2015

epipho commented Mar 16, 2015

moolen commented Mar 17, 2015

cap10morgan commented Mar 25, 2015

cap10morgan commented Mar 25, 2015

pm-vitaly-fedyunin commented May 5, 2015

jwierzbo commented May 19, 2015

morcmarc commented May 29, 2015

c4po commented Jul 30, 2015

mindscratch commented Nov 17, 2015

twhart commented Nov 18, 2015

Kill all containers

Delete all containers

Delete all images

andyshinn commented Dec 16, 2015

alexw23 commented Dec 27, 2015

anroots commented Feb 5, 2016

fcjbispo commented Jul 1, 2016