-
Notifications
You must be signed in to change notification settings - Fork 18.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Some veths created by Docker are not assigned a master bridge #26492
Comments
ping @mavenugo |
I was able to reproduce as well. I've ran 500 containers with |
This is reproducible by running a bunch of containers in sequence on current CoreOS beta or alpha. Eventually one will fail. When the master is not set, the container can't ping its own gateway interface, so this will do it:
So far, I've seen that the requests set up in |
I looked into this a bit more, and it appears to be due to a race condition between creating the veth interfaces and adding the host interface to the bridge. I wrote a proof-of-concept fix that waits for the kernel to notify that the veth interfaces are running. See dm0-/libnetwork@4343ba4c21f1a121f9e867efda3231a61dc5565e. I've run a couple thousand containers with it and did not have any network problems. Can someone else verify this? |
Sorry, I updated the cause on the pull request, but not this issue. Here's what can reproduce the issue: This error is triggered by a race condition on systems using Bridge networking seems to work fine in two scenarios: |
Thanks @dm0- for the analysis. I have not been able to reproduce the issue on ubuntu machine, but it is congruent with your explanation, because in my case It seems it is possible to control with configuration files whether networkd should manage a link based on the link type. I would rather document your finding in the docker documentation, instead of modifying the docker code to be resilent to networkd interfeerence. That way users can decide if it is worth to configure their networkd or to shut it down, depending on their setup. |
I looked into networkd earlier today, and for the record, this happens when it brings up an interface: Delete that block, and there is no issue. |
@dm0- I'll need to update my systemd service through package manager to solve this problem in my current system? Is there any workarround that I can make to avoid this issue? If yes, what is? Which file I need to edit and what content needs to be replaced? In my case, I have installed docker 1.12.3 from scratch (a few minutes ago) in 4 nodes, and executed the following commands: # manager1:
docker swarm init
# manager2:
docker swarm join --token MYTOKEN-MANAGER 10.0.1.100:2377
# manager3:
docker swarm join --token MYTOKEN-MANAGER 10.0.1.100:2377
# worker1:
docker swarm join --token MYTOKEN-WORKER 10.0.1.100:2377
#manager1:
docker network create \
--driver overlay \
infra
docker service create \
--name=viz \
--publish=8080:8080/tcp \
--constraint=node.role==manager \
--network=infra \
--mount=type=bind,src=/var/run/docker.sock,dst=/var/run/docker.sock \
manomarks/visualizer
docker service create \
--name=busybox \
--network=infra \
busybox The viz service can start successfully but the busybox startup enters in loop:
This is a piece of docker log from one of the nodes that is generated by docker-engine daemon when swarm tries to start the service: Dec 13 23:30:08 ip-10-0-20-48 kernel: [ 1653.611994] aufs au_opts_verify:1597:dockerd[13364]: dirperm1 breaks the protection by the permission bits on the lower branch
Dec 13 23:30:08 ip-10-0-20-48 kernel: [ 1653.639362] aufs au_opts_verify:1597:dockerd[13364]: dirperm1 breaks the protection by the permission bits on the lower branch
Dec 13 23:30:10 ip-10-0-20-48 kernel: [ 1656.113789] aufs au_opts_verify:1597:dockerd[14025]: dirperm1 breaks the protection by the permission bits on the lower branch
Dec 13 23:30:10 ip-10-0-20-48 kernel: [ 1656.120385] IPVS: Creating netns size=2192 id=63
Dec 13 23:30:10 ip-10-0-20-48 kernel: [ 1656.136840] br0: renamed from ov-000101-b29gr
Dec 13 23:30:10 ip-10-0-20-48 systemd-udevd[14514]: Could not generate persistent MAC address for vx-000101-b29gr: No such file or directory
Dec 13 23:30:10 ip-10-0-20-48 kernel: [ 1656.168396] vxlan1: renamed from vx-000101-b29gr
Dec 13 23:30:11 ip-10-0-20-48 systemd-udevd[14541]: Could not generate persistent MAC address for veth21f8b53: No such file or directory
Dec 13 23:30:11 ip-10-0-20-48 systemd-udevd[14540]: Could not generate persistent MAC address for veth34c132c: No such file or directory
Dec 13 23:30:11 ip-10-0-20-48 kernel: [ 1656.184273] device vxlan1 entered promiscuous mode
Dec 13 23:30:11 ip-10-0-20-48 kernel: [ 1656.184432] br0: port 1(vxlan1) entered forwarding state
Dec 13 23:30:11 ip-10-0-20-48 kernel: [ 1656.184437] br0: port 1(vxlan1) entered forwarding state
Dec 13 23:30:11 ip-10-0-20-48 kernel: [ 1656.200380] veth2: renamed from veth21f8b53
Dec 13 23:30:11 ip-10-0-20-48 systemd-udevd[14585]: Could not generate persistent MAC address for vethd539d43: No such file or directory
Dec 13 23:30:11 ip-10-0-20-48 systemd-udevd[14586]: Could not generate persistent MAC address for vethc117fcb: No such file or directory
Dec 13 23:30:11 ip-10-0-20-48 kernel: [ 1656.208307] device veth2 entered promiscuous mode
Dec 13 23:30:11 ip-10-0-20-48 kernel: [ 1656.208418] IPv6: ADDRCONF(NETDEV_UP): veth2: link is not ready
Dec 13 23:30:11 ip-10-0-20-48 kernel: [ 1656.208421] br0: port 2(veth2) entered forwarding state
Dec 13 23:30:11 ip-10-0-20-48 kernel: [ 1656.208426] br0: port 2(veth2) entered forwarding state
Dec 13 23:30:11 ip-10-0-20-48 kernel: [ 1656.209891] device vethc117fcb entered promiscuous mode
Dec 13 23:30:11 ip-10-0-20-48 kernel: [ 1656.209956] IPv6: ADDRCONF(NETDEV_UP): vethc117fcb: link is not ready
Dec 13 23:30:11 ip-10-0-20-48 kernel: [ 1656.209960] docker_gwbridge: port 2(vethc117fcb) entered forwarding state
Dec 13 23:30:11 ip-10-0-20-48 kernel: [ 1656.209967] docker_gwbridge: port 2(vethc117fcb) entered forwarding state
Dec 13 23:30:11 ip-10-0-20-48 kernel: [ 1656.242273] IPVS: Creating netns size=2192 id=64
Dec 13 23:30:11 ip-10-0-20-48 dockerd[6616]: time="2016-12-13T23:30:11Z" level=info msg="Firewalld running: false"
Dec 13 23:30:11 ip-10-0-20-48 kernel: [ 1656.316499] eth0: renamed from veth34c132c
Dec 13 23:30:11 ip-10-0-20-48 kernel: [ 1656.336338] IPv6: ADDRCONF(NETDEV_CHANGE): veth2: link becomes ready
Dec 13 23:30:11 ip-10-0-20-48 kernel: [ 1656.336379] docker_gwbridge: port 2(vethc117fcb) entered disabled state
Dec 13 23:30:11 ip-10-0-20-48 kernel: [ 1656.396454] eth1: renamed from vethd539d43
Dec 13 23:30:11 ip-10-0-20-48 kernel: [ 1656.416302] IPv6: ADDRCONF(NETDEV_CHANGE): vethc117fcb: link becomes ready
Dec 13 23:30:11 ip-10-0-20-48 kernel: [ 1656.416324] docker_gwbridge: port 2(vethc117fcb) entered forwarding state
Dec 13 23:30:11 ip-10-0-20-48 kernel: [ 1656.416333] docker_gwbridge: port 2(vethc117fcb) entered forwarding state
Dec 13 23:30:11 ip-10-0-20-48 kernel: [ 1656.505537] br0: port 2(veth2) entered disabled state
Dec 13 23:30:11 ip-10-0-20-48 kernel: [ 1656.505567] br0: port 1(vxlan1) entered disabled state
Dec 13 23:30:11 ip-10-0-20-48 kernel: [ 1656.507886] ov-000101-b29gr: renamed from br0
Dec 13 23:30:11 ip-10-0-20-48 kernel: [ 1656.516216] device veth2 left promiscuous mode
Dec 13 23:30:11 ip-10-0-20-48 kernel: [ 1656.516237] ov-000101-b29gr: port 2(veth2) entered disabled state
Dec 13 23:30:11 ip-10-0-20-48 kernel: [ 1656.528139] device vxlan1 left promiscuous mode
Dec 13 23:30:11 ip-10-0-20-48 kernel: [ 1656.528154] ov-000101-b29gr: port 1(vxlan1) entered disabled state
Dec 13 23:30:11 ip-10-0-20-48 kernel: [ 1656.566518] vx-000101-b29gr: renamed from vxlan1
Dec 13 23:30:11 ip-10-0-20-48 dockerd[6616]: message repeated 2 times: [ time="2016-12-13T23:30:11Z" level=info msg="Firewalld running: false"]
Dec 13 23:30:11 ip-10-0-20-48 systemd-udevd[14722]: Could not generate persistent MAC address for vx-000101-b29gr: No such file or directory
Dec 13 23:30:11 ip-10-0-20-48 kernel: [ 1656.595248] veth21f8b53: renamed from veth2
Dec 13 23:30:11 ip-10-0-20-48 systemd-udevd[14743]: Could not generate persistent MAC address for veth21f8b53: No such file or directory
Dec 13 23:30:11 ip-10-0-20-48 kernel: [ 1656.666459] veth34c132c: renamed from eth0
Dec 13 23:30:11 ip-10-0-20-48 kernel: [ 1656.701051] docker_gwbridge: port 2(vethc117fcb) entered disabled state
Dec 13 23:30:11 ip-10-0-20-48 kernel: [ 1656.701105] vethd539d43: renamed from eth1
Dec 13 23:30:11 ip-10-0-20-48 kernel: [ 1656.746335] docker_gwbridge: port 2(vethc117fcb) entered disabled state
Dec 13 23:30:11 ip-10-0-20-48 kernel: [ 1656.749061] device vethc117fcb left promiscuous mode
Dec 13 23:30:11 ip-10-0-20-48 kernel: [ 1656.749065] docker_gwbridge: port 2(vethc117fcb) entered disabled state
Dec 13 23:30:11 ip-10-0-20-48 kernel: [ 1656.856097] IPVS: __ip_vs_del_service: enter
Dec 13 23:30:11 ip-10-0-20-48 kernel: [ 1656.856100] IPVS: __ip_vs_del_service: enter I've tested this with docker 1.12.4 too and I can confirm that it continues occurring. So, I think that it isn't related to docker engine itself. |
@galindro Yes, I have sent two different methods of addressing the issue to upstream systemd: systemd/systemd#4228 and systemd/systemd#4809. Both have been merged, but they are not in a release yet, so you would have to manually build a patched package if you want to use them. See coreos/coreos-overlay#2300 for example configuration files using the first method. If your problem is caused by the same issue (networkd matching Docker veths), then the workaround is to rewrite your |
@galindro the |
Guys, thank you very much for the quick reply! |
Hi, guys. Do we have any, like, quick solution on this at 2018? |
Could this cause an AWS instance not to be accessible by SSH anymore? These are the last few lines I see in the log and I can't SSH to the instance anymore:
|
Just an update on my case, after restart there entries are not present in the log anymore and the instance works. |
i'm having the exact same issue.. is there a way to solve this instead of restarting it. |
having same issue , ssh server is refusing to start and seeing error like "port4 entered blocking state" |
Thats works for me sudo kill -9 $(sudo service docker status|grep 'Main PID'|grep '(dockerd)'|grep -o -E "[0-9]+") && sudo service docker start &
After docker-compose up
# OR
docker-compose down && docker-compose up |
Output of
docker version
:Output of
docker info
:Additional environment details (AWS, VirtualBox, physical, etc.):
Azure VM
Steps to reproduce the issue:
Describe the results you received:
Some veths don't get assigned a master bridge and so are unreachable:
Notice
veth1df4092
andveth267617a
.Describe the results you expected:
All the veths should have the master bridge defined.
Additional information you deem important (e.g. issue happens only occasionally):
The issue happens occasionally, and only noticed it after the latest coreos beta update. Here are logs of affected veths:
Here are logs of correctly operating veths:
Notice that on the logs of incorrect veths we have the line "Link readded", which is not present on the logs of correct veths. On the other hand, the log "entered forwarding state" is only present on correct veths. Maybe some kind of race condition?
The text was updated successfully, but these errors were encountered: