kernel crash after "unregister_netdevice: waiting for lo to become free. Usage count = 3" #5618

Open
tankywoo opened this Issue May 6, 2014 · 440 comments

Comments

Projects
None yet
@tankywoo

tankywoo commented May 6, 2014

This happens when I login the container, and can't quit by Ctrl-c.

My system is Ubuntu 12.04, kernel is 3.8.0-25-generic.

docker version:

root@wutq-docker:~# docker version
Client version: 0.10.0
Client API version: 1.10
Go version (client): go1.2.1
Git commit (client): dc9c28f
Server version: 0.10.0
Server API version: 1.10
Git commit (server): dc9c28f
Go version (server): go1.2.1
Last stable version: 0.10.0

I have used the script https://raw.githubusercontent.com/dotcloud/docker/master/contrib/check-config.sh to check, and all right.

I watch the syslog and found this message:

May  6 11:30:33 wutq-docker kernel: [62365.889369] unregister_netdevice: waiting for lo to become free. Usage count = 3
May  6 11:30:44 wutq-docker kernel: [62376.108277] unregister_netdevice: waiting for lo to become free. Usage count = 3
May  6 11:30:54 wutq-docker kernel: [62386.327156] unregister_netdevice: waiting for lo to become free. Usage count = 3
May  6 11:31:02 wutq-docker kernel: [62394.423920] INFO: task docker:1024 blocked for more than 120 seconds.
May  6 11:31:02 wutq-docker kernel: [62394.424175] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
May  6 11:31:02 wutq-docker kernel: [62394.424505] docker          D 0000000000000001     0  1024      1 0x00000004
May  6 11:31:02 wutq-docker kernel: [62394.424511]  ffff880077793cb0 0000000000000082 ffffffffffffff04 ffffffff816df509
May  6 11:31:02 wutq-docker kernel: [62394.424517]  ffff880077793fd8 ffff880077793fd8 ffff880077793fd8 0000000000013f40
May  6 11:31:02 wutq-docker kernel: [62394.424521]  ffff88007c461740 ffff880076b1dd00 000080d081f06880 ffffffff81cbbda0
May  6 11:31:02 wutq-docker kernel: [62394.424526] Call Trace:                                                         
May  6 11:31:02 wutq-docker kernel: [62394.424668]  [<ffffffff816df509>] ? __slab_alloc+0x28a/0x2b2
May  6 11:31:02 wutq-docker kernel: [62394.424700]  [<ffffffff816f1849>] schedule+0x29/0x70
May  6 11:31:02 wutq-docker kernel: [62394.424705]  [<ffffffff816f1afe>] schedule_preempt_disabled+0xe/0x10
May  6 11:31:02 wutq-docker kernel: [62394.424710]  [<ffffffff816f0777>] __mutex_lock_slowpath+0xd7/0x150
May  6 11:31:02 wutq-docker kernel: [62394.424715]  [<ffffffff815dc809>] ? copy_net_ns+0x69/0x130
May  6 11:31:02 wutq-docker kernel: [62394.424719]  [<ffffffff815dc0b1>] ? net_alloc_generic+0x21/0x30
May  6 11:31:02 wutq-docker kernel: [62394.424724]  [<ffffffff816f038a>] mutex_lock+0x2a/0x50
May  6 11:31:02 wutq-docker kernel: [62394.424727]  [<ffffffff815dc82c>] copy_net_ns+0x8c/0x130
May  6 11:31:02 wutq-docker kernel: [62394.424733]  [<ffffffff81084851>] create_new_namespaces+0x101/0x1b0
May  6 11:31:02 wutq-docker kernel: [62394.424737]  [<ffffffff81084a33>] copy_namespaces+0xa3/0xe0
May  6 11:31:02 wutq-docker kernel: [62394.424742]  [<ffffffff81057a60>] ? dup_mm+0x140/0x240
May  6 11:31:02 wutq-docker kernel: [62394.424746]  [<ffffffff81058294>] copy_process.part.22+0x6f4/0xe60
May  6 11:31:02 wutq-docker kernel: [62394.424752]  [<ffffffff812da406>] ? security_file_alloc+0x16/0x20
May  6 11:31:02 wutq-docker kernel: [62394.424758]  [<ffffffff8119d118>] ? get_empty_filp+0x88/0x180
May  6 11:31:02 wutq-docker kernel: [62394.424762]  [<ffffffff81058a80>] copy_process+0x80/0x90
May  6 11:31:02 wutq-docker kernel: [62394.424766]  [<ffffffff81058b7c>] do_fork+0x9c/0x230
May  6 11:31:02 wutq-docker kernel: [62394.424769]  [<ffffffff816f277e>] ? _raw_spin_lock+0xe/0x20
May  6 11:31:02 wutq-docker kernel: [62394.424774]  [<ffffffff811b9185>] ? __fd_install+0x55/0x70
May  6 11:31:02 wutq-docker kernel: [62394.424777]  [<ffffffff81058d96>] sys_clone+0x16/0x20
May  6 11:31:02 wutq-docker kernel: [62394.424782]  [<ffffffff816fb939>] stub_clone+0x69/0x90
May  6 11:31:02 wutq-docker kernel: [62394.424786]  [<ffffffff816fb5dd>] ? system_call_fastpath+0x1a/0x1f
May  6 11:31:04 wutq-docker kernel: [62396.466223] unregister_netdevice: waiting for lo to become free. Usage count = 3
May  6 11:31:14 wutq-docker kernel: [62406.689132] unregister_netdevice: waiting for lo to become free. Usage count = 3
May  6 11:31:25 wutq-docker kernel: [62416.908036] unregister_netdevice: waiting for lo to become free. Usage count = 3
May  6 11:31:35 wutq-docker kernel: [62427.126927] unregister_netdevice: waiting for lo to become free. Usage count = 3
May  6 11:31:45 wutq-docker kernel: [62437.345860] unregister_netdevice: waiting for lo to become free. Usage count = 3

After happend this, I open another terminal and kill this process, and then restart docker, but this will be hanged.

I reboot the host, and it still display that messages for some minutes when shutdown:
screen shot 2014-05-06 at 11 49 27

@drpancake

This comment has been minimized.

Show comment
Hide comment
@drpancake

drpancake May 23, 2014

I'm seeing a very similar issue for eth0. Ubuntu 12.04 also.

I have to power cycle the machine. From /var/log/kern.log:

May 22 19:26:08 box kernel: [596765.670275] device veth5070 entered promiscuous mode
May 22 19:26:08 box kernel: [596765.680630] IPv6: ADDRCONF(NETDEV_UP): veth5070: link is not ready
May 22 19:26:08 box kernel: [596765.700561] IPv6: ADDRCONF(NETDEV_CHANGE): veth5070: link becomes ready
May 22 19:26:08 box kernel: [596765.700628] docker0: port 7(veth5070) entered forwarding state
May 22 19:26:08 box kernel: [596765.700638] docker0: port 7(veth5070) entered forwarding state
May 22 19:26:19 box kernel: [596777.386084] [FW DBLOCK] IN=docker0 OUT= PHYSIN=veth5070 MAC=56:84:7a:fe:97:99:9e:df:a7:3f:23:42:08:00 SRC=172.17.0.8 DST=172.17.42.1 LEN=60 TOS=0x00 PREC=0x00 TTL=64 ID=170 DF PROTO=TCP SPT=51615 DPT=13162 WINDOW=14600 RES=0x00 SYN URGP=0
May 22 19:26:21 box kernel: [596779.371993] [FW DBLOCK] IN=docker0 OUT= PHYSIN=veth5070 MAC=56:84:7a:fe:97:99:9e:df:a7:3f:23:42:08:00 SRC=172.17.0.8 DST=172.17.42.1 LEN=60 TOS=0x00 PREC=0x00 TTL=64 ID=549 DF PROTO=TCP SPT=46878 DPT=12518 WINDOW=14600 RES=0x00 SYN URGP=0
May 22 19:26:23 box kernel: [596780.704031] docker0: port 7(veth5070) entered forwarding state
May 22 19:27:13 box kernel: [596831.359999] docker0: port 7(veth5070) entered disabled state
May 22 19:27:13 box kernel: [596831.361329] device veth5070 left promiscuous mode
May 22 19:27:13 box kernel: [596831.361333] docker0: port 7(veth5070) entered disabled state
May 22 19:27:24 box kernel: [596841.516039] unregister_netdevice: waiting for eth0 to become free. Usage count = 1
May 22 19:27:34 box kernel: [596851.756060] unregister_netdevice: waiting for eth0 to become free. Usage count = 1
May 22 19:27:44 box kernel: [596861.772101] unregister_netdevice: waiting for eth0 to become free. Usage count = 1

I'm seeing a very similar issue for eth0. Ubuntu 12.04 also.

I have to power cycle the machine. From /var/log/kern.log:

May 22 19:26:08 box kernel: [596765.670275] device veth5070 entered promiscuous mode
May 22 19:26:08 box kernel: [596765.680630] IPv6: ADDRCONF(NETDEV_UP): veth5070: link is not ready
May 22 19:26:08 box kernel: [596765.700561] IPv6: ADDRCONF(NETDEV_CHANGE): veth5070: link becomes ready
May 22 19:26:08 box kernel: [596765.700628] docker0: port 7(veth5070) entered forwarding state
May 22 19:26:08 box kernel: [596765.700638] docker0: port 7(veth5070) entered forwarding state
May 22 19:26:19 box kernel: [596777.386084] [FW DBLOCK] IN=docker0 OUT= PHYSIN=veth5070 MAC=56:84:7a:fe:97:99:9e:df:a7:3f:23:42:08:00 SRC=172.17.0.8 DST=172.17.42.1 LEN=60 TOS=0x00 PREC=0x00 TTL=64 ID=170 DF PROTO=TCP SPT=51615 DPT=13162 WINDOW=14600 RES=0x00 SYN URGP=0
May 22 19:26:21 box kernel: [596779.371993] [FW DBLOCK] IN=docker0 OUT= PHYSIN=veth5070 MAC=56:84:7a:fe:97:99:9e:df:a7:3f:23:42:08:00 SRC=172.17.0.8 DST=172.17.42.1 LEN=60 TOS=0x00 PREC=0x00 TTL=64 ID=549 DF PROTO=TCP SPT=46878 DPT=12518 WINDOW=14600 RES=0x00 SYN URGP=0
May 22 19:26:23 box kernel: [596780.704031] docker0: port 7(veth5070) entered forwarding state
May 22 19:27:13 box kernel: [596831.359999] docker0: port 7(veth5070) entered disabled state
May 22 19:27:13 box kernel: [596831.361329] device veth5070 left promiscuous mode
May 22 19:27:13 box kernel: [596831.361333] docker0: port 7(veth5070) entered disabled state
May 22 19:27:24 box kernel: [596841.516039] unregister_netdevice: waiting for eth0 to become free. Usage count = 1
May 22 19:27:34 box kernel: [596851.756060] unregister_netdevice: waiting for eth0 to become free. Usage count = 1
May 22 19:27:44 box kernel: [596861.772101] unregister_netdevice: waiting for eth0 to become free. Usage count = 1
@egasimus

This comment has been minimized.

Show comment
Hide comment
@egasimus

egasimus Jun 4, 2014

Hey, this just started happening for me as well.

Docker version:

Client version: 0.11.1
Client API version: 1.11
Go version (client): go1.2.1
Git commit (client): fb99f99
Server version: 0.11.1
Server API version: 1.11
Git commit (server): fb99f99
Go version (server): go1.2.1
Last stable version: 0.11.1

Kernel log: http://pastebin.com/TubCy1tG

System details:
Running Ubuntu 14.04 LTS with patched kernel (3.14.3-rt4). Yet to see it happen with the default linux-3.13.0-27-generic kernel. What's funny, though, is that when this happens, all my terminal windows freeze, letting me type a few characters at most before that. The same fate befalls any new ones I open, too - and I end up needing to power cycle my poor laptop just like the good doctor above. For the record, I'm running fish shell in urxvt or xterm in xmonad. Haven't checked if it affects plain bash.

egasimus commented Jun 4, 2014

Hey, this just started happening for me as well.

Docker version:

Client version: 0.11.1
Client API version: 1.11
Go version (client): go1.2.1
Git commit (client): fb99f99
Server version: 0.11.1
Server API version: 1.11
Git commit (server): fb99f99
Go version (server): go1.2.1
Last stable version: 0.11.1

Kernel log: http://pastebin.com/TubCy1tG

System details:
Running Ubuntu 14.04 LTS with patched kernel (3.14.3-rt4). Yet to see it happen with the default linux-3.13.0-27-generic kernel. What's funny, though, is that when this happens, all my terminal windows freeze, letting me type a few characters at most before that. The same fate befalls any new ones I open, too - and I end up needing to power cycle my poor laptop just like the good doctor above. For the record, I'm running fish shell in urxvt or xterm in xmonad. Haven't checked if it affects plain bash.

@egasimus

This comment has been minimized.

Show comment
Hide comment
@egasimus

egasimus Jun 5, 2014

This might be relevant:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1065434#yui_3_10_3_1_1401948176063_2050

Copying a fairly large amount of data over the network inside a container
and then exiting the container can trigger a missing decrement in the per
cpu reference count on a network device.

Sure enough, one of the times this happened for me was right after apt-getting a package with a ton of dependencies.

egasimus commented Jun 5, 2014

This might be relevant:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1065434#yui_3_10_3_1_1401948176063_2050

Copying a fairly large amount of data over the network inside a container
and then exiting the container can trigger a missing decrement in the per
cpu reference count on a network device.

Sure enough, one of the times this happened for me was right after apt-getting a package with a ton of dependencies.

@drpancake

This comment has been minimized.

Show comment
Hide comment
@drpancake

drpancake Jun 5, 2014

Upgrading from Ubuntu 12.04.3 to 14.04 fixed this for me without any other changes.

Upgrading from Ubuntu 12.04.3 to 14.04 fixed this for me without any other changes.

@unclejack unclejack added the kernel label Jul 16, 2014

@csabahenk

This comment has been minimized.

Show comment
Hide comment
@csabahenk

csabahenk Jul 22, 2014

I experience this on RHEL7, 3.10.0-123.4.2.el7.x86_64

I experience this on RHEL7, 3.10.0-123.4.2.el7.x86_64

@egasimus

This comment has been minimized.

Show comment
Hide comment
@egasimus

egasimus Jul 22, 2014

I've noticed the same thing happening with my VirtualBox virtual network interfaces when I'm running 3.14-rt4. It's supposed to be fixed in vanilla 3.13 or something.

I've noticed the same thing happening with my VirtualBox virtual network interfaces when I'm running 3.14-rt4. It's supposed to be fixed in vanilla 3.13 or something.

@spiffytech

This comment has been minimized.

Show comment
Hide comment
@spiffytech

spiffytech Jul 25, 2014

@egasimus Same here - I pulled in hundreds of MB of data before killing the container, then got this error.

@egasimus Same here - I pulled in hundreds of MB of data before killing the container, then got this error.

@spiffytech

This comment has been minimized.

Show comment
Hide comment
@spiffytech

spiffytech Jul 25, 2014

I upgraded to Debian kernel 3.14 and the problem appears to have gone away. Looks like the problem existed in some kernels < 3.5, was fixed in 3.5, regressed in 3.6, and was patched in something 3.12-3.14. https://bugzilla.redhat.com/show_bug.cgi?id=880394

I upgraded to Debian kernel 3.14 and the problem appears to have gone away. Looks like the problem existed in some kernels < 3.5, was fixed in 3.5, regressed in 3.6, and was patched in something 3.12-3.14. https://bugzilla.redhat.com/show_bug.cgi?id=880394

@egasimus

This comment has been minimized.

Show comment
Hide comment
@egasimus

egasimus Jul 27, 2014

@spiffytech Do you have any idea where I can report this regarding the realtime kernel flavour? I think they're only releasing a RT patch for every other version, and would really hate to see 3.16-rt come out with this still broken. :/

EDIT: Filed it at kernel.org.

@spiffytech Do you have any idea where I can report this regarding the realtime kernel flavour? I think they're only releasing a RT patch for every other version, and would really hate to see 3.16-rt come out with this still broken. :/

EDIT: Filed it at kernel.org.

@ibuildthecloud

This comment has been minimized.

Show comment
Hide comment
@ibuildthecloud

ibuildthecloud Dec 22, 2014

Contributor

I'm getting this on Ubuntu 14.10 running a 3.18.1. Kernel log shows

Dec 21 22:49:31 inotmac kernel: [15225.866600] unregister_netdevice: waiting for lo to become free. Usage count = 2
Dec 21 22:49:40 inotmac kernel: [15235.179263] INFO: task docker:19599 blocked for more than 120 seconds.
Dec 21 22:49:40 inotmac kernel: [15235.179268]       Tainted: G           OE  3.18.1-031801-generic #201412170637
Dec 21 22:49:40 inotmac kernel: [15235.179269] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Dec 21 22:49:40 inotmac kernel: [15235.179271] docker          D 0000000000000001     0 19599      1 0x00000000
Dec 21 22:49:40 inotmac kernel: [15235.179275]  ffff8802082abcc0 0000000000000086 ffff880235c3b700 00000000ffffffff
Dec 21 22:49:40 inotmac kernel: [15235.179277]  ffff8802082abfd8 0000000000013640 ffff8800288f2300 0000000000013640
Dec 21 22:49:40 inotmac kernel: [15235.179280]  ffff880232cf0000 ffff8801a467c600 ffffffff81f9d4b8 ffffffff81cd9c60
Dec 21 22:49:40 inotmac kernel: [15235.179282] Call Trace:
Dec 21 22:49:40 inotmac kernel: [15235.179289]  [<ffffffff817af549>] schedule+0x29/0x70
Dec 21 22:49:40 inotmac kernel: [15235.179292]  [<ffffffff817af88e>] schedule_preempt_disabled+0xe/0x10
Dec 21 22:49:40 inotmac kernel: [15235.179296]  [<ffffffff817b1545>] __mutex_lock_slowpath+0x95/0x100
Dec 21 22:49:40 inotmac kernel: [15235.179299]  [<ffffffff8168d5c9>] ? copy_net_ns+0x69/0x150
Dec 21 22:49:40 inotmac kernel: [15235.179302]  [<ffffffff817b15d3>] mutex_lock+0x23/0x37
Dec 21 22:49:40 inotmac kernel: [15235.179305]  [<ffffffff8168d5f8>] copy_net_ns+0x98/0x150
Dec 21 22:49:40 inotmac kernel: [15235.179308]  [<ffffffff810941f1>] create_new_namespaces+0x101/0x1b0
Dec 21 22:49:40 inotmac kernel: [15235.179311]  [<ffffffff8109432b>] copy_namespaces+0x8b/0xa0
Dec 21 22:49:40 inotmac kernel: [15235.179315]  [<ffffffff81073458>] copy_process.part.28+0x828/0xed0
Dec 21 22:49:40 inotmac kernel: [15235.179318]  [<ffffffff811f157f>] ? get_empty_filp+0xcf/0x1c0
Dec 21 22:49:40 inotmac kernel: [15235.179320]  [<ffffffff81073b80>] copy_process+0x80/0x90
Dec 21 22:49:40 inotmac kernel: [15235.179323]  [<ffffffff81073ca2>] do_fork+0x62/0x280
Dec 21 22:49:40 inotmac kernel: [15235.179326]  [<ffffffff8120cfc0>] ? get_unused_fd_flags+0x30/0x40
Dec 21 22:49:40 inotmac kernel: [15235.179329]  [<ffffffff8120d028>] ? __fd_install+0x58/0x70
Dec 21 22:49:40 inotmac kernel: [15235.179331]  [<ffffffff81073f46>] SyS_clone+0x16/0x20
Dec 21 22:49:40 inotmac kernel: [15235.179334]  [<ffffffff817b3ab9>] stub_clone+0x69/0x90
Dec 21 22:49:40 inotmac kernel: [15235.179336]  [<ffffffff817b376d>] ? system_call_fastpath+0x16/0x1b
Dec 21 22:49:41 inotmac kernel: [15235.950976] unregister_netdevice: waiting for lo to become free. Usage count = 2
Dec 21 22:49:51 inotmac kernel: [15246.059346] unregister_netdevice: waiting for lo to become free. Usage count = 2

I'll send docker version/info once the system isn't frozen anymore :)

Contributor

ibuildthecloud commented Dec 22, 2014

I'm getting this on Ubuntu 14.10 running a 3.18.1. Kernel log shows

Dec 21 22:49:31 inotmac kernel: [15225.866600] unregister_netdevice: waiting for lo to become free. Usage count = 2
Dec 21 22:49:40 inotmac kernel: [15235.179263] INFO: task docker:19599 blocked for more than 120 seconds.
Dec 21 22:49:40 inotmac kernel: [15235.179268]       Tainted: G           OE  3.18.1-031801-generic #201412170637
Dec 21 22:49:40 inotmac kernel: [15235.179269] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Dec 21 22:49:40 inotmac kernel: [15235.179271] docker          D 0000000000000001     0 19599      1 0x00000000
Dec 21 22:49:40 inotmac kernel: [15235.179275]  ffff8802082abcc0 0000000000000086 ffff880235c3b700 00000000ffffffff
Dec 21 22:49:40 inotmac kernel: [15235.179277]  ffff8802082abfd8 0000000000013640 ffff8800288f2300 0000000000013640
Dec 21 22:49:40 inotmac kernel: [15235.179280]  ffff880232cf0000 ffff8801a467c600 ffffffff81f9d4b8 ffffffff81cd9c60
Dec 21 22:49:40 inotmac kernel: [15235.179282] Call Trace:
Dec 21 22:49:40 inotmac kernel: [15235.179289]  [<ffffffff817af549>] schedule+0x29/0x70
Dec 21 22:49:40 inotmac kernel: [15235.179292]  [<ffffffff817af88e>] schedule_preempt_disabled+0xe/0x10
Dec 21 22:49:40 inotmac kernel: [15235.179296]  [<ffffffff817b1545>] __mutex_lock_slowpath+0x95/0x100
Dec 21 22:49:40 inotmac kernel: [15235.179299]  [<ffffffff8168d5c9>] ? copy_net_ns+0x69/0x150
Dec 21 22:49:40 inotmac kernel: [15235.179302]  [<ffffffff817b15d3>] mutex_lock+0x23/0x37
Dec 21 22:49:40 inotmac kernel: [15235.179305]  [<ffffffff8168d5f8>] copy_net_ns+0x98/0x150
Dec 21 22:49:40 inotmac kernel: [15235.179308]  [<ffffffff810941f1>] create_new_namespaces+0x101/0x1b0
Dec 21 22:49:40 inotmac kernel: [15235.179311]  [<ffffffff8109432b>] copy_namespaces+0x8b/0xa0
Dec 21 22:49:40 inotmac kernel: [15235.179315]  [<ffffffff81073458>] copy_process.part.28+0x828/0xed0
Dec 21 22:49:40 inotmac kernel: [15235.179318]  [<ffffffff811f157f>] ? get_empty_filp+0xcf/0x1c0
Dec 21 22:49:40 inotmac kernel: [15235.179320]  [<ffffffff81073b80>] copy_process+0x80/0x90
Dec 21 22:49:40 inotmac kernel: [15235.179323]  [<ffffffff81073ca2>] do_fork+0x62/0x280
Dec 21 22:49:40 inotmac kernel: [15235.179326]  [<ffffffff8120cfc0>] ? get_unused_fd_flags+0x30/0x40
Dec 21 22:49:40 inotmac kernel: [15235.179329]  [<ffffffff8120d028>] ? __fd_install+0x58/0x70
Dec 21 22:49:40 inotmac kernel: [15235.179331]  [<ffffffff81073f46>] SyS_clone+0x16/0x20
Dec 21 22:49:40 inotmac kernel: [15235.179334]  [<ffffffff817b3ab9>] stub_clone+0x69/0x90
Dec 21 22:49:40 inotmac kernel: [15235.179336]  [<ffffffff817b376d>] ? system_call_fastpath+0x16/0x1b
Dec 21 22:49:41 inotmac kernel: [15235.950976] unregister_netdevice: waiting for lo to become free. Usage count = 2
Dec 21 22:49:51 inotmac kernel: [15246.059346] unregister_netdevice: waiting for lo to become free. Usage count = 2

I'll send docker version/info once the system isn't frozen anymore :)

@sbward

This comment has been minimized.

Show comment
Hide comment
@sbward

sbward Dec 23, 2014

We're seeing this issue as well. Ubuntu 14.04, 3.13.0-37-generic

sbward commented Dec 23, 2014

We're seeing this issue as well. Ubuntu 14.04, 3.13.0-37-generic

@jbalonso

This comment has been minimized.

Show comment
Hide comment
@jbalonso

jbalonso Dec 29, 2014

On Ubuntu 14.04 server, my team has found that downgrading from 3.13.0-40-generic to 3.13.0-32-generic "resolves" the issue. Given @sbward's observation, that would put the regression after 3.13.0-32-generic and before (or including) 3.13.0-37-generic.

I'll add that, in our case, we sometimes see a negative usage count.

On Ubuntu 14.04 server, my team has found that downgrading from 3.13.0-40-generic to 3.13.0-32-generic "resolves" the issue. Given @sbward's observation, that would put the regression after 3.13.0-32-generic and before (or including) 3.13.0-37-generic.

I'll add that, in our case, we sometimes see a negative usage count.

@rsampaio

This comment has been minimized.

Show comment
Hide comment
@rsampaio

rsampaio Jan 15, 2015

Contributor

FWIW we hit this bug running lxc on trusty kernel (3.13.0-40-generic #69-Ubuntu) the message appears in dmesg followed by this stacktrace:

[27211131.602869] INFO: task lxc-start:26342 blocked for more than 120 seconds.
[27211131.602874]       Not tainted 3.13.0-40-generic #69-Ubuntu
[27211131.602877] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[27211131.602881] lxc-start       D 0000000000000001     0 26342      1 0x00000080
[27211131.602883]  ffff88000d001d40 0000000000000282 ffff88001aa21800 ffff88000d001fd8
[27211131.602886]  0000000000014480 0000000000014480 ffff88001aa21800 ffffffff81cdb760
[27211131.602888]  ffffffff81cdb764 ffff88001aa21800 00000000ffffffff ffffffff81cdb768
[27211131.602891] Call Trace:
[27211131.602894]  [<ffffffff81723b69>] schedule_preempt_disabled+0x29/0x70
[27211131.602897]  [<ffffffff817259d5>] __mutex_lock_slowpath+0x135/0x1b0
[27211131.602900]  [<ffffffff811a2679>] ? __kmalloc+0x1e9/0x230
[27211131.602903]  [<ffffffff81725a6f>] mutex_lock+0x1f/0x2f
[27211131.602905]  [<ffffffff8161c2c1>] copy_net_ns+0x71/0x130
[27211131.602908]  [<ffffffff8108f889>] create_new_namespaces+0xf9/0x180
[27211131.602910]  [<ffffffff8108f983>] copy_namespaces+0x73/0xa0
[27211131.602912]  [<ffffffff81065b16>] copy_process.part.26+0x9a6/0x16b0
[27211131.602915]  [<ffffffff810669f5>] do_fork+0xd5/0x340
[27211131.602917]  [<ffffffff810c8e8d>] ? call_rcu_sched+0x1d/0x20
[27211131.602919]  [<ffffffff81066ce6>] SyS_clone+0x16/0x20
[27211131.602921]  [<ffffffff81730089>] stub_clone+0x69/0x90
[27211131.602923]  [<ffffffff8172fd2d>] ? system_call_fastpath+0x1a/0x1f
Contributor

rsampaio commented Jan 15, 2015

FWIW we hit this bug running lxc on trusty kernel (3.13.0-40-generic #69-Ubuntu) the message appears in dmesg followed by this stacktrace:

[27211131.602869] INFO: task lxc-start:26342 blocked for more than 120 seconds.
[27211131.602874]       Not tainted 3.13.0-40-generic #69-Ubuntu
[27211131.602877] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[27211131.602881] lxc-start       D 0000000000000001     0 26342      1 0x00000080
[27211131.602883]  ffff88000d001d40 0000000000000282 ffff88001aa21800 ffff88000d001fd8
[27211131.602886]  0000000000014480 0000000000014480 ffff88001aa21800 ffffffff81cdb760
[27211131.602888]  ffffffff81cdb764 ffff88001aa21800 00000000ffffffff ffffffff81cdb768
[27211131.602891] Call Trace:
[27211131.602894]  [<ffffffff81723b69>] schedule_preempt_disabled+0x29/0x70
[27211131.602897]  [<ffffffff817259d5>] __mutex_lock_slowpath+0x135/0x1b0
[27211131.602900]  [<ffffffff811a2679>] ? __kmalloc+0x1e9/0x230
[27211131.602903]  [<ffffffff81725a6f>] mutex_lock+0x1f/0x2f
[27211131.602905]  [<ffffffff8161c2c1>] copy_net_ns+0x71/0x130
[27211131.602908]  [<ffffffff8108f889>] create_new_namespaces+0xf9/0x180
[27211131.602910]  [<ffffffff8108f983>] copy_namespaces+0x73/0xa0
[27211131.602912]  [<ffffffff81065b16>] copy_process.part.26+0x9a6/0x16b0
[27211131.602915]  [<ffffffff810669f5>] do_fork+0xd5/0x340
[27211131.602917]  [<ffffffff810c8e8d>] ? call_rcu_sched+0x1d/0x20
[27211131.602919]  [<ffffffff81066ce6>] SyS_clone+0x16/0x20
[27211131.602921]  [<ffffffff81730089>] stub_clone+0x69/0x90
[27211131.602923]  [<ffffffff8172fd2d>] ? system_call_fastpath+0x1a/0x1f
@MrMMorris

This comment has been minimized.

Show comment
Hide comment
@MrMMorris

MrMMorris Mar 16, 2015

Ran into this on Ubuntu 14.04 and Debian jessie w/ kernel 3.16.x.

Docker command:

docker run -t -i -v /data/sitespeed.io:/sitespeed.io/results company/dockerfiles:sitespeed.io-latest --name "Superbrowse"

This seems like a pretty bad issue...

Ran into this on Ubuntu 14.04 and Debian jessie w/ kernel 3.16.x.

Docker command:

docker run -t -i -v /data/sitespeed.io:/sitespeed.io/results company/dockerfiles:sitespeed.io-latest --name "Superbrowse"

This seems like a pretty bad issue...

@MrMMorris

This comment has been minimized.

Show comment
Hide comment
@MrMMorris

MrMMorris Mar 17, 2015

@jbalonso even with 3.13.0-32-generic I get the error after only a few successful runs 😭

@jbalonso even with 3.13.0-32-generic I get the error after only a few successful runs 😭

@rsampaio

This comment has been minimized.

Show comment
Hide comment
@rsampaio

rsampaio Mar 17, 2015

Contributor

@MrMMorris could you share a reproducer script using public available images?

Contributor

rsampaio commented Mar 17, 2015

@MrMMorris could you share a reproducer script using public available images?

@unclejack

This comment has been minimized.

Show comment
Hide comment
@unclejack

unclejack Mar 18, 2015

Contributor

Everyone who's seeing this error on their system is running a package of the Linux kernel on their distribution that's far too old and lacks the fixes for this particular problem.

If you run into this problem, make sure you run apt-get update && apt-get dist-upgrade -y and reboot your system. If you're on Digital Ocean, you also need to select the kernel version which was just installed during the update because they don't use the latest kernel automatically (see https://digitalocean.uservoice.com/forums/136585-digitalocean/suggestions/2814988-give-option-to-use-the-droplet-s-own-bootloader).

CentOS/RHEL/Fedora/Scientific Linux users need to keep their systems updated using yum update and reboot after installing the updates.

When reporting this problem, please make sure your system is fully patched and up to date with the latest stable updates (no manually installed experimental/testing/alpha/beta/rc packages) provided by your distribution's vendor.

Contributor

unclejack commented Mar 18, 2015

Everyone who's seeing this error on their system is running a package of the Linux kernel on their distribution that's far too old and lacks the fixes for this particular problem.

If you run into this problem, make sure you run apt-get update && apt-get dist-upgrade -y and reboot your system. If you're on Digital Ocean, you also need to select the kernel version which was just installed during the update because they don't use the latest kernel automatically (see https://digitalocean.uservoice.com/forums/136585-digitalocean/suggestions/2814988-give-option-to-use-the-droplet-s-own-bootloader).

CentOS/RHEL/Fedora/Scientific Linux users need to keep their systems updated using yum update and reboot after installing the updates.

When reporting this problem, please make sure your system is fully patched and up to date with the latest stable updates (no manually installed experimental/testing/alpha/beta/rc packages) provided by your distribution's vendor.

@MrMMorris

This comment has been minimized.

Show comment
Hide comment
@MrMMorris

MrMMorris Mar 18, 2015

@unclejack

I ran apt-get update && apt-get dist-upgrade -y

ubuntu 14.04 3.13.0-46-generic

Still get the error after only one docker run

I can create an AMI for reproducing if needed

@unclejack

I ran apt-get update && apt-get dist-upgrade -y

ubuntu 14.04 3.13.0-46-generic

Still get the error after only one docker run

I can create an AMI for reproducing if needed

@unclejack

This comment has been minimized.

Show comment
Hide comment
@unclejack

unclejack Mar 18, 2015

Contributor

@MrMMorris Thank you for confirming it's still a problem with the latest kernel package on Ubuntu 14.04.

Contributor

unclejack commented Mar 18, 2015

@MrMMorris Thank you for confirming it's still a problem with the latest kernel package on Ubuntu 14.04.

@MrMMorris

This comment has been minimized.

Show comment
Hide comment
@MrMMorris

MrMMorris Mar 18, 2015

Anything else I can do to help, let me know! 😄

Anything else I can do to help, let me know! 😄

@rsampaio

This comment has been minimized.

Show comment
Hide comment
@rsampaio

rsampaio Mar 18, 2015

Contributor

@MrMMorris if you can provide a reproducer there is a bug opened for Ubuntu and it will be much appreciated: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1403152

Contributor

rsampaio commented Mar 18, 2015

@MrMMorris if you can provide a reproducer there is a bug opened for Ubuntu and it will be much appreciated: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1403152

@MrMMorris

This comment has been minimized.

Show comment
Hide comment
@MrMMorris

MrMMorris Mar 18, 2015

@rsampaio if I have time today, I will definitely get that for you!

@rsampaio if I have time today, I will definitely get that for you!

@fxposter

This comment has been minimized.

Show comment
Hide comment
@fxposter

fxposter Mar 23, 2015

This problem also appears on 3.16(.7) on both Debian 7 and Debian 8: docker#9605 (comment). Rebooting the server is the only way to fix this for now.

This problem also appears on 3.16(.7) on both Debian 7 and Debian 8: docker#9605 (comment). Rebooting the server is the only way to fix this for now.

@chrisjstevenson

This comment has been minimized.

Show comment
Hide comment
@chrisjstevenson

chrisjstevenson Apr 27, 2015

Seeing this issue on RHEL 6.6 with kernel 2.6.32-504.8.1.el6.x86_64 when starting some docker containers (not all containers)
kernel:unregister_netdevice: waiting for lo to become free. Usage count = -1

Again, rebooting the server seems to be the only solution at this time

Seeing this issue on RHEL 6.6 with kernel 2.6.32-504.8.1.el6.x86_64 when starting some docker containers (not all containers)
kernel:unregister_netdevice: waiting for lo to become free. Usage count = -1

Again, rebooting the server seems to be the only solution at this time

@popsikle

This comment has been minimized.

Show comment
Hide comment
@popsikle

popsikle May 12, 2015

Also seeing this on CoreOS (647.0.0) with kernel 3.19.3.

Rebooting is also the only solution I have found.

Also seeing this on CoreOS (647.0.0) with kernel 3.19.3.

Rebooting is also the only solution I have found.

@fxposter

This comment has been minimized.

Show comment
Hide comment
@fxposter

fxposter May 20, 2015

Tested Debian jessie with sid's kernel (4.0.2) - the problem remains.

Tested Debian jessie with sid's kernel (4.0.2) - the problem remains.

@popsikle

This comment has been minimized.

Show comment
Hide comment
@popsikle

popsikle Jun 19, 2015

Anyone seeing this issue running non-ubuntu containers?

Anyone seeing this issue running non-ubuntu containers?

@fxposter

This comment has been minimized.

Show comment
Hide comment
@fxposter

fxposter Jun 19, 2015

Yes. Debian ones.
19 июня 2015 г. 19:01 пользователь "popsikle" notifications@github.com
написал:

Anyone seeing this issue running non-ubuntu containers?


Reply to this email directly or view it on GitHub
docker#5618 (comment).

Yes. Debian ones.
19 июня 2015 г. 19:01 пользователь "popsikle" notifications@github.com
написал:

Anyone seeing this issue running non-ubuntu containers?


Reply to this email directly or view it on GitHub
docker#5618 (comment).

@unclejack

This comment has been minimized.

Show comment
Hide comment
@unclejack

unclejack Jun 20, 2015

Contributor

This is a kernel issue, not an image related issue. Switching an image for another won't improve or make this problem worse.

Contributor

unclejack commented Jun 20, 2015

This is a kernel issue, not an image related issue. Switching an image for another won't improve or make this problem worse.

@techniq

This comment has been minimized.

Show comment
Hide comment
@techniq

techniq Jul 17, 2015

Experiencing issue on Debian Jessie on a BeagleBone Black running 4.1.2-bone12 kernel

techniq commented Jul 17, 2015

Experiencing issue on Debian Jessie on a BeagleBone Black running 4.1.2-bone12 kernel

@igorastds

This comment has been minimized.

Show comment
Hide comment
@igorastds

igorastds Jul 17, 2015

Experiencing after switching from 4.1.2 to 4.2-rc2 (using git build of 1.8.0).
Deleting /var/lib/docker/* doesn't solve the problem.
Switching back to 4.1.2 solves the problem.

Also, VirtualBox has same issue and there's patch for v5.0.0 (retro-ported to v4) which supposedly does something in kernel driver part.. worth looking to understand the problem.

Experiencing after switching from 4.1.2 to 4.2-rc2 (using git build of 1.8.0).
Deleting /var/lib/docker/* doesn't solve the problem.
Switching back to 4.1.2 solves the problem.

Also, VirtualBox has same issue and there's patch for v5.0.0 (retro-ported to v4) which supposedly does something in kernel driver part.. worth looking to understand the problem.

@fxposter

This comment has been minimized.

Show comment
Hide comment
@fxposter

fxposter Jul 22, 2015

This is the fix in the VirtualBox: https://www.virtualbox.org/attachment/ticket/12264/diff_unregister_netdev
They don't actually modify the kernel, just their kernel module.

This is the fix in the VirtualBox: https://www.virtualbox.org/attachment/ticket/12264/diff_unregister_netdev
They don't actually modify the kernel, just their kernel module.

@nazar-pc

This comment has been minimized.

Show comment
Hide comment
@nazar-pc

nazar-pc Jul 24, 2015

Also having this issue with 4.2-rc2:

unregister_netdevice: waiting for vethf1738d3 to become free. Usage count = 1

Also having this issue with 4.2-rc2:

unregister_netdevice: waiting for vethf1738d3 to become free. Usage count = 1

@nazar-pc

This comment has been minimized.

Show comment
Hide comment
@nazar-pc

nazar-pc Jul 24, 2015

Just compiled 4.2-RC3, seems to work again

Just compiled 4.2-RC3, seems to work again

@LK4D4

This comment has been minimized.

Show comment
Hide comment
@LK4D4

LK4D4 Jul 24, 2015

Contributor

@nazar-pc Thanks for info. Just hit it with 4.1.3, was pretty upset
@techniq same here, pretty bad kernel bug. I wonder if we should report it to be backported to 4.1 tree.

Contributor

LK4D4 commented Jul 24, 2015

@nazar-pc Thanks for info. Just hit it with 4.1.3, was pretty upset
@techniq same here, pretty bad kernel bug. I wonder if we should report it to be backported to 4.1 tree.

@feisuzhu

This comment has been minimized.

Show comment
Hide comment
@feisuzhu

feisuzhu Jul 30, 2015

Linux docker13 3.19.0-22-generic #22-Ubuntu SMP Tue Jun 16 17:15:15 UTC 2015 x86_64 x86_64 x86_64 GNU/Linux

Kernel from Ubuntu 15.04, same issue

Linux docker13 3.19.0-22-generic #22-Ubuntu SMP Tue Jun 16 17:15:15 UTC 2015 x86_64 x86_64 x86_64 GNU/Linux

Kernel from Ubuntu 15.04, same issue

@LK4D4

This comment has been minimized.

Show comment
Hide comment
@LK4D4

LK4D4 Jul 30, 2015

Contributor

I saw it with 4.2-rc3 as well. There is not one bug about device leakage :) I can reproduce on any kernel >=4.1 under highload.

Contributor

LK4D4 commented Jul 30, 2015

I saw it with 4.2-rc3 as well. There is not one bug about device leakage :) I can reproduce on any kernel >=4.1 under highload.

@wuming5569

This comment has been minimized.

Show comment
Hide comment
@wuming5569

wuming5569 Mar 9, 2018

still happend "unregister_netdevice: waiting for eth0 to become free. Usage count = 1" although I‘v upgraded kernel version to 4.4.118, and docker version 17.09.1-ce ,maybe I should try disable ipv6 at the kernel level . Hope it cloud work.

still happend "unregister_netdevice: waiting for eth0 to become free. Usage count = 1" although I‘v upgraded kernel version to 4.4.118, and docker version 17.09.1-ce ,maybe I should try disable ipv6 at the kernel level . Hope it cloud work.

@scher200

This comment has been minimized.

Show comment
Hide comment
@scher200

scher200 Mar 9, 2018

@wuming5569 please let me know if it worked for you with that version of linux

scher200 commented Mar 9, 2018

@wuming5569 please let me know if it worked for you with that version of linux

@4admin2root

This comment has been minimized.

Show comment
Hide comment
@4admin2root

4admin2root Mar 10, 2018

@wuming5569 maybe, upgrade kernel 4.4.114 fix "unregister_netdevice: waiting for lo to become free. Usage count = 1", not for "unregister_netdevice: waiting for eth0 to become free. Usage count = 1".
I tested in production.
@ddstreet this is a feedback, any help ?

4admin2root commented Mar 10, 2018

@wuming5569 maybe, upgrade kernel 4.4.114 fix "unregister_netdevice: waiting for lo to become free. Usage count = 1", not for "unregister_netdevice: waiting for eth0 to become free. Usage count = 1".
I tested in production.
@ddstreet this is a feedback, any help ?

@rn

This comment has been minimized.

Show comment
Hide comment
@rn

rn Mar 10, 2018

Member

@wuming5569 as mentioned above, the messages them self are benign but they may eventually lead to the kernel hanging. Does your kernel hang and if so, what is your network pattern, ie what type of networking do your containers do?

Member

rn commented Mar 10, 2018

@wuming5569 as mentioned above, the messages them self are benign but they may eventually lead to the kernel hanging. Does your kernel hang and if so, what is your network pattern, ie what type of networking do your containers do?

@soglad

This comment has been minimized.

Show comment
Hide comment
@soglad

soglad Mar 14, 2018

Experienced same issue on CentOS. My kernel is 3.10.0-693.17.1.el7.x86_64. But, I didn't get similar stack trace in syslog.

soglad commented Mar 14, 2018

Experienced same issue on CentOS. My kernel is 3.10.0-693.17.1.el7.x86_64. But, I didn't get similar stack trace in syslog.

@danielefranceschi

This comment has been minimized.

Show comment
Hide comment
@danielefranceschi

danielefranceschi Mar 27, 2018

Same on Centos7 kernel 3.10.0-514.21.1.el7.x86_64 and docker 18.03.0-ce

Same on Centos7 kernel 3.10.0-514.21.1.el7.x86_64 and docker 18.03.0-ce

@alexhexabeam

This comment has been minimized.

Show comment
Hide comment
@alexhexabeam

alexhexabeam Mar 27, 2018

@danielefranceschi I recommend you upgrade to the latest CentOS kernel (at least 3.10.0-693). It won't solve the issue, but it seems to be much less frequent. In kernels 3.10.0-327 and 3.10.0-514, we were seeing the stack trace, but by my memory, I don't think we've seen any of those in 3.10.0-693.

@danielefranceschi I recommend you upgrade to the latest CentOS kernel (at least 3.10.0-693). It won't solve the issue, but it seems to be much less frequent. In kernels 3.10.0-327 and 3.10.0-514, we were seeing the stack trace, but by my memory, I don't think we've seen any of those in 3.10.0-693.

@danielefranceschi

This comment has been minimized.

Show comment
Hide comment
@danielefranceschi

danielefranceschi Mar 28, 2018

@alexhexabeam 3.10.0-693 seems to work flawlessy, tnx :)

@alexhexabeam 3.10.0-693 seems to work flawlessy, tnx :)

@LeonanCarvalho

This comment has been minimized.

Show comment
Hide comment
@LeonanCarvalho

LeonanCarvalho Apr 3, 2018

Same on CentOS7 kernel 4.16.0-1.el7.elrepo.x86_64 and docker 18.03.0-ce

It worked for weeks before the crash and when to try to up, it completely stuck.

The problem also happened with kernel 3.10.0-693.21.1.el7

LeonanCarvalho commented Apr 3, 2018

Same on CentOS7 kernel 4.16.0-1.el7.elrepo.x86_64 and docker 18.03.0-ce

It worked for weeks before the crash and when to try to up, it completely stuck.

The problem also happened with kernel 3.10.0-693.21.1.el7

@marckamerbeek

This comment has been minimized.

Show comment
Hide comment
@marckamerbeek

marckamerbeek Apr 4, 2018

I can confirm it also happens on:

Linux 3.10.0-693.17.1.el7.x86_64
Red Hat Enterprise Linux Server release 7.4 (Maipo)

I can reproduce it by doing "service docker restart" while having a certain amount of load.

I can confirm it also happens on:

Linux 3.10.0-693.17.1.el7.x86_64
Red Hat Enterprise Linux Server release 7.4 (Maipo)

I can reproduce it by doing "service docker restart" while having a certain amount of load.

@Xuexiang825

This comment has been minimized.

Show comment
Hide comment
@Xuexiang825

Xuexiang825 Apr 11, 2018

@wuming5569 have you fixed this issue?what's your network type ? we have been confused by this issue for weeks .
Do you have wechat account ?

@wuming5569 have you fixed this issue?what's your network type ? we have been confused by this issue for weeks .
Do you have wechat account ?

@Sherweb-turing-pipeline

This comment has been minimized.

Show comment
Hide comment
@Sherweb-turing-pipeline

Sherweb-turing-pipeline Apr 12, 2018

4admin2root, given the fix you mentioned, https://cdn.kernel.org/pub/linux/kernel/v4.x/ChangeLog-4.4.114,

is it safe to disable userland proxy for docker daemon, if proper recent kernel is installed? It is not very clear if it is from

#8356
#11185

Since both are older than the kernel fix

Thank you

4admin2root, given the fix you mentioned, https://cdn.kernel.org/pub/linux/kernel/v4.x/ChangeLog-4.4.114,

is it safe to disable userland proxy for docker daemon, if proper recent kernel is installed? It is not very clear if it is from

#8356
#11185

Since both are older than the kernel fix

Thank you

@sampsonhuo

This comment has been minimized.

Show comment
Hide comment
@sampsonhuo

sampsonhuo Apr 18, 2018

we have been confused by this issue for weeks .
Linux 3.10.0-693.17.1.el7.x86_64
CentOS Linux release 7.4.1708 (Core)

we have been confused by this issue for weeks .
Linux 3.10.0-693.17.1.el7.x86_64
CentOS Linux release 7.4.1708 (Core)

@dElogics

This comment has been minimized.

Show comment
Hide comment
@dElogics

dElogics May 4, 2018

Can anyone confirm if the latest 4.14 kernel has this issue? Seems like it does not. No one around the Internet faced this issue with the 4.14 kernel.

dElogics commented May 4, 2018

Can anyone confirm if the latest 4.14 kernel has this issue? Seems like it does not. No one around the Internet faced this issue with the 4.14 kernel.

@dimm0

This comment has been minimized.

Show comment
Hide comment
@dimm0

dimm0 May 4, 2018

I see this in 4.15.15-1 kernel, Centos7

dimm0 commented May 4, 2018

I see this in 4.15.15-1 kernel, Centos7

@dElogics

This comment has been minimized.

Show comment
Hide comment
@dElogics

dElogics May 7, 2018

Looking at the change logs, https://cdn.kernel.org/pub/linux/kernel/v4.x/ChangeLog-4.15.8 has a fix for SCTP, but not TCP. So you may like to try the latest 4.14.

dElogics commented May 7, 2018

Looking at the change logs, https://cdn.kernel.org/pub/linux/kernel/v4.x/ChangeLog-4.15.8 has a fix for SCTP, but not TCP. So you may like to try the latest 4.14.

@spronin-aurea

This comment has been minimized.

Show comment
Hide comment
@spronin-aurea

spronin-aurea Jun 4, 2018

  • even 4.15.18 does not help with this bug
  • disabling ipv6 does not help as well

we have now upgraded to 4.16.13. Observing. This bug was hitting us on a one node only approx once per week.

  • even 4.15.18 does not help with this bug
  • disabling ipv6 does not help as well

we have now upgraded to 4.16.13. Observing. This bug was hitting us on a one node only approx once per week.

@qrpike

This comment has been minimized.

Show comment
Hide comment
@qrpike

qrpike Jun 4, 2018

qrpike commented Jun 4, 2018

@scher200

This comment has been minimized.

Show comment
Hide comment
@scher200

scher200 Jun 4, 2018

for me, most of the time the bug shows up after redeploying the same project/network again

scher200 commented Jun 4, 2018

for me, most of the time the bug shows up after redeploying the same project/network again

@spronin-aurea

This comment has been minimized.

Show comment
Hide comment
@spronin-aurea

spronin-aurea Jun 4, 2018

@qrpike you are right, we tried only sysctl. Let me try with grub. Thanks!

@qrpike you are right, we tried only sysctl. Let me try with grub. Thanks!

@dElogics

This comment has been minimized.

Show comment
Hide comment
@dElogics

dElogics Jun 19, 2018

4.9.88 Debian kernel. Reproducible.

4.9.88 Debian kernel. Reproducible.

@komljen

This comment has been minimized.

Show comment
Hide comment
@komljen

komljen Jun 19, 2018

@qrpike you are right, we tried only sysctl. Let me try with grub. Thanks!

In my case disabling ipv6 didn't make any difference.

komljen commented Jun 19, 2018

@qrpike you are right, we tried only sysctl. Let me try with grub. Thanks!

In my case disabling ipv6 didn't make any difference.

@qrpike

This comment has been minimized.

Show comment
Hide comment
@qrpike

qrpike Jun 19, 2018

@spronin-aurea Did disabling ipv6 at boot loader help?

qrpike commented Jun 19, 2018

@spronin-aurea Did disabling ipv6 at boot loader help?

@komljen

This comment has been minimized.

Show comment
Hide comment
@komljen

komljen Jun 19, 2018

@qrpike can you tell us about the nodes you are using if disabling ipv6 helped in your case? Kernel version, k8s version, CNI, docker version etc.

komljen commented Jun 19, 2018

@qrpike can you tell us about the nodes you are using if disabling ipv6 helped in your case? Kernel version, k8s version, CNI, docker version etc.

@qrpike

This comment has been minimized.

Show comment
Hide comment
@qrpike

qrpike Jun 19, 2018

@komljen I have been using CoreOS for the past 2years without a single incident. Since ~ver 1000. I haven't tried it recently but if I do not disable ipv6 the bug happens.

qrpike commented Jun 19, 2018

@komljen I have been using CoreOS for the past 2years without a single incident. Since ~ver 1000. I haven't tried it recently but if I do not disable ipv6 the bug happens.

@deimosfr

This comment has been minimized.

Show comment
Hide comment
@deimosfr

deimosfr Jun 19, 2018

On my side, I'm using CoreOS too, ipv6 disabled with grub and still getting the issue

On my side, I'm using CoreOS too, ipv6 disabled with grub and still getting the issue

@qrpike

This comment has been minimized.

Show comment
Hide comment
@qrpike

qrpike Jun 19, 2018

@deimosfr I'm currently using PXE boot for all my nodes:

      DEFAULT menu.c32
      prompt 0
      timeout 50
      MENU TITLE PXE Boot Blade 1
      label coreos
              menu label CoreOS ( blade 1 )
              kernel coreos/coreos_production_pxe.vmlinuz
              append initrd=coreos/coreos_production_pxe_image.cpio.gz ipv6.disable=1 net.ifnames=1 biosdevname=0 elevator=deadline cloud-config-url=http://HOST_PRIV_IP:8888/coreos-cloud-config.yml?host=1 root=LABEL=ROOT rootflags=noatime,discard,rw,seclabel,nodiratime

However, my main node that is the PXE host is also CoreOS and boots from disk, and does not have the issue either.

qrpike commented Jun 19, 2018

@deimosfr I'm currently using PXE boot for all my nodes:

      DEFAULT menu.c32
      prompt 0
      timeout 50
      MENU TITLE PXE Boot Blade 1
      label coreos
              menu label CoreOS ( blade 1 )
              kernel coreos/coreos_production_pxe.vmlinuz
              append initrd=coreos/coreos_production_pxe_image.cpio.gz ipv6.disable=1 net.ifnames=1 biosdevname=0 elevator=deadline cloud-config-url=http://HOST_PRIV_IP:8888/coreos-cloud-config.yml?host=1 root=LABEL=ROOT rootflags=noatime,discard,rw,seclabel,nodiratime

However, my main node that is the PXE host is also CoreOS and boots from disk, and does not have the issue either.

@dElogics

This comment has been minimized.

Show comment
Hide comment
@dElogics

dElogics Jun 19, 2018

What kernel versions you guys are running?

What kernel versions you guys are running?

@deimosfr

This comment has been minimized.

Show comment
Hide comment
@deimosfr

deimosfr Jun 19, 2018

The ones I got the issue were on 4.14.32-coreos and before. I do not encounter this issue yet on 4.14.42-coreos

The ones I got the issue were on 4.14.32-coreos and before. I do not encounter this issue yet on 4.14.42-coreos

@wallewuli

This comment has been minimized.

Show comment
Hide comment
@wallewuli

wallewuli Jul 2, 2018

Centos 7.5 with 4.17.3-1 kernel, still got the issue.

Env :
kubernetes 1.10.4
Docker 13.1
with Flannel network plugin.

Log :
[ 89.790907] IPv6: ADDRCONF(NETDEV_UP): eth0: link is not ready
[ 89.798523] IPv6: ADDRCONF(NETDEV_CHANGE): eth0: link becomes ready
[ 89.799623] cni0: port 8(vethb8a93c6f) entered blocking state
[ 89.800547] cni0: port 8(vethb8a93c6f) entered disabled state
[ 89.801471] device vethb8a93c6f entered promiscuous mode
[ 89.802323] cni0: port 8(vethb8a93c6f) entered blocking state
[ 89.803200] cni0: port 8(vethb8a93c6f) entered forwarding state

kernel:unregister_netdevice: waiting for lo to become free. Usage count = 1。

Now :
The node IP can reach, but cannot use any network services , like ssh...

Centos 7.5 with 4.17.3-1 kernel, still got the issue.

Env :
kubernetes 1.10.4
Docker 13.1
with Flannel network plugin.

Log :
[ 89.790907] IPv6: ADDRCONF(NETDEV_UP): eth0: link is not ready
[ 89.798523] IPv6: ADDRCONF(NETDEV_CHANGE): eth0: link becomes ready
[ 89.799623] cni0: port 8(vethb8a93c6f) entered blocking state
[ 89.800547] cni0: port 8(vethb8a93c6f) entered disabled state
[ 89.801471] device vethb8a93c6f entered promiscuous mode
[ 89.802323] cni0: port 8(vethb8a93c6f) entered blocking state
[ 89.803200] cni0: port 8(vethb8a93c6f) entered forwarding state

kernel:unregister_netdevice: waiting for lo to become free. Usage count = 1。

Now :
The node IP can reach, but cannot use any network services , like ssh...

@Blub

This comment has been minimized.

Show comment
Hide comment
@Blub

Blub Jul 2, 2018

The symptoms here are similar to a lot of reports in various other places. All having to do with network namespaces. Could the people running into this please see if unshare -n hangs, and if so, from another terminal, do cat /proc/$pid/stack of the unshare process to see if it hangs in copy_net_ns()? This seems to be a common denominator for many of the issues including some backtraces found here. Between 4.16 and 4.18 there have been a number of patches by Kirill Tkhai refactoring the involved locking a lot. The affected distro/kernel package maintainers should probably look into applying/backporting them to stable kernels and see if that helps.
See also: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1779678

Blub commented Jul 2, 2018

The symptoms here are similar to a lot of reports in various other places. All having to do with network namespaces. Could the people running into this please see if unshare -n hangs, and if so, from another terminal, do cat /proc/$pid/stack of the unshare process to see if it hangs in copy_net_ns()? This seems to be a common denominator for many of the issues including some backtraces found here. Between 4.16 and 4.18 there have been a number of patches by Kirill Tkhai refactoring the involved locking a lot. The affected distro/kernel package maintainers should probably look into applying/backporting them to stable kernels and see if that helps.
See also: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1779678

@cassiussa

This comment has been minimized.

Show comment
Hide comment
@cassiussa

cassiussa Jul 3, 2018

@Blub

sudo cat /proc/122355/stack
[<ffffffff8157f6e2>] copy_net_ns+0xa2/0x180
[<ffffffff810b7519>] create_new_namespaces+0xf9/0x180
[<ffffffff810b775a>] unshare_nsproxy_namespaces+0x5a/0xc0
[<ffffffff81088983>] SyS_unshare+0x193/0x300
[<ffffffff816b8c6b>] tracesys+0x97/0xbd
[<ffffffffffffffff>] 0xffffffffffffffff

cassiussa commented Jul 3, 2018

@Blub

sudo cat /proc/122355/stack
[<ffffffff8157f6e2>] copy_net_ns+0xa2/0x180
[<ffffffff810b7519>] create_new_namespaces+0xf9/0x180
[<ffffffff810b775a>] unshare_nsproxy_namespaces+0x5a/0xc0
[<ffffffff81088983>] SyS_unshare+0x193/0x300
[<ffffffff816b8c6b>] tracesys+0x97/0xbd
[<ffffffffffffffff>] 0xffffffffffffffff
@Blub

This comment has been minimized.

Show comment
Hide comment
@Blub

Blub Jul 4, 2018

Given the locking changes in 4.18 it would be good to test the current 4.18rc, especially if you can trigger it more or less reliably, as from what I've seen there are many people where changing kernel versions also changed the likelihood of this happening a lot.

Blub commented Jul 4, 2018

Given the locking changes in 4.18 it would be good to test the current 4.18rc, especially if you can trigger it more or less reliably, as from what I've seen there are many people where changing kernel versions also changed the likelihood of this happening a lot.

@komljen

This comment has been minimized.

Show comment
Hide comment
@komljen

komljen Jul 4, 2018

I had this issues with Kubernetes and after switching to latest CoreOS stable release - 1745.7.0 the issue is gone:

  • kernel: 4.14.48
  • docker: 18.03.1

komljen commented Jul 4, 2018

I had this issues with Kubernetes and after switching to latest CoreOS stable release - 1745.7.0 the issue is gone:

  • kernel: 4.14.48
  • docker: 18.03.1
@PengBAI

This comment has been minimized.

Show comment
Hide comment
@PengBAI

PengBAI Jul 5, 2018

same issue on CentOS 7

  • kernel: 4.11.1-1.el7.elrepo.x86_64
  • docker: 17.12.0-ce

Can anyone give us a woking version please !

PengBAI commented Jul 5, 2018

same issue on CentOS 7

  • kernel: 4.11.1-1.el7.elrepo.x86_64
  • docker: 17.12.0-ce

Can anyone give us a woking version please !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment