New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kernel crash after "unregister_netdevice: waiting for lo to become free. Usage count = 3" #5618

Open
tankywoo opened this Issue May 6, 2014 · 453 comments

Comments

Projects
None yet
@tankywoo

tankywoo commented May 6, 2014

This happens when I login the container, and can't quit by Ctrl-c.

My system is Ubuntu 12.04, kernel is 3.8.0-25-generic.

docker version:

root@wutq-docker:~# docker version
Client version: 0.10.0
Client API version: 1.10
Go version (client): go1.2.1
Git commit (client): dc9c28f
Server version: 0.10.0
Server API version: 1.10
Git commit (server): dc9c28f
Go version (server): go1.2.1
Last stable version: 0.10.0

I have used the script https://raw.githubusercontent.com/dotcloud/docker/master/contrib/check-config.sh to check, and all right.

I watch the syslog and found this message:

May  6 11:30:33 wutq-docker kernel: [62365.889369] unregister_netdevice: waiting for lo to become free. Usage count = 3
May  6 11:30:44 wutq-docker kernel: [62376.108277] unregister_netdevice: waiting for lo to become free. Usage count = 3
May  6 11:30:54 wutq-docker kernel: [62386.327156] unregister_netdevice: waiting for lo to become free. Usage count = 3
May  6 11:31:02 wutq-docker kernel: [62394.423920] INFO: task docker:1024 blocked for more than 120 seconds.
May  6 11:31:02 wutq-docker kernel: [62394.424175] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
May  6 11:31:02 wutq-docker kernel: [62394.424505] docker          D 0000000000000001     0  1024      1 0x00000004
May  6 11:31:02 wutq-docker kernel: [62394.424511]  ffff880077793cb0 0000000000000082 ffffffffffffff04 ffffffff816df509
May  6 11:31:02 wutq-docker kernel: [62394.424517]  ffff880077793fd8 ffff880077793fd8 ffff880077793fd8 0000000000013f40
May  6 11:31:02 wutq-docker kernel: [62394.424521]  ffff88007c461740 ffff880076b1dd00 000080d081f06880 ffffffff81cbbda0
May  6 11:31:02 wutq-docker kernel: [62394.424526] Call Trace:                                                         
May  6 11:31:02 wutq-docker kernel: [62394.424668]  [<ffffffff816df509>] ? __slab_alloc+0x28a/0x2b2
May  6 11:31:02 wutq-docker kernel: [62394.424700]  [<ffffffff816f1849>] schedule+0x29/0x70
May  6 11:31:02 wutq-docker kernel: [62394.424705]  [<ffffffff816f1afe>] schedule_preempt_disabled+0xe/0x10
May  6 11:31:02 wutq-docker kernel: [62394.424710]  [<ffffffff816f0777>] __mutex_lock_slowpath+0xd7/0x150
May  6 11:31:02 wutq-docker kernel: [62394.424715]  [<ffffffff815dc809>] ? copy_net_ns+0x69/0x130
May  6 11:31:02 wutq-docker kernel: [62394.424719]  [<ffffffff815dc0b1>] ? net_alloc_generic+0x21/0x30
May  6 11:31:02 wutq-docker kernel: [62394.424724]  [<ffffffff816f038a>] mutex_lock+0x2a/0x50
May  6 11:31:02 wutq-docker kernel: [62394.424727]  [<ffffffff815dc82c>] copy_net_ns+0x8c/0x130
May  6 11:31:02 wutq-docker kernel: [62394.424733]  [<ffffffff81084851>] create_new_namespaces+0x101/0x1b0
May  6 11:31:02 wutq-docker kernel: [62394.424737]  [<ffffffff81084a33>] copy_namespaces+0xa3/0xe0
May  6 11:31:02 wutq-docker kernel: [62394.424742]  [<ffffffff81057a60>] ? dup_mm+0x140/0x240
May  6 11:31:02 wutq-docker kernel: [62394.424746]  [<ffffffff81058294>] copy_process.part.22+0x6f4/0xe60
May  6 11:31:02 wutq-docker kernel: [62394.424752]  [<ffffffff812da406>] ? security_file_alloc+0x16/0x20
May  6 11:31:02 wutq-docker kernel: [62394.424758]  [<ffffffff8119d118>] ? get_empty_filp+0x88/0x180
May  6 11:31:02 wutq-docker kernel: [62394.424762]  [<ffffffff81058a80>] copy_process+0x80/0x90
May  6 11:31:02 wutq-docker kernel: [62394.424766]  [<ffffffff81058b7c>] do_fork+0x9c/0x230
May  6 11:31:02 wutq-docker kernel: [62394.424769]  [<ffffffff816f277e>] ? _raw_spin_lock+0xe/0x20
May  6 11:31:02 wutq-docker kernel: [62394.424774]  [<ffffffff811b9185>] ? __fd_install+0x55/0x70
May  6 11:31:02 wutq-docker kernel: [62394.424777]  [<ffffffff81058d96>] sys_clone+0x16/0x20
May  6 11:31:02 wutq-docker kernel: [62394.424782]  [<ffffffff816fb939>] stub_clone+0x69/0x90
May  6 11:31:02 wutq-docker kernel: [62394.424786]  [<ffffffff816fb5dd>] ? system_call_fastpath+0x1a/0x1f
May  6 11:31:04 wutq-docker kernel: [62396.466223] unregister_netdevice: waiting for lo to become free. Usage count = 3
May  6 11:31:14 wutq-docker kernel: [62406.689132] unregister_netdevice: waiting for lo to become free. Usage count = 3
May  6 11:31:25 wutq-docker kernel: [62416.908036] unregister_netdevice: waiting for lo to become free. Usage count = 3
May  6 11:31:35 wutq-docker kernel: [62427.126927] unregister_netdevice: waiting for lo to become free. Usage count = 3
May  6 11:31:45 wutq-docker kernel: [62437.345860] unregister_netdevice: waiting for lo to become free. Usage count = 3

After happend this, I open another terminal and kill this process, and then restart docker, but this will be hanged.

I reboot the host, and it still display that messages for some minutes when shutdown:
screen shot 2014-05-06 at 11 49 27

@drpancake

This comment has been minimized.

Show comment
Hide comment
@drpancake

drpancake May 23, 2014

I'm seeing a very similar issue for eth0. Ubuntu 12.04 also.

I have to power cycle the machine. From /var/log/kern.log:

May 22 19:26:08 box kernel: [596765.670275] device veth5070 entered promiscuous mode
May 22 19:26:08 box kernel: [596765.680630] IPv6: ADDRCONF(NETDEV_UP): veth5070: link is not ready
May 22 19:26:08 box kernel: [596765.700561] IPv6: ADDRCONF(NETDEV_CHANGE): veth5070: link becomes ready
May 22 19:26:08 box kernel: [596765.700628] docker0: port 7(veth5070) entered forwarding state
May 22 19:26:08 box kernel: [596765.700638] docker0: port 7(veth5070) entered forwarding state
May 22 19:26:19 box kernel: [596777.386084] [FW DBLOCK] IN=docker0 OUT= PHYSIN=veth5070 MAC=56:84:7a:fe:97:99:9e:df:a7:3f:23:42:08:00 SRC=172.17.0.8 DST=172.17.42.1 LEN=60 TOS=0x00 PREC=0x00 TTL=64 ID=170 DF PROTO=TCP SPT=51615 DPT=13162 WINDOW=14600 RES=0x00 SYN URGP=0
May 22 19:26:21 box kernel: [596779.371993] [FW DBLOCK] IN=docker0 OUT= PHYSIN=veth5070 MAC=56:84:7a:fe:97:99:9e:df:a7:3f:23:42:08:00 SRC=172.17.0.8 DST=172.17.42.1 LEN=60 TOS=0x00 PREC=0x00 TTL=64 ID=549 DF PROTO=TCP SPT=46878 DPT=12518 WINDOW=14600 RES=0x00 SYN URGP=0
May 22 19:26:23 box kernel: [596780.704031] docker0: port 7(veth5070) entered forwarding state
May 22 19:27:13 box kernel: [596831.359999] docker0: port 7(veth5070) entered disabled state
May 22 19:27:13 box kernel: [596831.361329] device veth5070 left promiscuous mode
May 22 19:27:13 box kernel: [596831.361333] docker0: port 7(veth5070) entered disabled state
May 22 19:27:24 box kernel: [596841.516039] unregister_netdevice: waiting for eth0 to become free. Usage count = 1
May 22 19:27:34 box kernel: [596851.756060] unregister_netdevice: waiting for eth0 to become free. Usage count = 1
May 22 19:27:44 box kernel: [596861.772101] unregister_netdevice: waiting for eth0 to become free. Usage count = 1

drpancake commented May 23, 2014

I'm seeing a very similar issue for eth0. Ubuntu 12.04 also.

I have to power cycle the machine. From /var/log/kern.log:

May 22 19:26:08 box kernel: [596765.670275] device veth5070 entered promiscuous mode
May 22 19:26:08 box kernel: [596765.680630] IPv6: ADDRCONF(NETDEV_UP): veth5070: link is not ready
May 22 19:26:08 box kernel: [596765.700561] IPv6: ADDRCONF(NETDEV_CHANGE): veth5070: link becomes ready
May 22 19:26:08 box kernel: [596765.700628] docker0: port 7(veth5070) entered forwarding state
May 22 19:26:08 box kernel: [596765.700638] docker0: port 7(veth5070) entered forwarding state
May 22 19:26:19 box kernel: [596777.386084] [FW DBLOCK] IN=docker0 OUT= PHYSIN=veth5070 MAC=56:84:7a:fe:97:99:9e:df:a7:3f:23:42:08:00 SRC=172.17.0.8 DST=172.17.42.1 LEN=60 TOS=0x00 PREC=0x00 TTL=64 ID=170 DF PROTO=TCP SPT=51615 DPT=13162 WINDOW=14600 RES=0x00 SYN URGP=0
May 22 19:26:21 box kernel: [596779.371993] [FW DBLOCK] IN=docker0 OUT= PHYSIN=veth5070 MAC=56:84:7a:fe:97:99:9e:df:a7:3f:23:42:08:00 SRC=172.17.0.8 DST=172.17.42.1 LEN=60 TOS=0x00 PREC=0x00 TTL=64 ID=549 DF PROTO=TCP SPT=46878 DPT=12518 WINDOW=14600 RES=0x00 SYN URGP=0
May 22 19:26:23 box kernel: [596780.704031] docker0: port 7(veth5070) entered forwarding state
May 22 19:27:13 box kernel: [596831.359999] docker0: port 7(veth5070) entered disabled state
May 22 19:27:13 box kernel: [596831.361329] device veth5070 left promiscuous mode
May 22 19:27:13 box kernel: [596831.361333] docker0: port 7(veth5070) entered disabled state
May 22 19:27:24 box kernel: [596841.516039] unregister_netdevice: waiting for eth0 to become free. Usage count = 1
May 22 19:27:34 box kernel: [596851.756060] unregister_netdevice: waiting for eth0 to become free. Usage count = 1
May 22 19:27:44 box kernel: [596861.772101] unregister_netdevice: waiting for eth0 to become free. Usage count = 1
@egasimus

This comment has been minimized.

Show comment
Hide comment
@egasimus

egasimus Jun 4, 2014

Hey, this just started happening for me as well.

Docker version:

Client version: 0.11.1
Client API version: 1.11
Go version (client): go1.2.1
Git commit (client): fb99f99
Server version: 0.11.1
Server API version: 1.11
Git commit (server): fb99f99
Go version (server): go1.2.1
Last stable version: 0.11.1

Kernel log: http://pastebin.com/TubCy1tG

System details:
Running Ubuntu 14.04 LTS with patched kernel (3.14.3-rt4). Yet to see it happen with the default linux-3.13.0-27-generic kernel. What's funny, though, is that when this happens, all my terminal windows freeze, letting me type a few characters at most before that. The same fate befalls any new ones I open, too - and I end up needing to power cycle my poor laptop just like the good doctor above. For the record, I'm running fish shell in urxvt or xterm in xmonad. Haven't checked if it affects plain bash.

egasimus commented Jun 4, 2014

Hey, this just started happening for me as well.

Docker version:

Client version: 0.11.1
Client API version: 1.11
Go version (client): go1.2.1
Git commit (client): fb99f99
Server version: 0.11.1
Server API version: 1.11
Git commit (server): fb99f99
Go version (server): go1.2.1
Last stable version: 0.11.1

Kernel log: http://pastebin.com/TubCy1tG

System details:
Running Ubuntu 14.04 LTS with patched kernel (3.14.3-rt4). Yet to see it happen with the default linux-3.13.0-27-generic kernel. What's funny, though, is that when this happens, all my terminal windows freeze, letting me type a few characters at most before that. The same fate befalls any new ones I open, too - and I end up needing to power cycle my poor laptop just like the good doctor above. For the record, I'm running fish shell in urxvt or xterm in xmonad. Haven't checked if it affects plain bash.

@egasimus

This comment has been minimized.

Show comment
Hide comment
@egasimus

egasimus Jun 5, 2014

This might be relevant:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1065434#yui_3_10_3_1_1401948176063_2050

Copying a fairly large amount of data over the network inside a container
and then exiting the container can trigger a missing decrement in the per
cpu reference count on a network device.

Sure enough, one of the times this happened for me was right after apt-getting a package with a ton of dependencies.

egasimus commented Jun 5, 2014

This might be relevant:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1065434#yui_3_10_3_1_1401948176063_2050

Copying a fairly large amount of data over the network inside a container
and then exiting the container can trigger a missing decrement in the per
cpu reference count on a network device.

Sure enough, one of the times this happened for me was right after apt-getting a package with a ton of dependencies.

@drpancake

This comment has been minimized.

Show comment
Hide comment
@drpancake

drpancake Jun 5, 2014

Upgrading from Ubuntu 12.04.3 to 14.04 fixed this for me without any other changes.

drpancake commented Jun 5, 2014

Upgrading from Ubuntu 12.04.3 to 14.04 fixed this for me without any other changes.

@unclejack unclejack added the kernel label Jul 16, 2014

@csabahenk

This comment has been minimized.

Show comment
Hide comment
@csabahenk

csabahenk Jul 22, 2014

I experience this on RHEL7, 3.10.0-123.4.2.el7.x86_64

csabahenk commented Jul 22, 2014

I experience this on RHEL7, 3.10.0-123.4.2.el7.x86_64

@egasimus

This comment has been minimized.

Show comment
Hide comment
@egasimus

egasimus Jul 22, 2014

I've noticed the same thing happening with my VirtualBox virtual network interfaces when I'm running 3.14-rt4. It's supposed to be fixed in vanilla 3.13 or something.

egasimus commented Jul 22, 2014

I've noticed the same thing happening with my VirtualBox virtual network interfaces when I'm running 3.14-rt4. It's supposed to be fixed in vanilla 3.13 or something.

@spiffytech

This comment has been minimized.

Show comment
Hide comment
@spiffytech

spiffytech Jul 25, 2014

@egasimus Same here - I pulled in hundreds of MB of data before killing the container, then got this error.

spiffytech commented Jul 25, 2014

@egasimus Same here - I pulled in hundreds of MB of data before killing the container, then got this error.

@spiffytech

This comment has been minimized.

Show comment
Hide comment
@spiffytech

spiffytech Jul 25, 2014

I upgraded to Debian kernel 3.14 and the problem appears to have gone away. Looks like the problem existed in some kernels < 3.5, was fixed in 3.5, regressed in 3.6, and was patched in something 3.12-3.14. https://bugzilla.redhat.com/show_bug.cgi?id=880394

spiffytech commented Jul 25, 2014

I upgraded to Debian kernel 3.14 and the problem appears to have gone away. Looks like the problem existed in some kernels < 3.5, was fixed in 3.5, regressed in 3.6, and was patched in something 3.12-3.14. https://bugzilla.redhat.com/show_bug.cgi?id=880394

@egasimus

This comment has been minimized.

Show comment
Hide comment
@egasimus

egasimus Jul 27, 2014

@spiffytech Do you have any idea where I can report this regarding the realtime kernel flavour? I think they're only releasing a RT patch for every other version, and would really hate to see 3.16-rt come out with this still broken. :/

EDIT: Filed it at kernel.org.

egasimus commented Jul 27, 2014

@spiffytech Do you have any idea where I can report this regarding the realtime kernel flavour? I think they're only releasing a RT patch for every other version, and would really hate to see 3.16-rt come out with this still broken. :/

EDIT: Filed it at kernel.org.

@ibuildthecloud

This comment has been minimized.

Show comment
Hide comment
@ibuildthecloud

ibuildthecloud Dec 22, 2014

Contributor

I'm getting this on Ubuntu 14.10 running a 3.18.1. Kernel log shows

Dec 21 22:49:31 inotmac kernel: [15225.866600] unregister_netdevice: waiting for lo to become free. Usage count = 2
Dec 21 22:49:40 inotmac kernel: [15235.179263] INFO: task docker:19599 blocked for more than 120 seconds.
Dec 21 22:49:40 inotmac kernel: [15235.179268]       Tainted: G           OE  3.18.1-031801-generic #201412170637
Dec 21 22:49:40 inotmac kernel: [15235.179269] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Dec 21 22:49:40 inotmac kernel: [15235.179271] docker          D 0000000000000001     0 19599      1 0x00000000
Dec 21 22:49:40 inotmac kernel: [15235.179275]  ffff8802082abcc0 0000000000000086 ffff880235c3b700 00000000ffffffff
Dec 21 22:49:40 inotmac kernel: [15235.179277]  ffff8802082abfd8 0000000000013640 ffff8800288f2300 0000000000013640
Dec 21 22:49:40 inotmac kernel: [15235.179280]  ffff880232cf0000 ffff8801a467c600 ffffffff81f9d4b8 ffffffff81cd9c60
Dec 21 22:49:40 inotmac kernel: [15235.179282] Call Trace:
Dec 21 22:49:40 inotmac kernel: [15235.179289]  [<ffffffff817af549>] schedule+0x29/0x70
Dec 21 22:49:40 inotmac kernel: [15235.179292]  [<ffffffff817af88e>] schedule_preempt_disabled+0xe/0x10
Dec 21 22:49:40 inotmac kernel: [15235.179296]  [<ffffffff817b1545>] __mutex_lock_slowpath+0x95/0x100
Dec 21 22:49:40 inotmac kernel: [15235.179299]  [<ffffffff8168d5c9>] ? copy_net_ns+0x69/0x150
Dec 21 22:49:40 inotmac kernel: [15235.179302]  [<ffffffff817b15d3>] mutex_lock+0x23/0x37
Dec 21 22:49:40 inotmac kernel: [15235.179305]  [<ffffffff8168d5f8>] copy_net_ns+0x98/0x150
Dec 21 22:49:40 inotmac kernel: [15235.179308]  [<ffffffff810941f1>] create_new_namespaces+0x101/0x1b0
Dec 21 22:49:40 inotmac kernel: [15235.179311]  [<ffffffff8109432b>] copy_namespaces+0x8b/0xa0
Dec 21 22:49:40 inotmac kernel: [15235.179315]  [<ffffffff81073458>] copy_process.part.28+0x828/0xed0
Dec 21 22:49:40 inotmac kernel: [15235.179318]  [<ffffffff811f157f>] ? get_empty_filp+0xcf/0x1c0
Dec 21 22:49:40 inotmac kernel: [15235.179320]  [<ffffffff81073b80>] copy_process+0x80/0x90
Dec 21 22:49:40 inotmac kernel: [15235.179323]  [<ffffffff81073ca2>] do_fork+0x62/0x280
Dec 21 22:49:40 inotmac kernel: [15235.179326]  [<ffffffff8120cfc0>] ? get_unused_fd_flags+0x30/0x40
Dec 21 22:49:40 inotmac kernel: [15235.179329]  [<ffffffff8120d028>] ? __fd_install+0x58/0x70
Dec 21 22:49:40 inotmac kernel: [15235.179331]  [<ffffffff81073f46>] SyS_clone+0x16/0x20
Dec 21 22:49:40 inotmac kernel: [15235.179334]  [<ffffffff817b3ab9>] stub_clone+0x69/0x90
Dec 21 22:49:40 inotmac kernel: [15235.179336]  [<ffffffff817b376d>] ? system_call_fastpath+0x16/0x1b
Dec 21 22:49:41 inotmac kernel: [15235.950976] unregister_netdevice: waiting for lo to become free. Usage count = 2
Dec 21 22:49:51 inotmac kernel: [15246.059346] unregister_netdevice: waiting for lo to become free. Usage count = 2

I'll send docker version/info once the system isn't frozen anymore :)

Contributor

ibuildthecloud commented Dec 22, 2014

I'm getting this on Ubuntu 14.10 running a 3.18.1. Kernel log shows

Dec 21 22:49:31 inotmac kernel: [15225.866600] unregister_netdevice: waiting for lo to become free. Usage count = 2
Dec 21 22:49:40 inotmac kernel: [15235.179263] INFO: task docker:19599 blocked for more than 120 seconds.
Dec 21 22:49:40 inotmac kernel: [15235.179268]       Tainted: G           OE  3.18.1-031801-generic #201412170637
Dec 21 22:49:40 inotmac kernel: [15235.179269] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Dec 21 22:49:40 inotmac kernel: [15235.179271] docker          D 0000000000000001     0 19599      1 0x00000000
Dec 21 22:49:40 inotmac kernel: [15235.179275]  ffff8802082abcc0 0000000000000086 ffff880235c3b700 00000000ffffffff
Dec 21 22:49:40 inotmac kernel: [15235.179277]  ffff8802082abfd8 0000000000013640 ffff8800288f2300 0000000000013640
Dec 21 22:49:40 inotmac kernel: [15235.179280]  ffff880232cf0000 ffff8801a467c600 ffffffff81f9d4b8 ffffffff81cd9c60
Dec 21 22:49:40 inotmac kernel: [15235.179282] Call Trace:
Dec 21 22:49:40 inotmac kernel: [15235.179289]  [<ffffffff817af549>] schedule+0x29/0x70
Dec 21 22:49:40 inotmac kernel: [15235.179292]  [<ffffffff817af88e>] schedule_preempt_disabled+0xe/0x10
Dec 21 22:49:40 inotmac kernel: [15235.179296]  [<ffffffff817b1545>] __mutex_lock_slowpath+0x95/0x100
Dec 21 22:49:40 inotmac kernel: [15235.179299]  [<ffffffff8168d5c9>] ? copy_net_ns+0x69/0x150
Dec 21 22:49:40 inotmac kernel: [15235.179302]  [<ffffffff817b15d3>] mutex_lock+0x23/0x37
Dec 21 22:49:40 inotmac kernel: [15235.179305]  [<ffffffff8168d5f8>] copy_net_ns+0x98/0x150
Dec 21 22:49:40 inotmac kernel: [15235.179308]  [<ffffffff810941f1>] create_new_namespaces+0x101/0x1b0
Dec 21 22:49:40 inotmac kernel: [15235.179311]  [<ffffffff8109432b>] copy_namespaces+0x8b/0xa0
Dec 21 22:49:40 inotmac kernel: [15235.179315]  [<ffffffff81073458>] copy_process.part.28+0x828/0xed0
Dec 21 22:49:40 inotmac kernel: [15235.179318]  [<ffffffff811f157f>] ? get_empty_filp+0xcf/0x1c0
Dec 21 22:49:40 inotmac kernel: [15235.179320]  [<ffffffff81073b80>] copy_process+0x80/0x90
Dec 21 22:49:40 inotmac kernel: [15235.179323]  [<ffffffff81073ca2>] do_fork+0x62/0x280
Dec 21 22:49:40 inotmac kernel: [15235.179326]  [<ffffffff8120cfc0>] ? get_unused_fd_flags+0x30/0x40
Dec 21 22:49:40 inotmac kernel: [15235.179329]  [<ffffffff8120d028>] ? __fd_install+0x58/0x70
Dec 21 22:49:40 inotmac kernel: [15235.179331]  [<ffffffff81073f46>] SyS_clone+0x16/0x20
Dec 21 22:49:40 inotmac kernel: [15235.179334]  [<ffffffff817b3ab9>] stub_clone+0x69/0x90
Dec 21 22:49:40 inotmac kernel: [15235.179336]  [<ffffffff817b376d>] ? system_call_fastpath+0x16/0x1b
Dec 21 22:49:41 inotmac kernel: [15235.950976] unregister_netdevice: waiting for lo to become free. Usage count = 2
Dec 21 22:49:51 inotmac kernel: [15246.059346] unregister_netdevice: waiting for lo to become free. Usage count = 2

I'll send docker version/info once the system isn't frozen anymore :)

@sbward

This comment has been minimized.

Show comment
Hide comment
@sbward

sbward Dec 23, 2014

We're seeing this issue as well. Ubuntu 14.04, 3.13.0-37-generic

sbward commented Dec 23, 2014

We're seeing this issue as well. Ubuntu 14.04, 3.13.0-37-generic

@jbalonso

This comment has been minimized.

Show comment
Hide comment
@jbalonso

jbalonso Dec 29, 2014

On Ubuntu 14.04 server, my team has found that downgrading from 3.13.0-40-generic to 3.13.0-32-generic "resolves" the issue. Given @sbward's observation, that would put the regression after 3.13.0-32-generic and before (or including) 3.13.0-37-generic.

I'll add that, in our case, we sometimes see a negative usage count.

jbalonso commented Dec 29, 2014

On Ubuntu 14.04 server, my team has found that downgrading from 3.13.0-40-generic to 3.13.0-32-generic "resolves" the issue. Given @sbward's observation, that would put the regression after 3.13.0-32-generic and before (or including) 3.13.0-37-generic.

I'll add that, in our case, we sometimes see a negative usage count.

@rsampaio

This comment has been minimized.

Show comment
Hide comment
@rsampaio

rsampaio Jan 15, 2015

Contributor

FWIW we hit this bug running lxc on trusty kernel (3.13.0-40-generic #69-Ubuntu) the message appears in dmesg followed by this stacktrace:

[27211131.602869] INFO: task lxc-start:26342 blocked for more than 120 seconds.
[27211131.602874]       Not tainted 3.13.0-40-generic #69-Ubuntu
[27211131.602877] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[27211131.602881] lxc-start       D 0000000000000001     0 26342      1 0x00000080
[27211131.602883]  ffff88000d001d40 0000000000000282 ffff88001aa21800 ffff88000d001fd8
[27211131.602886]  0000000000014480 0000000000014480 ffff88001aa21800 ffffffff81cdb760
[27211131.602888]  ffffffff81cdb764 ffff88001aa21800 00000000ffffffff ffffffff81cdb768
[27211131.602891] Call Trace:
[27211131.602894]  [<ffffffff81723b69>] schedule_preempt_disabled+0x29/0x70
[27211131.602897]  [<ffffffff817259d5>] __mutex_lock_slowpath+0x135/0x1b0
[27211131.602900]  [<ffffffff811a2679>] ? __kmalloc+0x1e9/0x230
[27211131.602903]  [<ffffffff81725a6f>] mutex_lock+0x1f/0x2f
[27211131.602905]  [<ffffffff8161c2c1>] copy_net_ns+0x71/0x130
[27211131.602908]  [<ffffffff8108f889>] create_new_namespaces+0xf9/0x180
[27211131.602910]  [<ffffffff8108f983>] copy_namespaces+0x73/0xa0
[27211131.602912]  [<ffffffff81065b16>] copy_process.part.26+0x9a6/0x16b0
[27211131.602915]  [<ffffffff810669f5>] do_fork+0xd5/0x340
[27211131.602917]  [<ffffffff810c8e8d>] ? call_rcu_sched+0x1d/0x20
[27211131.602919]  [<ffffffff81066ce6>] SyS_clone+0x16/0x20
[27211131.602921]  [<ffffffff81730089>] stub_clone+0x69/0x90
[27211131.602923]  [<ffffffff8172fd2d>] ? system_call_fastpath+0x1a/0x1f
Contributor

rsampaio commented Jan 15, 2015

FWIW we hit this bug running lxc on trusty kernel (3.13.0-40-generic #69-Ubuntu) the message appears in dmesg followed by this stacktrace:

[27211131.602869] INFO: task lxc-start:26342 blocked for more than 120 seconds.
[27211131.602874]       Not tainted 3.13.0-40-generic #69-Ubuntu
[27211131.602877] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[27211131.602881] lxc-start       D 0000000000000001     0 26342      1 0x00000080
[27211131.602883]  ffff88000d001d40 0000000000000282 ffff88001aa21800 ffff88000d001fd8
[27211131.602886]  0000000000014480 0000000000014480 ffff88001aa21800 ffffffff81cdb760
[27211131.602888]  ffffffff81cdb764 ffff88001aa21800 00000000ffffffff ffffffff81cdb768
[27211131.602891] Call Trace:
[27211131.602894]  [<ffffffff81723b69>] schedule_preempt_disabled+0x29/0x70
[27211131.602897]  [<ffffffff817259d5>] __mutex_lock_slowpath+0x135/0x1b0
[27211131.602900]  [<ffffffff811a2679>] ? __kmalloc+0x1e9/0x230
[27211131.602903]  [<ffffffff81725a6f>] mutex_lock+0x1f/0x2f
[27211131.602905]  [<ffffffff8161c2c1>] copy_net_ns+0x71/0x130
[27211131.602908]  [<ffffffff8108f889>] create_new_namespaces+0xf9/0x180
[27211131.602910]  [<ffffffff8108f983>] copy_namespaces+0x73/0xa0
[27211131.602912]  [<ffffffff81065b16>] copy_process.part.26+0x9a6/0x16b0
[27211131.602915]  [<ffffffff810669f5>] do_fork+0xd5/0x340
[27211131.602917]  [<ffffffff810c8e8d>] ? call_rcu_sched+0x1d/0x20
[27211131.602919]  [<ffffffff81066ce6>] SyS_clone+0x16/0x20
[27211131.602921]  [<ffffffff81730089>] stub_clone+0x69/0x90
[27211131.602923]  [<ffffffff8172fd2d>] ? system_call_fastpath+0x1a/0x1f
@MrMMorris

This comment has been minimized.

Show comment
Hide comment
@MrMMorris

MrMMorris Mar 16, 2015

Ran into this on Ubuntu 14.04 and Debian jessie w/ kernel 3.16.x.

Docker command:

docker run -t -i -v /data/sitespeed.io:/sitespeed.io/results company/dockerfiles:sitespeed.io-latest --name "Superbrowse"

This seems like a pretty bad issue...

MrMMorris commented Mar 16, 2015

Ran into this on Ubuntu 14.04 and Debian jessie w/ kernel 3.16.x.

Docker command:

docker run -t -i -v /data/sitespeed.io:/sitespeed.io/results company/dockerfiles:sitespeed.io-latest --name "Superbrowse"

This seems like a pretty bad issue...

@MrMMorris

This comment has been minimized.

Show comment
Hide comment
@MrMMorris

MrMMorris Mar 17, 2015

@jbalonso even with 3.13.0-32-generic I get the error after only a few successful runs 😭

MrMMorris commented Mar 17, 2015

@jbalonso even with 3.13.0-32-generic I get the error after only a few successful runs 😭

@rsampaio

This comment has been minimized.

Show comment
Hide comment
@rsampaio

rsampaio Mar 17, 2015

Contributor

@MrMMorris could you share a reproducer script using public available images?

Contributor

rsampaio commented Mar 17, 2015

@MrMMorris could you share a reproducer script using public available images?

@unclejack

This comment has been minimized.

Show comment
Hide comment
@unclejack

unclejack Mar 18, 2015

Contributor

Everyone who's seeing this error on their system is running a package of the Linux kernel on their distribution that's far too old and lacks the fixes for this particular problem.

If you run into this problem, make sure you run apt-get update && apt-get dist-upgrade -y and reboot your system. If you're on Digital Ocean, you also need to select the kernel version which was just installed during the update because they don't use the latest kernel automatically (see https://digitalocean.uservoice.com/forums/136585-digitalocean/suggestions/2814988-give-option-to-use-the-droplet-s-own-bootloader).

CentOS/RHEL/Fedora/Scientific Linux users need to keep their systems updated using yum update and reboot after installing the updates.

When reporting this problem, please make sure your system is fully patched and up to date with the latest stable updates (no manually installed experimental/testing/alpha/beta/rc packages) provided by your distribution's vendor.

Contributor

unclejack commented Mar 18, 2015

Everyone who's seeing this error on their system is running a package of the Linux kernel on their distribution that's far too old and lacks the fixes for this particular problem.

If you run into this problem, make sure you run apt-get update && apt-get dist-upgrade -y and reboot your system. If you're on Digital Ocean, you also need to select the kernel version which was just installed during the update because they don't use the latest kernel automatically (see https://digitalocean.uservoice.com/forums/136585-digitalocean/suggestions/2814988-give-option-to-use-the-droplet-s-own-bootloader).

CentOS/RHEL/Fedora/Scientific Linux users need to keep their systems updated using yum update and reboot after installing the updates.

When reporting this problem, please make sure your system is fully patched and up to date with the latest stable updates (no manually installed experimental/testing/alpha/beta/rc packages) provided by your distribution's vendor.

@MrMMorris

This comment has been minimized.

Show comment
Hide comment
@MrMMorris

MrMMorris Mar 18, 2015

@unclejack

I ran apt-get update && apt-get dist-upgrade -y

ubuntu 14.04 3.13.0-46-generic

Still get the error after only one docker run

I can create an AMI for reproducing if needed

MrMMorris commented Mar 18, 2015

@unclejack

I ran apt-get update && apt-get dist-upgrade -y

ubuntu 14.04 3.13.0-46-generic

Still get the error after only one docker run

I can create an AMI for reproducing if needed

@unclejack

This comment has been minimized.

Show comment
Hide comment
@unclejack

unclejack Mar 18, 2015

Contributor

@MrMMorris Thank you for confirming it's still a problem with the latest kernel package on Ubuntu 14.04.

Contributor

unclejack commented Mar 18, 2015

@MrMMorris Thank you for confirming it's still a problem with the latest kernel package on Ubuntu 14.04.

@MrMMorris

This comment has been minimized.

Show comment
Hide comment
@MrMMorris

MrMMorris Mar 18, 2015

Anything else I can do to help, let me know! 😄

MrMMorris commented Mar 18, 2015

Anything else I can do to help, let me know! 😄

@rsampaio

This comment has been minimized.

Show comment
Hide comment
@rsampaio

rsampaio Mar 18, 2015

Contributor

@MrMMorris if you can provide a reproducer there is a bug opened for Ubuntu and it will be much appreciated: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1403152

Contributor

rsampaio commented Mar 18, 2015

@MrMMorris if you can provide a reproducer there is a bug opened for Ubuntu and it will be much appreciated: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1403152

@MrMMorris

This comment has been minimized.

Show comment
Hide comment
@MrMMorris

MrMMorris Mar 18, 2015

@rsampaio if I have time today, I will definitely get that for you!

MrMMorris commented Mar 18, 2015

@rsampaio if I have time today, I will definitely get that for you!

@fxposter

This comment has been minimized.

Show comment
Hide comment
@fxposter

fxposter Mar 23, 2015

This problem also appears on 3.16(.7) on both Debian 7 and Debian 8: #9605 (comment). Rebooting the server is the only way to fix this for now.

fxposter commented Mar 23, 2015

This problem also appears on 3.16(.7) on both Debian 7 and Debian 8: #9605 (comment). Rebooting the server is the only way to fix this for now.

@chrisjstevenson

This comment has been minimized.

Show comment
Hide comment
@chrisjstevenson

chrisjstevenson Apr 27, 2015

Seeing this issue on RHEL 6.6 with kernel 2.6.32-504.8.1.el6.x86_64 when starting some docker containers (not all containers)
kernel:unregister_netdevice: waiting for lo to become free. Usage count = -1

Again, rebooting the server seems to be the only solution at this time

chrisjstevenson commented Apr 27, 2015

Seeing this issue on RHEL 6.6 with kernel 2.6.32-504.8.1.el6.x86_64 when starting some docker containers (not all containers)
kernel:unregister_netdevice: waiting for lo to become free. Usage count = -1

Again, rebooting the server seems to be the only solution at this time

@popsikle

This comment has been minimized.

Show comment
Hide comment
@popsikle

popsikle May 12, 2015

Also seeing this on CoreOS (647.0.0) with kernel 3.19.3.

Rebooting is also the only solution I have found.

popsikle commented May 12, 2015

Also seeing this on CoreOS (647.0.0) with kernel 3.19.3.

Rebooting is also the only solution I have found.

@fxposter

This comment has been minimized.

Show comment
Hide comment
@fxposter

fxposter May 20, 2015

Tested Debian jessie with sid's kernel (4.0.2) - the problem remains.

fxposter commented May 20, 2015

Tested Debian jessie with sid's kernel (4.0.2) - the problem remains.

@popsikle

This comment has been minimized.

Show comment
Hide comment
@popsikle

popsikle Jun 19, 2015

Anyone seeing this issue running non-ubuntu containers?

popsikle commented Jun 19, 2015

Anyone seeing this issue running non-ubuntu containers?

@fxposter

This comment has been minimized.

Show comment
Hide comment
@fxposter

fxposter Jun 19, 2015

Yes. Debian ones.
19 июня 2015 г. 19:01 пользователь "popsikle" notifications@github.com
написал:

Anyone seeing this issue running non-ubuntu containers?


Reply to this email directly or view it on GitHub
#5618 (comment).

fxposter commented Jun 19, 2015

Yes. Debian ones.
19 июня 2015 г. 19:01 пользователь "popsikle" notifications@github.com
написал:

Anyone seeing this issue running non-ubuntu containers?


Reply to this email directly or view it on GitHub
#5618 (comment).

@unclejack

This comment has been minimized.

Show comment
Hide comment
@unclejack

unclejack Jun 20, 2015

Contributor

This is a kernel issue, not an image related issue. Switching an image for another won't improve or make this problem worse.

Contributor

unclejack commented Jun 20, 2015

This is a kernel issue, not an image related issue. Switching an image for another won't improve or make this problem worse.

@techniq

This comment has been minimized.

Show comment
Hide comment
@techniq

techniq Jul 17, 2015

Experiencing issue on Debian Jessie on a BeagleBone Black running 4.1.2-bone12 kernel

techniq commented Jul 17, 2015

Experiencing issue on Debian Jessie on a BeagleBone Black running 4.1.2-bone12 kernel

@igorastds

This comment has been minimized.

Show comment
Hide comment
@igorastds

igorastds Jul 17, 2015

Experiencing after switching from 4.1.2 to 4.2-rc2 (using git build of 1.8.0).
Deleting /var/lib/docker/* doesn't solve the problem.
Switching back to 4.1.2 solves the problem.

Also, VirtualBox has same issue and there's patch for v5.0.0 (retro-ported to v4) which supposedly does something in kernel driver part.. worth looking to understand the problem.

igorastds commented Jul 17, 2015

Experiencing after switching from 4.1.2 to 4.2-rc2 (using git build of 1.8.0).
Deleting /var/lib/docker/* doesn't solve the problem.
Switching back to 4.1.2 solves the problem.

Also, VirtualBox has same issue and there's patch for v5.0.0 (retro-ported to v4) which supposedly does something in kernel driver part.. worth looking to understand the problem.

@fxposter

This comment has been minimized.

Show comment
Hide comment
@fxposter

fxposter Jul 22, 2015

This is the fix in the VirtualBox: https://www.virtualbox.org/attachment/ticket/12264/diff_unregister_netdev
They don't actually modify the kernel, just their kernel module.

fxposter commented Jul 22, 2015

This is the fix in the VirtualBox: https://www.virtualbox.org/attachment/ticket/12264/diff_unregister_netdev
They don't actually modify the kernel, just their kernel module.

@nazar-pc

This comment has been minimized.

Show comment
Hide comment
@nazar-pc

nazar-pc Jul 24, 2015

Also having this issue with 4.2-rc2:

unregister_netdevice: waiting for vethf1738d3 to become free. Usage count = 1

nazar-pc commented Jul 24, 2015

Also having this issue with 4.2-rc2:

unregister_netdevice: waiting for vethf1738d3 to become free. Usage count = 1

@nazar-pc

This comment has been minimized.

Show comment
Hide comment
@nazar-pc

nazar-pc Jul 24, 2015

Just compiled 4.2-RC3, seems to work again

nazar-pc commented Jul 24, 2015

Just compiled 4.2-RC3, seems to work again

@LK4D4

This comment has been minimized.

Show comment
Hide comment
@LK4D4

LK4D4 Jul 24, 2015

Contributor

@nazar-pc Thanks for info. Just hit it with 4.1.3, was pretty upset
@techniq same here, pretty bad kernel bug. I wonder if we should report it to be backported to 4.1 tree.

Contributor

LK4D4 commented Jul 24, 2015

@nazar-pc Thanks for info. Just hit it with 4.1.3, was pretty upset
@techniq same here, pretty bad kernel bug. I wonder if we should report it to be backported to 4.1 tree.

@feisuzhu

This comment has been minimized.

Show comment
Hide comment
@feisuzhu

feisuzhu Jul 30, 2015

Linux docker13 3.19.0-22-generic #22-Ubuntu SMP Tue Jun 16 17:15:15 UTC 2015 x86_64 x86_64 x86_64 GNU/Linux

Kernel from Ubuntu 15.04, same issue

feisuzhu commented Jul 30, 2015

Linux docker13 3.19.0-22-generic #22-Ubuntu SMP Tue Jun 16 17:15:15 UTC 2015 x86_64 x86_64 x86_64 GNU/Linux

Kernel from Ubuntu 15.04, same issue

@LK4D4

This comment has been minimized.

Show comment
Hide comment
@LK4D4

LK4D4 Jul 30, 2015

Contributor

I saw it with 4.2-rc3 as well. There is not one bug about device leakage :) I can reproduce on any kernel >=4.1 under highload.

Contributor

LK4D4 commented Jul 30, 2015

I saw it with 4.2-rc3 as well. There is not one bug about device leakage :) I can reproduce on any kernel >=4.1 under highload.

@dElogics

This comment has been minimized.

Show comment
Hide comment
@dElogics

dElogics May 4, 2018

Can anyone confirm if the latest 4.14 kernel has this issue? Seems like it does not. No one around the Internet faced this issue with the 4.14 kernel.

dElogics commented May 4, 2018

Can anyone confirm if the latest 4.14 kernel has this issue? Seems like it does not. No one around the Internet faced this issue with the 4.14 kernel.

@dimm0

This comment has been minimized.

Show comment
Hide comment
@dimm0

dimm0 May 4, 2018

I see this in 4.15.15-1 kernel, Centos7

dimm0 commented May 4, 2018

I see this in 4.15.15-1 kernel, Centos7

@dElogics

This comment has been minimized.

Show comment
Hide comment
@dElogics

dElogics May 7, 2018

Looking at the change logs, https://cdn.kernel.org/pub/linux/kernel/v4.x/ChangeLog-4.15.8 has a fix for SCTP, but not TCP. So you may like to try the latest 4.14.

dElogics commented May 7, 2018

Looking at the change logs, https://cdn.kernel.org/pub/linux/kernel/v4.x/ChangeLog-4.15.8 has a fix for SCTP, but not TCP. So you may like to try the latest 4.14.

@spronin-aurea

This comment has been minimized.

Show comment
Hide comment
@spronin-aurea

spronin-aurea Jun 4, 2018

  • even 4.15.18 does not help with this bug
  • disabling ipv6 does not help as well

we have now upgraded to 4.16.13. Observing. This bug was hitting us on a one node only approx once per week.

spronin-aurea commented Jun 4, 2018

  • even 4.15.18 does not help with this bug
  • disabling ipv6 does not help as well

we have now upgraded to 4.16.13. Observing. This bug was hitting us on a one node only approx once per week.

@qrpike

This comment has been minimized.

Show comment
Hide comment
@qrpike

qrpike Jun 4, 2018

qrpike commented Jun 4, 2018

@scher200

This comment has been minimized.

Show comment
Hide comment
@scher200

scher200 Jun 4, 2018

for me, most of the time the bug shows up after redeploying the same project/network again

scher200 commented Jun 4, 2018

for me, most of the time the bug shows up after redeploying the same project/network again

@spronin-aurea

This comment has been minimized.

Show comment
Hide comment
@spronin-aurea

spronin-aurea Jun 4, 2018

@qrpike you are right, we tried only sysctl. Let me try with grub. Thanks!

spronin-aurea commented Jun 4, 2018

@qrpike you are right, we tried only sysctl. Let me try with grub. Thanks!

@dElogics

This comment has been minimized.

Show comment
Hide comment
@dElogics

dElogics Jun 19, 2018

4.9.88 Debian kernel. Reproducible.

dElogics commented Jun 19, 2018

4.9.88 Debian kernel. Reproducible.

@komljen

This comment has been minimized.

Show comment
Hide comment
@komljen

komljen Jun 19, 2018

@qrpike you are right, we tried only sysctl. Let me try with grub. Thanks!

In my case disabling ipv6 didn't make any difference.

komljen commented Jun 19, 2018

@qrpike you are right, we tried only sysctl. Let me try with grub. Thanks!

In my case disabling ipv6 didn't make any difference.

@qrpike

This comment has been minimized.

Show comment
Hide comment
@qrpike

qrpike Jun 19, 2018

@spronin-aurea Did disabling ipv6 at boot loader help?

qrpike commented Jun 19, 2018

@spronin-aurea Did disabling ipv6 at boot loader help?

@komljen

This comment has been minimized.

Show comment
Hide comment
@komljen

komljen Jun 19, 2018

@qrpike can you tell us about the nodes you are using if disabling ipv6 helped in your case? Kernel version, k8s version, CNI, docker version etc.

komljen commented Jun 19, 2018

@qrpike can you tell us about the nodes you are using if disabling ipv6 helped in your case? Kernel version, k8s version, CNI, docker version etc.

@qrpike

This comment has been minimized.

Show comment
Hide comment
@qrpike

qrpike Jun 19, 2018

@komljen I have been using CoreOS for the past 2years without a single incident. Since ~ver 1000. I haven't tried it recently but if I do not disable ipv6 the bug happens.

qrpike commented Jun 19, 2018

@komljen I have been using CoreOS for the past 2years without a single incident. Since ~ver 1000. I haven't tried it recently but if I do not disable ipv6 the bug happens.

@deimosfr

This comment has been minimized.

Show comment
Hide comment
@deimosfr

deimosfr Jun 19, 2018

On my side, I'm using CoreOS too, ipv6 disabled with grub and still getting the issue

deimosfr commented Jun 19, 2018

On my side, I'm using CoreOS too, ipv6 disabled with grub and still getting the issue

@qrpike

This comment has been minimized.

Show comment
Hide comment
@qrpike

qrpike Jun 19, 2018

@deimosfr I'm currently using PXE boot for all my nodes:

      DEFAULT menu.c32
      prompt 0
      timeout 50
      MENU TITLE PXE Boot Blade 1
      label coreos
              menu label CoreOS ( blade 1 )
              kernel coreos/coreos_production_pxe.vmlinuz
              append initrd=coreos/coreos_production_pxe_image.cpio.gz ipv6.disable=1 net.ifnames=1 biosdevname=0 elevator=deadline cloud-config-url=http://HOST_PRIV_IP:8888/coreos-cloud-config.yml?host=1 root=LABEL=ROOT rootflags=noatime,discard,rw,seclabel,nodiratime

However, my main node that is the PXE host is also CoreOS and boots from disk, and does not have the issue either.

qrpike commented Jun 19, 2018

@deimosfr I'm currently using PXE boot for all my nodes:

      DEFAULT menu.c32
      prompt 0
      timeout 50
      MENU TITLE PXE Boot Blade 1
      label coreos
              menu label CoreOS ( blade 1 )
              kernel coreos/coreos_production_pxe.vmlinuz
              append initrd=coreos/coreos_production_pxe_image.cpio.gz ipv6.disable=1 net.ifnames=1 biosdevname=0 elevator=deadline cloud-config-url=http://HOST_PRIV_IP:8888/coreos-cloud-config.yml?host=1 root=LABEL=ROOT rootflags=noatime,discard,rw,seclabel,nodiratime

However, my main node that is the PXE host is also CoreOS and boots from disk, and does not have the issue either.

@dElogics

This comment has been minimized.

Show comment
Hide comment
@dElogics

dElogics Jun 19, 2018

What kernel versions you guys are running?

dElogics commented Jun 19, 2018

What kernel versions you guys are running?

@deimosfr

This comment has been minimized.

Show comment
Hide comment
@deimosfr

deimosfr Jun 19, 2018

The ones I got the issue were on 4.14.32-coreos and before. I do not encounter this issue yet on 4.14.42-coreos

deimosfr commented Jun 19, 2018

The ones I got the issue were on 4.14.32-coreos and before. I do not encounter this issue yet on 4.14.42-coreos

@wallewuli

This comment has been minimized.

Show comment
Hide comment
@wallewuli

wallewuli Jul 2, 2018

Centos 7.5 with 4.17.3-1 kernel, still got the issue.

Env :
kubernetes 1.10.4
Docker 13.1
with Flannel network plugin.

Log :
[ 89.790907] IPv6: ADDRCONF(NETDEV_UP): eth0: link is not ready
[ 89.798523] IPv6: ADDRCONF(NETDEV_CHANGE): eth0: link becomes ready
[ 89.799623] cni0: port 8(vethb8a93c6f) entered blocking state
[ 89.800547] cni0: port 8(vethb8a93c6f) entered disabled state
[ 89.801471] device vethb8a93c6f entered promiscuous mode
[ 89.802323] cni0: port 8(vethb8a93c6f) entered blocking state
[ 89.803200] cni0: port 8(vethb8a93c6f) entered forwarding state

kernel:unregister_netdevice: waiting for lo to become free. Usage count = 1。

Now :
The node IP can reach, but cannot use any network services , like ssh...

wallewuli commented Jul 2, 2018

Centos 7.5 with 4.17.3-1 kernel, still got the issue.

Env :
kubernetes 1.10.4
Docker 13.1
with Flannel network plugin.

Log :
[ 89.790907] IPv6: ADDRCONF(NETDEV_UP): eth0: link is not ready
[ 89.798523] IPv6: ADDRCONF(NETDEV_CHANGE): eth0: link becomes ready
[ 89.799623] cni0: port 8(vethb8a93c6f) entered blocking state
[ 89.800547] cni0: port 8(vethb8a93c6f) entered disabled state
[ 89.801471] device vethb8a93c6f entered promiscuous mode
[ 89.802323] cni0: port 8(vethb8a93c6f) entered blocking state
[ 89.803200] cni0: port 8(vethb8a93c6f) entered forwarding state

kernel:unregister_netdevice: waiting for lo to become free. Usage count = 1。

Now :
The node IP can reach, but cannot use any network services , like ssh...

@Blub

This comment has been minimized.

Show comment
Hide comment
@Blub

Blub Jul 2, 2018

The symptoms here are similar to a lot of reports in various other places. All having to do with network namespaces. Could the people running into this please see if unshare -n hangs, and if so, from another terminal, do cat /proc/$pid/stack of the unshare process to see if it hangs in copy_net_ns()? This seems to be a common denominator for many of the issues including some backtraces found here. Between 4.16 and 4.18 there have been a number of patches by Kirill Tkhai refactoring the involved locking a lot. The affected distro/kernel package maintainers should probably look into applying/backporting them to stable kernels and see if that helps.
See also: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1779678

Blub commented Jul 2, 2018

The symptoms here are similar to a lot of reports in various other places. All having to do with network namespaces. Could the people running into this please see if unshare -n hangs, and if so, from another terminal, do cat /proc/$pid/stack of the unshare process to see if it hangs in copy_net_ns()? This seems to be a common denominator for many of the issues including some backtraces found here. Between 4.16 and 4.18 there have been a number of patches by Kirill Tkhai refactoring the involved locking a lot. The affected distro/kernel package maintainers should probably look into applying/backporting them to stable kernels and see if that helps.
See also: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1779678

@cassiussa

This comment has been minimized.

Show comment
Hide comment
@cassiussa

cassiussa Jul 3, 2018

@Blub

sudo cat /proc/122355/stack
[<ffffffff8157f6e2>] copy_net_ns+0xa2/0x180
[<ffffffff810b7519>] create_new_namespaces+0xf9/0x180
[<ffffffff810b775a>] unshare_nsproxy_namespaces+0x5a/0xc0
[<ffffffff81088983>] SyS_unshare+0x193/0x300
[<ffffffff816b8c6b>] tracesys+0x97/0xbd
[<ffffffffffffffff>] 0xffffffffffffffff

cassiussa commented Jul 3, 2018

@Blub

sudo cat /proc/122355/stack
[<ffffffff8157f6e2>] copy_net_ns+0xa2/0x180
[<ffffffff810b7519>] create_new_namespaces+0xf9/0x180
[<ffffffff810b775a>] unshare_nsproxy_namespaces+0x5a/0xc0
[<ffffffff81088983>] SyS_unshare+0x193/0x300
[<ffffffff816b8c6b>] tracesys+0x97/0xbd
[<ffffffffffffffff>] 0xffffffffffffffff
@Blub

This comment has been minimized.

Show comment
Hide comment
@Blub

Blub Jul 4, 2018

Given the locking changes in 4.18 it would be good to test the current 4.18rc, especially if you can trigger it more or less reliably, as from what I've seen there are many people where changing kernel versions also changed the likelihood of this happening a lot.

Blub commented Jul 4, 2018

Given the locking changes in 4.18 it would be good to test the current 4.18rc, especially if you can trigger it more or less reliably, as from what I've seen there are many people where changing kernel versions also changed the likelihood of this happening a lot.

@komljen

This comment has been minimized.

Show comment
Hide comment
@komljen

komljen Jul 4, 2018

I had this issues with Kubernetes and after switching to latest CoreOS stable release - 1745.7.0 the issue is gone:

  • kernel: 4.14.48
  • docker: 18.03.1

komljen commented Jul 4, 2018

I had this issues with Kubernetes and after switching to latest CoreOS stable release - 1745.7.0 the issue is gone:

  • kernel: 4.14.48
  • docker: 18.03.1
@PengBAI

This comment has been minimized.

Show comment
Hide comment
@PengBAI

PengBAI Jul 5, 2018

same issue on CentOS 7

  • kernel: 4.11.1-1.el7.elrepo.x86_64
  • docker: 17.12.0-ce

PengBAI commented Jul 5, 2018

same issue on CentOS 7

  • kernel: 4.11.1-1.el7.elrepo.x86_64
  • docker: 17.12.0-ce
@zihaoyu

This comment has been minimized.

Show comment
Hide comment
@zihaoyu

zihaoyu Jul 9, 2018

@Blub Seeing the same on CoreOS 1688.5.3, kernel 4.14.32

ip-10-72-101-86 core # cat /proc/59515/stack
[<ffffffff9a4df14e>] copy_net_ns+0xae/0x200
[<ffffffff9a09519c>] create_new_namespaces+0x11c/0x1b0
[<ffffffff9a0953a9>] unshare_nsproxy_namespaces+0x59/0xb0
[<ffffffff9a07418d>] SyS_unshare+0x1ed/0x3b0
[<ffffffff9a003977>] do_syscall_64+0x67/0x120
[<ffffffff9a800081>] entry_SYSCALL_64_after_hwframe+0x3d/0xa2
[<ffffffffffffffff>] 0xffffffffffffffff

zihaoyu commented Jul 9, 2018

@Blub Seeing the same on CoreOS 1688.5.3, kernel 4.14.32

ip-10-72-101-86 core # cat /proc/59515/stack
[<ffffffff9a4df14e>] copy_net_ns+0xae/0x200
[<ffffffff9a09519c>] create_new_namespaces+0x11c/0x1b0
[<ffffffff9a0953a9>] unshare_nsproxy_namespaces+0x59/0xb0
[<ffffffff9a07418d>] SyS_unshare+0x1ed/0x3b0
[<ffffffff9a003977>] do_syscall_64+0x67/0x120
[<ffffffff9a800081>] entry_SYSCALL_64_after_hwframe+0x3d/0xa2
[<ffffffffffffffff>] 0xffffffffffffffff
@Blub

This comment has been minimized.

Show comment
Hide comment
@Blub

Blub Jul 10, 2018

In theory there may be one or more other traces somewhere containing one of the functions from net_namespace.c locking the net_mutex (cleanup_net, net_ns_barrier, net_ns_init, {,un}register_pernet_{subsys,device}). For stable kernels it would of course be much easier if there was one particular thing deadlocking in a way that could be fixed, than backporting all the locking changes from 4.18. But so far I haven't seen a trace leading to the root cause. I don't know if it'll help, but maybe other /proc/*/stacks with the above functions are visible when the issue appears?

Blub commented Jul 10, 2018

In theory there may be one or more other traces somewhere containing one of the functions from net_namespace.c locking the net_mutex (cleanup_net, net_ns_barrier, net_ns_init, {,un}register_pernet_{subsys,device}). For stable kernels it would of course be much easier if there was one particular thing deadlocking in a way that could be fixed, than backporting all the locking changes from 4.18. But so far I haven't seen a trace leading to the root cause. I don't know if it'll help, but maybe other /proc/*/stacks with the above functions are visible when the issue appears?

@baykier

This comment has been minimized.

Show comment
Hide comment
@baykier

baykier Jul 19, 2018

same issue ! and my env is debian 8
debian-docker
docker

baykier commented Jul 19, 2018

same issue ! and my env is debian 8
debian-docker
docker

@anthraxn8b

This comment has been minimized.

Show comment
Hide comment
@anthraxn8b

anthraxn8b Jul 19, 2018

RHEL, SWARM, 18.03.0-ce

  1. Connecting to manager node via ssh

  2. Manually starting a container on a manager node:

    sudo docker run -it -v /import:/temp/eximport -v /home/myUser:/temp/exhome docker.repo.myHost/fedora:23 /bin/bash

  3. After some time doing nothing:

    [root@8a9857c25919 myDir]#
    Message from syslogd@se1-shub-t002 at Jul 19 11:56:03 ...
    kernel:unregister_netdevice: waiting for lo to become free. Usage count = 1

After minutes I am back on the console of the manager node and the started container is not running any longer.

Does this describe the same issue or is this another "problem suite"?

THX in advance!

UPDATE
This also happens directly on the ssh console (on the swarm manager bash).

UPDATE
Host machine (one manager node in the swarm):
Linux [MACHINENNAME] 3.10.0-514.2.2.el7.x86_64 #1 SMP Wed Nov 16 13:15:13 EST 2016 x86_64 x86_64 x86_64 GNU/Linux

anthraxn8b commented Jul 19, 2018

RHEL, SWARM, 18.03.0-ce

  1. Connecting to manager node via ssh

  2. Manually starting a container on a manager node:

    sudo docker run -it -v /import:/temp/eximport -v /home/myUser:/temp/exhome docker.repo.myHost/fedora:23 /bin/bash

  3. After some time doing nothing:

    [root@8a9857c25919 myDir]#
    Message from syslogd@se1-shub-t002 at Jul 19 11:56:03 ...
    kernel:unregister_netdevice: waiting for lo to become free. Usage count = 1

After minutes I am back on the console of the manager node and the started container is not running any longer.

Does this describe the same issue or is this another "problem suite"?

THX in advance!

UPDATE
This also happens directly on the ssh console (on the swarm manager bash).

UPDATE
Host machine (one manager node in the swarm):
Linux [MACHINENNAME] 3.10.0-514.2.2.el7.x86_64 #1 SMP Wed Nov 16 13:15:13 EST 2016 x86_64 x86_64 x86_64 GNU/Linux

@dElogics

This comment has been minimized.

Show comment
Hide comment
@dElogics

dElogics Jul 19, 2018

If this does not fix after some time, then this's a different problem.

dElogics commented Jul 19, 2018

If this does not fix after some time, then this's a different problem.

@mazengxie

This comment has been minimized.

Show comment
Hide comment
@mazengxie

mazengxie Jul 24, 2018

Same on CentOS7.5 kernel 3.10.0-693.el7.x86_64 and docker 1.13.1

mazengxie commented Jul 24, 2018

Same on CentOS7.5 kernel 3.10.0-693.el7.x86_64 and docker 1.13.1

@opskumu opskumu referenced this issue Jul 24, 2018

Open

学习周报 #19

@che-vgik

This comment has been minimized.

Show comment
Hide comment
@che-vgik

che-vgik Jul 25, 2018

The same problem OEL 7.5
uname -a
4.1.12-124.16.1.el7uek.x86_64 #2 SMP Mon Jun 11 20:09:51 PDT 2018 x86_64 x86_64 x86_64 GNU/Linux
docker info
Containers: 9
Running: 5
Paused: 0
Stopped: 4
Images: 6
Server Version: 17.06.2-ol

dmesg
[2238374.718889] unregister_netdevice: waiting for lo to become free. Usage count = 1
[2238384.762813] unregister_netdevice: waiting for lo to become free. Usage count = 1
[2238392.792585] eth0: renamed from vethbed6d59

che-vgik commented Jul 25, 2018

The same problem OEL 7.5
uname -a
4.1.12-124.16.1.el7uek.x86_64 #2 SMP Mon Jun 11 20:09:51 PDT 2018 x86_64 x86_64 x86_64 GNU/Linux
docker info
Containers: 9
Running: 5
Paused: 0
Stopped: 4
Images: 6
Server Version: 17.06.2-ol

dmesg
[2238374.718889] unregister_netdevice: waiting for lo to become free. Usage count = 1
[2238384.762813] unregister_netdevice: waiting for lo to become free. Usage count = 1
[2238392.792585] eth0: renamed from vethbed6d59

@thaJeztah

This comment has been minimized.

Show comment
Hide comment
@thaJeztah

thaJeztah Jul 25, 2018

Member

(repeating this #5618 (comment) here again, because GitHub is hiding old comments)

If you are arriving here

The issue being discussed here is a kernel bug and has not yet been fully fixed. Some patches went in the kernel that fix some occurrences of this issue, but others are not yet resolved.

There are a number of options that may help for some situations, but not for all (again; it's most likely a combination of issues that trigger the same error)

The "unregister_netdevice: waiting for lo to become free" error itself is not the bug

If's the kernel crash after that's a bug (see below)

Do not leave "I have this too" comments

"I have this too" does not help resolving the bug. only leave a comment if you have information that may help resolve the issue (in which case; providing a patch to the kernel upstream may be the best step).

If you want to let know you have this issue too use the "thumbs up" button in the top description:
screen shot 2017-03-09 at 16 12 17

If you want to stay informed on updates use the subscribe button.

screen shot 2017-03-09 at 16 11 03

Every comment here sends an e-mail / notification to over 3000 people I don't want to lock the conversation on this issue, because it's not resolved yet, but may be forced to if you ignore this.

I will be removing comments that don't add useful information in order to (slightly) shorten the thread

If you want to help resolving this issue

  • Read the whole thread, including those comments that are hidden; it's long, and github hides comments (so you'll have to click to make those visible again). There's a lot if information present in this thread already that could possibly help you

screen shot 2018-07-25 at 15 18 14

  • Read this comment #5618 (comment) (and comments around that time) for information that can be helpful:

To be clear, the message itself is benign, it's the kernel crash after the messages reported by the OP which is not.

The comment in the code, where this message is coming from, explains what's happening. Basically every user, such as the IP stack) of a network device (such as the end of veth pair inside a container) increments a reference count in the network device structure when it is using the network device. When the device is removed (e,g. when the container is removed) each user is notified so that they can do some cleanup (e.g. closing open sockets etc) before decrementing the reference count. Because this cleanup can take some time, especially under heavy load (lot's of interface, a lot of connections etc), the kernel may print the message here once in a while.

If a user of network device never decrements the reference count, some other part of the kernel will determine that the task waiting for the cleanup is stuck and it will crash. It is only this crash which indicates a kernel bug (some user, via some code path, did not decrement the reference count). There have been several such bugs and they have been fixed in modern kernel (and possibly back ported to older ones). I have written quite a few stress tests (and continue writing them) to trigger such crashes but have not been able to reproduce on modern kernels (i do however the above message).

** Please only report on this issue if your kernel actually crashes**, and then we would be very interested in:

  • kernel version (output of uname -r)
  • Linux distribution/version
  • Are you on the latest kernel version of your Linux vendor?
  • Network setup (bridge, overlay, IPv4, IPv6, etc)
  • Description of the workload (what type of containers, what type of network load, etc)
  • And ideally a simple reproduction

Thanks!

Member

thaJeztah commented Jul 25, 2018

(repeating this #5618 (comment) here again, because GitHub is hiding old comments)

If you are arriving here

The issue being discussed here is a kernel bug and has not yet been fully fixed. Some patches went in the kernel that fix some occurrences of this issue, but others are not yet resolved.

There are a number of options that may help for some situations, but not for all (again; it's most likely a combination of issues that trigger the same error)

The "unregister_netdevice: waiting for lo to become free" error itself is not the bug

If's the kernel crash after that's a bug (see below)

Do not leave "I have this too" comments

"I have this too" does not help resolving the bug. only leave a comment if you have information that may help resolve the issue (in which case; providing a patch to the kernel upstream may be the best step).

If you want to let know you have this issue too use the "thumbs up" button in the top description:
screen shot 2017-03-09 at 16 12 17

If you want to stay informed on updates use the subscribe button.

screen shot 2017-03-09 at 16 11 03

Every comment here sends an e-mail / notification to over 3000 people I don't want to lock the conversation on this issue, because it's not resolved yet, but may be forced to if you ignore this.

I will be removing comments that don't add useful information in order to (slightly) shorten the thread

If you want to help resolving this issue

  • Read the whole thread, including those comments that are hidden; it's long, and github hides comments (so you'll have to click to make those visible again). There's a lot if information present in this thread already that could possibly help you

screen shot 2018-07-25 at 15 18 14

  • Read this comment #5618 (comment) (and comments around that time) for information that can be helpful:

To be clear, the message itself is benign, it's the kernel crash after the messages reported by the OP which is not.

The comment in the code, where this message is coming from, explains what's happening. Basically every user, such as the IP stack) of a network device (such as the end of veth pair inside a container) increments a reference count in the network device structure when it is using the network device. When the device is removed (e,g. when the container is removed) each user is notified so that they can do some cleanup (e.g. closing open sockets etc) before decrementing the reference count. Because this cleanup can take some time, especially under heavy load (lot's of interface, a lot of connections etc), the kernel may print the message here once in a while.

If a user of network device never decrements the reference count, some other part of the kernel will determine that the task waiting for the cleanup is stuck and it will crash. It is only this crash which indicates a kernel bug (some user, via some code path, did not decrement the reference count). There have been several such bugs and they have been fixed in modern kernel (and possibly back ported to older ones). I have written quite a few stress tests (and continue writing them) to trigger such crashes but have not been able to reproduce on modern kernels (i do however the above message).

** Please only report on this issue if your kernel actually crashes**, and then we would be very interested in:

  • kernel version (output of uname -r)
  • Linux distribution/version
  • Are you on the latest kernel version of your Linux vendor?
  • Network setup (bridge, overlay, IPv4, IPv6, etc)
  • Description of the workload (what type of containers, what type of network load, etc)
  • And ideally a simple reproduction

Thanks!

@dElogics

This comment has been minimized.

Show comment
Hide comment
@dElogics

dElogics Aug 7, 2018

Are you guys running docker under any limits? Like ulimits, cgroups etc...

newer systemd has a default limit even if you didnt set it. I set things to unlimited and the issue hasn't occurred ever since (watching since 31 days).

dElogics commented Aug 7, 2018

Are you guys running docker under any limits? Like ulimits, cgroups etc...

newer systemd has a default limit even if you didnt set it. I set things to unlimited and the issue hasn't occurred ever since (watching since 31 days).

@johvann

This comment has been minimized.

Show comment
Hide comment
@johvann

johvann Aug 7, 2018

I had the same issue in many environments and my solution was stop firewall. It has not happened again, for now

Rhel 7.5 - 3.10.0-862.3.2.el7.x86_64
Docker 1.13

johvann commented Aug 7, 2018

I had the same issue in many environments and my solution was stop firewall. It has not happened again, for now

Rhel 7.5 - 3.10.0-862.3.2.el7.x86_64
Docker 1.13

@alexhexabeam

This comment has been minimized.

Show comment
Hide comment
@alexhexabeam

alexhexabeam Aug 7, 2018

@dElogics What version of systemd is considered "newer"? Is this default limit enabled in the CentOS 7.5 systemd?

Also, when you ask if we're running docker under any limits, do you mean the docker daemon, or the individual containers?

alexhexabeam commented Aug 7, 2018

@dElogics What version of systemd is considered "newer"? Is this default limit enabled in the CentOS 7.5 systemd?

Also, when you ask if we're running docker under any limits, do you mean the docker daemon, or the individual containers?

@dElogics

This comment has been minimized.

Show comment
Hide comment
@dElogics

dElogics Aug 11, 2018

The docker daemon. The systemd as in Debian 9 (232-25).

Not sure about RHEL, but I've personally seen this issue on RHEL too. I'd set LimitNOFILE=1048576, LimitNPROC=infinity, LimitCORE=infinity, TasksMax=infinity

dElogics commented Aug 11, 2018

The docker daemon. The systemd as in Debian 9 (232-25).

Not sure about RHEL, but I've personally seen this issue on RHEL too. I'd set LimitNOFILE=1048576, LimitNPROC=infinity, LimitCORE=infinity, TasksMax=infinity

@xugl

This comment has been minimized.

Show comment
Hide comment
@xugl

xugl Aug 14, 2018

kernel: unregister_netdevice: waiting for eth0 to become free. Usage count = 3
kernel 4.4.146-1.el7.elrepo.x86_64
linux version CentOS Linux release 7.4.1708 (Core)
bridge mode

I had the same issue,what can i do?

xugl commented Aug 14, 2018

kernel: unregister_netdevice: waiting for eth0 to become free. Usage count = 3
kernel 4.4.146-1.el7.elrepo.x86_64
linux version CentOS Linux release 7.4.1708 (Core)
bridge mode

I had the same issue,what can i do?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment