Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kernel crash after "unregister_netdevice: waiting for lo to become free. Usage count = 3" #5618

Open
tankywoo opened this issue May 6, 2014 · 521 comments

Comments

@tankywoo
Copy link

@tankywoo tankywoo commented May 6, 2014

This happens when I login the container, and can't quit by Ctrl-c.

My system is Ubuntu 12.04, kernel is 3.8.0-25-generic.

docker version:

root@wutq-docker:~# docker version
Client version: 0.10.0
Client API version: 1.10
Go version (client): go1.2.1
Git commit (client): dc9c28f
Server version: 0.10.0
Server API version: 1.10
Git commit (server): dc9c28f
Go version (server): go1.2.1
Last stable version: 0.10.0

I have used the script https://raw.githubusercontent.com/dotcloud/docker/master/contrib/check-config.sh to check, and all right.

I watch the syslog and found this message:

May  6 11:30:33 wutq-docker kernel: [62365.889369] unregister_netdevice: waiting for lo to become free. Usage count = 3
May  6 11:30:44 wutq-docker kernel: [62376.108277] unregister_netdevice: waiting for lo to become free. Usage count = 3
May  6 11:30:54 wutq-docker kernel: [62386.327156] unregister_netdevice: waiting for lo to become free. Usage count = 3
May  6 11:31:02 wutq-docker kernel: [62394.423920] INFO: task docker:1024 blocked for more than 120 seconds.
May  6 11:31:02 wutq-docker kernel: [62394.424175] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
May  6 11:31:02 wutq-docker kernel: [62394.424505] docker          D 0000000000000001     0  1024      1 0x00000004
May  6 11:31:02 wutq-docker kernel: [62394.424511]  ffff880077793cb0 0000000000000082 ffffffffffffff04 ffffffff816df509
May  6 11:31:02 wutq-docker kernel: [62394.424517]  ffff880077793fd8 ffff880077793fd8 ffff880077793fd8 0000000000013f40
May  6 11:31:02 wutq-docker kernel: [62394.424521]  ffff88007c461740 ffff880076b1dd00 000080d081f06880 ffffffff81cbbda0
May  6 11:31:02 wutq-docker kernel: [62394.424526] Call Trace:                                                         
May  6 11:31:02 wutq-docker kernel: [62394.424668]  [<ffffffff816df509>] ? __slab_alloc+0x28a/0x2b2
May  6 11:31:02 wutq-docker kernel: [62394.424700]  [<ffffffff816f1849>] schedule+0x29/0x70
May  6 11:31:02 wutq-docker kernel: [62394.424705]  [<ffffffff816f1afe>] schedule_preempt_disabled+0xe/0x10
May  6 11:31:02 wutq-docker kernel: [62394.424710]  [<ffffffff816f0777>] __mutex_lock_slowpath+0xd7/0x150
May  6 11:31:02 wutq-docker kernel: [62394.424715]  [<ffffffff815dc809>] ? copy_net_ns+0x69/0x130
May  6 11:31:02 wutq-docker kernel: [62394.424719]  [<ffffffff815dc0b1>] ? net_alloc_generic+0x21/0x30
May  6 11:31:02 wutq-docker kernel: [62394.424724]  [<ffffffff816f038a>] mutex_lock+0x2a/0x50
May  6 11:31:02 wutq-docker kernel: [62394.424727]  [<ffffffff815dc82c>] copy_net_ns+0x8c/0x130
May  6 11:31:02 wutq-docker kernel: [62394.424733]  [<ffffffff81084851>] create_new_namespaces+0x101/0x1b0
May  6 11:31:02 wutq-docker kernel: [62394.424737]  [<ffffffff81084a33>] copy_namespaces+0xa3/0xe0
May  6 11:31:02 wutq-docker kernel: [62394.424742]  [<ffffffff81057a60>] ? dup_mm+0x140/0x240
May  6 11:31:02 wutq-docker kernel: [62394.424746]  [<ffffffff81058294>] copy_process.part.22+0x6f4/0xe60
May  6 11:31:02 wutq-docker kernel: [62394.424752]  [<ffffffff812da406>] ? security_file_alloc+0x16/0x20
May  6 11:31:02 wutq-docker kernel: [62394.424758]  [<ffffffff8119d118>] ? get_empty_filp+0x88/0x180
May  6 11:31:02 wutq-docker kernel: [62394.424762]  [<ffffffff81058a80>] copy_process+0x80/0x90
May  6 11:31:02 wutq-docker kernel: [62394.424766]  [<ffffffff81058b7c>] do_fork+0x9c/0x230
May  6 11:31:02 wutq-docker kernel: [62394.424769]  [<ffffffff816f277e>] ? _raw_spin_lock+0xe/0x20
May  6 11:31:02 wutq-docker kernel: [62394.424774]  [<ffffffff811b9185>] ? __fd_install+0x55/0x70
May  6 11:31:02 wutq-docker kernel: [62394.424777]  [<ffffffff81058d96>] sys_clone+0x16/0x20
May  6 11:31:02 wutq-docker kernel: [62394.424782]  [<ffffffff816fb939>] stub_clone+0x69/0x90
May  6 11:31:02 wutq-docker kernel: [62394.424786]  [<ffffffff816fb5dd>] ? system_call_fastpath+0x1a/0x1f
May  6 11:31:04 wutq-docker kernel: [62396.466223] unregister_netdevice: waiting for lo to become free. Usage count = 3
May  6 11:31:14 wutq-docker kernel: [62406.689132] unregister_netdevice: waiting for lo to become free. Usage count = 3
May  6 11:31:25 wutq-docker kernel: [62416.908036] unregister_netdevice: waiting for lo to become free. Usage count = 3
May  6 11:31:35 wutq-docker kernel: [62427.126927] unregister_netdevice: waiting for lo to become free. Usage count = 3
May  6 11:31:45 wutq-docker kernel: [62437.345860] unregister_netdevice: waiting for lo to become free. Usage count = 3

After happend this, I open another terminal and kill this process, and then restart docker, but this will be hanged.

I reboot the host, and it still display that messages for some minutes when shutdown:
screen shot 2014-05-06 at 11 49 27

@drpancake
Copy link

@drpancake drpancake commented May 23, 2014

I'm seeing a very similar issue for eth0. Ubuntu 12.04 also.

I have to power cycle the machine. From /var/log/kern.log:

May 22 19:26:08 box kernel: [596765.670275] device veth5070 entered promiscuous mode
May 22 19:26:08 box kernel: [596765.680630] IPv6: ADDRCONF(NETDEV_UP): veth5070: link is not ready
May 22 19:26:08 box kernel: [596765.700561] IPv6: ADDRCONF(NETDEV_CHANGE): veth5070: link becomes ready
May 22 19:26:08 box kernel: [596765.700628] docker0: port 7(veth5070) entered forwarding state
May 22 19:26:08 box kernel: [596765.700638] docker0: port 7(veth5070) entered forwarding state
May 22 19:26:19 box kernel: [596777.386084] [FW DBLOCK] IN=docker0 OUT= PHYSIN=veth5070 MAC=56:84:7a:fe:97:99:9e:df:a7:3f:23:42:08:00 SRC=172.17.0.8 DST=172.17.42.1 LEN=60 TOS=0x00 PREC=0x00 TTL=64 ID=170 DF PROTO=TCP SPT=51615 DPT=13162 WINDOW=14600 RES=0x00 SYN URGP=0
May 22 19:26:21 box kernel: [596779.371993] [FW DBLOCK] IN=docker0 OUT= PHYSIN=veth5070 MAC=56:84:7a:fe:97:99:9e:df:a7:3f:23:42:08:00 SRC=172.17.0.8 DST=172.17.42.1 LEN=60 TOS=0x00 PREC=0x00 TTL=64 ID=549 DF PROTO=TCP SPT=46878 DPT=12518 WINDOW=14600 RES=0x00 SYN URGP=0
May 22 19:26:23 box kernel: [596780.704031] docker0: port 7(veth5070) entered forwarding state
May 22 19:27:13 box kernel: [596831.359999] docker0: port 7(veth5070) entered disabled state
May 22 19:27:13 box kernel: [596831.361329] device veth5070 left promiscuous mode
May 22 19:27:13 box kernel: [596831.361333] docker0: port 7(veth5070) entered disabled state
May 22 19:27:24 box kernel: [596841.516039] unregister_netdevice: waiting for eth0 to become free. Usage count = 1
May 22 19:27:34 box kernel: [596851.756060] unregister_netdevice: waiting for eth0 to become free. Usage count = 1
May 22 19:27:44 box kernel: [596861.772101] unregister_netdevice: waiting for eth0 to become free. Usage count = 1
@egasimus
Copy link

@egasimus egasimus commented Jun 4, 2014

Hey, this just started happening for me as well.

Docker version:

Client version: 0.11.1
Client API version: 1.11
Go version (client): go1.2.1
Git commit (client): fb99f99
Server version: 0.11.1
Server API version: 1.11
Git commit (server): fb99f99
Go version (server): go1.2.1
Last stable version: 0.11.1

Kernel log: http://pastebin.com/TubCy1tG

System details:
Running Ubuntu 14.04 LTS with patched kernel (3.14.3-rt4). Yet to see it happen with the default linux-3.13.0-27-generic kernel. What's funny, though, is that when this happens, all my terminal windows freeze, letting me type a few characters at most before that. The same fate befalls any new ones I open, too - and I end up needing to power cycle my poor laptop just like the good doctor above. For the record, I'm running fish shell in urxvt or xterm in xmonad. Haven't checked if it affects plain bash.

@egasimus
Copy link

@egasimus egasimus commented Jun 5, 2014

This might be relevant:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1065434#yui_3_10_3_1_1401948176063_2050

Copying a fairly large amount of data over the network inside a container
and then exiting the container can trigger a missing decrement in the per
cpu reference count on a network device.

Sure enough, one of the times this happened for me was right after apt-getting a package with a ton of dependencies.

@drpancake
Copy link

@drpancake drpancake commented Jun 5, 2014

Upgrading from Ubuntu 12.04.3 to 14.04 fixed this for me without any other changes.

@unclejack unclejack added the kernel label Jul 16, 2014
@csabahenk
Copy link

@csabahenk csabahenk commented Jul 22, 2014

I experience this on RHEL7, 3.10.0-123.4.2.el7.x86_64

@egasimus
Copy link

@egasimus egasimus commented Jul 22, 2014

I've noticed the same thing happening with my VirtualBox virtual network interfaces when I'm running 3.14-rt4. It's supposed to be fixed in vanilla 3.13 or something.

@spiffytech
Copy link

@spiffytech spiffytech commented Jul 25, 2014

@egasimus Same here - I pulled in hundreds of MB of data before killing the container, then got this error.

@spiffytech
Copy link

@spiffytech spiffytech commented Jul 25, 2014

I upgraded to Debian kernel 3.14 and the problem appears to have gone away. Looks like the problem existed in some kernels < 3.5, was fixed in 3.5, regressed in 3.6, and was patched in something 3.12-3.14. https://bugzilla.redhat.com/show_bug.cgi?id=880394

@egasimus
Copy link

@egasimus egasimus commented Jul 27, 2014

@spiffytech Do you have any idea where I can report this regarding the realtime kernel flavour? I think they're only releasing a RT patch for every other version, and would really hate to see 3.16-rt come out with this still broken. :/

EDIT: Filed it at kernel.org.

@ibuildthecloud
Copy link
Contributor

@ibuildthecloud ibuildthecloud commented Dec 22, 2014

I'm getting this on Ubuntu 14.10 running a 3.18.1. Kernel log shows

Dec 21 22:49:31 inotmac kernel: [15225.866600] unregister_netdevice: waiting for lo to become free. Usage count = 2
Dec 21 22:49:40 inotmac kernel: [15235.179263] INFO: task docker:19599 blocked for more than 120 seconds.
Dec 21 22:49:40 inotmac kernel: [15235.179268]       Tainted: G           OE  3.18.1-031801-generic #201412170637
Dec 21 22:49:40 inotmac kernel: [15235.179269] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Dec 21 22:49:40 inotmac kernel: [15235.179271] docker          D 0000000000000001     0 19599      1 0x00000000
Dec 21 22:49:40 inotmac kernel: [15235.179275]  ffff8802082abcc0 0000000000000086 ffff880235c3b700 00000000ffffffff
Dec 21 22:49:40 inotmac kernel: [15235.179277]  ffff8802082abfd8 0000000000013640 ffff8800288f2300 0000000000013640
Dec 21 22:49:40 inotmac kernel: [15235.179280]  ffff880232cf0000 ffff8801a467c600 ffffffff81f9d4b8 ffffffff81cd9c60
Dec 21 22:49:40 inotmac kernel: [15235.179282] Call Trace:
Dec 21 22:49:40 inotmac kernel: [15235.179289]  [<ffffffff817af549>] schedule+0x29/0x70
Dec 21 22:49:40 inotmac kernel: [15235.179292]  [<ffffffff817af88e>] schedule_preempt_disabled+0xe/0x10
Dec 21 22:49:40 inotmac kernel: [15235.179296]  [<ffffffff817b1545>] __mutex_lock_slowpath+0x95/0x100
Dec 21 22:49:40 inotmac kernel: [15235.179299]  [<ffffffff8168d5c9>] ? copy_net_ns+0x69/0x150
Dec 21 22:49:40 inotmac kernel: [15235.179302]  [<ffffffff817b15d3>] mutex_lock+0x23/0x37
Dec 21 22:49:40 inotmac kernel: [15235.179305]  [<ffffffff8168d5f8>] copy_net_ns+0x98/0x150
Dec 21 22:49:40 inotmac kernel: [15235.179308]  [<ffffffff810941f1>] create_new_namespaces+0x101/0x1b0
Dec 21 22:49:40 inotmac kernel: [15235.179311]  [<ffffffff8109432b>] copy_namespaces+0x8b/0xa0
Dec 21 22:49:40 inotmac kernel: [15235.179315]  [<ffffffff81073458>] copy_process.part.28+0x828/0xed0
Dec 21 22:49:40 inotmac kernel: [15235.179318]  [<ffffffff811f157f>] ? get_empty_filp+0xcf/0x1c0
Dec 21 22:49:40 inotmac kernel: [15235.179320]  [<ffffffff81073b80>] copy_process+0x80/0x90
Dec 21 22:49:40 inotmac kernel: [15235.179323]  [<ffffffff81073ca2>] do_fork+0x62/0x280
Dec 21 22:49:40 inotmac kernel: [15235.179326]  [<ffffffff8120cfc0>] ? get_unused_fd_flags+0x30/0x40
Dec 21 22:49:40 inotmac kernel: [15235.179329]  [<ffffffff8120d028>] ? __fd_install+0x58/0x70
Dec 21 22:49:40 inotmac kernel: [15235.179331]  [<ffffffff81073f46>] SyS_clone+0x16/0x20
Dec 21 22:49:40 inotmac kernel: [15235.179334]  [<ffffffff817b3ab9>] stub_clone+0x69/0x90
Dec 21 22:49:40 inotmac kernel: [15235.179336]  [<ffffffff817b376d>] ? system_call_fastpath+0x16/0x1b
Dec 21 22:49:41 inotmac kernel: [15235.950976] unregister_netdevice: waiting for lo to become free. Usage count = 2
Dec 21 22:49:51 inotmac kernel: [15246.059346] unregister_netdevice: waiting for lo to become free. Usage count = 2

I'll send docker version/info once the system isn't frozen anymore :)

@sbward
Copy link

@sbward sbward commented Dec 23, 2014

We're seeing this issue as well. Ubuntu 14.04, 3.13.0-37-generic

@jbalonso
Copy link

@jbalonso jbalonso commented Dec 29, 2014

On Ubuntu 14.04 server, my team has found that downgrading from 3.13.0-40-generic to 3.13.0-32-generic "resolves" the issue. Given @sbward's observation, that would put the regression after 3.13.0-32-generic and before (or including) 3.13.0-37-generic.

I'll add that, in our case, we sometimes see a negative usage count.

@rsampaio
Copy link
Contributor

@rsampaio rsampaio commented Jan 15, 2015

FWIW we hit this bug running lxc on trusty kernel (3.13.0-40-generic #69-Ubuntu) the message appears in dmesg followed by this stacktrace:

[27211131.602869] INFO: task lxc-start:26342 blocked for more than 120 seconds.
[27211131.602874]       Not tainted 3.13.0-40-generic #69-Ubuntu
[27211131.602877] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[27211131.602881] lxc-start       D 0000000000000001     0 26342      1 0x00000080
[27211131.602883]  ffff88000d001d40 0000000000000282 ffff88001aa21800 ffff88000d001fd8
[27211131.602886]  0000000000014480 0000000000014480 ffff88001aa21800 ffffffff81cdb760
[27211131.602888]  ffffffff81cdb764 ffff88001aa21800 00000000ffffffff ffffffff81cdb768
[27211131.602891] Call Trace:
[27211131.602894]  [<ffffffff81723b69>] schedule_preempt_disabled+0x29/0x70
[27211131.602897]  [<ffffffff817259d5>] __mutex_lock_slowpath+0x135/0x1b0
[27211131.602900]  [<ffffffff811a2679>] ? __kmalloc+0x1e9/0x230
[27211131.602903]  [<ffffffff81725a6f>] mutex_lock+0x1f/0x2f
[27211131.602905]  [<ffffffff8161c2c1>] copy_net_ns+0x71/0x130
[27211131.602908]  [<ffffffff8108f889>] create_new_namespaces+0xf9/0x180
[27211131.602910]  [<ffffffff8108f983>] copy_namespaces+0x73/0xa0
[27211131.602912]  [<ffffffff81065b16>] copy_process.part.26+0x9a6/0x16b0
[27211131.602915]  [<ffffffff810669f5>] do_fork+0xd5/0x340
[27211131.602917]  [<ffffffff810c8e8d>] ? call_rcu_sched+0x1d/0x20
[27211131.602919]  [<ffffffff81066ce6>] SyS_clone+0x16/0x20
[27211131.602921]  [<ffffffff81730089>] stub_clone+0x69/0x90
[27211131.602923]  [<ffffffff8172fd2d>] ? system_call_fastpath+0x1a/0x1f
@MrMMorris
Copy link

@MrMMorris MrMMorris commented Mar 16, 2015

Ran into this on Ubuntu 14.04 and Debian jessie w/ kernel 3.16.x.

Docker command:

docker run -t -i -v /data/sitespeed.io:/sitespeed.io/results company/dockerfiles:sitespeed.io-latest --name "Superbrowse"

This seems like a pretty bad issue...

@MrMMorris
Copy link

@MrMMorris MrMMorris commented Mar 17, 2015

@jbalonso even with 3.13.0-32-generic I get the error after only a few successful runs 😭

@rsampaio
Copy link
Contributor

@rsampaio rsampaio commented Mar 17, 2015

@MrMMorris could you share a reproducer script using public available images?

@unclejack
Copy link
Contributor

@unclejack unclejack commented Mar 18, 2015

Everyone who's seeing this error on their system is running a package of the Linux kernel on their distribution that's far too old and lacks the fixes for this particular problem.

If you run into this problem, make sure you run apt-get update && apt-get dist-upgrade -y and reboot your system. If you're on Digital Ocean, you also need to select the kernel version which was just installed during the update because they don't use the latest kernel automatically (see https://digitalocean.uservoice.com/forums/136585-digitalocean/suggestions/2814988-give-option-to-use-the-droplet-s-own-bootloader).

CentOS/RHEL/Fedora/Scientific Linux users need to keep their systems updated using yum update and reboot after installing the updates.

When reporting this problem, please make sure your system is fully patched and up to date with the latest stable updates (no manually installed experimental/testing/alpha/beta/rc packages) provided by your distribution's vendor.

@MrMMorris
Copy link

@MrMMorris MrMMorris commented Mar 18, 2015

@unclejack

I ran apt-get update && apt-get dist-upgrade -y

ubuntu 14.04 3.13.0-46-generic

Still get the error after only one docker run

I can create an AMI for reproducing if needed

@unclejack
Copy link
Contributor

@unclejack unclejack commented Mar 18, 2015

@MrMMorris Thank you for confirming it's still a problem with the latest kernel package on Ubuntu 14.04.

@MrMMorris
Copy link

@MrMMorris MrMMorris commented Mar 18, 2015

Anything else I can do to help, let me know! 😄

@rsampaio
Copy link
Contributor

@rsampaio rsampaio commented Mar 18, 2015

@MrMMorris if you can provide a reproducer there is a bug opened for Ubuntu and it will be much appreciated: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1403152

@MrMMorris
Copy link

@MrMMorris MrMMorris commented Mar 18, 2015

@rsampaio if I have time today, I will definitely get that for you!

@fxposter
Copy link

@fxposter fxposter commented Mar 23, 2015

This problem also appears on 3.16(.7) on both Debian 7 and Debian 8: #9605 (comment). Rebooting the server is the only way to fix this for now.

@chrisjstevenson
Copy link

@chrisjstevenson chrisjstevenson commented Apr 27, 2015

Seeing this issue on RHEL 6.6 with kernel 2.6.32-504.8.1.el6.x86_64 when starting some docker containers (not all containers)
kernel:unregister_netdevice: waiting for lo to become free. Usage count = -1

Again, rebooting the server seems to be the only solution at this time

@popsikle
Copy link

@popsikle popsikle commented May 12, 2015

Also seeing this on CoreOS (647.0.0) with kernel 3.19.3.

Rebooting is also the only solution I have found.

@fxposter
Copy link

@fxposter fxposter commented May 20, 2015

Tested Debian jessie with sid's kernel (4.0.2) - the problem remains.

@popsikle
Copy link

@popsikle popsikle commented Jun 19, 2015

Anyone seeing this issue running non-ubuntu containers?

@fxposter
Copy link

@fxposter fxposter commented Jun 19, 2015

Yes. Debian ones.
19 июня 2015 г. 19:01 пользователь "popsikle" notifications@github.com
написал:

Anyone seeing this issue running non-ubuntu containers?


Reply to this email directly or view it on GitHub
#5618 (comment).

@commixon
Copy link

@commixon commixon commented May 22, 2019

Thanks a lot @pmoust . Will try them out. In any case I'll also try to patch kpatch to work with Debian (as a side project) and post updates here for anyone interested.

@dElogics
Copy link

@dElogics dElogics commented May 23, 2019

@ethercflow @2rs2ts we are also running debian. I have encountered a lot of issues trying to get kpatch-build to work. If I manage to find a workaround I'll keep you posted. In any case, does anybody have any other solution? Is it kernel version 4.15 or 4.19 that mitigates the problem? I have been trying to find the answer for the past week and still have not managed to.

You may upgrade to 4.19. It's in the backports.

BTW it's been a year for us here. ;)

@2rs2ts
Copy link

@2rs2ts 2rs2ts commented May 24, 2019

We actually tried the 4.19 in the backports but it had some major regressions in other areas (the EC2 instances would just randomly reboot and then networking would be broken upon startup.) Guess we'll have to deal with this until the next stable.

@commixon
Copy link

@commixon commixon commented May 27, 2019

@2rs2ts For the past 4 days we are using 4.19 from backports (in EC2) and we have not seen any problems at all. The kernel crash issue has not appeared at all and everything else seems fine as well. I don't believe it makes any difference but we based our Debian image to the one provided by kops (https://github.com/kubernetes/kops/blob/master/docs/images.md#debian). We updated the kernel in this image and not the stock debian.

@goudai
Copy link

@goudai goudai commented May 31, 2019

Friends, I have been using the 4.19 kernel for stable operation for half a year. I hope that you can enjoy stability as well.

@yaoshenshen
Copy link

@yaoshenshen yaoshenshen commented Jun 5, 2019

I have a container open 80 and 443 port , every 2 weeks ,from anther computer access container 80 and
443 is deny

centos7.3 kernel version is :
Linux browser1 3.10.0-514.el7.x86_64 #1 SMP Tue Nov 22 16:42:41 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux

root@browser1 ~]# docker version
Client:
Version: 18.06.3-ce
API version: 1.38
Go version: go1.10.4
Git commit: d7080c1
Built: Wed Feb 20 02:24:22 2019
OS/Arch: linux/amd64
Experimental: false

Server:
Engine:
Version: 18.06.3-ce
API version: 1.38 (minimum version 1.12)
Go version: go1.10.3
Git commit: d7080c1
Built: Wed Feb 20 02:25:33 2019
OS/Arch: linux/amd64
Experimental: false
[root@browser1 ~]#

dmesg:

[1063959.636785] unregister_netdevice: waiting for lo to become free. Usage count = 1
[1071340.887512] br-af29e1edc1b8: port 5(vethc2ac4f8) entered disabled state
[1071340.891753] br-af29e1edc1b8: port 5(vethc2ac4f8) entered disabled state
[1071340.895118] device vethc2ac4f8 left promiscuous mode
[1071340.895138] br-af29e1edc1b8: port 5(vethc2ac4f8) entered disabled state
[1071340.990505] device veth5e4f161 entered promiscuous mode
[1071340.990897] IPv6: ADDRCONF(NETDEV_UP): veth5e4f161: link is not ready
[1071340.990904] br-af29e1edc1b8: port 5(veth5e4f161) entered forwarding state
[1071340.990924] br-af29e1edc1b8: port 5(veth5e4f161) entered forwarding state
[1071341.231405] IPv6: ADDRCONF(NETDEV_CHANGE): veth5e4f161: link becomes ready
[1071355.991701] br-af29e1edc1b8: port 5(veth5e4f161) entered forwarding state
[1071551.533907] br-af29e1edc1b8: port 5(veth5e4f161) entered disabled state
[1071551.537564] br-af29e1edc1b8: port 5(veth5e4f161) entered disabled state
[1071551.540295] device veth5e4f161 left promiscuous mode
[1071551.540313] br-af29e1edc1b8: port 5(veth5e4f161) entered disabled state
[1071551.570924] device veth8fd3a0a entered promiscuous mode
[1071551.571550] IPv6: ADDRCONF(NETDEV_UP): veth8fd3a0a: link is not ready
[1071551.571556] br-af29e1edc1b8: port 5(veth8fd3a0a) entered forwarding state
[1071551.571582] br-af29e1edc1b8: port 5(veth8fd3a0a) entered forwarding state
[1071551.841656] IPv6: ADDRCONF(NETDEV_CHANGE): veth8fd3a0a: link becomes ready
[1071566.613998] br-af29e1edc1b8: port 5(veth8fd3a0a) entered forwarding state
[1071923.465082] br-af29e1edc1b8: port 5(veth8fd3a0a) entered disabled state
[1071923.470215] br-af29e1edc1b8: port 5(veth8fd3a0a) entered disabled state
[1071923.472888] device veth8fd3a0a left promiscuous mode
[1071923.472904] br-af29e1edc1b8: port 5(veth8fd3a0a) entered disabled state
[1071923.505580] device veth9e693ae entered promiscuous mode
[1071923.505919] IPv6: ADDRCONF(NETDEV_UP): veth9e693ae: link is not ready
[1071923.505925] br-af29e1edc1b8: port 5(veth9e693ae) entered forwarding state
[1071923.505944] br-af29e1edc1b8: port 5(veth9e693ae) entered forwarding state
[1071923.781658] IPv6: ADDRCONF(NETDEV_CHANGE): veth9e693ae: link becomes ready
[1071938.515044] br-af29e1edc1b8: port 5(veth9e693ae) entered forwarding state

@peidong-hu
Copy link

@peidong-hu peidong-hu commented Jul 16, 2019

Anyone saw this bug with 4.19?

Yes. I have the issue on kernel 4.19.4-1.e17.elrep.x86_64

@fsadykov
Copy link

@fsadykov fsadykov commented Jul 19, 2019

Hello,

I am also seeing this error. Do we have any solution for this issue? Kernel 3.10.0-514.26.2.el7.x86_64

[username@ip-10-1-4-64 ~]$
Message from syslogd@ip-10-1-4-64 at Jul 19 10:50:01 ...
 kernel:unregister_netdevice: waiting for lo to become free. Usage count = 1

Message from syslogd@ip-10-1-4-64 at Jul 19 10:50:48 ...
 kernel:unregister_netdevice: waiting for lo to become free. Usage count = 1
@marc-gist
Copy link

@marc-gist marc-gist commented Jul 21, 2019

This issue still happening :( no update/ideas on how to fix?

@NIXKnight
Copy link

@NIXKnight NIXKnight commented Sep 1, 2019

Happening on Debian Stretch. I was trying to update my Jenkins container via Ansible when this happened.

@Aleishus
Copy link

@Aleishus Aleishus commented Sep 18, 2019

This issue has been solved by this commit :
torvalds/linux@ee60ad2
Using kpatch

curl -SOL https://raw.githubusercontent.com/Aleishus/kdt/master/kpatchs/route.patch
kpatch-build -t vmlinux route.patch 
mkdir -p /var/lib/kpatch/${UNAME} 
cp -a livepatch-route.ko /var/lib/kpatch/${UNAME}
systemctl restart kpatch
kpatch list
@dElogics
Copy link

@dElogics dElogics commented Oct 7, 2019

This issue has been solved by this commit :
torvalds/linux@ee60ad2
Using kpatch

curl -SOL https://raw.githubusercontent.com/Aleishus/kdt/master/kpatchs/route.patch
kpatch-build -t vmlinux route.patch 
mkdir -p /var/lib/kpatch/${UNAME} 
cp -a livepatch-route.ko /var/lib/kpatch/${UNAME}
systemctl restart kpatch
kpatch list

This must be in 4.19.30 onwards.

@pmoust
Copy link
Contributor

@pmoust pmoust commented Oct 7, 2019

I am not sure torvalds/linux@ee60ad2 is the definitive fix for it - we've seen this in 4.4.0 AFAR, whereas torvalds/linux@deed49d was only added in 4.5.0

@w-simon
Copy link

@w-simon w-simon commented Nov 4, 2019

We've reproduced the same bug using a diagnostic kernel that had delays artificially inserted to make PMTU discovery exception routes hit this window.

  1. Debugging kernel patch:
diff --git a/net/ipv4/route.c b/net/ipv4/route.c
index a0163c5..6b9e7ee 100644
--- a/net/ipv4/route.c
+++ b/net/ipv4/route.c
@@ -133,6 +133,8 @@

 static int ip_min_valid_pmtu __read_mostly	= IPV4_MIN_MTU;

+static int ref_leak_test;
+
/*
  *	Interface to generic destination cache.
  */
@@ -1599,6 +1601,9 @@ static void ip_del_fnhe(struct fib_nh *nh, __be32 daddr)
 	fnhe = rcu_dereference_protected(*fnhe_p, lockdep_is_held(&fnhe_lock));
	while (fnhe) {
 		if (fnhe->fnhe_daddr == daddr) {
+			if (ref_leak_test)
+				pr_info("XXX pid: %d, %s: fib_nh:%p, fnhe:%p, daddr:%x\n",
+					current->pid,  __func__, nh, fnhe, daddr);
 			rcu_assign_pointer(*fnhe_p, rcu_dereference_protected(
 				fnhe->fnhe_next, lockdep_is_held(&fnhe_lock)));
 			fnhe_flush_routes(fnhe);
@@ -2145,10 +2150,14 @@ static struct rtable *__mkroute_output(const struct fib_result *res,

		fnhe = find_exception(nh, fl4->daddr);
 		if (fnhe) {
+			if (ref_leak_test)
+				pr_info("XXX pid: %d, found fnhe :%p\n", current->pid, fnhe);
 			prth = &fnhe->fnhe_rth_output;
 			rth = rcu_dereference(*prth);
 			if (rth && rth->dst.expires &&
`			    time_after(jiffies, rth->dst.expires)) {
+				if (ref_leak_test)
+					pr_info("eXX pid: %d, del fnhe :%p\n", current->pid, fnhe);
				ip_del_fnhe(nh, fl4->daddr);
 				fnhe = NULL;
 			} else {
@@ -2204,6 +2213,14 @@ static struct rtable *__mkroute_output(const struct fib_result *res,
 #endif
 	}

+	if (fnhe && ref_leak_test) {
+		unsigned long  time_out;
+
+		time_out = jiffies + ref_leak_test;
+		while (time_before(jiffies, time_out))
+			cpu_relax();
+		pr_info("XXX pid: %d, reuse fnhe :%p\n", current->pid, fnhe);
+	}
 	rt_set_nexthop(rth, fl4->daddr, res, fnhe, fi, type, 0);
 	if (lwtunnel_output_redirect(rth->dst.lwtstate))
 		rth->dst.output = lwtunnel_output;
@@ -2733,6 +2750,13 @@ static int ipv4_sysctl_rtcache_flush(struct ctl_table *__ctl, int write,
		.proc_handler	= proc_dointvec,
	},
 	{
+		.procname	= "ref_leak_test",
+		.data		= &ref_leak_test,
+		.maxlen		= sizeof(int),
+		.mode		= 0644,
+		.proc_handler	= proc_dointvec,
+	},
+	{
		.procname	= "max_size",
		.data		= &ip_rt_max_size,
 		.maxlen		= sizeof(int),
  1. User mode script:

ref_leak_test_begin.sh:

#!/bin/bash

# constructing a basic network with netns
# client <-->gateway <--> server
ip netns add svr
ip netns add gw
ip netns add cli

ip netns exec gw sysctl net.ipv4.ip_forward=1

ip link add svr-veth type veth peer name svrgw-veth
ip link add cli-veth type veth peer name cligw-veth

ip link set svr-veth netns svr
ip link set svrgw-veth netns gw
ip link set cligw-veth netns gw
ip link set cli-veth netns cli

ip netns exec svr ifconfig svr-veth 192.168.123.1
ip netns exec gw ifconfig svrgw-veth 192.168.123.254
ip netns exec gw ifconfig cligw-veth 10.0.123.254
ip netns exec cli ifconfig cli-veth 10.0.123.1

ip netns exec cli route add default gw 10.0.123.254
ip netns exec svr route add default gw 192.168.123.254

# constructing concurrently accessed scenes with nerperf
nohup ip netns exec svr  netserver -L 192.168.123.1

nohup ip netns exec cli  netperf -H 192.168.123.1 -l 300 &
nohup ip netns exec cli  netperf -H 192.168.123.1 -l 300 &
nohup ip netns exec cli  netperf -H 192.168.123.1 -l 300 &
nohup ip netns exec cli  netperf -H 192.168.123.1 -l 300 &

# Add delay
echo 3000 > /proc/sys/net/ipv4/route/ref_leak_test

# making PMTU discovery exception routes
echo 1 >  /proc/sys/net/ipv4/route/mtu_expires
for((i=1;i<=60;i++));
do
  for j in 1400  1300 1100 1000
  do
	echo "set mtu to "$j;
	ip netns exec svr ifconfig  svr-veth  mtu $j;
	ip netns exec cli ifconfig  cli-veth  mtu $j;
	ip netns exec gw ifconfig svrgw-veth  mtu $j;
	ip netns exec gw ifconfig cligw-veth  mtu $j;
	sleep 2;
  done
done

ref_leak_test_end.sh:

#!/bin/bash

echo 0 > /proc/sys/net/ipv4/route/ref_leak_test

pkill netserver
pkill netperf

ip netns exec cli ifconfig cli-veth down
ip netns exec gw ifconfig svrgw-veth down
ip netns exec gw ifconfig cligw-veth down
ip netns exec svr ifconfig svr-veth down

ip netns del svr
ip netns del gw
ip netns del cli

The test process:

  • first load the debug kernel,
  • then run ref_leak_test_begin.sh,
  • wait a few seconds, run ref_leak_test_end.sh,
  • and finally you can observe the error.
[root@iZuf6h1kfgutxc3el68z2lZ test]# bash ref_leak_test_begin.sh
net.ipv4.ip_forward = 1
nohup: ignoring input and appending output to ‘nohup.out’
nohup: set mtu to 1400
appending output to ‘nohup.out’
nohup: appending output to ‘nohup.out’
nohup: appending output to ‘nohup.out’
nohup: appending output to ‘nohup.out’
set mtu to 1300
set mtu to 1100
set mtu to 1000
set mtu to 1400
set mtu to 1300
set mtu to 1100
^C
[root@iZuf6h1kfgutxc3el68z2lZ test]# bash ref_leak_test_end.sh
[root@iZuf6h1kfgutxc3el68z2lZ test]#
Message from syslogd@iZuf6h1kfgutxc3el68z2lZ at Nov  4 20:29:43 ...
 kernel:unregister_netdevice: waiting for cli-veth to become free. Usage count = 1

After some testing, torvalds/linux@ee60ad2 can indeed fix this bug.

@GuomeiQtl
Copy link

@GuomeiQtl GuomeiQtl commented Feb 20, 2020

Anyone saw this bug with 4.19?

same

@steelcowboy
Copy link

@steelcowboy steelcowboy commented Feb 23, 2020

Yes, on Debian! Is there any way to suppress it?

@steelcowboy
Copy link

@steelcowboy steelcowboy commented Mar 21, 2020

Found out my Docker logs are also being spammed. Kernel 5.4.0, Docker 19.03.8:

Mar 21 18:46:14 host.mysite.com dockerd[16544]: time="2020-03-21T18:46:14.127275161Z" level=info msg="ignoring event" module=libcontainerd namespace=moby topic=/tasks/delete type="*events.TaskDelete"
Mar 21 18:45:13 host.mysite.com dockerd[16544]: time="2020-03-21T18:45:13.642050333Z" level=info msg="ignoring event" module=libcontainerd namespace=moby topic=/tasks/delete type="*events.TaskDelete"
Mar 21 18:44:13 host.mysite.com dockerd[16544]: time="2020-03-21T18:44:13.161364216Z" level=info msg="ignoring event" module=libcontainerd namespace=moby topic=/tasks/delete type="*events.TaskDelete"
Mar 21 18:43:12 host.mysite.com dockerd[16544]: time="2020-03-21T18:43:12.714725302Z" level=info msg="ignoring event" module=libcontainerd namespace=moby topic=/tasks/delete type="*events.TaskDelete"
@steelcowboy
Copy link

@steelcowboy steelcowboy commented Mar 27, 2020

I finally found out how to suppress these messages btw. From this question on StackExchange, I commented out this line in /etc/rsyslog.conf:

# Everybody gets emergency messages
#*.emerg                    :omusrmsg:*

Very nuclear option, but at least now my system is usable again!

@SuperSandro2000
Copy link

@SuperSandro2000 SuperSandro2000 commented Sep 6, 2020

@steelcowboy You can configure rsyslog to only void those annoying messages instead of all emergencies which is more desirable.

I wrote the following into /etc/rsyslog.d/40-unreigster-netdevice.conf and restarted rsyslog systemctl restart rsyslog.

# match frequent not relevant emergency messages generated by Docker when transfering large amounts of data through the network
:msg,contains,"unregister_netdevice: waiting for lo to become free. Usage count = 1" /dev/null

# discard matching messages
& stop
@hedza06
Copy link

@hedza06 hedza06 commented Oct 20, 2020

Any news here?

@StruggleYang
Copy link

@StruggleYang StruggleYang commented Dec 29, 2020

kernel:unregister_netdevice: waiting for lo to become free. Usage count = 1

Any news here?

@w-simon
Copy link

@w-simon w-simon commented Dec 29, 2020

kernel:unregister_netdevice: waiting for lo to become free. Usage count = 1

Any news here?

This patch has fixed this problem:

https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=ee60ad219f5c7c4fb2f047f88037770063ef785f

We have analyzed in the following link, and this problem could also be reproduced:
https://github.com/w-simon/kernel_debug_notes/blob/master/reproduce_a_leaked_dst_entry

@Xyaren
Copy link

@Xyaren Xyaren commented Feb 5, 2021

This is still happening for me on kernel Ubuntu 5.8.0-41.46-generic 5.8.18

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet