New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Kernel panic during build #2960

Closed
Tranquility opened this Issue Nov 29, 2013 · 53 comments

Comments

Projects
None yet
@Tranquility
Contributor

Tranquility commented Nov 29, 2013

I had at least two kernel panics with stable docker versions (0.6.7, 0.7). I am running 3.12.0 with genpatches and aufs-patches.

img_20131128_104318

@tianon

This comment has been minimized.

Show comment
Hide comment
@tianon

tianon Nov 30, 2013

Member

I'll add again here that I've seen this same panic on three different machines, two AUFS and one device-mapper. Was running 3.11 on one of the AUFS machines, and 3.10.17 on the other two (but curiously enough didn't see it on 3.10.7 - perhaps that was just good luck, because it doesn't seem to be necessarily consistent). I'm configuring one of my machines for kdump right now, so hopefully I'll get a kdump next time.

Member

tianon commented Nov 30, 2013

I'll add again here that I've seen this same panic on three different machines, two AUFS and one device-mapper. Was running 3.11 on one of the AUFS machines, and 3.10.17 on the other two (but curiously enough didn't see it on 3.10.7 - perhaps that was just good luck, because it doesn't seem to be necessarily consistent). I'm configuring one of my machines for kdump right now, so hopefully I'll get a kdump next time.

@alexlarsson

This comment has been minimized.

Show comment
Hide comment
@alexlarsson

alexlarsson Dec 2, 2013

Contributor

I see this regularly too on container exit:
https://bugzilla.redhat.com/show_bug.cgi?id=1015989

Contributor

alexlarsson commented Dec 2, 2013

I see this regularly too on container exit:
https://bugzilla.redhat.com/show_bug.cgi?id=1015989

@crosbymichael

This comment has been minimized.

Show comment
Hide comment
@crosbymichael

crosbymichael Dec 14, 2013

Contributor

@alexlarsson Do you have any insights on this issue?

Contributor

crosbymichael commented Dec 14, 2013

@alexlarsson Do you have any insights on this issue?

@alexlarsson

This comment has been minimized.

Show comment
Hide comment
@alexlarsson

alexlarsson Dec 16, 2013

Contributor

@crosbymichael There is a potential fix linked to in the redhat bug above, i have not had time to try it though.

Contributor

alexlarsson commented Dec 16, 2013

@crosbymichael There is a potential fix linked to in the redhat bug above, i have not had time to try it though.

@alexlarsson

This comment has been minimized.

Show comment
Hide comment
@alexlarsson

alexlarsson Dec 17, 2013

Contributor

I tried to reproduce this, but its kinda hard. I wrote a script that launches lots of containers, but on a freshly booted machine it totally fails to trigger this. However, on my devel workstation which had been running for a few days it triggered almost instantly when i ran the script. So, it seems like its a combination of a namespace exit and something else.

Contributor

alexlarsson commented Dec 17, 2013

I tried to reproduce this, but its kinda hard. I wrote a script that launches lots of containers, but on a freshly booted machine it totally fails to trigger this. However, on my devel workstation which had been running for a few days it triggered almost instantly when i ran the script. So, it seems like its a combination of a namespace exit and something else.

@mschulkind

This comment has been minimized.

Show comment
Hide comment
@mschulkind

mschulkind Jan 23, 2014

I'm having a similar issue, although not totally sure if it's the same crash since I'm not sure how to capture the oops. Nothing shows up in the logs after reboot, and kdump looks pretty scary to try to set up.

I'm on gentoo, using device-mapper, and running the 3.12.7-gentoo kernel which already has the fix in the linked redhat bug. I regularly get the crash when running a 'docker build' command.

mschulkind commented Jan 23, 2014

I'm having a similar issue, although not totally sure if it's the same crash since I'm not sure how to capture the oops. Nothing shows up in the logs after reboot, and kdump looks pretty scary to try to set up.

I'm on gentoo, using device-mapper, and running the 3.12.7-gentoo kernel which already has the fix in the linked redhat bug. I regularly get the crash when running a 'docker build' command.

@alexlarsson

This comment has been minimized.

Show comment
Hide comment
@alexlarsson

alexlarsson Jan 30, 2014

Contributor

https://bugzilla.redhat.com/show_bug.cgi?id=1015989#c18 mentions a possible fix that is in 3.13-rc1, so it would be interesting to know if anyone sees this on 3.13.

Contributor

alexlarsson commented Jan 30, 2014

https://bugzilla.redhat.com/show_bug.cgi?id=1015989#c18 mentions a possible fix that is in 3.13-rc1, so it would be interesting to know if anyone sees this on 3.13.

@mschulkind

This comment has been minimized.

Show comment
Hide comment
@mschulkind

mschulkind Jan 31, 2014

Unfortunately doesn't seem to have helped much. I'm on 3.13.1-gentoo which definitely has that fix, and just experienced a crash.

mschulkind commented Jan 31, 2014

Unfortunately doesn't seem to have helped much. I'm on 3.13.1-gentoo which definitely has that fix, and just experienced a crash.

@pnasrat

This comment has been minimized.

Show comment
Hide comment
@pnasrat

pnasrat Feb 28, 2014

Contributor

@mschulkind do you have a capture of the panic you got?

Contributor

pnasrat commented Feb 28, 2014

@mschulkind do you have a capture of the panic you got?

@mschulkind

This comment has been minimized.

Show comment
Hide comment
@mschulkind

mschulkind Mar 2, 2014

I don't. Not totally sure how when the panic happens when X is running. I can try to reproduce without X running though.

mschulkind commented Mar 2, 2014

I don't. Not totally sure how when the panic happens when X is running. I can try to reproduce without X running though.

@renato-zannon

This comment has been minimized.

Show comment
Hide comment
@renato-zannon

renato-zannon Mar 27, 2014

Contributor

I'm hitting this on 3.13.7, on Arch Linux, btrfs + native. I've hit this 3 times only today, a true flow killer :(

It's a bit unpredictable though. I haven't been able to correlate with high load, or low memory, or with any particular build step in my app.

Is there any particular bits of information from the trace that I should be looking to gather the next time this happens? Should I search how to enable kdump? Is this the correct place to ask for help about this? :)

Contributor

renato-zannon commented Mar 27, 2014

I'm hitting this on 3.13.7, on Arch Linux, btrfs + native. I've hit this 3 times only today, a true flow killer :(

It's a bit unpredictable though. I haven't been able to correlate with high load, or low memory, or with any particular build step in my app.

Is there any particular bits of information from the trace that I should be looking to gather the next time this happens? Should I search how to enable kdump? Is this the correct place to ask for help about this? :)

@alexlarsson

This comment has been minimized.

Show comment
Hide comment
@alexlarsson

alexlarsson Mar 27, 2014

Contributor

I'm not really a kernel guy so i don't know how to debug this further. I know one thing though, if we could figure out a way to reproduce it easier I could get the right people to look at it.

However, I've had a hard time reproducing this. It seems to be triggered on container exit, so i created some scripts that just spawned lots of containers. However, the script could run for thousands of containers on my newly booted laptop without any sign of the crash. Then i ran it on my desktop which had been running for at least a week doing a lot of random stuff. It crashed in less than 100 container exits. Then i rebooted and ran it again, but was unable to get any crash...

So, it is triggered by container exits in combination with something else, and we need to figure out what. It seems so unpredictable though... The only thing I've seen is that it seems like its more likely to happen if i'm running something playing audio in the background.

Contributor

alexlarsson commented Mar 27, 2014

I'm not really a kernel guy so i don't know how to debug this further. I know one thing though, if we could figure out a way to reproduce it easier I could get the right people to look at it.

However, I've had a hard time reproducing this. It seems to be triggered on container exit, so i created some scripts that just spawned lots of containers. However, the script could run for thousands of containers on my newly booted laptop without any sign of the crash. Then i ran it on my desktop which had been running for at least a week doing a lot of random stuff. It crashed in less than 100 container exits. Then i rebooted and ran it again, but was unable to get any crash...

So, it is triggered by container exits in combination with something else, and we need to figure out what. It seems so unpredictable though... The only thing I've seen is that it seems like its more likely to happen if i'm running something playing audio in the background.

@alexlarsson

This comment has been minimized.

Show comment
Hide comment
@alexlarsson

alexlarsson Mar 27, 2014

Contributor

I asked in the docker meeting today for people who have seen this, and a bunch of people had never seen it and some had. One thing that seemed to be consistent with not seeing the panic is running the kernel in a VM. So, maybe this only triggers on bare metal.

Contributor

alexlarsson commented Mar 27, 2014

I asked in the docker meeting today for people who have seen this, and a bunch of people had never seen it and some had. One thing that seemed to be consistent with not seeing the panic is running the kernel in a VM. So, maybe this only triggers on bare metal.

@renato-zannon

This comment has been minimized.

Show comment
Hide comment
@renato-zannon

renato-zannon Mar 27, 2014

Contributor

For those on Arch that have are hitting this and may want a workaround: I'm using now the linux-lts kernel from the official repo (3.10.34-1-lts), and haven't bumped into this issue yet.

Interesting point about the virtualization... It might have to do with some code path that is not taken via the virtualization drivers, or that the virtualization overhead doesn't allow the same levels of concurrency.

This weekend I will try to come up with some way to reproduce this consistently.

Contributor

renato-zannon commented Mar 27, 2014

For those on Arch that have are hitting this and may want a workaround: I'm using now the linux-lts kernel from the official repo (3.10.34-1-lts), and haven't bumped into this issue yet.

Interesting point about the virtualization... It might have to do with some code path that is not taken via the virtualization drivers, or that the virtualization overhead doesn't allow the same levels of concurrency.

This weekend I will try to come up with some way to reproduce this consistently.

@thaJeztah

This comment has been minimized.

Show comment
Hide comment
@thaJeztah

thaJeztah Mar 27, 2014

Member

Maybe virtualisation itself is not the common factor, but people may be shutting down their virtual machines more often? VMs can be quite hungry on resources and if I don't need them, I'll shut them down. As @alexlarsson mentioned, the problem occurred quite soon on a computer that has been running for a longer period.

Member

thaJeztah commented Mar 27, 2014

Maybe virtualisation itself is not the common factor, but people may be shutting down their virtual machines more often? VMs can be quite hungry on resources and if I don't need them, I'll shut them down. As @alexlarsson mentioned, the problem occurred quite soon on a computer that has been running for a longer period.

@rohansingh

This comment has been minimized.

Show comment
Hide comment
@rohansingh

rohansingh Apr 1, 2014

I saw some commits in 3.14 that I was hoping would resolve this. Unfortunately, after upgrading a machine to 3.14 to test that hypothesis, it seems like that's not the case. Still seeing the same race condition.

rohansingh commented Apr 1, 2014

I saw some commits in 3.14 that I was hoping would resolve this. Unfortunately, after upgrading a machine to 3.14 to test that hypothesis, it seems like that's not the case. Still seeing the same race condition.

@renato-zannon

This comment has been minimized.

Show comment
Hide comment
@renato-zannon

renato-zannon Apr 1, 2014

Contributor

@rohansingh Have you been able to reproduce this consistently? I'm not seeing this behavior anymore (with the same 3.13.7 from before), even while actively trying to trigger it.

Contributor

renato-zannon commented Apr 1, 2014

@rohansingh Have you been able to reproduce this consistently? I'm not seeing this behavior anymore (with the same 3.13.7 from before), even while actively trying to trigger it.

@rohansingh

This comment has been minimized.

Show comment
Hide comment
@rohansingh

rohansingh Apr 1, 2014

@riccieri Fairly consistently. To clarify, I haven't been actively trying to repro. Instead I'm just monitoring a machine that other users are using to do builds, largely during business hours. Previously it was running 3.13.0, and is now at 3.14.0.

I see the machine reboot due to this issue every few hours, and so far three times today since upgrading to 3.14.

rohansingh commented Apr 1, 2014

@riccieri Fairly consistently. To clarify, I haven't been actively trying to repro. Instead I'm just monitoring a machine that other users are using to do builds, largely during business hours. Previously it was running 3.13.0, and is now at 3.14.0.

I see the machine reboot due to this issue every few hours, and so far three times today since upgrading to 3.14.

@eandre

This comment has been minimized.

Show comment
Hide comment
@eandre

eandre Apr 3, 2014

To build on @rohansingh's comment, this happens on both VMs and real hardware for us, and quite consistently on both types of machines.

eandre commented Apr 3, 2014

To build on @rohansingh's comment, this happens on both VMs and real hardware for us, and quite consistently on both types of machines.

@jpoimboe

This comment has been minimized.

Show comment
Hide comment
@jpoimboe

jpoimboe Apr 3, 2014

Contributor

This is a hard panic to debug because it happens in the nf_conntrack destroy path. Can anybody recreate with kdump enabled and provide a kdump? That would probably improve our chances of fixing it.

Contributor

jpoimboe commented Apr 3, 2014

This is a hard panic to debug because it happens in the nf_conntrack destroy path. Can anybody recreate with kdump enabled and provide a kdump? That would probably improve our chances of fixing it.

@rohansingh

This comment has been minimized.

Show comment
Hide comment
@rohansingh

rohansingh Apr 3, 2014

@jpoimboe Unfortunately the machine on which we see this happening pretty often is a virtual machine in EC2 running under PV-GRUB, so a kdump is not possible. I'll work with @eandre on seeing if we can reproduce this on a machine where we can kdump.

rohansingh commented Apr 3, 2014

@jpoimboe Unfortunately the machine on which we see this happening pretty often is a virtual machine in EC2 running under PV-GRUB, so a kdump is not possible. I'll work with @eandre on seeing if we can reproduce this on a machine where we can kdump.

@renato-zannon

This comment has been minimized.

Show comment
Hide comment
@renato-zannon

renato-zannon Apr 3, 2014

Contributor

I will look into enabling kdump on my dev machine (which is where I was getting the panic before), so that if I'm lucky (!!!) to stumble on the crash again, I'll be able to report back with more info.

Contributor

renato-zannon commented Apr 3, 2014

I will look into enabling kdump on my dev machine (which is where I was getting the panic before), so that if I'm lucky (!!!) to stumble on the crash again, I'll be able to report back with more info.

@joelmoss

This comment has been minimized.

Show comment
Hide comment
@joelmoss

joelmoss Apr 8, 2014

ok, so we also get kernel panics using anything later than docker 0.7.2. Just tested 0.9.1 and still get the panics. 0.7.2 is fine.

Is anyone on the docker team able to verify this please?

joelmoss commented Apr 8, 2014

ok, so we also get kernel panics using anything later than docker 0.7.2. Just tested 0.9.1 and still get the panics. 0.7.2 is fine.

Is anyone on the docker team able to verify this please?

@jamtur01

This comment has been minimized.

Show comment
Hide comment
@jamtur01

jamtur01 Apr 8, 2014

Contributor

@joelmoss Can you elaborate please? Does the panic occur on build or container exit or elsewhere? Also what platform and kernel release are you running? Thanks!

Contributor

jamtur01 commented Apr 8, 2014

@joelmoss Can you elaborate please? Does the panic occur on build or container exit or elsewhere? Also what platform and kernel release are you running? Thanks!

@joelmoss

This comment has been minimized.

Show comment
Hide comment
@joelmoss

joelmoss Apr 8, 2014

The problem is that we have been unable to pin down when exactly it happens, so I couldn't say what action causes the panic.

We are on Ubuntu 13.10 (GNU/Linux 3.11.0-15-generic x86_64)

joelmoss commented Apr 8, 2014

The problem is that we have been unable to pin down when exactly it happens, so I couldn't say what action causes the panic.

We are on Ubuntu 13.10 (GNU/Linux 3.11.0-15-generic x86_64)

@unclejack

This comment has been minimized.

Show comment
Hide comment
@unclejack

unclejack Apr 8, 2014

Contributor

@joelmoss Please update the system and the kernel. It's likely that you might be running into a kernel bug. Ubuntu has updated packages for the 3.11.0 kernel and you can install them by installing updates.

Contributor

unclejack commented Apr 8, 2014

@joelmoss Please update the system and the kernel. It's likely that you might be running into a kernel bug. Ubuntu has updated packages for the 3.11.0 kernel and you can install them by installing updates.

@jamtur01

This comment has been minimized.

Show comment
Hide comment
@jamtur01

jamtur01 Apr 8, 2014

Contributor

Thanks @joelmoss - Can you capture any output with kdump?

Contributor

jamtur01 commented Apr 8, 2014

Thanks @joelmoss - Can you capture any output with kdump?

@rohansingh

This comment has been minimized.

Show comment
Hide comment
@rohansingh

rohansingh Apr 8, 2014

@unclejack

Ubuntu has updated packages for the 3.11.0 kernel and you can install them by installing updates.

We've seen this on the 3.13 and 3.14 kernels provided by Ubuntu as well, so if @joelmoss is hitting the same issue, upgrading is unlikely to help.

@joelmoss Even if you don't have a kdump output, do you have the stacktrace from the system log so that we can verify that it's the same nf_conntrack issue?

rohansingh commented Apr 8, 2014

@unclejack

Ubuntu has updated packages for the 3.11.0 kernel and you can install them by installing updates.

We've seen this on the 3.13 and 3.14 kernels provided by Ubuntu as well, so if @joelmoss is hitting the same issue, upgrading is unlikely to help.

@joelmoss Even if you don't have a kdump output, do you have the stacktrace from the system log so that we can verify that it's the same nf_conntrack issue?

@unclejack

This comment has been minimized.

Show comment
Hide comment
@unclejack

unclejack Apr 8, 2014

Contributor

@rohansingh I haven't said that you're not encountering the issue on 3.13 and 3.14. I've only said that upgrading to the latest 3.11 kernel packages and keeping the system up to date is a good idea. I'm using the latest 3.11 on some systems and I'm not running into this particular problem, that's why I've recommended it.

Contributor

unclejack commented Apr 8, 2014

@rohansingh I haven't said that you're not encountering the issue on 3.13 and 3.14. I've only said that upgrading to the latest 3.11 kernel packages and keeping the system up to date is a good idea. I'm using the latest 3.11 on some systems and I'm not running into this particular problem, that's why I've recommended it.

@rohansingh

This comment has been minimized.

Show comment
Hide comment
@rohansingh

rohansingh Apr 8, 2014

By the way, here is a text version of a similar stacktrace to complement the screenshot above:

[16314069.877834] BUG: unable to handle kernel paging request at ffffc900029fdb58
[16314069.877857] IP: [<ffffffffa0289200>] nf_nat_cleanup_conntrack+0x40/0x70 [nf_nat]
[16314069.877870] PGD 1b6426067 PUD 1b6427067 PMD 1019e5067 PTE 0
[16314069.877879] Oops: 0002 [#1] SMP 
[16314069.877886] Modules linked in: nf_conntrack_netlink nfnetlink veth xt_addrtype xt_conntrack xt_CHECKSUM iptable_mangle ipt_MASQUERADE iptable_nat nf_nat_ipv4 nf_nat xt_tcpudp iptable_filter ip_tables x_tables nfsd auth_rpcgss nfs_acl aufs nfs lockd sunrpc bridge fscache 8021q garp stp mrp llc intel_rapl crct10dif_pclmul crc32_pclmul ghash_clmulni_intel aesni_intel aes_x86_64 lrw gf128mul glue_helper ablk_helper cryptd nf_conntrack_ipv4 nf_defrag_ipv4 nf_conntrack xen_kbdfront xen_fbfront syscopyarea sysfillrect sysimgblt fb_sys_fops raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq [last unloaded: ipmi_devintf]
[16314069.877968] CPU: 2 PID: 97 Comm: kworker/u16:1 Not tainted 3.13.0-18-generic #38-Ubuntu
[16314069.877982] Workqueue: netns cleanup_net
[16314069.877987] task: ffff8801affd17f0 ti: ffff8801affc4000 task.ti: ffff8801affc4000
[16314069.877994] RIP: e030:[<ffffffffa0289200>]  [<ffffffffa0289200>] nf_nat_cleanup_conntrack+0x40/0x70 [nf_nat]
[16314069.878005] RSP: e02b:ffff8801affc5cb8  EFLAGS: 00010246
[16314069.878010] RAX: 0000000000000000 RBX: ffff880004a5ce08 RCX: ffff8800b5a8e988
[16314069.878016] RDX: ffffc900029fdb58 RSI: 0000000037d437d2 RDI: ffffffffa028c4c0
[16314069.878022] RBP: ffff8801affc5cc0 R08: 0000000000000200 R09: 0000000000000000
[16314069.878029] R10: ffffea0005150940 R11: ffffffff812247fd R12: ffff880004a5cd80
[16314069.878035] R13: ffff8800d24e6750 R14: ffff8800d24e6758 R15: ffff8800b5a8e000
[16314069.878046] FS:  00007f8a03029700(0000) GS:ffff8801bec80000(0000) knlGS:0000000000000000
[16314069.878052] CS:  e033 DS: 0000 ES: 0000 CR0: 000000008005003b
[16314069.878057] CR2: ffffc900029fdb58 CR3: 00000000d702a000 CR4: 0000000000002660
[16314069.878064] Stack:
[16314069.878068]  0000000000000001 ffff8801affc5ce8 ffffffffa00995a4 ffff8800d24e6750
[16314069.878078]  ffff8800b5a8e000 ffffffffa007e2c0 ffff8801affc5d08 ffffffffa00912d5
[16314069.878088]  ffff8800d24e6750 ffff8800b5a8e000 ffff8801affc5d28 ffffffffa00927b4
[16314069.878097] Call Trace:
[16314069.878112]  [<ffffffffa00995a4>] __nf_ct_ext_destroy+0x44/0x60 [nf_conntrack]
[16314069.878125]  [<ffffffffa00912d5>] nf_conntrack_free+0x25/0x60 [nf_conntrack]
[16314069.878136]  [<ffffffffa00927b4>] destroy_conntrack+0xb4/0x110 [nf_conntrack]
[16314069.878149]  [<ffffffffa0096260>] ? nf_conntrack_helper_fini+0x30/0x30 [nf_conntrack]
[16314069.878159]  [<ffffffff81645767>] nf_conntrack_destroy+0x17/0x20
[16314069.878170]  [<ffffffffa009223b>] nf_ct_iterate_cleanup+0x12b/0x150 [nf_conntrack]
[16314069.878183]  [<ffffffffa009653d>] nf_ct_l3proto_pernet_unregister+0x1d/0x20 [nf_conntrack]
[16314069.878194]  [<ffffffffa007c309>] ipv4_net_exit+0x19/0x50 [nf_conntrack_ipv4]
[16314069.878202]  [<ffffffff8160e549>] ops_exit_list.isra.1+0x39/0x60
[16314069.878210]  [<ffffffff8160edd0>] cleanup_net+0x110/0x250
[16314069.878221]  [<ffffffff810824a2>] process_one_work+0x182/0x450
[16314069.878228]  [<ffffffff81083241>] worker_thread+0x121/0x410
[16314069.878235]  [<ffffffff81083120>] ? rescuer_thread+0x3e0/0x3e0
[16314069.878243]  [<ffffffff81089ed2>] kthread+0xd2/0xf0
[16314069.878249]  [<ffffffff81089e00>] ? kthread_create_on_node+0x190/0x190
[16314069.878258]  [<ffffffff817219bc>] ret_from_fork+0x7c/0xb0
[16314069.878264]  [<ffffffff81089e00>] ? kthread_create_on_node+0x190/0x190
[16314069.878269] Code: 53 0f b6 58 11 84 db 74 45 48 01 c3 74 40 48 83 7b 10 00 74 39 48 c7 c7 c0 c4 28 a0 e8 3a fe 48 e1 48 8b 03 48 8b 53 08 48 85 c0 <48> 89 02 74 04 48 89 50 08 48 b8 00 02 20 00 00 00 ad de 48 c7 
[16314069.878332] RIP  [<ffffffffa0289200>] nf_nat_cleanup_conntrack+0x40/0x70 [nf_nat]
[16314069.878341]  RSP <ffff8801affc5cb8>
[16314069.878345] CR2: ffffc900029fdb58
[16314069.878353] ---[ end trace 98cfb73f60c69903 ]---

rohansingh commented Apr 8, 2014

By the way, here is a text version of a similar stacktrace to complement the screenshot above:

[16314069.877834] BUG: unable to handle kernel paging request at ffffc900029fdb58
[16314069.877857] IP: [<ffffffffa0289200>] nf_nat_cleanup_conntrack+0x40/0x70 [nf_nat]
[16314069.877870] PGD 1b6426067 PUD 1b6427067 PMD 1019e5067 PTE 0
[16314069.877879] Oops: 0002 [#1] SMP 
[16314069.877886] Modules linked in: nf_conntrack_netlink nfnetlink veth xt_addrtype xt_conntrack xt_CHECKSUM iptable_mangle ipt_MASQUERADE iptable_nat nf_nat_ipv4 nf_nat xt_tcpudp iptable_filter ip_tables x_tables nfsd auth_rpcgss nfs_acl aufs nfs lockd sunrpc bridge fscache 8021q garp stp mrp llc intel_rapl crct10dif_pclmul crc32_pclmul ghash_clmulni_intel aesni_intel aes_x86_64 lrw gf128mul glue_helper ablk_helper cryptd nf_conntrack_ipv4 nf_defrag_ipv4 nf_conntrack xen_kbdfront xen_fbfront syscopyarea sysfillrect sysimgblt fb_sys_fops raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq [last unloaded: ipmi_devintf]
[16314069.877968] CPU: 2 PID: 97 Comm: kworker/u16:1 Not tainted 3.13.0-18-generic #38-Ubuntu
[16314069.877982] Workqueue: netns cleanup_net
[16314069.877987] task: ffff8801affd17f0 ti: ffff8801affc4000 task.ti: ffff8801affc4000
[16314069.877994] RIP: e030:[<ffffffffa0289200>]  [<ffffffffa0289200>] nf_nat_cleanup_conntrack+0x40/0x70 [nf_nat]
[16314069.878005] RSP: e02b:ffff8801affc5cb8  EFLAGS: 00010246
[16314069.878010] RAX: 0000000000000000 RBX: ffff880004a5ce08 RCX: ffff8800b5a8e988
[16314069.878016] RDX: ffffc900029fdb58 RSI: 0000000037d437d2 RDI: ffffffffa028c4c0
[16314069.878022] RBP: ffff8801affc5cc0 R08: 0000000000000200 R09: 0000000000000000
[16314069.878029] R10: ffffea0005150940 R11: ffffffff812247fd R12: ffff880004a5cd80
[16314069.878035] R13: ffff8800d24e6750 R14: ffff8800d24e6758 R15: ffff8800b5a8e000
[16314069.878046] FS:  00007f8a03029700(0000) GS:ffff8801bec80000(0000) knlGS:0000000000000000
[16314069.878052] CS:  e033 DS: 0000 ES: 0000 CR0: 000000008005003b
[16314069.878057] CR2: ffffc900029fdb58 CR3: 00000000d702a000 CR4: 0000000000002660
[16314069.878064] Stack:
[16314069.878068]  0000000000000001 ffff8801affc5ce8 ffffffffa00995a4 ffff8800d24e6750
[16314069.878078]  ffff8800b5a8e000 ffffffffa007e2c0 ffff8801affc5d08 ffffffffa00912d5
[16314069.878088]  ffff8800d24e6750 ffff8800b5a8e000 ffff8801affc5d28 ffffffffa00927b4
[16314069.878097] Call Trace:
[16314069.878112]  [<ffffffffa00995a4>] __nf_ct_ext_destroy+0x44/0x60 [nf_conntrack]
[16314069.878125]  [<ffffffffa00912d5>] nf_conntrack_free+0x25/0x60 [nf_conntrack]
[16314069.878136]  [<ffffffffa00927b4>] destroy_conntrack+0xb4/0x110 [nf_conntrack]
[16314069.878149]  [<ffffffffa0096260>] ? nf_conntrack_helper_fini+0x30/0x30 [nf_conntrack]
[16314069.878159]  [<ffffffff81645767>] nf_conntrack_destroy+0x17/0x20
[16314069.878170]  [<ffffffffa009223b>] nf_ct_iterate_cleanup+0x12b/0x150 [nf_conntrack]
[16314069.878183]  [<ffffffffa009653d>] nf_ct_l3proto_pernet_unregister+0x1d/0x20 [nf_conntrack]
[16314069.878194]  [<ffffffffa007c309>] ipv4_net_exit+0x19/0x50 [nf_conntrack_ipv4]
[16314069.878202]  [<ffffffff8160e549>] ops_exit_list.isra.1+0x39/0x60
[16314069.878210]  [<ffffffff8160edd0>] cleanup_net+0x110/0x250
[16314069.878221]  [<ffffffff810824a2>] process_one_work+0x182/0x450
[16314069.878228]  [<ffffffff81083241>] worker_thread+0x121/0x410
[16314069.878235]  [<ffffffff81083120>] ? rescuer_thread+0x3e0/0x3e0
[16314069.878243]  [<ffffffff81089ed2>] kthread+0xd2/0xf0
[16314069.878249]  [<ffffffff81089e00>] ? kthread_create_on_node+0x190/0x190
[16314069.878258]  [<ffffffff817219bc>] ret_from_fork+0x7c/0xb0
[16314069.878264]  [<ffffffff81089e00>] ? kthread_create_on_node+0x190/0x190
[16314069.878269] Code: 53 0f b6 58 11 84 db 74 45 48 01 c3 74 40 48 83 7b 10 00 74 39 48 c7 c7 c0 c4 28 a0 e8 3a fe 48 e1 48 8b 03 48 8b 53 08 48 85 c0 <48> 89 02 74 04 48 89 50 08 48 b8 00 02 20 00 00 00 ad de 48 c7 
[16314069.878332] RIP  [<ffffffffa0289200>] nf_nat_cleanup_conntrack+0x40/0x70 [nf_nat]
[16314069.878341]  RSP <ffff8801affc5cb8>
[16314069.878345] CR2: ffffc900029fdb58
[16314069.878353] ---[ end trace 98cfb73f60c69903 ]---
@unclejack

This comment has been minimized.

Show comment
Hide comment
@unclejack

unclejack Apr 10, 2014

Contributor

@rohansingh Could you provide more details about the host where you can reproduce this?
Knowing the particular network setup (VPN, openvswitch, bridges, any network hardware offloading engines, etc) and some approximate steps you've taken to reproduce that would help.

I couldn't reproduce this so far, so I think getting the system into the right state to make it panic during a build is related to some sequence of events which isn't very common.

If you have a sequence of steps you follow to get it to crash, could you let us know how to reproduce this, please?

Contributor

unclejack commented Apr 10, 2014

@rohansingh Could you provide more details about the host where you can reproduce this?
Knowing the particular network setup (VPN, openvswitch, bridges, any network hardware offloading engines, etc) and some approximate steps you've taken to reproduce that would help.

I couldn't reproduce this so far, so I think getting the system into the right state to make it panic during a build is related to some sequence of events which isn't very common.

If you have a sequence of steps you follow to get it to crash, could you let us know how to reproduce this, please?

@rohansingh

This comment has been minimized.

Show comment
Hide comment
@rohansingh

rohansingh Apr 11, 2014

@unclejack In terms of hardware and network setup, this is a paravirtual machine on EC2.

The general procedure we have for reproducing this is to kick off a build process that starts 16 parallel containers to run various integration tests. The issue occurs intermittently, around two minutes after the containers are stopped.

Unfortunately the situation isn't that great in terms of reproducibility, in that it's tied up with a bunch of internal code and build tools. Right now I'm trying to simplify that down to a simple script for reproducing the issue, which I hope to finish and be able to provide in the next couple days.

rohansingh commented Apr 11, 2014

@unclejack In terms of hardware and network setup, this is a paravirtual machine on EC2.

The general procedure we have for reproducing this is to kick off a build process that starts 16 parallel containers to run various integration tests. The issue occurs intermittently, around two minutes after the containers are stopped.

Unfortunately the situation isn't that great in terms of reproducibility, in that it's tied up with a bunch of internal code and build tools. Right now I'm trying to simplify that down to a simple script for reproducing the issue, which I hope to finish and be able to provide in the next couple days.

@konobi

This comment has been minimized.

Show comment
Hide comment
@konobi

konobi Apr 23, 2014

I've been seeing this error too, outside of docker, with plain ol lxc.

So far it seems to be a combination of SMP, lxc and using nat over a bridge(?).

I think I have an idea of what's going on, but due to local hardware issues, I'm unable to get a kernel dump. Does someone have a recent one around that I can take a look at?

konobi commented Apr 23, 2014

I've been seeing this error too, outside of docker, with plain ol lxc.

So far it seems to be a combination of SMP, lxc and using nat over a bridge(?).

I think I have an idea of what's going on, but due to local hardware issues, I'm unable to get a kernel dump. Does someone have a recent one around that I can take a look at?

@unclejack

This comment has been minimized.

Show comment
Hide comment
@unclejack

unclejack Apr 23, 2014

Contributor

@rohansingh Did you make progress with building something to be used to reproduce this problem?

Contributor

unclejack commented Apr 23, 2014

@rohansingh Did you make progress with building something to be used to reproduce this problem?

@rohansingh

This comment has been minimized.

Show comment
Hide comment
@rohansingh

rohansingh Apr 24, 2014

@unclejack Negative. Currently unable to reproduce outside of a specific set of EC2 instances, and not for lack of effort.

rohansingh commented Apr 24, 2014

@unclejack Negative. Currently unable to reproduce outside of a specific set of EC2 instances, and not for lack of effort.

@rohansingh

This comment has been minimized.

Show comment
Hide comment
@rohansingh

rohansingh May 9, 2014

I'm now able to consistently reproduce this issue on physical hardware and produce a kernel crash dump by running part of a build process a non-public project. Next step is to isolate what exactly we're doing in that project that causes this and produce a shareable crash dump that doesn't contain proprietary data.

rohansingh commented May 9, 2014

I'm now able to consistently reproduce this issue on physical hardware and produce a kernel crash dump by running part of a build process a non-public project. Next step is to isolate what exactly we're doing in that project that causes this and produce a shareable crash dump that doesn't contain proprietary data.

@yosifkit

This comment has been minimized.

Show comment
Hide comment
@yosifkit

yosifkit May 22, 2014

Contributor

This happened right when I did a docker kill on a container that was created during docker build (apt-get install specifically).

$ docker version
Client version: 0.11.1
Client API version: 1.11
Go version (client): go1.2
Git commit (client): fb99f99
Server version: 0.11.1
Server API version: 1.11
Git commit (server): fb99f99
Go version (server): go1.2
Last stable version: 0.11.1
$ docker info
Containers: 3
Images: 29
Storage Driver: devicemapper
 Pool Name: docker-8:19-19268241-pool
 Data file: /var/lib/docker/devicemapper/devicemapper/data
 Metadata file: /var/lib/docker/devicemapper/devicemapper/metadata
 Data Space Used: 3165.4 Mb
 Data Space Total: 102400.0 Mb
 Metadata Space Used: 3.0 Mb
 Metadata Space Total: 2048.0 Mb
Execution Driver: native-0.2
Kernel Version: 3.12.13-gentoo
$ uname -a
Linux minas-morgul 3.12.13-gentoo #2 SMP Mon May 12 10:07:16 MDT 2014 x86_64 AMD Phenom(tm) II X6 1090T Processor AuthenticAMD GNU/Linux
Contributor

yosifkit commented May 22, 2014

This happened right when I did a docker kill on a container that was created during docker build (apt-get install specifically).

$ docker version
Client version: 0.11.1
Client API version: 1.11
Go version (client): go1.2
Git commit (client): fb99f99
Server version: 0.11.1
Server API version: 1.11
Git commit (server): fb99f99
Go version (server): go1.2
Last stable version: 0.11.1
$ docker info
Containers: 3
Images: 29
Storage Driver: devicemapper
 Pool Name: docker-8:19-19268241-pool
 Data file: /var/lib/docker/devicemapper/devicemapper/data
 Metadata file: /var/lib/docker/devicemapper/devicemapper/metadata
 Data Space Used: 3165.4 Mb
 Data Space Total: 102400.0 Mb
 Metadata Space Used: 3.0 Mb
 Metadata Space Total: 2048.0 Mb
Execution Driver: native-0.2
Kernel Version: 3.12.13-gentoo
$ uname -a
Linux minas-morgul 3.12.13-gentoo #2 SMP Mon May 12 10:07:16 MDT 2014 x86_64 AMD Phenom(tm) II X6 1090T Processor AuthenticAMD GNU/Linux
@gdm85

This comment has been minimized.

Show comment
Hide comment
@gdm85

gdm85 May 26, 2014

Contributor

@rohansingh any progress on your efforts of isolating root cause?

Contributor

gdm85 commented May 26, 2014

@rohansingh any progress on your efforts of isolating root cause?

@unclejack unclejack added the kernel label May 28, 2014

@gdm85

This comment has been minimized.

Show comment
Hide comment
@gdm85

gdm85 Jun 3, 2014

Contributor

I am still getting this crash:

crash

The host is a Xen VM as far as I know, and this did NOT happen during a build...

Any ideas how to fix this? It's happening with Ubuntu 14, I would like to know which patches we need to push upstream for a fix.

Update: I think this might be the upstream kernel bug: https://bugzilla.kernel.org/show_bug.cgi?id=65191

Other trackers:

There is no fix yet apparently :(

Contributor

gdm85 commented Jun 3, 2014

I am still getting this crash:

crash

The host is a Xen VM as far as I know, and this did NOT happen during a build...

Any ideas how to fix this? It's happening with Ubuntu 14, I would like to know which patches we need to push upstream for a fix.

Update: I think this might be the upstream kernel bug: https://bugzilla.kernel.org/show_bug.cgi?id=65191

Other trackers:

There is no fix yet apparently :(

@wwadge

This comment has been minimized.

Show comment
Hide comment
@wwadge

wwadge Jun 3, 2014

I had this too, solved by going to 3.10.34.

wwadge commented Jun 3, 2014

I had this too, solved by going to 3.10.34.

@rohansingh

This comment has been minimized.

Show comment
Hide comment
@rohansingh

rohansingh Jun 3, 2014

@gdm85 Some progress, but nothing quite useful yet. Note that I'm no longer working on this issue personally, but have a teammate who is. Here are our findings so far:

  • As you and @yosifkit have discovered, this doesn't actually occur during builds. Rather, it occurs sometime after containers are stopped or killed and conntrack cleanup is occurring.
  • Newer kernels (contrary to reports by others) don't seem to solve this. We've been consistently reproducing with 3.13.0.
  • We have now been able to reproduce this a few times using docker-stress rather than any internal build processes. This puts us a lot closer to having crash dumps and other detailed information that we can share with the community.

Apologies for not having anything more concrete, but we're still working on it.

rohansingh commented Jun 3, 2014

@gdm85 Some progress, but nothing quite useful yet. Note that I'm no longer working on this issue personally, but have a teammate who is. Here are our findings so far:

  • As you and @yosifkit have discovered, this doesn't actually occur during builds. Rather, it occurs sometime after containers are stopped or killed and conntrack cleanup is occurring.
  • Newer kernels (contrary to reports by others) don't seem to solve this. We've been consistently reproducing with 3.13.0.
  • We have now been able to reproduce this a few times using docker-stress rather than any internal build processes. This puts us a lot closer to having crash dumps and other detailed information that we can share with the community.

Apologies for not having anything more concrete, but we're still working on it.

@renato-zannon

This comment has been minimized.

Show comment
Hide comment
@renato-zannon

renato-zannon Jun 3, 2014

Contributor

Newer kernels (contrary to reports by others) don't seem to solve this. We've been consistently reproducing with 3.13.0.

Have you tried with 3.14.x? I used to have this almost once a day, and now it hasn't happened to me in months (with no change in workflow). Of course that doesn't mean the bug is fixed, but it might at least have become less likely to trigger on later kernels.

EDIT: @rohansingh how long it usually takes for you to hit a failure with docker-stress? I could try it out on my machine to see if I can reproduce in a reasonable amount of time.

Contributor

renato-zannon commented Jun 3, 2014

Newer kernels (contrary to reports by others) don't seem to solve this. We've been consistently reproducing with 3.13.0.

Have you tried with 3.14.x? I used to have this almost once a day, and now it hasn't happened to me in months (with no change in workflow). Of course that doesn't mean the bug is fixed, but it might at least have become less likely to trigger on later kernels.

EDIT: @rohansingh how long it usually takes for you to hit a failure with docker-stress? I could try it out on my machine to see if I can reproduce in a reasonable amount of time.

@gdm85

This comment has been minimized.

Show comment
Hide comment
@gdm85

gdm85 Jun 3, 2014

Contributor

@rohansingh thanks for your feedback, it is indeed a blocker for any production usage idea.

The only workaround I can think of is to somehow serialize the killing of containers, to reduce the cross-section of multiple conntrack cleanup...but this would be just a hack, and not even guaranteed to completely address the issue.

Contributor

gdm85 commented Jun 3, 2014

@rohansingh thanks for your feedback, it is indeed a blocker for any production usage idea.

The only workaround I can think of is to somehow serialize the killing of containers, to reduce the cross-section of multiple conntrack cleanup...but this would be just a hack, and not even guaranteed to completely address the issue.

@konobi

This comment has been minimized.

Show comment
Hide comment
@konobi

konobi Jun 3, 2014

Though I'm not a docker user, we were also seeing the same with libvirt+lxc.

Workaround:
We had a bridge interface (virbr0) that we weren't even using. Once we removed the extraneous bridge, we've not seen this issue again. It seems even having that bridge around for nat purposes, causes everything to get connection tracked, regardless of wether or not there's actually any NAT going on.

konobi commented Jun 3, 2014

Though I'm not a docker user, we were also seeing the same with libvirt+lxc.

Workaround:
We had a bridge interface (virbr0) that we weren't even using. Once we removed the extraneous bridge, we've not seen this issue again. It seems even having that bridge around for nat purposes, causes everything to get connection tracked, regardless of wether or not there's actually any NAT going on.

@gdm85

This comment has been minimized.

Show comment
Hide comment
@gdm85

gdm85 Jun 6, 2014

Contributor

there is now a (tentative) patch upstream

if somebody is already compiling his kernel, maybe he can give this a spin?

Contributor

gdm85 commented Jun 6, 2014

there is now a (tentative) patch upstream

if somebody is already compiling his kernel, maybe he can give this a spin?

@rsampaio

This comment has been minimized.

Show comment
Hide comment
@rsampaio

rsampaio Jun 10, 2014

Contributor

I can confirm the patch posted on upstream bug prevent the crash with a pure-lxc test case.

Contributor

rsampaio commented Jun 10, 2014

I can confirm the patch posted on upstream bug prevent the crash with a pure-lxc test case.

@gdm85

This comment has been minimized.

Show comment
Hide comment
@gdm85

gdm85 Jun 10, 2014

Contributor

@rsampaio nice to hear that! I patched kernel for Ubuntu 14.04 LTS and I am going to publish Dockerfile's and .debs shortly

Contributor

gdm85 commented Jun 10, 2014

@rsampaio nice to hear that! I patched kernel for Ubuntu 14.04 LTS and I am going to publish Dockerfile's and .debs shortly

@gdm85

This comment has been minimized.

Show comment
Hide comment
@gdm85

gdm85 Jun 10, 2014

Contributor

For people interested at testing the first and second of the two patches available upstream: patched Ubuntu .deb packages in release v0.1.0 and release v0.2.0.

You can build same packages I did by using this script to debootstrap Trusty and then my Dockerfile for a kernel builder image.

UPDATE: now I have built both patched kernels and I am testing the second for intense container start/kill

Contributor

gdm85 commented Jun 10, 2014

For people interested at testing the first and second of the two patches available upstream: patched Ubuntu .deb packages in release v0.1.0 and release v0.2.0.

You can build same packages I did by using this script to debootstrap Trusty and then my Dockerfile for a kernel builder image.

UPDATE: now I have built both patched kernels and I am testing the second for intense container start/kill

@unclejack

This comment has been minimized.

Show comment
Hide comment
@unclejack

unclejack Jun 18, 2014

Contributor

As @f0 commented on #6439:

this seems like the fix for the problem https://bugzilla.kernel.org/show_bug.cgi?id=65191
Contributor

unclejack commented Jun 18, 2014

As @f0 commented on #6439:

this seems like the fix for the problem https://bugzilla.kernel.org/show_bug.cgi?id=65191
@gdm85

This comment has been minimized.

Show comment
Hide comment
@gdm85

gdm85 Jun 24, 2014

Contributor

Been running the patched kernel for 12 days now, I confirm issue has gone.

Now if upstream would merge that patch, this bug could be closed and pressure be on invidual distro maintainers instead

Contributor

gdm85 commented Jun 24, 2014

Been running the patched kernel for 12 days now, I confirm issue has gone.

Now if upstream would merge that patch, this bug could be closed and pressure be on invidual distro maintainers instead

@tianon

This comment has been minimized.

Show comment
Hide comment
Member

tianon commented Jun 27, 2014

@gregkh

This comment has been minimized.

Show comment
Hide comment
@gregkh

gregkh Jul 7, 2014

Will be in the next round of stable kernel releases, so you can mark this one closed.

gregkh commented Jul 7, 2014

Will be in the next round of stable kernel releases, so you can mark this one closed.

@unclejack

This comment has been minimized.

Show comment
Hide comment
@unclejack

unclejack Jul 7, 2014

Contributor

@gregkh Thanks!

Contributor

unclejack commented Jul 7, 2014

@gregkh Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment