Kernel <3.8.0 panics on lxc-start with 1core/low memory VM #407

Closed
creack opened this Issue Apr 14, 2013 · 30 comments

Comments

Projects
None yet
Contributor

creack commented Apr 14, 2013

The issue does not occur on a 8 core machine.

The lxc-start process causes a kernel panic while waiting for the child process to return.

It has something to do with the unmount/lock/namespaces. The output of the panic is difficult to catch.

On the last version, this happen after 5 hello world...
On older version, it takes more time.

I'll push a script that reproduce the issue.

@creack creack added a commit that referenced this issue Apr 14, 2013

@creack creack Add a script to help reproduce #407 1ec6c22
Contributor

creack commented Apr 14, 2013

I managed to reproduce the issue on 2ee3db6, I need to update the test script to handle the old docker/dockerd scheme in order to go further

Collaborator

shykes commented Apr 14, 2013

Attaching screenshots from virtualbox console output. Unfortunately the output is incomplete.

Steps to reproduce:

  1. Run docker in daemon mode

for i in $(seq 100); do docker run base echo hello world; done

The command causing the crash will print the intended output ("hello world"), then crash before returning.

Screenshot 1: visible immediately. docker 407 1

Screenshot 2: appears 2 - 5 seconds after screenshot 1. Then every 3-5 seconds, it is re-printed.
docker 407 2

creack was assigned Apr 15, 2013

Collaborator

shykes commented Apr 15, 2013

This is a blocker for 0.2.

My best guess is some sort of interaction between aufs and lxc-start - maybe we unmount the rootfs too early for example?

Collaborator

shykes commented Apr 15, 2013

@creack can you share the exact steps to reproduce with maximum certainty? That way we can all help with debugging, by each trying different revisions.

Contributor

creack commented Apr 15, 2013

I pushed my script in contrib/crashTest.go

You need to update the docket path and just 'go run crashTest.go'

On Monday, April 15, 2013, Solomon Hykes wrote:

@creack https://github.com/creack can you share the exact steps to
reproduce with maximum certainty? That way we can all help with debugging,
by each trying different revisions.


Reply to this email directly or view it on GitHubhttps://github.com/dotcloud/docker/issues/407#issuecomment-16395640
.

Guillaume J. Charmes

Collaborator

shykes commented Apr 15, 2013

Thanks. What is the current range of good / bad revisions that you
identified?

On Mon, Apr 15, 2013 at 9:29 AM, Guillaume J. Charmes <
notifications@github.com> wrote:

I pushed my script in contrib/crashTest.go

You need to update the docket path and just 'go run crashTest.go'

On Monday, April 15, 2013, Solomon Hykes wrote:

@creack https://github.com/creack can you share the exact steps to
reproduce with maximum certainty? That way we can all help with
debugging,
by each trying different revisions.


Reply to this email directly or view it on GitHub<
https://github.com/dotcloud/docker/issues/407#issuecomment-16395640>
.

Guillaume J. Charmes


Reply to this email directly or view it on GitHubhttps://github.com/dotcloud/docker/issues/407#issuecomment-16395821
.

Contributor

jpetazzo commented Apr 15, 2013

Things to try:

  • reproduce with hardened kernel (s3://
    get.docker.io/kernels/linux-headers-3.2.40-grsec-dotcloud_42~tethys_amd64.deb
    )
  • reproduce in such a way that we actually get the full backtrace (e.g. in
    a Xen VM on our test machines at the office :-))
  • if the problem can be triggered in a Xen VM, extract the backtrace of the
    kernel (starting point: xenctx)

You mentioned that the problem happened on UP machines but not SMP. If
that's indeed the case, try with 1 core but with SMP code anyway (IIRC,
kernel option noreplace-smp).

Contributor

unclejack commented Apr 15, 2013

Memory use increases after every docker run. It looks like aufs has some kind of problem or there's some other problem within the kernel.

I've just tried the script posted above with 10000 runs and I was able to get 3.8.7 with aufs3 to start swapping with 3GB of RAM. Memory never got released after running this script, it just kept growing and growing forever.

Contributor

creack commented Apr 16, 2013

I installed a fresh ubuntu 13 with a kernel 3.8.0 and I wasn't able to reproduce (I let the script run for ~1h).
However, as @unclejack said, it leaks.

Contributor

creack commented Apr 16, 2013

After a lot of tests, I am pretty sure the leaks are due to #197

Contributor

unclejack commented Apr 18, 2013

I've performed a few tests to try to reproduce this on 12.04 with stock kernels.
It didn't crash, nor leak.

docker was downloaded from docker.io to keep things simple

docker version
Version:0.1.4
Git Commit:
uname -rv
3.2.0-40-generic #64-Ubuntu SMP Mon Mar 25 21:22:10 UTC 2013
cat /proc/cpuinfo | grep processor
processor       : 0
cat /proc/meminfo | grep Total
MemTotal:         496260 kB
SwapTotal:             0 kB

memory after first test w/ 100 runs & before second test

procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa
 0  0      0 333512  29196  54776    0    0   214    38   41  213  3  2 94  1

memory after the second test w/ 100 runs

procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa
 1  0      0 324360  30968  56548    0    0   182    46   47  283  4  2 93  1
Contributor

creack commented Apr 18, 2013

do you perform your test with @shykes command or from my script (in /contrib/crashTest.go) ?
What version of the kernel and lxc are you using?

Contributor

unclejack commented Apr 18, 2013

@creack I was trying the command @shykes has posted earlier.

I'll try your crashTest script as well.

lxc is the standard one from Ubuntu 12.04.

Getting the same issue, captured the output from VirtualBox here: https://gist.github.com/robknight/5430280 - it's pretty much the same thing reported by @shykes earlier. Running Docker inside the standard Vagrant box, OSX 10.8 host.

For me this doesn't seem to have much to do with the length of time that the container is running for. I'm trying to build an image using docker-build, and my build succeeds maybe 25% of the time while the other 75% results in the above crash, after which the Vagrant box becomes unresponsive and has to be restarted.

My docker-build changefile only has two lines:
from base:latest
copy dist/dbx.tar /tmp/dbx.tar

The file referenced here definitely exists, and the build does succeed sometimes.

Identical behaviour occurs when using a different base image, e.g. centos.

Also getting kernel panics running docker 0.1.5, 0.1.6, 0.1.7 on Ubuntu 12.10, Linux 3.5.0-27, bare metal Dell Latitude D830 w/ Intel Core 2 Duo and 4GB RAM.

Reproduced by running the example multiple (<20) times:

docker run base echo hello world

Screen photos (docker 0.1.7):
https://f.cloud.github.com/assets/361379/406625/f4a5a682-aaa8-11e2-8add-2c965f5758b9.jpg
https://f.cloud.github.com/assets/361379/406627/0581c620-aaa9-11e2-9f3d-18f0ec82aae6.jpg

Collaborator

shykes commented Apr 23, 2013

It seems that for the time being Docker requires Linux >3.8. This is unfortunate, but it seems earlier versions just can't handle spawning too many short-lived namespaced processes. And we couldn't pinpoint the exact change which caused the bug to strike more frequently...

Docker now issues a warning on Linux kernels <3.8.

shykes closed this Apr 23, 2013

Contributor

jpetazzo commented Apr 23, 2013

The screenshot posted by @barryaustin shows that it's exactly the same problem with bare metal. That's very useful, because it rules out many potential side effects caused by virtualization.

Are we sure we want to close this issue? People running Ubuntu in production will very probably run 12.04 LTS rather than 12.10 or 13.04, and 12.04 LTS might not be supporting 3.8 ever.

Collaborator

shykes commented Apr 23, 2013

I don't mind keeping it open, but that would imply that there's something
we can do other than upgrading the kernel. Do you have any suggestions?

On Tue, Apr 23, 2013 at 10:03 AM, Jérôme Petazzoni <notifications@github.com

wrote:

The screenshot posted by @barryaustin https://github.com/barryaustinshows that it's exactly the same problem with bare metal. That's very
useful, because it rules out many potential side effects caused by
virtualization.

Are we sure we want to close this issue? People running Ubuntu in
production will very probably run 12.04 LTS rather than 12.10 or 13.04, and
12.04 LTS might not be supporting 3.8 ever.


Reply to this email directly or view it on GitHubhttps://github.com/dotcloud/docker/issues/407#issuecomment-16871722
.

Contributor

jpetazzo commented Apr 23, 2013

My plan would look like this:

  • reproduce the issue using only lxc-start commands
  • escalate to lxc mailing list
  • reproduce the issue using only basic namespace code (unshare or just
    clone syscalls)
  • escalate to kernel mailing list

On Tue, Apr 23, 2013 at 10:08 AM, Solomon Hykes notifications@github.comwrote:

I don't mind keeping it open, but that would imply that there's something
we can do other than upgrading the kernel. Do you have any suggestions?

On Tue, Apr 23, 2013 at 10:03 AM, Jérôme Petazzoni <
notifications@github.com

wrote:

The screenshot posted by @barryaustin https://github.com/barryaustinshows
that it's exactly the same problem with bare metal. That's very
useful, because it rules out many potential side effects caused by
virtualization.

Are we sure we want to close this issue? People running Ubuntu in
production will very probably run 12.04 LTS rather than 12.10 or 13.04,
and
12.04 LTS might not be supporting 3.8 ever.


Reply to this email directly or view it on GitHub<
https://github.com/dotcloud/docker/issues/407#issuecomment-16871722>
.


Reply to this email directly or view it on GitHubhttps://github.com/dotcloud/docker/issues/407#issuecomment-16872040
.

Collaborator

shykes commented Apr 23, 2013

I agree this would be great. I re-opened the issue and removed it from 0.2.

Want to lead the charge? Let me know and I'll assign to you.

shykes reopened this Apr 23, 2013

Contributor

lopter commented Apr 24, 2013

Per @creack requests here what happens for me on apt-get update && apt-get install:

https://gist.github.com/lopter/5449001#file-dmesg-log

I'm running Docker in daemon mode on Ubuntu 12.04 in Virtualbox:

louis@dotcloud-docker:~$ uname -a
Linux dotcloud-docker 3.2.0-40-generic #64-Ubuntu SMP Mon Mar 25 21:22:10 UTC 2013 x86_64 x86_64 x86_64 GNU/Linux
louis@dotcloud-docker:~$ lsb_release -a
No LSB modules are available.
Distributor ID: Ubuntu
Description:    Ubuntu 12.04.2 LTS
Release:        12.04
Codename:       precise
louis@dotcloud-docker:~$ 
Contributor

unclejack commented Apr 25, 2013

I've just tried the exact same setup as @lopter. It didn't crash at all, not even with CPU limit set to 40%.

However, I was able to make the system leak memory when the memory cgroup didn't get mounted. That seems to happen once every other boot on ubuntu in virtualbox. This didn't seem to break the system even after running the script @shykes posted above with 10000 runs.

I've used the precise64 box. I've updated the system to use the latest kernel (3.2.0.40).

Contributor

unclejack commented Apr 25, 2013

It locked up by the time it reached the 1355th run with the script posted by @shykes.

[ 2692.120088] BUG: soft lockup - CPU#0 stuck for 23s! [lxc-start:27038]
[ 2692.122073] Modules linked in: veth aufs xt_addrtype vboxvideo(O) drm vboxsf(O) ipt_MASQUERADE iptable_nat nf_nat nf_conntrack_ipv4 nf_conntrack nf_defrag_ipv4 ip_tables x_tables bridge stp vesafb ppdev i2c_piix4 psmouse serio_raw vboxguest(O) nfsd parport_pc nfs lockd fscache auth_rpcgss nfs_acl mac_hid sunrpc lp parport ext2
[ 2692.123992] CPU 0
[ 2692.124019] Modules linked in: veth aufs xt_addrtype vboxvideo(O) drm vboxsf(O) ipt_MASQUERADE iptable_nat nf_nat nf_conntrack_ipv4 nf_conntrack nf_defrag_ipv4 ip_tables x_tables bridge stp vesafb ppdev i2c_piix4 psmouse serio_raw vboxguest(O) nfsd parport_pc nfs lockd fscache auth_rpcgss nfs_acl mac_hid sunrpc lp parport ext2
[ 2692.124053]
[ 2692.124053] Pid: 27038, comm: lxc-start Tainted: G      D    O 3.2.0-40-generic #64-Ubuntu innotek GmbH VirtualBox/VirtualBox
[ 2692.124053] RIP: 0010:[<ffffffff8103ebd5>]  [<ffffffff8103ebd5>] __ticket_spin_lock+0x25/0x30
[ 2692.124053] RSP: 0018:ffff8800157937b8  EFLAGS: 00000297
[ 2692.124053] RAX: 000000000000ca9e RBX: ffffffff8112525c RCX: 0000000100045ada
[ 2692.124053] RDX: 000000000000ca9f RSI: ffffffff8117a8e0 RDI: ffff880017c10950
[ 2692.124053] RBP: ffff8800157937b8 R08: 0000000000000001 R09: 0000000000000000
[ 2692.124053] R10: ffff880014e69410 R11: 0000000000000001 R12: 000000018200017f
[ 2692.124053] R13: ffff880012c02940 R14: 000000000000000c R15: 000000000000000c
[ 2692.124053] FS:  0000000000000000(0000) GS:ffff880017c00000(0000) knlGS:0000000000000000
[ 2692.124053] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 2692.124053] CR2: ffff880117c00001 CR3: 0000000001c05000 CR4: 00000000000006f0
[ 2692.124053] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 2692.124053] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[ 2692.124053] Process lxc-start (pid: 27038, threadinfo ffff880015792000, task ffff880016ff1700)
[ 2692.124053] Stack:
[ 2692.124053]  ffff8800157937c8 ffffffff8119712e ffff8800157937e8 ffffffff81198f5d
[ 2692.124053]  ffff880014e69400 0000000000000010 ffff8800157937f8 ffffffff8119904f
[ 2692.124053]  ffff880015793848 ffffffff8117ad83 ffff880014e69410 ffff880016efbf00
[ 2692.124053] Call Trace:
[ 2692.124053]  [<ffffffff8119712e>] vfsmount_lock_local_lock+0x1e/0x30
[ 2692.124053]  [<ffffffff81198f5d>] mntput_no_expire+0x1d/0xf0
[ 2692.124053]  [<ffffffff8119904f>] mntput+0x1f/0x30
[ 2692.124053]  [<ffffffff8117ad83>] __fput+0x153/0x210
[ 2692.124053]  [<ffffffff8117ae65>] fput+0x25/0x30
[ 2692.124053]  [<ffffffff81065a89>] removed_exe_file_vma+0x39/0x50
[ 2692.124053]  [<ffffffff81143039>] remove_vma+0x89/0x90
[ 2692.124053]  [<ffffffff81145b38>] exit_mmap+0xe8/0x140
[ 2692.124053]  [<ffffffff81065b42>] mmput.part.16+0x42/0x130
[ 2692.124053]  [<ffffffff81065c59>] mmput+0x29/0x30
[ 2692.124053]  [<ffffffff8106c5f3>] exit_mm+0x113/0x130
[ 2692.124053]  [<ffffffff810e5555>] ? taskstats_exit+0x45/0x240
[ 2692.124053]  [<ffffffff8165e785>] ? _raw_spin_lock_irq+0x15/0x20
[ 2692.124053]  [<ffffffff8106c77e>] do_exit+0x16e/0x450
[ 2692.124053]  [<ffffffff8165f620>] oops_end+0xb0/0xf0
[ 2692.124053]  [<ffffffff81644907>] no_context+0x150/0x15d
[ 2692.124053]  [<ffffffff81644adf>] __bad_area_nosemaphore+0x1cb/0x1ea
[ 2692.124053]  [<ffffffff816441e4>] ? pud_offset+0x1a/0x20
[ 2692.124053]  [<ffffffff81644b11>] bad_area_nosemaphore+0x13/0x15
[ 2692.124053]  [<ffffffff81662266>] do_page_fault+0x426/0x520
[ 2692.124053]  [<ffffffff81323730>] ? zlib_inflate+0x1320/0x16d0
[ 2692.124053]  [<ffffffff81318c41>] ? vsnprintf+0x461/0x600
[ 2692.124053]  [<ffffffff8165ebf5>] page_fault+0x25/0x30
[ 2692.124053]  [<ffffffff81198f68>] ? mntput_no_expire+0x28/0xf0
[ 2692.124053]  [<ffffffff81198f5d>] ? mntput_no_expire+0x1d/0xf0
[ 2692.124053]  [<ffffffff8119904f>] mntput+0x1f/0x30
[ 2692.124053]  [<ffffffff8119addc>] kern_unmount+0x2c/0x40
[ 2692.124053]  [<ffffffff811d9ca5>] pid_ns_release_proc+0x15/0x20
[ 2692.124053]  [<ffffffff811de8f9>] proc_flush_task+0x89/0xa0
[ 2692.124053]  [<ffffffff8106b1e3>] release_task+0x33/0x130
[ 2692.124053]  [<ffffffff8131b1cd>] ? __put_user_4+0x1d/0x30
[ 2692.124053]  [<ffffffff8106b77e>] wait_task_zombie+0x49e/0x5f0
[ 2692.124053]  [<ffffffff8106b9d3>] wait_consider_task.part.9+0x103/0x170
[ 2692.124053]  [<ffffffff8106baa5>] wait_consider_task+0x65/0x70
[ 2692.124053]  [<ffffffff8106bbb1>] do_wait+0x101/0x260
[ 2692.124053]  [<ffffffff8106cf00>] sys_wait4+0xa0/0xf0
[ 2692.124053]  [<ffffffff8106a700>] ? wait_task_continued+0x170/0x170
[ 2692.124053]  [<ffffffff81666a82>] system_call_fastpath+0x16/0x1b
[ 2692.124053] Code: 90 90 90 90 90 90 55 b8 00 00 01 00 48 89 e5 f0 0f c1 07 89 c2 c1 ea 10 66 39 c2 74 13 66 0f 1f 84 00 00 00 00 00 f3 90 0f b7 07 <66> 39 d0 75 f6 5d c3 0f 1f 40 00 8b 17 55 31 c0 48 89 e5 89 d1
[ 2692.124053] Call Trace:
[ 2692.124053]  [<ffffffff8119712e>] vfsmount_lock_local_lock+0x1e/0x30
[ 2692.124053]  [<ffffffff81198f5d>] mntput_no_expire+0x1d/0xf0
[ 2692.124053]  [<ffffffff8119904f>] mntput+0x1f/0x30
[ 2692.124053]  [<ffffffff8117ad83>] __fput+0x153/0x210
[ 2692.124053]  [<ffffffff8117ae65>] fput+0x25/0x30
[ 2692.124053]  [<ffffffff81065a89>] removed_exe_file_vma+0x39/0x50
[ 2692.124053]  [<ffffffff81143039>] remove_vma+0x89/0x90
[ 2692.124053]  [<ffffffff81145b38>] exit_mmap+0xe8/0x140
[ 2692.124053]  [<ffffffff81065b42>] mmput.part.16+0x42/0x130
[ 2692.124053]  [<ffffffff81065c59>] mmput+0x29/0x30
[ 2692.124053]  [<ffffffff8106c5f3>] exit_mm+0x113/0x130
[ 2692.124053]  [<ffffffff810e5555>] ? taskstats_exit+0x45/0x240
[ 2692.124053]  [<ffffffff8165e785>] ? _raw_spin_lock_irq+0x15/0x20
[ 2692.124053]  [<ffffffff8106c77e>] do_exit+0x16e/0x450
[ 2692.124053]  [<ffffffff8165f620>] oops_end+0xb0/0xf0
[ 2692.124053]  [<ffffffff81644907>] no_context+0x150/0x15d
[ 2692.124053]  [<ffffffff81644adf>] __bad_area_nosemaphore+0x1cb/0x1ea
[ 2692.124053]  [<ffffffff816441e4>] ? pud_offset+0x1a/0x20
[ 2692.124053]  [<ffffffff81644b11>] bad_area_nosemaphore+0x13/0x15
[ 2692.124053]  [<ffffffff81662266>] do_page_fault+0x426/0x520
[ 2692.124053]  [<ffffffff81323730>] ? zlib_inflate+0x1320/0x16d0
[ 2692.124053]  [<ffffffff81318c41>] ? vsnprintf+0x461/0x600
[ 2692.124053]  [<ffffffff8165ebf5>] page_fault+0x25/0x30
[ 2692.124053]  [<ffffffff81198f68>] ? mntput_no_expire+0x28/0xf0
[ 2692.124053]  [<ffffffff81198f5d>] ? mntput_no_expire+0x1d/0xf0
[ 2692.124053]  [<ffffffff8119904f>] mntput+0x1f/0x30
[ 2692.124053]  [<ffffffff8119addc>] kern_unmount+0x2c/0x40
[ 2692.124053]  [<ffffffff811d9ca5>] pid_ns_release_proc+0x15/0x20
[ 2692.124053]  [<ffffffff811de8f9>] proc_flush_task+0x89/0xa0
[ 2692.124053]  [<ffffffff8106b1e3>] release_task+0x33/0x130
[ 2692.124053]  [<ffffffff8131b1cd>] ? __put_user_4+0x1d/0x30
[ 2692.124053]  [<ffffffff8106b77e>] wait_task_zombie+0x49e/0x5f0
[ 2692.124053]  [<ffffffff8106b9d3>] wait_consider_task.part.9+0x103/0x170
[ 2692.124053]  [<ffffffff8106baa5>] wait_consider_task+0x65/0x70
[ 2692.124053]  [<ffffffff8106bbb1>] do_wait+0x101/0x260
[ 2692.124053]  [<ffffffff8106cf00>] sys_wait4+0xa0/0xf0
[ 2692.124053]  [<ffffffff8106a700>] ? wait_task_continued+0x170/0x170
[ 2692.124053]  [<ffffffff81666a82>] system_call_fastpath+0x16/0x1b
vagrant@precise64:~$ dmesg | less
[ 2720.112029]  [<ffffffff81644b11>] bad_area_nosemaphore+0x13/0x15
[ 2720.112029]  [<ffffffff81662266>] do_page_fault+0x426/0x520
[ 2720.112029]  [<ffffffff81323730>] ? zlib_inflate+0x1320/0x16d0
[ 2720.112029]  [<ffffffff81318c41>] ? vsnprintf+0x461/0x600
[ 2720.112029]  [<ffffffff8165ebf5>] page_fault+0x25/0x30
[ 2720.112029]  [<ffffffff81198f68>] ? mntput_no_expire+0x28/0xf0
[ 2720.112029]  [<ffffffff81198f5d>] ? mntput_no_expire+0x1d/0xf0
[ 2720.112029]  [<ffffffff8119904f>] mntput+0x1f/0x30
[ 2720.112029]  [<ffffffff8119addc>] kern_unmount+0x2c/0x40
[ 2720.112029]  [<ffffffff811d9ca5>] pid_ns_release_proc+0x15/0x20
[ 2720.112029]  [<ffffffff811de8f9>] proc_flush_task+0x89/0xa0
[ 2720.112029]  [<ffffffff8106b1e3>] release_task+0x33/0x130
[ 2720.112029]  [<ffffffff8131b1cd>] ? __put_user_4+0x1d/0x30
[ 2720.112029]  [<ffffffff8106b77e>] wait_task_zombie+0x49e/0x5f0
[ 2720.112029]  [<ffffffff8106b9d3>] wait_consider_task.part.9+0x103/0x170
[ 2720.112029]  [<ffffffff8106baa5>] wait_consider_task+0x65/0x70
[ 2720.112029]  [<ffffffff8106bbb1>] do_wait+0x101/0x260
[ 2720.112029]  [<ffffffff8106cf00>] sys_wait4+0xa0/0xf0
[ 2720.112029]  [<ffffffff8106a700>] ? wait_task_continued+0x170/0x170
[ 2720.112029]  [<ffffffff81666a82>] system_call_fastpath+0x16/0x1b
[ 2720.112029] Code: 90 90 90 90 90 90 90 90 90 55 b8 00 00 01 00 48 89 e5 f0 0f c1 07 89 c2 c1 ea 10 66 39 c2 74 13 66 0f 1f 84 00 00 00 00 00 f3 90 <0f> b7 07 66 39 d0 75 f6 5d c3 0f 1f 40 00 8b 17 55 31 c0 48 89
[ 2720.112029] Call Trace:
[ 2720.112029]  [<ffffffff8119712e>] vfsmount_lock_local_lock+0x1e/0x30
[ 2720.112029]  [<ffffffff81198f5d>] mntput_no_expire+0x1d/0xf0
[ 2720.112029]  [<ffffffff8119904f>] mntput+0x1f/0x30
[ 2720.112029]  [<ffffffff8117ad83>] __fput+0x153/0x210
[ 2720.112029]  [<ffffffff8117ae65>] fput+0x25/0x30
[ 2720.112029]  [<ffffffff81065a89>] removed_exe_file_vma+0x39/0x50
[ 2720.112029]  [<ffffffff81143039>] remove_vma+0x89/0x90
[ 2720.112029]  [<ffffffff81145b38>] exit_mmap+0xe8/0x140
[ 2720.112029]  [<ffffffff81065b42>] mmput.part.16+0x42/0x130
[ 2720.112029]  [<ffffffff81065c59>] mmput+0x29/0x30
[ 2720.112029]  [<ffffffff8106c5f3>] exit_mm+0x113/0x130
[ 2720.112029]  [<ffffffff810e5555>] ? taskstats_exit+0x45/0x240
[ 2720.112029]  [<ffffffff8165e785>] ? _raw_spin_lock_irq+0x15/0x20
[ 2720.112029]  [<ffffffff8106c77e>] do_exit+0x16e/0x450
[ 2720.112029]  [<ffffffff8165f620>] oops_end+0xb0/0xf0
[ 2720.112029]  [<ffffffff81644907>] no_context+0x150/0x15d
[ 2720.112029]  [<ffffffff81644adf>] __bad_area_nosemaphore+0x1cb/0x1ea
[ 2720.112029]  [<ffffffff816441e4>] ? pud_offset+0x1a/0x20
[ 2720.112029]  [<ffffffff81644b11>] bad_area_nosemaphore+0x13/0x15
[ 2720.112029]  [<ffffffff81662266>] do_page_fault+0x426/0x520
[ 2720.112029]  [<ffffffff81323730>] ? zlib_inflate+0x1320/0x16d0
[ 2720.112029]  [<ffffffff81318c41>] ? vsnprintf+0x461/0x600
[ 2720.112029]  [<ffffffff8165ebf5>] page_fault+0x25/0x30
[ 2720.112029]  [<ffffffff81198f68>] ? mntput_no_expire+0x28/0xf0
[ 2720.112029]  [<ffffffff81198f5d>] ? mntput_no_expire+0x1d/0xf0
[ 2720.112029]  [<ffffffff8119904f>] mntput+0x1f/0x30
[ 2720.112029]  [<ffffffff8119addc>] kern_unmount+0x2c/0x40
[ 2720.112029]  [<ffffffff811d9ca5>] pid_ns_release_proc+0x15/0x20
[ 2720.112029]  [<ffffffff811de8f9>] proc_flush_task+0x89/0xa0
[ 2720.112029]  [<ffffffff8106b1e3>] release_task+0x33/0x130
[ 2720.112029]  [<ffffffff8131b1cd>] ? __put_user_4+0x1d/0x30
[ 2720.112029]  [<ffffffff8106b77e>] wait_task_zombie+0x49e/0x5f0
[ 2720.112029]  [<ffffffff8106b9d3>] wait_consider_task.part.9+0x103/0x170
[ 2720.112029]  [<ffffffff8106baa5>] wait_consider_task+0x65/0x70
[ 2720.112029]  [<ffffffff8106bbb1>] do_wait+0x101/0x260
[ 2720.112029]  [<ffffffff8106cf00>] sys_wait4+0xa0/0xf0
[ 2720.112029]  [<ffffffff8106a700>] ? wait_task_continued+0x170/0x170
[ 2720.112029]  [<ffffffff81666a82>] system_call_fastpath+0x16/0x1b
(END)
Contributor

paulhammond commented May 23, 2013

As another data point, I just ran the crashTest.go script on a Debian Wheezy VM running under Virtualbox on a dual-core i7 Macbook Air:

$ docker version
Version: 0.3.2
Git Commit: e289308
Kernel: 3.2.0-4-amd64
WARNING: No memory limit support
WARNING: No swap limit support

$ uname -rv
3.2.0-4-amd64 #1 SMP Debian 3.2.41-2

$ cat /proc/cpuinfo | grep processor
processor   : 0

The script has been running for an hour without crashing, and has done just over 10,000 runs. Memory usage remained constant throughout (with between 4 and 6MB free out of 250MB the whole time).

$ sudo /usr/local/go/bin/go run crashTest.go 
2013/05/23 00:05:54 WARNING: You are running linux kernel version 3.2.0-4-amd64, which might be unstable running docker. Please upgrade your kernel to 3.8.0.
2013/05/23 00:05:54 WARNING: cgroup mountpoint not found for memory
2013/05/23 00:05:54 Listening for RCLI/tcp on 127.0.0.1:4242
2013/05/23 00:05:54 docker run base echo 3
2013/05/23 00:05:54 docker run base echo 4
...
2013/05/23 01:05:59 docker run base echo 10153
2013/05/23 01:05:59 docker run base echo 10154

I think this means either:

  • I've made a mistake somewhere
  • The bug is a regression introduced after 3.2.0
  • The bug only affects the Ubuntu kernel tree

I hope some progress can be made on this issue. One of the things that I like about Docker is how easy it is to get started, requiring a 3.8 kernel makes that much harder in many environments.

Collaborator

shykes commented May 23, 2013

Broadening kernel support is going to be a major priority for us in June. I'm also frustrated by the current kernel situation!

@solomonstre
@getdocker

On Wed, May 22, 2013 at 7:09 PM, Paul Hammond notifications@github.com
wrote:

As another data point, I just ran the crashTest.go script on a Debian Wheezy VM running under Virtualbox on a dual-core i7 Macbook Air:

$ docker version
Version: 0.3.2
Git Commit: e289308
Kernel: 3.2.0-4-amd64
WARNING: No memory limit support
WARNING: No swap limit support
$ uname -rv
3.2.0-4-amd64 #1 SMP Debian 3.2.41-2
$ cat /proc/cpuinfo | grep processor
processor : 0

The script has been running for an hour without crashing, and has done just over 10,000 runs. Memory usage remained constant throughout (with between 4 and 6MB free out of 250MB the whole time).

$ sudo /usr/local/go/bin/go run crashTest.go 
2013/05/23 00:05:54 WARNING: You are running linux kernel version 3.2.0-4-amd64, which might be unstable running docker. Please upgrade your kernel to 3.8.0.
2013/05/23 00:05:54 WARNING: cgroup mountpoint not found for memory
2013/05/23 00:05:54 Listening for RCLI/tcp on 127.0.0.1:4242
2013/05/23 00:05:54 docker run base echo 3
2013/05/23 00:05:54 docker run base echo 4
...
2013/05/23 01:05:59 docker run base echo 10153
2013/05/23 01:05:59 docker run base echo 10154

I think this means either:

  • I've made a mistake somewhere
  • The bug is a regression introduced after 3.2.0
  • The bug only affects the Ubuntu kernel tree
    I hope some progress can be made on this issue. One of the things that I like about Docker is how easy it is to get started, requiring a 3.8 kernel makes that much harder in many environments.

    Reply to this email directly or view it on GitHub:
    dotcloud#407 (comment)
@ghost

ghost commented Jun 7, 2013

Hi guys, sorry I wasn't aware of this issue when I did my write up for OStatic. I'll update the post with a link back here. I experienced what looks to be this bug on Ubuntu 12.04 LTS, 64bit.

I had something very similar in the past and was related to the hardware/kernel pair. Please test it on different CPU families if you can to see correlations.

Collaborator

shykes commented Jul 18, 2013

I'm closing this since it's not immediately actionable.

keeb closed this Jul 21, 2013

Contributor

jpetazzo commented Oct 7, 2013

For the record, Josh Poimboeuf found something which might be related to this:

I did some digging. These panics seem to be caused by some race
conditions related to removing a container's mounts. I was easily able
to recreate with:

for i in seq 1 100; do docker run -i -t -d ubuntu bash; done |xargs docker kill

The fixes needed for RHEL 6.4 (based on 2.6.32) are in the following two
upstream kernel commits:

  • "45a68628d37222e655219febce9e91b6484789b2" (fixed in 2.6.39)
  • "17cf22c33e1f1b5e435469c84e43872579497653" (fixed in 3.8)

@jpoimboe jpoimboe added a commit to jpoimboe/docker that referenced this issue Dec 2, 2013

@jpoimboe jpoimboe add env variable to disable kernel version warning
Allow the user to set DOCKER_NOWARN_KERNEL_VERSION=1 to disable the
warning for RHEL 6.5 and other distributions that don't exhibit the
panics described in moby#407.
e4aba11
Contributor

gdm85 commented May 15, 2014

can we keep this open? still verified with 0.11 and kernel 3.2.0 (2 processors)

https://gist.github.com/gdm85/9328ae13653e5683adda

Contributor

unclejack commented May 15, 2014

@gdm85 No, this will stay closed. Kernels older than 3.8 aren't supported. That means technical support isn't provided and you might run into unexpected behavior, even if it seems like it's working.

The only exception is the kernel provided by RHEL6 (2.6.32xxxxxx) which was patched and improved to work properly with Docker.

Docker on kernels older than 3.8 won't happen. Please upgrade your kernel.

creack removed their assignment Jul 24, 2014

@shykes shykes pushed a commit to shykes/docker-dev that referenced this issue Oct 2, 2014

@jpoimboe jpoimboe add env variable to disable kernel version warning
Allow the user to set DOCKER_NOWARN_KERNEL_VERSION=1 to disable the
warning for RHEL 6.5 and other distributions that don't exhibit the
panics described in moby/moby#407.
2c2a655

@sbasyal sbasyal added a commit to sbasyal/docker that referenced this issue Apr 15, 2015

@sbasyal sbasyal The link to issue 407 was broken
The link to issue 407 was broken. The old link was: moby#407 
The link must be: moby#407
db23693

@sbasyal sbasyal added a commit to sbasyal/docker that referenced this issue Apr 15, 2015

@sbasyal sbasyal The link to issue 407 was broken
The link to issue 407 was broken. The old link was: moby#407
The link must be: moby#407

Signed-off-by: Sabin Basyal <sabin.basyal@gmail.com>
6860c75
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment