New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Kernel panic on Debian 8 with Docker 1.12 #29397

Closed
jmcollin78 opened this Issue Dec 14, 2016 · 15 comments

Comments

Projects
None yet
8 participants
@jmcollin78

jmcollin78 commented Dec 14, 2016

Description

Debian 8 box on an Openstack Juno Cloud, I've docker 1.12 installed, with run a Postgresql image. Regularly, this instance crash with kernel Panic.

Steps to reproduce the issue:

  1. create an instance on a Openstack Juno Cloud
  2. start the docker image provided here: paunin/postgresql-cluster-pgsql:latest
  3. waits for crash

Describe the results you received:
The instance crash with kernel panic with those logs:

[11176.153139] BUG: unable to handle kernel NULL pointer dereference at 0000000000000078
[11176.155778] IP: [<ffffffff810a10d4>] check_preempt_wakeup+0xd4/0x1d0
[11176.157013] PGD bb2d8067 PUD b9ac2067 PMD 0 
[11176.157013] Oops: 0000 [#1] SMP 
[11176.157013] Modules linked in: ipt_MASQUERADE xfrm_user xfrm_algo iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 xt_addrtype iptable_filter ip_tables xt_conntrack x_tables nf_nat nf_conntrack bridge stp llc aufs(C) joydev hid_generic usbhid hid ppdev crc32_pclmul aesni_intel evdev aes_x86_64 lrw gf128mul glue_helper ablk_helper ttm cryptd drm_kms_helper serio_raw virtio_balloon parport_pc drm pvpanic parport processor i2c_piix4 thermal_sys i2c_core button autofs4 ext4 crc16 mbcache jbd2 ata_generic virtio_blk virtio_net ata_piix uhci_hcd crct10dif_pclmul crct10dif_common ehci_hcd crc32c_intel psmouse libata virtio_pci usbcore virtio_ring scsi_mod virtio usb_common floppy
[11176.157013] CPU: 0 PID: 23662 Comm: exe Tainted: G         C    3.16.0-4-amd64 #1 Debian 3.16.36-1+deb8u2
[11176.157013] Hardware name: OpenStack Foundation OpenStack Nova, BIOS 1.7.5-20140531_083030-gandalf 04/01/2014
[11176.157013] task: ffff8800bad30210 ti: ffff8800badc4000 task.ti: ffff8800badc4000
[11176.157013] RIP: 0010:[<ffffffff810a10d4>]  [<ffffffff810a10d4>] check_preempt_wakeup+0xd4/0x1d0
[11176.157013] RSP: 0000:ffff88013fc03e58  EFLAGS: 00010097
[11176.157013] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000008
[11176.157013] RDX: 0000000000000000 RSI: ffff88013a87e210 RDI: ffff8800bb75e800
[11176.157013] RBP: ffff88007f8f0340 R08: ffffffff816108c0 R09: 0000000000000001
[11176.157013] R10: 0000000000020022 R11: 0000000000000010 R12: ffff8800bad30210
[11176.157013] R13: ffff88013fc12f40 R14: 0000000000000000 R15: 0000000000000000
[11176.157013] FS:  00007fe9bdc48700(0000) GS:ffff88013fc00000(0000) knlGS:0000000000000000
[11176.157013] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[11176.157013] CR2: 0000000000000078 CR3: 00000000bad3e000 CR4: 00000000000406f0
[11176.157013] Stack:
[11176.157013]  ffffffff8109ffe2 ffff88013fc12f40 ffff88013a87e210 ffff88013fc12f40
[11176.157013]  0000000000000046 0000000000000000 0000000000000000 ffffffff81095b75
[11176.157013]  ffff88013a87e210 ffffffff81095ba4 ffff88013a87e210 ffff88013fc12f40
[11176.157013] Call Trace:
[11176.157013]  <IRQ> 
[11176.157013]  [<ffffffff8109ffe2>] ? enqueue_task_fair+0x7f2/0xe20
[11176.157013]  [<ffffffff81095b75>] ? check_preempt_curr+0x85/0xa0
[11176.157013]  [<ffffffff81095ba4>] ? ttwu_do_wakeup+0x14/0xf0
[11176.157013]  [<ffffffff81098176>] ? try_to_wake_up+0x1b6/0x2f0
[11176.157013]  [<ffffffff8108bfe0>] ? hrtimer_get_res+0x50/0x50
[11176.157013]  [<ffffffff8108bffe>] ? hrtimer_wakeup+0x1e/0x30
[11176.157013]  [<ffffffff8108c667>] ? __run_hrtimer+0x67/0x210
[11176.157013]  [<ffffffff8108ca69>] ? hrtimer_interrupt+0xe9/0x220
[11176.157013]  [<ffffffff8151b46b>] ? smp_apic_timer_interrupt+0x3b/0x50
[11176.157013]  [<ffffffff815194fd>] ? apic_timer_interrupt+0x6d/0x80
[11176.157013]  <EOI> 
[11176.157013] Code: 0f 1f 80 00 00 00 00 83 e8 01 48 8b 5b 70 39 d0 75 f5 48 8b 7d 78 48 3b 7b 78 74 15 0f 1f 00 48 8b 6d 70 48 8b 5b 70 48 8b 7d 78 <48> 3b 7b 78 75 ee 48 85 ff 74 e9 e8 8c cb ff ff 48 85 db 0f 84 
[11176.157013] RIP  [<ffffffff810a10d4>] check_preempt_wakeup+0xd4/0x1d0
[11176.157013]  RSP <ffff88013fc03e58>
[11176.157013] CR2: 0000000000000078
[11176.157013] ---[ end trace 90a4d010673f1243 ]---
[11176.157013] Kernel panic - not syncing: Fatal exception in interrupt
[11176.157013] Shutting down cpus with NMI
[11176.157013] Kernel Offset: 0x0 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffff9fffffff)
[11176.157013] ---[ end Kernel panic - not syncing: Fatal exception in interrupt

Describe the results you expected:
No crash

Additional information you deem important (e.g. issue happens only occasionally):

root$ uname -a 
Linux etg-dbs-temp-pgcluster-0 3.16.0-4-amd64 #1 SMP Debian 3.16.36-1+deb8u2 (2016-10-19) x86_64 GNU/Linux

Output of docker version:

root$ docker version 
Client:
 Version:      1.12.3
 API version:  1.24
 Go version:   go1.6.3
 Git commit:   6b644ec
 Built:        Wed Oct 26 21:39:14 2016
 OS/Arch:      linux/amd64

Server:
 Version:      1.12.3
 API version:  1.24
 Go version:   go1.6.3
 Git commit:   6b644ec
 Built:        Wed Oct 26 21:39:14 2016
 OS/Arch:      linux/amd64

Output of docker info:

root$ docker info 
Containers: 4
 Running: 3
 Paused: 0
 Stopped: 1
Images: 4
Server Version: 1.12.3
Storage Driver: aufs
 Root Dir: /var/lib/docker/aufs
 Backing Filesystem: extfs
 Dirs: 34
 Dirperm1 Supported: true
Logging Driver: json-file
Cgroup Driver: cgroupfs
Plugins:
 Volume: local
 Network: bridge host null overlay
Swarm: inactive
Runtimes: runc
Default Runtime: runc
Security Options:
Kernel Version: 3.16.0-4-amd64
Operating System: Debian GNU/Linux 8 (jessie)
OSType: linux
Architecture: x86_64
CPUs: 4
Total Memory: 3.873 GiB
Name: xxxxxxx
ID: BADA:AGVI:24J2:LEUY:4JPS:2LB7:QFSY:TBGQ:25XJ:DQC7:7ZLW:NTAD
Docker Root Dir: /var/lib/docker
Debug Mode (client): false
Debug Mode (server): false
Http Proxy: http://10.228.11.142:3128
No Proxy: localhost,127.0.0.1,minint.fr
Registry: https://index.docker.io/v1/
WARNING: No memory limit support
WARNING: No swap limit support
WARNING: No kernel memory limit support
WARNING: No oom kill disable support
WARNING: No cpu cfs quota support
WARNING: No cpu cfs period support
Insecure Registries:
 xxxxxxxx:80
 127.0.0.0/8

Additional environment details (AWS, VirtualBox, physical, etc.):
Openstack Juno environment.

@justincormack

This comment has been minimized.

Show comment
Hide comment
@justincormack

justincormack Dec 16, 2016

Contributor

A kernel crash is not a problem we can fix in docker, it means you need to fix the kernel, or possibly the virtualisation, as it is an emulated OpenStack machine. I recommend testing on the same kernel on a different VM and/or physical hardware, and on a more recent kernel (eg the debian backports).

Contributor

justincormack commented Dec 16, 2016

A kernel crash is not a problem we can fix in docker, it means you need to fix the kernel, or possibly the virtualisation, as it is an emulated OpenStack machine. I recommend testing on the same kernel on a different VM and/or physical hardware, and on a more recent kernel (eg the debian backports).

@jmcollin78

This comment has been minimized.

Show comment
Hide comment
@jmcollin78

jmcollin78 Dec 16, 2016

I understand your point of view but only instances with Docker are doing those Kernel Panic. Same instance in same cloud with same OS but without Docker don't crash. So, I guess there is something on Docker which cause this kernel panic and this should be fixed or workaround.

I understand your point of view but only instances with Docker are doing those Kernel Panic. Same instance in same cloud with same OS but without Docker don't crash. So, I guess there is something on Docker which cause this kernel panic and this should be fixed or workaround.

@cpuguy83

This comment has been minimized.

Show comment
Hide comment
@cpuguy83

cpuguy83 Dec 16, 2016

Contributor

@jmcollin78 Sure, something Docker is doing may be triggering the issue, but that doesn't mean docker is the cause or even that docker could work around it.

From the stack trace it looks like hardware (virtualized or otherwise) issues... but @justincormack would probably know more about that (being a Xen maintainer).

Contributor

cpuguy83 commented Dec 16, 2016

@jmcollin78 Sure, something Docker is doing may be triggering the issue, but that doesn't mean docker is the cause or even that docker could work around it.

From the stack trace it looks like hardware (virtualized or otherwise) issues... but @justincormack would probably know more about that (being a Xen maintainer).

@hdimitriou

This comment has been minimized.

Show comment
Hide comment
@hdimitriou

hdimitriou Jan 23, 2017

@cpuguy83 , @justincormack I understand your point of view, but if Kernel 3.16 crashes with Docker and you cannot do anything about it, then just state that Docker is not compatible with it.
By not saying anything about it, many people end up using it and when they try to scale they face a wall they cannot climb.
The whole Docker approach is faulty on the subject, on the commercial support (https://success.docker.com/Policies/Compatibility_Matrix) Debian is not mentioned into the supported systems, but nowhere on the official documents can someone find the reason. Instead you find a page on how to install on Debian.

Really, it's so much easier to stop using docker than changing distribution for a non-startup company and this is plain sad for both your effort and our effort.

hdimitriou commented Jan 23, 2017

@cpuguy83 , @justincormack I understand your point of view, but if Kernel 3.16 crashes with Docker and you cannot do anything about it, then just state that Docker is not compatible with it.
By not saying anything about it, many people end up using it and when they try to scale they face a wall they cannot climb.
The whole Docker approach is faulty on the subject, on the commercial support (https://success.docker.com/Policies/Compatibility_Matrix) Debian is not mentioned into the supported systems, but nowhere on the official documents can someone find the reason. Instead you find a page on how to install on Debian.

Really, it's so much easier to stop using docker than changing distribution for a non-startup company and this is plain sad for both your effort and our effort.

@ijc

This comment has been minimized.

Show comment
Hide comment
@ijc

ijc Jan 23, 2017

Contributor

The stack trace and kernel version here looks identical to Debian bug #847360 to me, I'd suggest subscribing to that bug and perhaps posting there regarding your usecase and reproduction steps (since that bug seems rather light on those to me).

Contributor

ijc commented Jan 23, 2017

The stack trace and kernel version here looks identical to Debian bug #847360 to me, I'd suggest subscribing to that bug and perhaps posting there regarding your usecase and reproduction steps (since that bug seems rather light on those to me).

@cpuguy83

This comment has been minimized.

Show comment
Hide comment
@cpuguy83

cpuguy83 Jan 23, 2017

Contributor

@hdimitriou You say "regularly" in the original post. What does this mean? When the container is running after some time the kernel panics? The kernel panics exactly when the container starts?

I understand it's frustrating to run into an issue like this that seemingly blocks everything you are trying to do.

Contributor

cpuguy83 commented Jan 23, 2017

@hdimitriou You say "regularly" in the original post. What does this mean? When the container is running after some time the kernel panics? The kernel panics exactly when the container starts?

I understand it's frustrating to run into an issue like this that seemingly blocks everything you are trying to do.

@hdimitriou

This comment has been minimized.

Show comment
Hide comment
@hdimitriou

hdimitriou Jan 23, 2017

@cpuguy83 I did not write the original ticket and I do not want to hijack it. I have just noticed a significant number of tickets that refer to panics with kernel 3.16, after suffering from such an issue repeatedly. I haven't seen a resolution of the issue in any of those tickets without using a newer kernel.
As a result, I wonder if you should note down somewhere that there are unsolved issues while running Docker under 3.16 kernel, for people who consider production usage.

Sorry again for taking attention from the original issue

@cpuguy83 I did not write the original ticket and I do not want to hijack it. I have just noticed a significant number of tickets that refer to panics with kernel 3.16, after suffering from such an issue repeatedly. I haven't seen a resolution of the issue in any of those tickets without using a newer kernel.
As a result, I wonder if you should note down somewhere that there are unsolved issues while running Docker under 3.16 kernel, for people who consider production usage.

Sorry again for taking attention from the original issue

@jmcollin78

This comment has been minimized.

Show comment
Hide comment
@jmcollin78

jmcollin78 Jan 23, 2017

Hi @hdimitriou , @cpuguy83 , @justincormack thank's for your effort to try to help with this issue. I subscribe to the Debian Kernel Issue as mentionned. "regularly" means that after a certain time (not at startup), randomly the VM is stopped with this kernel panic.
The VM could run 10 days without trouble and one day being stopped. Those days nothing particular is noticed.
This could the Mongodb container or the Postgresql container or another container that crash.
The only commons thing I notice is that all crashing VM have Openstack volume mounted into the container.
Other container without volume mounted don't crash (as far I can see).

Hi @hdimitriou , @cpuguy83 , @justincormack thank's for your effort to try to help with this issue. I subscribe to the Debian Kernel Issue as mentionned. "regularly" means that after a certain time (not at startup), randomly the VM is stopped with this kernel panic.
The VM could run 10 days without trouble and one day being stopped. Those days nothing particular is noticed.
This could the Mongodb container or the Postgresql container or another container that crash.
The only commons thing I notice is that all crashing VM have Openstack volume mounted into the container.
Other container without volume mounted don't crash (as far I can see).

@cpuguy83

This comment has been minimized.

Show comment
Hide comment
@cpuguy83

cpuguy83 Jan 24, 2017

Contributor

Alo note I found this: kubernetes/kubernetes#23253 (comment)

Contributor

cpuguy83 commented Jan 24, 2017

Alo note I found this: kubernetes/kubernetes#23253 (comment)

@jmcollin78

This comment has been minimized.

Show comment
Hide comment
@jmcollin78

jmcollin78 Feb 2, 2017

Maybe this will help also: #13940
I will try upgrade Linux kernel to 3.19

Maybe this will help also: #13940
I will try upgrade Linux kernel to 3.19

@thiesschneider

This comment has been minimized.

Show comment
Hide comment
@thiesschneider

thiesschneider Mar 3, 2017

did it solve your problem?

did it solve your problem?

@jmcollin78

This comment has been minimized.

Show comment
Hide comment
@jmcollin78

jmcollin78 Mar 4, 2017

I upgrade to kernel 4.9 and it solve my problem.

I upgrade to kernel 4.9 and it solve my problem.

@c0deright

This comment has been minimized.

Show comment
Hide comment
@c0deright

c0deright Jul 14, 2017

I had a similar kernel panic

IP: [<ffffffffc06a1a2b>] au_write_pre+0x8b/0x110 [aufs]
[..]
Call Trace:
 [<ffffffffc06a229c>] aufs_write_iter+0x4c/0x100 [aufs]
 [<ffffffffc06a2250>] ? aufs_splice_write+0x110/0x110 [aufs]
 [<ffffffff8125fa7a>] aio_run_iocb+0x26a/0x2d0
 [<ffffffff812f644c>] ? jbd2_complete_transaction+0x5c/0xa0
 [<ffffffff811b4ecd>] ? kzfree+0x2d/0x40
 [<ffffffff811ee2ba>] ? kfree+0x13a/0x150
 [<ffffffff8126088b>] ? do_io_submit+0x19b/0x500
 [<ffffffff8126094f>] do_io_submit+0x25f/0x500
 [<ffffffff81210fe0>] ? __fput+0x190/0x220
 [<ffffffff81260c00>] SyS_io_submit+0x10/0x20
 [<ffffffff81840b72>] entry_SYSCALL_64_fastpath+0x16/0x71

and

NMI watchdog: BUG: soft lockup - CPU#3 stuck for 22s! [mysqld:10353]

that I was able to reproduce on Ubuntu 16.04 LTS running percona-server-5.6 (mysql) dockerized when my.cnf setting innodb_flush_method = O_DIRECT was active.

The kernel panic and docker crashes went away the second I disabled the innodb_flush_method option.

Might be related?

c0deright commented Jul 14, 2017

I had a similar kernel panic

IP: [<ffffffffc06a1a2b>] au_write_pre+0x8b/0x110 [aufs]
[..]
Call Trace:
 [<ffffffffc06a229c>] aufs_write_iter+0x4c/0x100 [aufs]
 [<ffffffffc06a2250>] ? aufs_splice_write+0x110/0x110 [aufs]
 [<ffffffff8125fa7a>] aio_run_iocb+0x26a/0x2d0
 [<ffffffff812f644c>] ? jbd2_complete_transaction+0x5c/0xa0
 [<ffffffff811b4ecd>] ? kzfree+0x2d/0x40
 [<ffffffff811ee2ba>] ? kfree+0x13a/0x150
 [<ffffffff8126088b>] ? do_io_submit+0x19b/0x500
 [<ffffffff8126094f>] do_io_submit+0x25f/0x500
 [<ffffffff81210fe0>] ? __fput+0x190/0x220
 [<ffffffff81260c00>] SyS_io_submit+0x10/0x20
 [<ffffffff81840b72>] entry_SYSCALL_64_fastpath+0x16/0x71

and

NMI watchdog: BUG: soft lockup - CPU#3 stuck for 22s! [mysqld:10353]

that I was able to reproduce on Ubuntu 16.04 LTS running percona-server-5.6 (mysql) dockerized when my.cnf setting innodb_flush_method = O_DIRECT was active.

The kernel panic and docker crashes went away the second I disabled the innodb_flush_method option.

Might be related?

@cpuguy83

This comment has been minimized.

Show comment
Hide comment
@cpuguy83

cpuguy83 Jul 14, 2017

Contributor

Could be related, but I would make sure percona is not writing to aufs (or overlayfs).

Contributor

cpuguy83 commented Jul 14, 2017

Could be related, but I would make sure percona is not writing to aufs (or overlayfs).

@c0deright

This comment has been minimized.

Show comment
Hide comment
@c0deright

c0deright Jul 14, 2017

Sorry, forgot to mention that I tried to test the concept outlined at https://about.zoosk.com/en/engineering-blog/test-databases-docker-containers/ with a 20GB dataset.

I intentionally modified the percona image like outlined in the article so that mysql datadir was not outsourced into a docker volume (/var/lib/mysql) but remained inside the container (/data).

aufs doesn't seem to play nicely with big data written with O_DIRECT.

c0deright commented Jul 14, 2017

Sorry, forgot to mention that I tried to test the concept outlined at https://about.zoosk.com/en/engineering-blog/test-databases-docker-containers/ with a 20GB dataset.

I intentionally modified the percona image like outlined in the article so that mysql datadir was not outsourced into a docker volume (/var/lib/mysql) but remained inside the container (/data).

aufs doesn't seem to play nicely with big data written with O_DIRECT.

@h4llm3n h4llm3n referenced this issue Aug 2, 2017

Closed

mailcow freeze #485

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment