Kernel Oops #2684

mgmartin · 2014-09-10T13:51:32Z

I hit this kernel oops last night. This was running latest as of 2014-09-09 on a 3.16.2 kernel.

Sep 09 22:22:35 server kernel: BUG: unable to handle kernel NULL pointer dereference at 0000000000000010
Sep 09 22:22:35 server kernel: IP: [<ffffffff8150f3c9>] __mutex_unlock_slowpath+0x29/0x40
Sep 09 22:22:35 server kernel: PGD 0 
Sep 09 22:22:35 server kernel: Oops: 0000 [#1] SMP 
Sep 09 22:22:35 server kernel: Modules linked in: rpcsec_gss_krb5 nfsv4 dns_resolver snd_hda_codec_hdmi iTCO_wdt iTCO_vendor_support gpio_ich evdev mac_hid corete
Sep 09 22:22:35 server kernel:  serio zfs(PO) zunicode(PO) zavl(PO) zcommon(PO) znvpair(PO) spl(O) nvidia(PO) drm i2c_core
Sep 09 22:22:35 server kernel: CPU: 0 PID: 909 Comm: z_null_iss/0 Tainted: P           O  3.16.2-mgm #1
Sep 09 22:22:35 server kernel: Hardware name: Gigabyte Technology Co., Ltd. EX58-UD5/EX58-UD5, BIOS F13 01/10/2012
Sep 09 22:22:35 server kernel: task: ffff8805019b4670 ti: ffff8800d7a5c000 task.ti: ffff8800d7a5c000
Sep 09 22:22:35 server kernel: RIP: 0010:[<ffffffff8150f3c9>]  [<ffffffff8150f3c9>] __mutex_unlock_slowpath+0x29/0x40
Sep 09 22:22:35 server kernel: RSP: 0018:ffff8800d7a5fd50  EFLAGS: 00010217
Sep 09 22:22:35 server kernel: RAX: 0000000000000000 RBX: ffff8800d0ec3e20 RCX: 0000000000000000
Sep 09 22:22:35 server kernel: RDX: ffff8800d0ec3e28 RSI: 0000000000000246 RDI: ffff8800d0ec3e24
Sep 09 22:22:35 server kernel: RBP: ffff8800d7a5fd58 R08: ffff88051fc14400 R09: 0000000000000001
Sep 09 22:22:35 server kernel: R10: 0000000000015ab9 R11: 0000000000000010 R12: 0000000000000000
Sep 09 22:22:35 server kernel: R13: ffff8805019b4670 R14: ffff8804f649f800 R15: ffff8800d0ec3e20
Sep 09 22:22:35 server kernel: FS:  0000000000000000(0000) GS:ffff88051fc00000(0000) knlGS:0000000000000000
Sep 09 22:22:35 server kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
Sep 09 22:22:35 server kernel: CR2: 0000000000000010 CR3: 0000000001811000 CR4: 00000000000007f0
Sep 09 22:22:35 server kernel: Stack:
Sep 09 22:22:35 server kernel:  ffff8800d0ec3b10 ffff8800d7a5fd68 ffffffff8150f3fb ffff8800d7a5fdc8
Sep 09 22:22:35 server kernel:  ffffffffa0d0cd69 ffffffffa0d0c526 ffff8805019b4670 ffff8800d0ec3e20
Sep 09 22:22:35 server kernel:  00000000019b4670 0000000000000000 0000000000200000 ffff8800d0ec3b10
Sep 09 22:22:35 server kernel: Call Trace:
Sep 09 22:22:35 server kernel:  [<ffffffff8150f3fb>] mutex_unlock+0x1b/0x20
Sep 09 22:22:35 server kernel:  [<ffffffffa0d0cd69>] zio_done+0x5a9/0xdc0 [zfs]
Sep 09 22:22:35 server kernel:  [<ffffffffa0d0c526>] ? zio_ready+0x176/0x410 [zfs]
Sep 09 22:22:35 server kernel:  [<ffffffffa0d08e55>] zio_execute+0xb5/0x150 [zfs]
Sep 09 22:22:35 server kernel:  [<ffffffffa0bb0437>] taskq_thread+0x267/0x510 [spl]
Sep 09 22:22:35 server kernel:  [<ffffffff8109eb80>] ? wake_up_process+0x50/0x50
Sep 09 22:22:35 server kernel:  [<ffffffffa0bb01d0>] ? taskq_cancel_id+0x200/0x200 [spl]
Sep 09 22:22:35 server kernel:  [<ffffffff8108e38a>] kthread+0xea/0x100
Sep 09 22:22:35 server kernel:  [<ffffffff8108e2a0>] ? kthread_create_on_node+0x1a0/0x1a0
Sep 09 22:22:35 server kernel:  [<ffffffff815112fc>] ret_from_fork+0x7c/0xb0
Sep 09 22:22:35 server kernel:  [<ffffffff8108e2a0>] ? kthread_create_on_node+0x1a0/0x1a0
Sep 09 22:22:35 server kernel: Code: 00 00 66 66 66 66 90 55 48 89 e5 53 48 89 fb c7 07 01 00 00 00 48 8d 7f 04 e8 c4 17 00 00 48 8b 43 08 48 8d 53 08 48 39 d0 74
Sep 09 22:22:35 server kernel: RIP  [<ffffffff8150f3c9>] __mutex_unlock_slowpath+0x29/0x40
Sep 09 22:22:35 server kernel:  RSP <ffff8800d7a5fd50>
Sep 09 22:22:35 server kernel: CR2: 0000000000000010
Sep 09 22:22:35 server kernel: ---[ end trace 3c47c5859faee18a ]---

The text was updated successfully, but these errors were encountered:

behlendorf · 2014-09-10T15:21:39Z

There's a good chance this is related to #2523.

mgmartin · 2014-09-11T00:24:33Z

Thanks. I've installed your commit 59e9418 , so I'll run with that in place for now and watch it.

edillmann · 2014-10-18T17:26:29Z

I was hit by the same bug while taking a snapshot

http://pastebin.com/mbzrf785

It is known that mutexes in Linux are not safe when using them to synchronize the freeing of object in which the mutex is embedded: http://lwn.net/Articles/575477/ The known places in ZFS which are suspected to suffer from the race condition are zio->io_lock and dbuf->db_mtx. * zio uses zio->io_lock and zio->io_cv to synchronize freeing between zio_wait() and zio_done(). * dbuf uses dbuf->db_mtx to protect reference counting. This patch fixes this kind of race by forcing serialization on mutex_exit() with a spin lock, making the mutex safe by sacrificing a bit of performance and memory overhead. This issue most commonly manifests itself as a deadlock in the zio pipeline caused by a process spinning on the damaged mutex. Similar deadlocks have been reported for the dbuf->db_mtx mutex. And it can also cause a NULL dereference or bad paging request under the right circumstances. This issue any many like it are linked off the openzfs/zfs#2523 issue. Specifically this fix resolves at least the following outstanding issues: openzfs/zfs#401 openzfs/zfs#2523 openzfs/zfs#2679 openzfs/zfs#2684 openzfs/zfs#2704 openzfs/zfs#2708 openzfs/zfs#2517 openzfs/zfs#2827 openzfs/zfs#2850 openzfs/zfs#2891 openzfs/zfs#2897 openzfs/zfs#2247 openzfs/zfs#2939 Signed-off-by: Chunwei Chen <tuxoko@gmail.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Richard Yao <ryao@gentoo.org> Closes #421

behlendorf · 2014-12-19T19:00:41Z

This issue which is a duplicate of #2523 was resolved by the following commit. Full details can be found in the commit message and related lwn article.

openzfs/spl@a3c1eb7 mutex: force serialization on mutex_exit() to fix races

Commit: openzfs/zfs@a3c1eb7 From: Chunwei Chen <tuxoko@gmail.com> Date: Fri, 19 Dec 2014 11:31:59 +0800 Subject: mutex: force serialization on mutex_exit() to fix races It is known that mutexes in Linux are not safe when using them to synchronize the freeing of object in which the mutex is embedded: http://lwn.net/Articles/575477/ The known places in ZFS which are suspected to suffer from the race condition are zio->io_lock and dbuf->db_mtx. * zio uses zio->io_lock and zio->io_cv to synchronize freeing between zio_wait() and zio_done(). * dbuf uses dbuf->db_mtx to protect reference counting. This patch fixes this kind of race by forcing serialization on mutex_exit() with a spin lock, making the mutex safe by sacrificing a bit of performance and memory overhead. This issue most commonly manifests itself as a deadlock in the zio pipeline caused by a process spinning on the damaged mutex. Similar deadlocks have been reported for the dbuf->db_mtx mutex. And it can also cause a NULL dereference or bad paging request under the right circumstances. This issue any many like it are linked off the openzfs/zfs#2523 issue. Specifically this fix resolves at least the following outstanding issues: openzfs/zfs#401 openzfs/zfs#2523 openzfs/zfs#2679 openzfs/zfs#2684 openzfs/zfs#2704 openzfs/zfs#2708 openzfs/zfs#2517 openzfs/zfs#2827 openzfs/zfs#2850 openzfs/zfs#2891 openzfs/zfs#2897 openzfs/zfs#2247 openzfs/zfs#2939 Signed-off-by: Chunwei Chen <tuxoko@gmail.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Richard Yao <ryao@gentoo.org> Backported-by: Darik Horn <dajhorn@vanadac.com> Closes #421 Conflicts: include/sys/mutex.h

It is known that mutexes in Linux are not safe when using them to synchronize the freeing of object in which the mutex is embedded: http://lwn.net/Articles/575477/ The known places in ZFS which are suspected to suffer from the race condition are zio->io_lock and dbuf->db_mtx. * zio uses zio->io_lock and zio->io_cv to synchronize freeing between zio_wait() and zio_done(). * dbuf uses dbuf->db_mtx to protect reference counting. This patch fixes this kind of race by forcing serialization on mutex_exit() with a spin lock, making the mutex safe by sacrificing a bit of performance and memory overhead. This issue most commonly manifests itself as a deadlock in the zio pipeline caused by a process spinning on the damaged mutex. Similar deadlocks have been reported for the dbuf->db_mtx mutex. And it can also cause a NULL dereference or bad paging request under the right circumstances. This issue any many like it are linked off the openzfs/zfs#2523 issue. Specifically this fix resolves at least the following outstanding issues: openzfs/zfs#401 openzfs/zfs#2523 openzfs/zfs#2679 openzfs/zfs#2684 openzfs/zfs#2704 openzfs/zfs#2708 openzfs/zfs#2517 openzfs/zfs#2827 openzfs/zfs#2850 openzfs/zfs#2891 openzfs/zfs#2897 openzfs/zfs#2247 openzfs/zfs#2939 Signed-off-by: Chunwei Chen <tuxoko@gmail.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Richard Yao <ryao@gentoo.org> Closes #421

It is known that mutexes in Linux are not safe when using them to synchronize the freeing of object in which the mutex is embedded: http://lwn.net/Articles/575477/ The known places in ZFS which are suspected to suffer from the race condition are zio->io_lock and dbuf->db_mtx. * zio uses zio->io_lock and zio->io_cv to synchronize freeing between zio_wait() and zio_done(). * dbuf uses dbuf->db_mtx to protect reference counting. This patch fixes this kind of race by forcing serialization on mutex_exit() with a spin lock, making the mutex safe by sacrificing a bit of performance and memory overhead. This issue most commonly manifests itself as a deadlock in the zio pipeline caused by a process spinning on the damaged mutex. Similar deadlocks have been reported for the dbuf->db_mtx mutex. And it can also cause a NULL dereference or bad paging request under the right circumstances. This issue any many like it are linked off the openzfs/zfs#2523 issue. Specifically this fix resolves at least the following outstanding issues: openzfs/zfs#401 openzfs/zfs#2523 openzfs/zfs#2679 openzfs/zfs#2684 openzfs/zfs#2704 openzfs/zfs#2708 openzfs/zfs#2517 openzfs/zfs#2827 openzfs/zfs#2850 openzfs/zfs#2891 openzfs/zfs#2897 openzfs/zfs#2247 openzfs/zfs#2939 Signed-off-by: Chunwei Chen <tuxoko@gmail.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Richard Yao <ryao@gentoo.org> Closes openzfs#421 Conflicts: include/sys/mutex.h

behlendorf added the Bug label Sep 11, 2014

behlendorf added this to the 0.6.4 milestone Sep 11, 2014

behlendorf added Bug - Major and removed Bug labels Oct 19, 2014

behlendorf closed this as completed Dec 19, 2014

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Kernel Oops #2684

Kernel Oops #2684

mgmartin commented Sep 10, 2014

behlendorf commented Sep 10, 2014

mgmartin commented Sep 11, 2014

edillmann commented Oct 18, 2014

behlendorf commented Dec 19, 2014

Kernel Oops #2684

Kernel Oops #2684

Comments

mgmartin commented Sep 10, 2014

behlendorf commented Sep 10, 2014

mgmartin commented Sep 11, 2014

edillmann commented Oct 18, 2014

behlendorf commented Dec 19, 2014