Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Kernel Oops #2684

Closed
mgmartin opened this issue Sep 10, 2014 · 4 comments
Closed

Kernel Oops #2684

mgmartin opened this issue Sep 10, 2014 · 4 comments
Milestone

Comments

@mgmartin
Copy link

I hit this kernel oops last night. This was running latest as of 2014-09-09 on a 3.16.2 kernel.

Sep 09 22:22:35 server kernel: BUG: unable to handle kernel NULL pointer dereference at 0000000000000010
Sep 09 22:22:35 server kernel: IP: [<ffffffff8150f3c9>] __mutex_unlock_slowpath+0x29/0x40
Sep 09 22:22:35 server kernel: PGD 0 
Sep 09 22:22:35 server kernel: Oops: 0000 [#1] SMP 
Sep 09 22:22:35 server kernel: Modules linked in: rpcsec_gss_krb5 nfsv4 dns_resolver snd_hda_codec_hdmi iTCO_wdt iTCO_vendor_support gpio_ich evdev mac_hid corete
Sep 09 22:22:35 server kernel:  serio zfs(PO) zunicode(PO) zavl(PO) zcommon(PO) znvpair(PO) spl(O) nvidia(PO) drm i2c_core
Sep 09 22:22:35 server kernel: CPU: 0 PID: 909 Comm: z_null_iss/0 Tainted: P           O  3.16.2-mgm #1
Sep 09 22:22:35 server kernel: Hardware name: Gigabyte Technology Co., Ltd. EX58-UD5/EX58-UD5, BIOS F13 01/10/2012
Sep 09 22:22:35 server kernel: task: ffff8805019b4670 ti: ffff8800d7a5c000 task.ti: ffff8800d7a5c000
Sep 09 22:22:35 server kernel: RIP: 0010:[<ffffffff8150f3c9>]  [<ffffffff8150f3c9>] __mutex_unlock_slowpath+0x29/0x40
Sep 09 22:22:35 server kernel: RSP: 0018:ffff8800d7a5fd50  EFLAGS: 00010217
Sep 09 22:22:35 server kernel: RAX: 0000000000000000 RBX: ffff8800d0ec3e20 RCX: 0000000000000000
Sep 09 22:22:35 server kernel: RDX: ffff8800d0ec3e28 RSI: 0000000000000246 RDI: ffff8800d0ec3e24
Sep 09 22:22:35 server kernel: RBP: ffff8800d7a5fd58 R08: ffff88051fc14400 R09: 0000000000000001
Sep 09 22:22:35 server kernel: R10: 0000000000015ab9 R11: 0000000000000010 R12: 0000000000000000
Sep 09 22:22:35 server kernel: R13: ffff8805019b4670 R14: ffff8804f649f800 R15: ffff8800d0ec3e20
Sep 09 22:22:35 server kernel: FS:  0000000000000000(0000) GS:ffff88051fc00000(0000) knlGS:0000000000000000
Sep 09 22:22:35 server kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
Sep 09 22:22:35 server kernel: CR2: 0000000000000010 CR3: 0000000001811000 CR4: 00000000000007f0
Sep 09 22:22:35 server kernel: Stack:
Sep 09 22:22:35 server kernel:  ffff8800d0ec3b10 ffff8800d7a5fd68 ffffffff8150f3fb ffff8800d7a5fdc8
Sep 09 22:22:35 server kernel:  ffffffffa0d0cd69 ffffffffa0d0c526 ffff8805019b4670 ffff8800d0ec3e20
Sep 09 22:22:35 server kernel:  00000000019b4670 0000000000000000 0000000000200000 ffff8800d0ec3b10
Sep 09 22:22:35 server kernel: Call Trace:
Sep 09 22:22:35 server kernel:  [<ffffffff8150f3fb>] mutex_unlock+0x1b/0x20
Sep 09 22:22:35 server kernel:  [<ffffffffa0d0cd69>] zio_done+0x5a9/0xdc0 [zfs]
Sep 09 22:22:35 server kernel:  [<ffffffffa0d0c526>] ? zio_ready+0x176/0x410 [zfs]
Sep 09 22:22:35 server kernel:  [<ffffffffa0d08e55>] zio_execute+0xb5/0x150 [zfs]
Sep 09 22:22:35 server kernel:  [<ffffffffa0bb0437>] taskq_thread+0x267/0x510 [spl]
Sep 09 22:22:35 server kernel:  [<ffffffff8109eb80>] ? wake_up_process+0x50/0x50
Sep 09 22:22:35 server kernel:  [<ffffffffa0bb01d0>] ? taskq_cancel_id+0x200/0x200 [spl]
Sep 09 22:22:35 server kernel:  [<ffffffff8108e38a>] kthread+0xea/0x100
Sep 09 22:22:35 server kernel:  [<ffffffff8108e2a0>] ? kthread_create_on_node+0x1a0/0x1a0
Sep 09 22:22:35 server kernel:  [<ffffffff815112fc>] ret_from_fork+0x7c/0xb0
Sep 09 22:22:35 server kernel:  [<ffffffff8108e2a0>] ? kthread_create_on_node+0x1a0/0x1a0
Sep 09 22:22:35 server kernel: Code: 00 00 66 66 66 66 90 55 48 89 e5 53 48 89 fb c7 07 01 00 00 00 48 8d 7f 04 e8 c4 17 00 00 48 8b 43 08 48 8d 53 08 48 39 d0 74
Sep 09 22:22:35 server kernel: RIP  [<ffffffff8150f3c9>] __mutex_unlock_slowpath+0x29/0x40
Sep 09 22:22:35 server kernel:  RSP <ffff8800d7a5fd50>
Sep 09 22:22:35 server kernel: CR2: 0000000000000010
Sep 09 22:22:35 server kernel: ---[ end trace 3c47c5859faee18a ]---
@behlendorf
Copy link
Contributor

There's a good chance this is related to #2523.

@mgmartin
Copy link
Author

Thanks. I've installed your commit 59e9418 , so I'll run with that in place for now and watch it.

@behlendorf behlendorf added the Bug label Sep 11, 2014
@behlendorf behlendorf added this to the 0.6.4 milestone Sep 11, 2014
@edillmann
Copy link
Contributor

I was hit by the same bug while taking a snapshot

http://pastebin.com/mbzrf785

behlendorf pushed a commit to openzfs/spl that referenced this issue Dec 19, 2014
It is known that mutexes in Linux are not safe when using them to
synchronize the freeing of object in which the mutex is embedded:

http://lwn.net/Articles/575477/

The known places in ZFS which are suspected to suffer from the race
condition are zio->io_lock and dbuf->db_mtx.

* zio uses zio->io_lock and zio->io_cv to synchronize freeing
  between zio_wait() and zio_done().
* dbuf uses dbuf->db_mtx to protect reference counting.

This patch fixes this kind of race by forcing serialization on
mutex_exit() with a spin lock, making the mutex safe by sacrificing
a bit of performance and memory overhead.

This issue most commonly manifests itself as a deadlock in the zio
pipeline caused by a process spinning on the damaged mutex.  Similar
deadlocks have been reported for the dbuf->db_mtx mutex.  And it can
also cause a NULL dereference or bad paging request under the right
circumstances.

This issue any many like it are linked off the openzfs/zfs#2523
issue.  Specifically this fix resolves at least the following
outstanding issues:

openzfs/zfs#401
openzfs/zfs#2523
openzfs/zfs#2679
openzfs/zfs#2684
openzfs/zfs#2704
openzfs/zfs#2708
openzfs/zfs#2517
openzfs/zfs#2827
openzfs/zfs#2850
openzfs/zfs#2891
openzfs/zfs#2897
openzfs/zfs#2247
openzfs/zfs#2939

Signed-off-by: Chunwei Chen <tuxoko@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Richard Yao <ryao@gentoo.org>
Closes #421
@behlendorf
Copy link
Contributor

This issue which is a duplicate of #2523 was resolved by the following commit. Full details can be found in the commit message and related lwn article.

openzfs/spl@a3c1eb7 mutex: force serialization on mutex_exit() to fix races

dajhorn added a commit to zfsonlinux/pkg-spl that referenced this issue Dec 20, 2014
Commit: openzfs/zfs@a3c1eb7
From: Chunwei Chen <tuxoko@gmail.com>
Date: Fri, 19 Dec 2014 11:31:59 +0800
Subject: mutex: force serialization on mutex_exit() to fix races

It is known that mutexes in Linux are not safe when using them to
synchronize the freeing of object in which the mutex is embedded:

http://lwn.net/Articles/575477/

The known places in ZFS which are suspected to suffer from the race
condition are zio->io_lock and dbuf->db_mtx.

* zio uses zio->io_lock and zio->io_cv to synchronize freeing
  between zio_wait() and zio_done().
* dbuf uses dbuf->db_mtx to protect reference counting.

This patch fixes this kind of race by forcing serialization on
mutex_exit() with a spin lock, making the mutex safe by sacrificing
a bit of performance and memory overhead.

This issue most commonly manifests itself as a deadlock in the zio
pipeline caused by a process spinning on the damaged mutex.  Similar
deadlocks have been reported for the dbuf->db_mtx mutex.  And it can
also cause a NULL dereference or bad paging request under the right
circumstances.

This issue any many like it are linked off the openzfs/zfs#2523
issue.  Specifically this fix resolves at least the following
outstanding issues:

openzfs/zfs#401
openzfs/zfs#2523
openzfs/zfs#2679
openzfs/zfs#2684
openzfs/zfs#2704
openzfs/zfs#2708
openzfs/zfs#2517
openzfs/zfs#2827
openzfs/zfs#2850
openzfs/zfs#2891
openzfs/zfs#2897
openzfs/zfs#2247
openzfs/zfs#2939

Signed-off-by: Chunwei Chen <tuxoko@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Richard Yao <ryao@gentoo.org>
Backported-by: Darik Horn <dajhorn@vanadac.com>
Closes #421

Conflicts:
        include/sys/mutex.h
behlendorf pushed a commit to openzfs/spl that referenced this issue Dec 23, 2014
It is known that mutexes in Linux are not safe when using them to
synchronize the freeing of object in which the mutex is embedded:

http://lwn.net/Articles/575477/

The known places in ZFS which are suspected to suffer from the race
condition are zio->io_lock and dbuf->db_mtx.

* zio uses zio->io_lock and zio->io_cv to synchronize freeing
  between zio_wait() and zio_done().
* dbuf uses dbuf->db_mtx to protect reference counting.

This patch fixes this kind of race by forcing serialization on
mutex_exit() with a spin lock, making the mutex safe by sacrificing
a bit of performance and memory overhead.

This issue most commonly manifests itself as a deadlock in the zio
pipeline caused by a process spinning on the damaged mutex.  Similar
deadlocks have been reported for the dbuf->db_mtx mutex.  And it can
also cause a NULL dereference or bad paging request under the right
circumstances.

This issue any many like it are linked off the openzfs/zfs#2523
issue.  Specifically this fix resolves at least the following
outstanding issues:

openzfs/zfs#401
openzfs/zfs#2523
openzfs/zfs#2679
openzfs/zfs#2684
openzfs/zfs#2704
openzfs/zfs#2708
openzfs/zfs#2517
openzfs/zfs#2827
openzfs/zfs#2850
openzfs/zfs#2891
openzfs/zfs#2897
openzfs/zfs#2247
openzfs/zfs#2939

Signed-off-by: Chunwei Chen <tuxoko@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Richard Yao <ryao@gentoo.org>
Closes #421
ryao pushed a commit to ryao/spl that referenced this issue Feb 19, 2015
It is known that mutexes in Linux are not safe when using them to
synchronize the freeing of object in which the mutex is embedded:

http://lwn.net/Articles/575477/

The known places in ZFS which are suspected to suffer from the race
condition are zio->io_lock and dbuf->db_mtx.

* zio uses zio->io_lock and zio->io_cv to synchronize freeing
  between zio_wait() and zio_done().
* dbuf uses dbuf->db_mtx to protect reference counting.

This patch fixes this kind of race by forcing serialization on
mutex_exit() with a spin lock, making the mutex safe by sacrificing
a bit of performance and memory overhead.

This issue most commonly manifests itself as a deadlock in the zio
pipeline caused by a process spinning on the damaged mutex.  Similar
deadlocks have been reported for the dbuf->db_mtx mutex.  And it can
also cause a NULL dereference or bad paging request under the right
circumstances.

This issue any many like it are linked off the openzfs/zfs#2523
issue.  Specifically this fix resolves at least the following
outstanding issues:

openzfs/zfs#401
openzfs/zfs#2523
openzfs/zfs#2679
openzfs/zfs#2684
openzfs/zfs#2704
openzfs/zfs#2708
openzfs/zfs#2517
openzfs/zfs#2827
openzfs/zfs#2850
openzfs/zfs#2891
openzfs/zfs#2897
openzfs/zfs#2247
openzfs/zfs#2939

Signed-off-by: Chunwei Chen <tuxoko@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Richard Yao <ryao@gentoo.org>
Closes openzfs#421

Conflicts:
	include/sys/mutex.h
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants