Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Illumos #5244 zio pipeline callers should explicitly invoke next stage #2828

Closed
wants to merge 1 commit into from

Conversation

ryao
Copy link
Contributor

@ryao ryao commented Oct 24, 2014

References:
https://www.illumos.org/projects/illumos-gate//issues/5244
https://reviews.csiden.org/r/119/diff/#

Porting Notes:

  1. The unported "2932 support crash dumps to raidz, etc. pools" caused a
    merge conflict due to a copyright difference in module/zfs/vdev_raidz.c.
  2. The unported "4128 disks in zpools never go away when pulled" and
    additional Linux-specific changes caused merge conflicts in
    module/zfs/vdev_disk.c.
  3. Our changes to use the TQ_PUSHPAGE extension and vdev_file_taskq
    caused merge conflicts in module/zfs/vdev_file.c. The taskq import that
    I plan to send in the future would have prevented this particular
    conflict.

Reviewed by: Matthew Ahrens mahrens@delphix.com
Reviewed by: Adam Leventhal ahl@delphix.com
Reviewed by: Alex Reece alex.reece@delphix.com
Reviewed by: Christopher Siden christopher.siden@delphix.com
Ported-by: Richard Yao richard.yao@clusterhq.com

@adilger
Copy link
Contributor

adilger commented Dec 8, 2014

Is there something specific holding this patch back from landing? I see that the upstream Illumos patch that @ryao was referencing has been merged. This is causing a fair number of testing failures with Lustre at OST unmount time (https://jira.hpdd.intel.com/browse/LU-5242) so it would be great to get this fixed. This is one of the last issues blocking our ability to enforce review-zfs testing for every Lustre patch submission.

kernelOfTruth pushed a commit to kernelOfTruth/zfs that referenced this pull request Dec 13, 2014
…xt stage openzfs#2828

References:
https://www.illumos.org/projects/illumos-gate//issues/5244
https://reviews.csiden.org/r/119/diff/#

Porting Notes:
1. The unported "2932 support crash dumps to raidz, etc. pools" caused a
merge conflict due to a copyright difference in module/zfs/vdev_raidz.c.
2. The unported "4128 disks in zpools never go away when pulled" and
additional Linux-specific changes caused merge conflicts in
module/zfs/vdev_disk.c.
3. Our changes to use the TQ_PUSHPAGE extension and vdev_file_taskq
caused merge conflicts in module/zfs/vdev_file.c. The taskq import that
I plan to send in the future would have prevented this particular
conflict.

Reviewed by: Matthew Ahrens mahrens@delphix.com
Reviewed by: Adam Leventhal ahl@delphix.com
Reviewed by: Alex Reece alex.reece@delphix.com
Reviewed by: Christopher Siden christopher.siden@delphix.com
Ported-by: Richard Yao richard.yao@clusterhq.com
@behlendorf
Copy link
Contributor

@adilger have you verified that this patch resolves your issue? I know there was some speculation that this would resolve #2523 but frankly that was just a (wrong) hunch. And in fact the real root cause of #2523 has been identified and a patch proposed.

That said, the only thing holding up this patch is getting it reviewed and then merged. It's on my short list but I've been heads down on the kmem rework for the last week or so.

@behlendorf behlendorf added this to the 0.6.4 milestone Dec 19, 2014
@behlendorf behlendorf modified the milestones: 0.6.5, 0.6.4 Feb 6, 2015
@behlendorf
Copy link
Contributor

Bumping to 0.6.5 this isn't critical and delaying it will all us to pull in the Illumos dependencies.

kernelOfTruth pushed a commit to kernelOfTruth/zfs that referenced this pull request Mar 12, 2015
…xt stage openzfs#2828

References:
https://www.illumos.org/projects/illumos-gate//issues/5244
https://reviews.csiden.org/r/119/diff/#

Porting Notes:
1. The unported "2932 support crash dumps to raidz, etc. pools" caused a
merge conflict due to a copyright difference in module/zfs/vdev_raidz.c.
2. The unported "4128 disks in zpools never go away when pulled" and
additional Linux-specific changes caused merge conflicts in
module/zfs/vdev_disk.c.
3. Our changes to use the TQ_PUSHPAGE extension and vdev_file_taskq
caused merge conflicts in module/zfs/vdev_file.c. The taskq import that
I plan to send in the future would have prevented this particular
conflict.

Reviewed by: Matthew Ahrens mahrens@delphix.com
Reviewed by: Adam Leventhal ahl@delphix.com
Reviewed by: Alex Reece alex.reece@delphix.com
Reviewed by: Christopher Siden christopher.siden@delphix.com
Ported-by: Richard Yao richard.yao@clusterhq.com

* files that needed updating:
- module/zfs/vdev_disk.c
- module/zfs/vdev_file.c
kernelOfTruth added a commit to kernelOfTruth/zfs that referenced this pull request Mar 13, 2015
kernelOfTruth pushed a commit to kernelOfTruth/zfs that referenced this pull request Mar 13, 2015
…xt stage openzfs#2828

References:
https://www.illumos.org/projects/illumos-gate//issues/5244
https://reviews.csiden.org/r/119/diff/#

Porting Notes:
1. The unported "2932 support crash dumps to raidz, etc. pools" caused a
merge conflict due to a copyright difference in module/zfs/vdev_raidz.c.
2. The unported "4128 disks in zpools never go away when pulled" and
additional Linux-specific changes caused merge conflicts in
module/zfs/vdev_disk.c.
3. Our changes to use the TQ_PUSHPAGE extension and vdev_file_taskq
caused merge conflicts in module/zfs/vdev_file.c. The taskq import that
I plan to send in the future would have prevented this particular
conflict.

Reviewed by: Matthew Ahrens mahrens@delphix.com
Reviewed by: Adam Leventhal ahl@delphix.com
Reviewed by: Alex Reece alex.reece@delphix.com
Reviewed by: Christopher Siden christopher.siden@delphix.com
Ported-by: Richard Yao richard.yao@clusterhq.com

----------------------------------------------------------------------------------

* files that needed updating:
- module/zfs/vdev_disk.c
- module/zfs/vdev_file.c

(2nd attempt,
~kernelOfTruth)
kernelOfTruth added a commit to kernelOfTruth/zfs that referenced this pull request Mar 13, 2015
…nvoke next stage openzfs#2828"

This reverts commit 9291279.

conflicts with:

commit 92119cc
Author: Brian Behlendorf <behlendorf1@llnl.gov>
Date:   Sun Jul 13 14:35:19 2014 -0400

    Mark IO pipeline with PF_FSTRANS

    In order to avoid deadlocking in the IO pipeline it is critical that
    pageout be avoided during direct memory reclaim.  This ensures that
    the pipeline threads can always make forward progress and never end
    up blocking on a DMU transaction.  For this very reason Linux now
    provides the PF_FSTRANS flag which may be set in the process context.

    Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>

not sure where that commit came from, it's not in master (?!) but commits
where always fetched from master /sadpanda
@eolson78
Copy link

Recently ran into an issue that looked to be very similar to issues in this particular pull request as well as referenced issues. Currently running Centos 6.5 Kernel 3.14.4 with ZFS 0.6.3.1 with some additonal pull requests that were recommended by @ryao including the mutex serialization fix to SPL authored by @tuxoko My system is 15VCPU with 30GB of RAM running on an AWS HVM instnace with a single 1TB device with a single Zpool with a heavy NFS workload. We were running a 320GB L2arc made up of 2x 160GB ephemeral storage devices. We were also running ZFS send and receive for replication to another machine. The initial complaint was of NFS clients being disconnected after TXG sync had gotten stuck. The dmesg output looks like this

INFO: task nfsd:2777 blocked for more than 120 seconds.
      Tainted: P           O 3.14.4 #1
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
nfsd            D ffffffff8181d000     0  2777      2 0x00000000
 ffff88075b0d9c48 0000000000000046 ffff88075b0d9c04 ffff88075b0d8010
 0000000000014440 0000000000014440 ffff88075c424210 ffff880768818fb0
 ffff88075b0d9d38 ffffffffa069d0c4 ffffffffa069d0c0 00000000ffffffff
Call Trace:
 [<ffffffff816442f9>] schedule+0x29/0x70
 [<ffffffff8164465e>] schedule_preempt_disabled+0xe/0x10
 [<ffffffff81646024>] __mutex_lock_slowpath+0x1a4/0x240
 [<ffffffff816460e3>] mutex_lock+0x23/0x40
 [<ffffffffa0689705>] nfs4_lock_state+0x15/0x20 [nfsd]
 [<ffffffffa068b559>] nfsd4_renew+0x49/0xb0 [nfsd]
 [<ffffffffa067fc71>] nfsd4_proc_compound+0x3f1/0x580 [nfsd]
 [<ffffffffa066d258>] nfsd_dispatch+0xa8/0x1b0 [nfsd]
 [<ffffffffa062297c>] svc_process_common+0x30c/0x530 [sunrpc]
 [<ffffffffa066d930>] ? nfsd_pool_stats_release+0x60/0x60 [nfsd]
 [<ffffffffa0622ef2>] svc_process+0x112/0x160 [sunrpc]
 [<ffffffffa066d9ef>] nfsd+0xbf/0x130 [nfsd]
 [<ffffffffa066d930>] ? nfsd_pool_stats_release+0x60/0x60 [nfsd]
 [<ffffffff81098dee>] kthread+0xce/0xf0
 [<ffffffff81098d20>] ? kthread_freezable_should_stop+0x70/0x70
 [<ffffffff8165213c>] ret_from_fork+0x7c/0xb0
 [<ffffffff81098d20>] ? kthread_freezable_should_stop+0x70/0x70
INFO: task nfsd:2778 blocked for more than 120 seconds.
      Tainted: P           O 3.14.4 #1
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
nfsd            D ffffffff8181d000     0  2778      2 0x00000000
 ffff88075b1f9c48 0000000000000046 ffff88075b1f9c04 ffff88075b1f8010
 0000000000014440 0000000000014440 ffff88075b1f6ff0 ffff8807688101d0
 ffff88075b1f9d38 ffffffffa069d0c4 ffffffffa069d0c0 00000000ffffffff
Call Trace:
 [<ffffffff816442f9>] schedule+0x29/0x70
 [<ffffffff8164465e>] schedule_preempt_disabled+0xe/0x10
 [<ffffffff81646024>] __mutex_lock_slowpath+0x1a4/0x240
 [<ffffffff816460e3>] mutex_lock+0x23/0x40
 [<ffffffffa0689705>] nfs4_lock_state+0x15/0x20 [nfsd]
 [<ffffffffa068b559>] nfsd4_renew+0x49/0xb0 [nfsd]
 [<ffffffffa067fc71>] nfsd4_proc_compound+0x3f1/0x580 [nfsd]
 [<ffffffffa066d258>] nfsd_dispatch+0xa8/0x1b0 [nfsd]
 [<ffffffffa062297c>] svc_process_common+0x30c/0x530 [sunrpc]
 [<ffffffffa066d930>] ? nfsd_pool_stats_release+0x60/0x60 [nfsd]
 [<ffffffffa0622ef2>] svc_process+0x112/0x160 [sunrpc]
 [<ffffffffa066d9ef>] nfsd+0xbf/0x130 [nfsd]
 [<ffffffffa066d930>] ? nfsd_pool_stats_release+0x60/0x60 [nfsd]
 [<ffffffff81098dee>] kthread+0xce/0xf0
 [<ffffffff81098d20>] ? kthread_freezable_should_stop+0x70/0x70
 [<ffffffff8165213c>] ret_from_fork+0x7c/0xb0
 [<ffffffff81098d20>] ? kthread_freezable_should_stop+0x70/0x70
INFO: task nfsd:2779 blocked for more than 120 seconds.
      Tainted: P           O 3.14.4 #1
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
nfsd            D ffffffff8181d000     0  2779      2 0x00000000
 ffff88075b2f5c28 0000000000000046 bb28ddfd00000006 ffff88075b2f4010
 0000000000014440 0000000000014440 ffff88075b1f6250 ffff8807688101d0
 ffff88075b2f5c48 ffffffffa069d0c4 ffffffffa069d0c0 00000000ffffffff
Call Trace:
 [<ffffffff816442f9>] schedule+0x29/0x70
 [<ffffffff8164465e>] schedule_preempt_disabled+0xe/0x10
 [<ffffffff81646024>] __mutex_lock_slowpath+0x1a4/0x240
 [<ffffffffa0677047>] ? nfsd_setuser+0x117/0x2b0 [nfsd]
 [<ffffffff816460e3>] mutex_lock+0x23/0x40
 [<ffffffffa0689705>] nfs4_lock_state+0x15/0x20 [nfsd]
 [<ffffffffa068c557>] nfsd4_close+0x67/0x220 [nfsd]
 [<ffffffffa06801da>] ? nfsd4_encode_operation+0xda/0xf0 [nfsd]
 [<ffffffffa067fc71>] nfsd4_proc_compound+0x3f1/0x580 [nfsd]
 [<ffffffffa066d258>] nfsd_dispatch+0xa8/0x1b0 [nfsd]
 [<ffffffffa062297c>] svc_process_common+0x30c/0x530 [sunrpc]
 [<ffffffffa066d930>] ? nfsd_pool_stats_release+0x60/0x60 [nfsd]
 [<ffffffffa0622ef2>] svc_process+0x112/0x160 [sunrpc]
 [<ffffffffa066d9ef>] nfsd+0xbf/0x130 [nfsd]
 [<ffffffffa066d930>] ? nfsd_pool_stats_release+0x60/0x60 [nfsd]
 [<ffffffff81098dee>] kthread+0xce/0xf0
 [<ffffffff81098d20>] ? kthread_freezable_should_stop+0x70/0x70
 [<ffffffff8165213c>] ret_from_fork+0x7c/0xb0
 [<ffffffff81098d20>] ? kthread_freezable_should_stop+0x70/0x70
INFO: task nfsd:2780 blocked for more than 120 seconds.
      Tainted: P           O 3.14.4 #1
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
nfsd            D ffffffff8181d000     0  2780      2 0x00000000
 ffff88075b3f5c48 0000000000000046 ffff88075b3f5c04 ffff88075b3f4010
 0000000000014440 0000000000014440 ffff88075b3f3030 ffff880768818fb0
 ffff88075b3f5d38 ffffffffa069d0c4 ffffffffa069d0c0 00000000ffffffff
Call Trace:
 [<ffffffff816442f9>] schedule+0x29/0x70
 [<ffffffff8164465e>] schedule_preempt_disabled+0xe/0x10
 [<ffffffff81646024>] __mutex_lock_slowpath+0x1a4/0x240
 [<ffffffffa062ca4a>] ? sunrpc_cache_lookup+0x7a/0x2c0 [sunrpc]
 [<ffffffff816460e3>] mutex_lock+0x23/0x40
 [<ffffffffa0689705>] nfs4_lock_state+0x15/0x20 [nfsd]
 [<ffffffffa068b559>] nfsd4_renew+0x49/0xb0 [nfsd]
 [<ffffffffa067fc71>] nfsd4_proc_compound+0x3f1/0x580 [nfsd]
 [<ffffffffa066d258>] nfsd_dispatch+0xa8/0x1b0 [nfsd]
 [<ffffffffa062297c>] svc_process_common+0x30c/0x530 [sunrpc]
 [<ffffffffa066d930>] ? nfsd_pool_stats_release+0x60/0x60 [nfsd]
 [<ffffffffa0622ef2>] svc_process+0x112/0x160 [sunrpc]
 [<ffffffffa066d9ef>] nfsd+0xbf/0x130 [nfsd]
 [<ffffffffa066d930>] ? nfsd_pool_stats_release+0x60/0x60 [nfsd]
 [<ffffffff81098dee>] kthread+0xce/0xf0
 [<ffffffff81098d20>] ? kthread_freezable_should_stop+0x70/0x70
 [<ffffffff8165213c>] ret_from_fork+0x7c/0xb0
 [<ffffffff81098d20>] ? kthread_freezable_should_stop+0x70/0x70
INFO: task nfsd:2782 blocked for more than 120 seconds.
      Tainted: P           O 3.14.4 #1
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
nfsd            D ffffffff8181d000     0  2782      2 0x00000000
 ffff8800ea6ada48 0000000000000046 ffff8800ea6ab070 ffff8800ea6ac010
 0000000000014440 0000000000014440 ffff8800ea6ab070 ffff880768800150
 ffff8800ea6ada58 ffff88075c1ee2b8 ffff88075c1ee208 ffff8800ea6ada88
Call Trace:
 [<ffffffff816442f9>] schedule+0x29/0x70
 [<ffffffffa01983f5>] cv_wait_common+0x135/0x1c0 [spl]
 [<ffffffff810bc060>] ? bit_waitqueue+0xe0/0xe0
 [<ffffffffa01984d5>] __cv_wait+0x15/0x20 [spl]
 [<ffffffffa02db033>] txg_wait_synced+0xb3/0x1a0 [zfs]
 [<ffffffffa0321008>] zil_commit_writer+0x138/0x240 [zfs]
 [<ffffffffa03211d0>] zil_commit+0xc0/0x100 [zfs]
 [<ffffffffa0314daf>] zfs_fsync+0x8f/0x110 [zfs]
 [<ffffffffa032c47d>] zpl_commit_metadata+0x4d/0x70 [zfs]
 [<ffffffffa0671949>] commit_metadata+0x39/0x50 [nfsd]
 [<ffffffffa067445f>] do_nfsd_create+0x3ff/0x580 [nfsd]
 [<ffffffffa067ea51>] do_open_lookup+0xc1/0x2a0 [nfsd]
 [<ffffffffa067f7a7>] nfsd4_open+0x397/0x470 [nfsd]
 [<ffffffffa06801da>] ? nfsd4_encode_operation+0xda/0xf0 [nfsd]
 [<ffffffffa067fc71>] nfsd4_proc_compound+0x3f1/0x580 [nfsd]
 [<ffffffffa066d258>] nfsd_dispatch+0xa8/0x1b0 [nfsd]
 [<ffffffffa062297c>] svc_process_common+0x30c/0x530 [sunrpc]
 [<ffffffffa066d930>] ? nfsd_pool_stats_release+0x60/0x60 [nfsd]
 [<ffffffffa0622ef2>] svc_process+0x112/0x160 [sunrpc]
 [<ffffffffa066d9ef>] nfsd+0xbf/0x130 [nfsd]
 [<ffffffffa066d930>] ? nfsd_pool_stats_release+0x60/0x60 [nfsd]
 [<ffffffff81098dee>] kthread+0xce/0xf0
 [<ffffffff81098d20>] ? kthread_freezable_should_stop+0x70/0x70
 [<ffffffff8165213c>] ret_from_fork+0x7c/0xb0
 [<ffffffff81098d20>] ? kthread_freezable_should_stop+0x70/0x70
INFO: task nfsd:2784 blocked for more than 120 seconds.
      Tainted: P           O 3.14.4 #1
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
nfsd            D ffffffff8181d000     0  2784      2 0x00000000
 ffff88075a9f1c28 0000000000000046 971f9a1600000006 ffff88075a9f0010
 0000000000014440 0000000000014440 ffff88075a9ef0b0 ffff8807688101d0
 ffff880025148350 ffffffffa069d0c4 ffffffffa069d0c0 00000000ffffffff
Call Trace:
 [<ffffffff816442f9>] schedule+0x29/0x70
 [<ffffffff8164465e>] schedule_preempt_disabled+0xe/0x10
 [<ffffffff81646024>] __mutex_lock_slowpath+0x1a4/0x240
 [<ffffffffa0677047>] ? nfsd_setuser+0x117/0x2b0 [nfsd]
 [<ffffffff816460e3>] mutex_lock+0x23/0x40
 [<ffffffffa0689705>] nfs4_lock_state+0x15/0x20 [nfsd]
 [<ffffffffa067f4ef>] nfsd4_open+0xdf/0x470 [nfsd]
 [<ffffffffa06801da>] ? nfsd4_encode_operation+0xda/0xf0 [nfsd]
 [<ffffffffa067fc71>] nfsd4_proc_compound+0x3f1/0x580 [nfsd]
 [<ffffffffa066d258>] nfsd_dispatch+0xa8/0x1b0 [nfsd]
 [<ffffffffa062297c>] svc_process_common+0x30c/0x530 [sunrpc]
 [<ffffffffa066d930>] ? nfsd_pool_stats_release+0x60/0x60 [nfsd]
 [<ffffffffa0622ef2>] svc_process+0x112/0x160 [sunrpc]
 [<ffffffffa066d9ef>] nfsd+0xbf/0x130 [nfsd]
 [<ffffffffa066d930>] ? nfsd_pool_stats_release+0x60/0x60 [nfsd]
 [<ffffffff81098dee>] kthread+0xce/0xf0
 [<ffffffff81098d20>] ? kthread_freezable_should_stop+0x70/0x70
 [<ffffffff8165213c>] ret_from_fork+0x7c/0xb0
 [<ffffffff81098d20>] ? kthread_freezable_should_stop+0x70/0x70
INFO: task txg_sync:3390 blocked for more than 120 seconds.
      Tainted: P           O 3.14.4 #1
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
txg_sync        D ffffffff8181d000     0  3390      2 0x00000000
 ffff8800e9a3fb58 0000000000000046 ffffc9000a32e400 ffff8800e9a3e010
 0000000000014440 0000000000014440 ffff88075260aef0 ffff880768800ef0
 ffff8800e9a3fb58 ffff88078e634440 ffff88075260aef0 ffff8800e9a3fbb8
Call Trace:
 [<ffffffff816442f9>] schedule+0x29/0x70
 [<ffffffff816443cc>] io_schedule+0x8c/0xd0
 [<ffffffffa0198380>] cv_wait_common+0xc0/0x1c0 [spl]
 [<ffffffffa03249d0>] ? zio_reexecute+0x300/0x300 [zfs]
 [<ffffffff810bc060>] ? bit_waitqueue+0xe0/0xe0
 [<ffffffffa0198498>] __cv_wait_io+0x18/0x20 [spl]
 [<ffffffffa0324c0b>] zio_wait+0xfb/0x1c0 [zfs]
 [<ffffffffa02b2433>] dsl_pool_sync+0x2c3/0x440 [zfs]
 [<ffffffffa02c6b56>] spa_sync+0x336/0x860 [zfs]
 [<ffffffff810e049c>] ? ktime_get_ts+0x4c/0xf0
 [<ffffffffa02dbaa9>] txg_sync_thread+0x379/0x4a0 [zfs]
 [<ffffffffa02db730>] ? txg_dispatch_callbacks+0x180/0x180 [zfs]
 [<ffffffffa0190370>] ? __thread_create+0x360/0x360 [spl]
 [<ffffffffa01903e8>] thread_generic_wrapper+0x78/0x90 [spl]
 [<ffffffffa0190370>] ? __thread_create+0x360/0x360 [spl]
 [<ffffffff81098dee>] kthread+0xce/0xf0
 [<ffffffff81098d20>] ? kthread_freezable_should_stop+0x70/0x70
 [<ffffffff8165213c>] ret_from_fork+0x7c/0xb0
 [<ffffffff81098d20>] ? kthread_freezable_should_stop+0x70/0x70
INFO: task kworker/u256:2:85051 blocked for more than 120 seconds.
      Tainted: P           O 3.14.4 #1
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
kworker/u256:2  D ffffffff8181d000     0 85051      2 0x00000000
Workqueue: nfsd4 laundromat_main [nfsd]
 ffff880199385c68 0000000000000046 0000000200000002 ffff880199384010
 0000000000014440 0000000000014440 ffff8807670b10b0 ffffffff81c13480
 0000000000014440 ffffffffa069d0c4 ffffffffa069d0c0 00000000ffffffff
Call Trace:
 [<ffffffff816442f9>] schedule+0x29/0x70
 [<ffffffff8164465e>] schedule_preempt_disabled+0xe/0x10
 [<ffffffff81646024>] __mutex_lock_slowpath+0x1a4/0x240
 [<ffffffff816460e3>] mutex_lock+0x23/0x40
 [<ffffffffa0689705>] nfs4_lock_state+0x15/0x20 [nfsd]
 [<ffffffffa068e0cd>] nfs4_laundromat+0x2d/0x380 [nfsd]
 [<ffffffff810b49af>] ? idle_balance+0x13f/0x1a0
 [<ffffffffa068e43d>] laundromat_main+0x1d/0x40 [nfsd]
 [<ffffffff810908ce>] process_one_work+0x18e/0x470
 [<ffffffff810931c3>] worker_thread+0x123/0x520
 [<ffffffff81643e15>] ? __schedule+0x3c5/0x720
 [<ffffffff810930a0>] ? manage_workers+0x170/0x170
 [<ffffffff81098dee>] kthread+0xce/0xf0
 [<ffffffff81098d20>] ? kthread_freezable_should_stop+0x70/0x70
 [<ffffffff8165213c>] ret_from_fork+0x7c/0xb0
 [<ffffffff81098d20>] ? kthread_freezable_should_stop+0x70/0x70
INFO: task zfs:121884 blocked for more than 120 seconds.
      Tainted: P           O 3.14.4 #1
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
zfs             D ffffffff8181d000     0 121884 121883 0x00000000
 ffff880197315a98 0000000000000086 ffff880197315ba8 ffff880197314010
 0000000000014440 0000000000014440 ffff8801974e2e70 ffff880768820250
 ffff880197315aa8 ffff88075c1ee2b8 ffff88075c1ee208 ffff880197315ad8
Call Trace:
 [<ffffffff816442f9>] schedule+0x29/0x70
 [<ffffffffa01983f5>] cv_wait_common+0x135/0x1c0 [spl]
 [<ffffffff810bc060>] ? bit_waitqueue+0xe0/0xe0
 [<ffffffffa018ce1b>] ? spl_kmem_free+0x2b/0x40 [spl]
 [<ffffffffa01984d5>] __cv_wait+0x15/0x20 [spl]
 [<ffffffffa02db033>] txg_wait_synced+0xb3/0x1a0 [zfs]
 [<ffffffffa02a9370>] ? dsl_dataset_snapshot_sync+0x170/0x170 [zfs]
 [<ffffffffa02a9200>] ? dsl_dataset_hold+0x210/0x210 [zfs]
 [<ffffffffa02b9f9f>] dsl_sync_task+0x14f/0x240 [zfs]
 [<ffffffffa02a9370>] ? dsl_dataset_snapshot_sync+0x170/0x170 [zfs]
 [<ffffffffa02a9200>] ? dsl_dataset_hold+0x210/0x210 [zfs]
 [<ffffffffa02a63fd>] dsl_dataset_snapshot+0xed/0x250 [zfs]
 [<ffffffff811b396c>] ? __kmalloc_node+0x3c/0x50
 [<ffffffffa018eec7>] ? spl_kmem_alloc_node+0x77/0x150 [spl]
 [<ffffffffa018f493>] ? kmem_alloc_debug+0x53/0x230 [spl]
 [<ffffffffa0306129>] zfs_ioc_snapshot+0x1c9/0x200 [zfs]
 [<ffffffffa0306547>] zfsdev_ioctl+0x1c7/0x4d0 [zfs]
 [<ffffffff811e4843>] do_vfs_ioctl+0x73/0x380
 [<ffffffff811e4bf1>] SyS_ioctl+0xa1/0xb0
 [<ffffffff816521e9>] system_call_fastpath+0x16/0x1b
INFO: task nfsd:2777 blocked for more than 120 seconds.
      Tainted: P           O 3.14.4 #1
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
nfsd            D ffffffff8181d000     0  2777      2 0x00000000
 ffff88075b0d9c48 0000000000000046 ffff88075b0d9c04 ffff88075b0d8010
 0000000000014440 0000000000014440 ffff88075c424210 ffff880768818fb0
 ffff88075b0d9d38 ffffffffa069d0c4 ffffffffa069d0c0 00000000ffffffff
Call Trace:
 [<ffffffff816442f9>] schedule+0x29/0x70
 [<ffffffff8164465e>] schedule_preempt_disabled+0xe/0x10
 [<ffffffff81646024>] __mutex_lock_slowpath+0x1a4/0x240
 [<ffffffff816460e3>] mutex_lock+0x23/0x40
 [<ffffffffa0689705>] nfs4_lock_state+0x15/0x20 [nfsd]
 [<ffffffffa068b559>] nfsd4_renew+0x49/0xb0 [nfsd]
 [<ffffffffa067fc71>] nfsd4_proc_compound+0x3f1/0x580 [nfsd]
 [<ffffffffa066d258>] nfsd_dispatch+0xa8/0x1b0 [nfsd]
 [<ffffffffa062297c>] svc_process_common+0x30c/0x530 [sunrpc]
 [<ffffffffa066d930>] ? nfsd_pool_stats_release+0x60/0x60 [nfsd]
 [<ffffffffa0622ef2>] svc_process+0x112/0x160 [sunrpc]
 [<ffffffffa066d9ef>] nfsd+0xbf/0x130 [nfsd]
 [<ffffffffa066d930>] ? nfsd_pool_stats_release+0x60/0x60 [nfsd]
 [<ffffffff81098dee>] kthread+0xce/0xf0
 [<ffffffff81098d20>] ? kthread_freezable_should_stop+0x70/0x70
 [<ffffffff8165213c>] ret_from_fork+0x7c/0xb0
 [<ffffffff81098d20>] ? kthread_freezable_should_stop+0x70/0x70
nfsd: peername failed (err 107)!
nfsd: peername failed (err 107)!
nfsd: peername failed (err 107)!
nfsd: peername failed (err 107)!
nfsd: peername failed (err 107)!
nfsd: peername failed (err 107)!
nfsd: peername failed (err 107)!
nfsd: peername failed (err 107)!
nfsd: peername failed (err 107)!
nfsd: peername failed (err 107)!

Started working with @dweeezil and @ryao after we were able to succesfully reproduce the issue. I am happy to after numerous reproductions we applied this patch and the issue appears to be resolved at this point

@behlendorf
Copy link
Contributor

@ryao it would be great if you could rebase this on master so we could get a fresh round of tests in before it's merged.

@ryao ryao force-pushed the illumos-5244 branch 3 times, most recently from e81dc3d to 8e8589a Compare April 27, 2015 20:31
@eolson78
Copy link

so I may have spoken to soon on this pull request solving this issue completely. After multiple hours of running with this patch we have seen the following new issue.

Apr 26 22:19:10 ip-50-0-0-122 kernel: INFO: task zfs:5492 blocked for more than 120 seconds.
Apr 26 22:19:10 ip-50-0-0-122 kernel: Tainted: P O 3.14.4 #1
Apr 26 22:19:10 ip-50-0-0-122 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Apr 26 22:19:10 ip-50-0-0-122 kernel: zfs D ffffffff8181d000 0 5492 5491 0x00000000
Apr 26 22:19:10 ip-50-0-0-122 kernel: ffff8800b98e7a98 0000000000000082 ffff8800b98e7ba8 ffff8800b98e6010
Apr 26 22:19:10 ip-50-0-0-122 kernel: 0000000000014440 0000000000014440 ffff8800788be350 ffff880768820ff0
Apr 26 22:19:10 ip-50-0-0-122 kernel: ffff8800b98e7aa8 ffff880742104ab8 ffff880742104a08 ffff8800b98e7ad8
Apr 26 22:19:10 ip-50-0-0-122 kernel: Call Trace:
Apr 26 22:19:10 ip-50-0-0-122 kernel: [] schedule+0x29/0x70
Apr 26 22:19:10 ip-50-0-0-122 kernel: [] cv_wait_common+0x135/0x1c0 [spl]
Apr 26 22:19:10 ip-50-0-0-122 kernel: [] ? bit_waitqueue+0xe0/0xe0
Apr 26 22:19:10 ip-50-0-0-122 kernel: [] ? spl_kmem_free+0x2b/0x40 [spl]
Apr 26 22:19:10 ip-50-0-0-122 kernel: [] __cv_wait+0x15/0x20 [spl]
Apr 26 22:19:10 ip-50-0-0-122 kernel: [] txg_wait_synced+0xb3/0x1a0 [zfs]
Apr 26 22:19:10 ip-50-0-0-122 kernel: [] ? dsl_dataset_snapshot_sync+0x170/0x170 [zfs]
Apr 26 22:19:10 ip-50-0-0-122 kernel: [] ? dsl_dataset_hold+0x210/0x210 [zfs]
Apr 26 22:19:10 ip-50-0-0-122 kernel: [] dsl_sync_task+0x14f/0x240 [zfs]
Apr 26 22:19:10 ip-50-0-0-122 kernel: [] ? dsl_dataset_snapshot_sync+0x170/0x170 [zfs]
Apr 26 22:19:10 ip-50-0-0-122 kernel: [] ? dsl_dataset_hold+0x210/0x210 [zfs]
Apr 26 22:19:10 ip-50-0-0-122 kernel: [] dsl_dataset_snapshot+0xed/0x250 [zfs]
Apr 26 22:19:10 ip-50-0-0-122 kernel: [] ? __kmalloc_node+0x3c/0x50
Apr 26 22:19:10 ip-50-0-0-122 kernel: [] ? spl_kmem_alloc_node+0x77/0x150 [spl]
Apr 26 22:19:10 ip-50-0-0-122 kernel: [] ? kmem_alloc_debug+0x53/0x230 [spl]
Apr 26 22:19:10 ip-50-0-0-122 kernel: [] zfs_ioc_snapshot+0x1c9/0x200 [zfs]
Apr 26 22:19:10 ip-50-0-0-122 kernel: [] zfsdev_ioctl+0x1c7/0x4d0 [zfs]
Apr 26 22:19:10 ip-50-0-0-122 kernel: [] do_vfs_ioctl+0x73/0x380
Apr 26 22:19:10 ip-50-0-0-122 kernel: [] SyS_ioctl+0xa1/0xb0
Apr 26 22:19:10 ip-50-0-0-122 kernel: [] system_call_fastpath+0x16/0x1b
Apr 26 22:19:10 ip-50-0-0-122 kernel: INFO: task zfs:5653 blocked for more than 120 seconds.
Apr 26 22:19:10 ip-50-0-0-122 kernel: Tainted: P O 3.14.4 #1
Apr 26 22:19:10 ip-50-0-0-122 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Apr 26 22:19:10 ip-50-0-0-122 kernel: zfs D ffffffff8181d000 0 5653 5652 0x00000000
Apr 26 22:19:10 ip-50-0-0-122 kernel: ffff8803a6e9dbc8 0000000000000086 0000000000000018 ffff8803a6e9c010
Apr 26 22:19:10 ip-50-0-0-122 kernel: 0000000000014440 0000000000014440 ffff88053837ef70 ffff8807688322d0
Apr 26 22:19:10 ip-50-0-0-122 kernel: ffff8803a6e9dbd8 ffff880742104ab8 ffff880742104a08 ffff8803a6e9dc08
Apr 26 22:19:10 ip-50-0-0-122 kernel: Call Trace:
Apr 26 22:19:10 ip-50-0-0-122 kernel: [] schedule+0x29/0x70
Apr 26 22:19:10 ip-50-0-0-122 kernel: [] cv_wait_common+0x135/0x1c0 [spl]
Apr 26 22:19:10 ip-50-0-0-122 kernel: [] ? bit_waitqueue+0xe0/0xe0
Apr 26 22:19:10 ip-50-0-0-122 kernel: [] ? spl_kmem_free+0x2b/0x40 [spl]
Apr 26 22:19:10 ip-50-0-0-122 kernel: [] __cv_wait+0x15/0x20 [spl]
Apr 26 22:19:10 ip-50-0-0-122 kernel: [] txg_wait_synced+0xb3/0x1a0 [zfs]
Apr 26 22:19:10 ip-50-0-0-122 kernel: [] ? dsl_destroy_snapshot_check_impl+0xb0/0xb0 [zfs]
Apr 26 22:19:10 ip-50-0-0-122 kernel: [] ? dsl_destroy_head_sync+0xc0/0xc0 [zfs]
Apr 26 22:19:10 ip-50-0-0-122 kernel: [] dsl_sync_task+0x14f/0x240 [zfs]
Apr 26 22:19:10 ip-50-0-0-122 kernel: [] ? dsl_destroy_snapshot_check_impl+0xb0/0xb0 [zfs]
Apr 26 22:19:10 ip-50-0-0-122 kernel: [] ? dsl_destroy_head_sync+0xc0/0xc0 [zfs]
Apr 26 22:19:10 ip-50-0-0-122 kernel: [] dsl_destroy_snapshots_nvl+0x8b/0xf0 [zfs]
Apr 26 22:19:10 ip-50-0-0-122 kernel: [] zfs_ioc_destroy_snaps+0xff/0x120 [zfs]
Apr 26 22:19:10 ip-50-0-0-122 kernel: [] zfsdev_ioctl+0x1c7/0x4d0 [zfs]
Apr 26 22:19:10 ip-50-0-0-122 kernel: [] do_vfs_ioctl+0x73/0x380
Apr 26 22:19:10 ip-50-0-0-122 kernel: [] SyS_ioctl+0xa1/0xb0
Apr 26 22:19:10 ip-50-0-0-122 kernel: [] system_call_fastpath+0x16/0x1b
Apr 26 22:31:10 ip-50-0-0-122 kernel: INFO: task zfs:9148 blocked for more than 120 seconds.
Apr 26 22:31:10 ip-50-0-0-122 kernel: Tainted: P O 3.14.4 #1
Apr 26 22:31:10 ip-50-0-0-122 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Apr 26 22:31:10 ip-50-0-0-122 kernel: zfs D ffffffff8181d000 0 9148 9146 0x00000000
Apr 26 22:31:10 ip-50-0-0-122 kernel: ffff88026ef1fbd8 0000000000000082 0000000000000001 ffff88026ef1e010
Apr 26 22:31:10 ip-50-0-0-122 kernel: 0000000000014440 0000000000014440 ffff8800bb6ff130 ffff880768820250
Apr 26 22:31:10 ip-50-0-0-122 kernel: ffff88026ef1fbe8 ffff880742104ab8 ffff880742104a08 ffff88026ef1fc18
Apr 26 22:31:10 ip-50-0-0-122 kernel: Call Trace:
Apr 26 22:31:10 ip-50-0-0-122 kernel: [] schedule+0x29/0x70
Apr 26 22:31:10 ip-50-0-0-122 kernel: [] cv_wait_common+0x135/0x1c0 [spl]
Apr 26 22:31:10 ip-50-0-0-122 kernel: [] ? bit_waitqueue+0xe0/0xe0
Apr 26 22:31:10 ip-50-0-0-122 kernel: [] __cv_wait+0x15/0x20 [spl]
Apr 26 22:31:10 ip-50-0-0-122 kernel: [] txg_wait_synced+0xb3/0x1a0 [zfs]
Apr 26 22:31:10 ip-50-0-0-122 kernel: [] ? dsl_dataset_user_hold_check_one+0xd0/0xd0 [zfs]
Apr 26 22:31:10 ip-50-0-0-122 kernel: [] ? dsl_onexit_hold_cleanup+0x110/0x110 [zfs]
Apr 26 22:31:10 ip-50-0-0-122 kernel: [] dsl_sync_task+0x14f/0x240 [zfs]
Apr 26 22:31:10 ip-50-0-0-122 kernel: [] ? dsl_dataset_user_hold_check_one+0xd0/0xd0 [zfs]
Apr 26 22:31:10 ip-50-0-0-122 kernel: [] ? dsl_onexit_hold_cleanup+0x110/0x110 [zfs]
Apr 26 22:31:10 ip-50-0-0-122 kernel: [] dsl_dataset_user_hold+0x99/0x100 [zfs]
Apr 26 22:31:10 ip-50-0-0-122 kernel: [] zfs_ioc_hold+0x86/0xb0 [zfs]
Apr 26 22:31:10 ip-50-0-0-122 kernel: [] zfsdev_ioctl+0x1c7/0x4d0 [zfs]
Apr 26 22:31:10 ip-50-0-0-122 kernel: [] do_vfs_ioctl+0x73/0x380
Apr 26 22:31:10 ip-50-0-0-122 kernel: [] SyS_ioctl+0xa1/0xb0
Apr 26 22:31:10 ip-50-0-0-122 kernel: [] system_call_fastpath+0x16/0x1b
Apr 26 22:33:10 ip-50-0-0-122 kernel: INFO: task txg_sync:28380 blocked for more than 120 seconds.
Apr 26 22:33:10 ip-50-0-0-122 kernel: Tainted: P O 3.14.4 #1
Apr 26 22:33:10 ip-50-0-0-122 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Apr 26 22:33:10 ip-50-0-0-122 kernel: txg_sync D 0000000000000005 0 28380 2 0x00000000
Apr 26 22:33:10 ip-50-0-0-122 kernel: ffff88072e857b58 0000000000000046 ffffc9001ef5c400 ffff88072e856010
Apr 26 22:33:10 ip-50-0-0-122 kernel: 0000000000014440 0000000000014440 ffff88072e850110 ffff880767c70fb0
Apr 26 22:33:10 ip-50-0-0-122 kernel: ffff88072e857b58 ffff88078e6b4440 ffff88072e850110 ffff88072e857bb8
Apr 26 22:33:10 ip-50-0-0-122 kernel: Call Trace:
Apr 26 22:33:10 ip-50-0-0-122 kernel: [] schedule+0x29/0x70
Apr 26 22:33:10 ip-50-0-0-122 kernel: [] io_schedule+0x8c/0xd0
Apr 26 22:33:10 ip-50-0-0-122 kernel: [] cv_wait_common+0xc0/0x1c0 [spl]
Apr 26 22:33:10 ip-50-0-0-122 kernel: [] ? zio_reexecute+0x300/0x300 [zfs]
Apr 26 22:33:10 ip-50-0-0-122 kernel: [] ? bit_waitqueue+0xe0/0xe0
Apr 26 22:33:10 ip-50-0-0-122 kernel: [] __cv_wait_io+0x18/0x20 [spl]
Apr 26 22:33:10 ip-50-0-0-122 kernel: [] zio_wait+0xfb/0x1c0 [zfs]
Apr 26 22:33:10 ip-50-0-0-122 kernel: [] dsl_pool_sync+0x2c3/0x440 [zfs]
Apr 26 22:33:10 ip-50-0-0-122 kernel: [] spa_sync+0x336/0x860 [zfs]
Apr 26 22:33:10 ip-50-0-0-122 kernel: [] ? ktime_get_ts+0x4c/0xf0
Apr 26 22:33:10 ip-50-0-0-122 kernel: [] txg_sync_thread+0x379/0x4a0 [zfs]
Apr 26 22:33:10 ip-50-0-0-122 kernel: [] ? txg_dispatch_callbacks+0x180/0x180 [zfs]
Apr 26 22:33:10 ip-50-0-0-122 kernel: [] ? __thread_create+0x360/0x360 [spl]
Apr 26 22:33:10 ip-50-0-0-122 kernel: [] thread_generic_wrapper+0x78/0x90 [spl]
Apr 26 22:33:10 ip-50-0-0-122 kernel: [] ? __thread_create+0x360/0x360 [spl]
Apr 26 22:33:10 ip-50-0-0-122 kernel: [] kthread+0xce/0xf0
Apr 26 22:33:10 ip-50-0-0-122 kernel: [] ? kthread_freezable_should_stop+0x70/0x70
Apr 26 22:33:10 ip-50-0-0-122 kernel: [] ret_from_fork+0x7c/0xb0
Apr 26 22:33:10 ip-50-0-0-122 kernel: [] ? kthread_freezable_should_stop+0x70/0x70
Apr 26 22:33:10 ip-50-0-0-122 kernel: INFO: task zfs:9148 blocked for more than 120 seconds.
Apr 26 22:33:10 ip-50-0-0-122 kernel: Tainted: P O 3.14.4 #1
Apr 26 22:33:10 ip-50-0-0-122 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Apr 26 22:33:10 ip-50-0-0-122 kernel: zfs D ffffffff8181d000 0 9148 9146 0x00000000
Apr 26 22:33:10 ip-50-0-0-122 kernel: ffff88026ef1fbd8 0000000000000082 0000000000000001 ffff88026ef1e010
Apr 26 22:33:10 ip-50-0-0-122 kernel: 0000000000014440 0000000000014440 ffff8800bb6ff130 ffff880768820250
Apr 26 22:33:10 ip-50-0-0-122 kernel: ffff88026ef1fbe8 ffff880742104ab8 ffff880742104a08 ffff88026ef1fc18
Apr 26 22:33:10 ip-50-0-0-122 kernel: Call Trace:
Apr 26 22:33:10 ip-50-0-0-122 kernel: [] schedule+0x29/0x70
Apr 26 22:33:10 ip-50-0-0-122 kernel: [] cv_wait_common+0x135/0x1c0 [spl]
Apr 26 22:33:10 ip-50-0-0-122 kernel: [] ? bit_waitqueue+0xe0/0xe0
Apr 26 22:33:10 ip-50-0-0-122 kernel: [] __cv_wait+0x15/0x20 [spl]
Apr 26 22:33:10 ip-50-0-0-122 kernel: [] txg_wait_synced+0xb3/0x1a0 [zfs]
Apr 26 22:33:10 ip-50-0-0-122 kernel: [] ? dsl_dataset_user_hold_check_one+0xd0/0xd0 [zfs]
Apr 26 22:33:10 ip-50-0-0-122 kernel: [] ? dsl_onexit_hold_cleanup+0x110/0x110 [zfs]
Apr 26 22:33:10 ip-50-0-0-122 kernel: [] dsl_sync_task+0x14f/0x240 [zfs]
Apr 26 22:33:10 ip-50-0-0-122 kernel: [] ? dsl_dataset_user_hold_check_one+0xd0/0xd0 [zfs]
Apr 26 22:33:10 ip-50-0-0-122 kernel: [] ? dsl_onexit_hold_cleanup+0x110/0x110 [zfs]
Apr 26 22:33:10 ip-50-0-0-122 kernel: [] dsl_dataset_user_hold+0x99/0x100 [zfs]
Apr 26 22:33:10 ip-50-0-0-122 kernel: [] zfs_ioc_hold+0x86/0xb0 [zfs]
Apr 26 22:33:10 ip-50-0-0-122 kernel: [] zfsdev_ioctl+0x1c7/0x4d0 [zfs]
Apr 26 22:33:10 ip-50-0-0-122 kernel: [] do_vfs_ioctl+0x73/0x380
Apr 26 22:33:10 ip-50-0-0-122 kernel: [] SyS_ioctl+0xa1/0xb0
Apr 26 22:33:10 ip-50-0-0-122 kernel: [] system_call_fastpath+0x16/0x1b
Apr 26 22:34:51 ip-50-0-0-122 dhclient[2410]: DHCPREQUEST on eth0 to 50.0.0.1 port 67 (xid=0x5f8fb5f4)
Apr 26 22:34:51 ip-50-0-0-122 dhclient[2410]: DHCPACK from 50.0.0.1 (xid=0x5f8fb5f4)
Apr 26 22:34:53 ip-50-0-0-122 dhclient[2410]: bound to 50.0.0.122 -- renewal in 1563 seconds.
Apr 26 22:37:10 ip-50-0-0-122 kernel: INFO: task zpool:10408 blocked for more than 120 seconds.
Apr 26 22:37:10 ip-50-0-0-122 kernel: Tainted: P O 3.14.4 #1
Apr 26 22:37:10 ip-50-0-0-122 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Apr 26 22:37:10 ip-50-0-0-122 kernel: zpool D ffffffff8181d000 0 10408 10407 0x00000000
Apr 26 22:37:10 ip-50-0-0-122 kernel: ffff8800050e3ae8 0000000000000086 ffff8800050e3af8 ffff8800050e2010
Apr 26 22:37:10 ip-50-0-0-122 kernel: 0000000000014440 0000000000014440 ffff88075c1defb0 ffff880768818210
Apr 26 22:37:10 ip-50-0-0-122 kernel: 0000000100000000 ffffffffa038be44 ffffffffa038be40 00000000ffffffff
Apr 26 22:37:10 ip-50-0-0-122 kernel: Call Trace:
Apr 26 22:37:10 ip-50-0-0-122 kernel: [] schedule+0x29/0x70
Apr 26 22:37:10 ip-50-0-0-122 kernel: [] schedule_preempt_disabled+0xe/0x10
Apr 26 22:37:10 ip-50-0-0-122 kernel: [] __mutex_lock_slowpath+0x1a4/0x240
Apr 26 22:37:10 ip-50-0-0-122 kernel: [] mutex_lock+0x23/0x40
Apr 26 22:37:10 ip-50-0-0-122 kernel: [] zfsdev_open+0x24/0x100 [zfs]
Apr 26 22:37:10 ip-50-0-0-122 kernel: [] misc_open+0xb5/0x170
Apr 26 22:37:10 ip-50-0-0-122 kernel: [] ? cdev_put+0x30/0x30
Apr 26 22:37:10 ip-50-0-0-122 kernel: [] chrdev_open+0xe1/0x1b0
Apr 26 22:37:10 ip-50-0-0-122 kernel: [] ? security_file_open+0x7a/0x80
Apr 26 22:37:10 ip-50-0-0-122 kernel: [] do_dentry_open+0x25e/0x2f0
Apr 26 22:37:10 ip-50-0-0-122 kernel: [] finish_open+0x35/0x50
Apr 26 22:37:10 ip-50-0-0-122 kernel: [] do_last+0x4de/0x7f0
Apr 26 22:37:10 ip-50-0-0-122 kernel: [] path_openat+0xc4/0x480
Apr 26 22:37:10 ip-50-0-0-122 kernel: [] ? __handle_mm_fault+0x196/0x360
Apr 26 22:37:10 ip-50-0-0-122 kernel: [] do_filp_open+0x4a/0xa0
Apr 26 22:37:10 ip-50-0-0-122 kernel: [] ? __alloc_fd+0xac/0x150
Apr 26 22:37:10 ip-50-0-0-122 kernel: [] do_sys_open+0x11a/0x230
Apr 26 22:37:10 ip-50-0-0-122 kernel: [] ? do_page_fault+0x36/0x50
Apr 26 22:37:10 ip-50-0-0-122 kernel: [] SyS_open+0x1e/0x20
Apr 26 22:37:10 ip-50-0-0-122 kernel: [] system_call_fastpath+0x16/0x1b
Apr 26 22:37:10 ip-50-0-0-122 kernel: INFO: task zpool:10558 blocked for more than 120 seconds.
Apr 26 22:37:10 ip-50-0-0-122 kernel: Tainted: P O 3.14.4 #1
Apr 26 22:37:10 ip-50-0-0-122 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Apr 26 22:37:10 ip-50-0-0-122 kernel: zpool D ffffffff8181d000 0 10558 10557 0x00000000
Apr 26 22:37:10 ip-50-0-0-122 kernel: ffff880014541b18 0000000000000086 0000000100000000 ffff880014540010
Apr 26 22:37:10 ip-50-0-0-122 kernel: 0000000000014440 0000000000014440 ffff88072732a0d0 ffff880768820250
Apr 26 22:37:10 ip-50-0-0-122 kernel: 0000000300000040 ffffffff81cebec4 ffffffff81cebec0 00000000ffffffff
Apr 26 22:37:10 ip-50-0-0-122 kernel: Call Trace:
Apr 26 22:37:10 ip-50-0-0-122 kernel: [] schedule+0x29/0x70
Apr 26 22:37:10 ip-50-0-0-122 kernel: [] schedule_preempt_disabled+0xe/0x10
Apr 26 22:37:10 ip-50-0-0-122 kernel: [] __mutex_lock_slowpath+0x1a4/0x240
Apr 26 22:37:10 ip-50-0-0-122 kernel: [] ? __alloc_pages_nodemask+0x15f/0x3d0
Apr 26 22:37:10 ip-50-0-0-122 kernel: [] ? cdev_put+0x30/0x30
Apr 26 22:37:10 ip-50-0-0-122 kernel: [] mutex_lock+0x23/0x40
Apr 26 22:37:10 ip-50-0-0-122 kernel: [] misc_open+0x2b/0x170
Apr 26 22:37:10 ip-50-0-0-122 kernel: [] ? cdev_put+0x30/0x30
Apr 26 22:37:10 ip-50-0-0-122 kernel: [] chrdev_open+0xe1/0x1b0
Apr 26 22:37:10 ip-50-0-0-122 kernel: [] ? security_file_open+0x7a/0x80
Apr 26 22:37:10 ip-50-0-0-122 kernel: [] do_dentry_open+0x25e/0x2f0
Apr 26 22:37:10 ip-50-0-0-122 kernel: [] finish_open+0x35/0x50
Apr 26 22:37:10 ip-50-0-0-122 kernel: [] do_last+0x4de/0x7f0
Apr 26 22:37:10 ip-50-0-0-122 kernel: [] path_openat+0xc4/0x480
Apr 26 22:37:10 ip-50-0-0-122 kernel: [] ? __handle_mm_fault+0x196/0x360
Apr 26 22:37:10 ip-50-0-0-122 kernel: [] do_filp_open+0x4a/0xa0
Apr 26 22:37:10 ip-50-0-0-122 kernel: [] ? __alloc_fd+0xac/0x150
Apr 26 22:37:10 ip-50-0-0-122 kernel: [] do_sys_open+0x11a/0x230
Apr 26 22:37:10 ip-50-0-0-122 kernel: [] ? do_page_fault+0x36/0x50
Apr 26 22:37:10 ip-50-0-0-122 kernel: [] SyS_open+0x1e/0x20
Apr 26 22:37:10 ip-50-0-0-122 kernel: [] system_call_fastpath+0x16/0x1b
Apr 26 22:45:10 ip-50-0-0-122 kernel: INFO: task zfs:11498 blocked for more than 120 seconds.
Apr 26 22:45:10 ip-50-0-0-122 kernel: Tainted: P O 3.14.4 #1
Apr 26 22:45:10 ip-50-0-0-122 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Apr 26 22:45:10 ip-50-0-0-122 kernel: zfs D ffffffff8181d000 0 11498 11497 0x00000000
Apr 26 22:45:10 ip-50-0-0-122 kernel: ffff880062213bc8 0000000000000086 0000000000000018 ffff880062212010
Apr 26 22:45:10 ip-50-0-0-122 kernel: 0000000000014440 0000000000014440 ffff8807583d2390 ffff880768820ff0
Apr 26 22:45:10 ip-50-0-0-122 kernel: ffff880062213bd8 ffff880742104ab8 ffff880742104a08 ffff880062213c08
Apr 26 22:45:10 ip-50-0-0-122 kernel: Call Trace:
Apr 26 22:45:10 ip-50-0-0-122 kernel: [] schedule+0x29/0x70
Apr 26 22:45:10 ip-50-0-0-122 kernel: [] cv_wait_common+0x135/0x1c0 [spl]
Apr 26 22:45:10 ip-50-0-0-122 kernel: [] ? bit_waitqueue+0xe0/0xe0
Apr 26 22:45:10 ip-50-0-0-122 kernel: [] ? spl_kmem_free+0x2b/0x40 [spl]
Apr 26 22:45:10 ip-50-0-0-122 kernel: [] __cv_wait+0x15/0x20 [spl]
Apr 26 22:45:10 ip-50-0-0-122 kernel: [] txg_wait_synced+0xb3/0x1a0 [zfs]
Apr 26 22:45:10 ip-50-0-0-122 kernel: [] ? dsl_destroy_snapshot_check_impl+0xb0/0xb0 [zfs]
Apr 26 22:45:10 ip-50-0-0-122 kernel: [] ? dsl_destroy_head_sync+0xc0/0xc0 [zfs]
Apr 26 22:45:10 ip-50-0-0-122 kernel: [] dsl_sync_task+0x14f/0x240 [zfs]
Apr 26 22:45:10 ip-50-0-0-122 kernel: [] ? dsl_destroy_snapshot_check_impl+0xb0/0xb0 [zfs]
Apr 26 22:45:10 ip-50-0-0-122 kernel: [] ? dsl_destroy_head_sync+0xc0/0xc0 [zfs]
Apr 26 22:45:10 ip-50-0-0-122 kernel: [] dsl_destroy_snapshots_nvl+0x8b/0xf0 [zfs]
Apr 26 22:45:10 ip-50-0-0-122 kernel: [] zfs_ioc_destroy_snaps+0xff/0x120 [zfs]
Apr 26 22:45:10 ip-50-0-0-122 kernel: [] zfsdev_ioctl+0x1c7/0x4d0 [zfs]
Apr 26 22:45:10 ip-50-0-0-122 kernel: [] do_vfs_ioctl+0x73/0x380
Apr 26 22:45:10 ip-50-0-0-122 kernel: [] SyS_ioctl+0xa1/0xb0
Apr 26 22:45:10 ip-50-0-0-122 kernel: [] system_call_fastpath+0x16/0x1b
Apr 26 22:53:10 ip-50-0-0-122 kernel: INFO: task zfs:13790 blocked for more than 120 seconds.
Apr 26 22:53:10 ip-50-0-0-122 kernel: Tainted: P O 3.14.4 #1
Apr 26 22:53:10 ip-50-0-0-122 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Apr 26 22:53:10 ip-50-0-0-122 kernel: zfs D ffffffff8181d000 0 13790 13789 0x00000000
Apr 26 22:53:10 ip-50-0-0-122 kernel: ffff88069b093bc8 0000000000000082 0000000000000018 ffff88069b092010
Apr 26 22:53:10 ip-50-0-0-122 kernel: 0000000000014440 0000000000014440 ffff88000ac3b170 ffff88076882a290
Apr 26 22:53:10 ip-50-0-0-122 kernel: ffff88069b093bd8 ffff880742104ab8 ffff880742104a08 ffff88069b093c08
Apr 26 22:53:10 ip-50-0-0-122 kernel: Call Trace:
Apr 26 22:53:10 ip-50-0-0-122 kernel: [] schedule+0x29/0x70
Apr 26 22:53:10 ip-50-0-0-122 kernel: [] cv_wait_common+0x135/0x1c0 [spl]
Apr 26 22:53:10 ip-50-0-0-122 kernel: [] ? bit_waitqueue+0xe0/0xe0
Apr 26 22:53:10 ip-50-0-0-122 kernel: [] ? spl_kmem_free+0x2b/0x40 [spl]
Apr 26 22:53:10 ip-50-0-0-122 kernel: [] __cv_wait+0x15/0x20 [spl]
Apr 26 22:53:10 ip-50-0-0-122 kernel: [] txg_wait_synced+0xb3/0x1a0 [zfs]
Apr 26 22:53:10 ip-50-0-0-122 kernel: [] ? dsl_destroy_snapshot_check_impl+0xb0/0xb0 [zfs]
Apr 26 22:53:10 ip-50-0-0-122 kernel: [] ? dsl_destroy_head_sync+0xc0/0xc0 [zfs]
Apr 26 22:53:10 ip-50-0-0-122 kernel: [] dsl_sync_task+0x14f/0x240 [zfs]
Apr 26 22:53:10 ip-50-0-0-122 kernel: [] ? dsl_destroy_snapshot_check_impl+0xb0/0xb0 [zfs]
Apr 26 22:53:10 ip-50-0-0-122 kernel: [] ? dsl_destroy_head_sync+0xc0/0xc0 [zfs]
Apr 26 22:53:10 ip-50-0-0-122 kernel: [] dsl_destroy_snapshots_nvl+0x8b/0xf0 [zfs]
Apr 26 22:53:10 ip-50-0-0-122 kernel: [] zfs_ioc_destroy_snaps+0xff/0x120 [zfs]
Apr 26 22:53:10 ip-50-0-0-122 kernel: [] zfsdev_ioctl+0x1c7/0x4d0 [zfs]
Apr 26 22:53:10 ip-50-0-0-122 kernel: [] do_vfs_ioctl+0x73/0x380
Apr 26 22:53:10 ip-50-0-0-122 kernel: [] SyS_ioctl+0xa1/0xb0
Apr 26 22:53:10 ip-50-0-0-122 kernel: [] system_call_fastpath+0x16/0x1b
Apr 26 22:55:10 ip-50-0-0-122 kernel: INFO: task zfs:13790 blocked for more than 120 seconds.
Apr 26 22:55:10 ip-50-0-0-122 kernel: Tainted: P O 3.14.4 #1
Apr 26 22:55:10 ip-50-0-0-122 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Apr 26 22:55:10 ip-50-0-0-122 kernel: zfs D ffffffff8181d000 0 13790 13789 0x00000000
Apr 26 22:55:10 ip-50-0-0-122 kernel: ffff88069b093bc8 0000000000000082 0000000000000018 ffff88069b092010
Apr 26 22:55:10 ip-50-0-0-122 kernel: 0000000000014440 0000000000014440 ffff88000ac3b170 ffff88076882a290
Apr 26 22:55:10 ip-50-0-0-122 kernel: ffff88069b093bd8 ffff880742104ab8 ffff880742104a08 ffff88069b093c08
Apr 26 22:55:10 ip-50-0-0-122 kernel: Call Trace:
Apr 26 22:55:10 ip-50-0-0-122 kernel: [] schedule+0x29/0x70
Apr 26 22:55:10 ip-50-0-0-122 kernel: [] cv_wait_common+0x135/0x1c0 [spl]
Apr 26 22:55:10 ip-50-0-0-122 kernel: [] ? bit_waitqueue+0xe0/0xe0
Apr 26 22:55:10 ip-50-0-0-122 kernel: [] ? spl_kmem_free+0x2b/0x40 [spl]
Apr 26 22:55:10 ip-50-0-0-122 kernel: [] __cv_wait+0x15/0x20 [spl]
Apr 26 22:55:10 ip-50-0-0-122 kernel: [] txg_wait_synced+0xb3/0x1a0 [zfs]
Apr 26 22:55:10 ip-50-0-0-122 kernel: [] ? dsl_destroy_snapshot_check_impl+0xb0/0xb0 [zfs]
Apr 26 22:55:10 ip-50-0-0-122 kernel: [] ? dsl_destroy_head_sync+0xc0/0xc0 [zfs]
Apr 26 22:55:10 ip-50-0-0-122 kernel: [] dsl_sync_task+0x14f/0x240 [zfs]
Apr 26 22:55:10 ip-50-0-0-122 kernel: [] ? dsl_destroy_snapshot_check_impl+0xb0/0xb0 [zfs]
Apr 26 22:55:10 ip-50-0-0-122 kernel: [] ? dsl_destroy_head_sync+0xc0/0xc0 [zfs]
Apr 26 22:55:10 ip-50-0-0-122 kernel: [] dsl_destroy_snapshots_nvl+0x8b/0xf0 [zfs]
Apr 26 22:55:10 ip-50-0-0-122 kernel: [] zfs_ioc_destroy_snaps+0xff/0x120 [zfs]
Apr 26 22:55:10 ip-50-0-0-122 kernel: [] zfsdev_ioctl+0x1c7/0x4d0 [zfs]
Apr 26 22:55:10 ip-50-0-0-122 kernel: [] do_vfs_ioctl+0x73/0x380
Apr 26 22:55:10 ip-50-0-0-122 kernel: [] SyS_ioctl+0xa1/0xb0
Apr 26 22:55:10 ip-50-0-0-122 kernel: [] system_call_fastpath+0x16/0x1b

@ryao
Copy link
Contributor Author

ryao commented Apr 29, 2015

@behlendorf I have been focused on developing the ability to upstream code changes to Illumos this past week, so I have relied solely on the buildbot for testing fixes to merge conflicts in vdev_file.c. The latest push resolved the merge conflicts correctly and is largely passing the buildbot. The two failures are an issue with the Ubuntu repository and a spurious failure in ztest. I am going to rebase on latest master and do another push. Hopefully all tests will be green with no spurious failures this time.

@eolson78's second issue appears to be #2060, which is resolved by #3348.

…xt stage

References:
https://www.illumos.org/issues/5244
https://reviews.csiden.org/r/119/diff/#

Porting Notes:
1. The unported "2932 support crash dumps to raidz, etc. pools" caused a
merge conflict due to a copyright difference in module/zfs/vdev_raidz.c.
2. The unported "4128 disks in zpools never go away when pulled" and
additional Linux-specific changes caused merge conflicts in
module/zfs/vdev_disk.c.
3. Our changes to use the TQ_PUSHPAGE extension and vdev_file_taskq
caused merge conflicts in module/zfs/vdev_file.c. The taskq import that
I plan to send in the future would have prevented this particular
conflict.

Reviewed by: Matthew Ahrens mahrens@delphix.com
Reviewed by: Adam Leventhal ahl@delphix.com
Reviewed by: Alex Reece alex.reece@delphix.com
Reviewed by: Christopher Siden christopher.siden@delphix.com
Ported-by: Richard Yao <richard.yao@clusterhq.com>
@eolson78
Copy link

@ryao is correct on this. We have been able to reproduce the issue addressed by this pull request and I can confirm that this pull request does in fact 100% solve this issue. I can also confirm that #3348 does solve a second issue I hit which was the same as in #2060

@behlendorf
Copy link
Contributor

@eolson78 thanks for the follow up. This fix has now been merged.

dasjoe pushed a commit to dasjoe/zfs that referenced this pull request May 24, 2015
…next stage

5244 zio pipeline callers should explicitly invoke next stage
Reviewed by: Adam Leventhal <ahl@delphix.com>
Reviewed by: Alex Reece <alex.reece@delphix.com>
Reviewed by: Christopher Siden <christopher.siden@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Richard Elling <richard.elling@gmail.com>
Reviewed by: Dan McDonald <danmcd@omniti.com>
Reviewed by: Steven Hartland <killing@multiplay.co.uk>
Approved by: Gordon Ross <gwr@nexenta.com>

References:
  https://www.illumos.org/issues/5244
  illumos/illumos-gate@738f37b

Porting Notes:

1. The unported "2932 support crash dumps to raidz, etc. pools"
   caused a merge conflict due to a copyright difference in
   module/zfs/vdev_raidz.c.
2. The unported "4128 disks in zpools never go away when pulled"
   and additional Linux-specific changes caused merge conflicts in
   module/zfs/vdev_disk.c.

Ported-by: Richard Yao <richard.yao@clusterhq.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes openzfs#2828
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants