Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

lock order inversion between zvol_open() and dsl_pool_sync()...zvol_rename_minors() #3681

Closed
bprotopopov opened this issue Aug 11, 2015 · 33 comments
Milestone

Comments

@bprotopopov
Copy link
Contributor

There appears to be a problem with the following two code paths executed concurrently:

  1. resulting from the zfs rename on zvol snapshot, taking place in txg_sync_thread()

#0 [ffff8802a6e7ba20] schedule at ffffffff815299d0
#1 [ffff8802a6e7baf8] __mutex_lock_slowpath at ffffffff8152b3a6
#2 [ffff8802a6e7bb68] mutex_lock at ffffffff8152aecb
#3 [ffff8802a6e7bb88] zvol_rename_minors at ffffffffa02fe634 [zfs]
#4 [ffff8802a6e7bbd8] dsl_dir_rename_sync at ffffffffa027c04d [zfs]
#5 [ffff8802a6e7bc48] dsl_sync_task_sync at ffffffffa02850f2 [zfs]
#6 [ffff8802a6e7bc78] dsl_pool_sync at ffffffffa027d31b [zfs]
#7 [ffff8802a6e7bcf8] spa_sync at ffffffffa0293587 [zfs]
#8 [ffff8802a6e7bdc8] txg_sync_thread at ffffffffa02a890b [zfs]
#9 [ffff8802a6e7beb8] thread_generic_wrapper at ffffffffa0169978 [spl]
#10 [ffff8802a6e7bee8] kthread at ffffffff8109e78e
#11 [ffff8802a6e7bf48] kernel_thread at ffffffff8100c28a

  1. resulting from opening zvol for I/O
    #0 [ffff88020e53d9b8] schedule at ffffffff815299d0
    Use Barriers in pre-2.6.24 kernels #1 [ffff88020e53da90] cv_wait_common at ffffffffa0171e65 [spl]
    Implement zeventd daemon #2 [ffff88020e53db20] __cv_wait at ffffffffa0171f75 [spl]
    Use New BIO_RW_FAILFAST_* API #3 [ffff88020e53db30] rrw_enter_read at ffffffffa028df7b [zfs]
    Having ZVOL can cause EBUSY for certain operations #4 [ffff88020e53db60] rrw_enter at ffffffffa028e0d0 [zfs]
    Verify On Disk Compatibility #5 [ffff88020e53db70] dsl_pool_config_enter at ffffffffa027c24d [zfs]
    ZFS Test Suite #6 [ffff88020e53db80] dsl_pool_hold at ffffffffa027c2da [zfs]
    Native ZPL Implementation #7 [ffff88020e53dbc0] dmu_objset_own at ffffffffa025e1a0 [zfs]
    Fuse Implementation #8 [ffff88020e53dc20] zvol_open at ffffffffa02fd4fb [zfs]
    ZVOL Performance #9 [ffff88020e53dc90] __blkdev_get at ffffffff811cbc4e
    Enclosure Management Integration #10 [ffff88020e53dcf0] blkdev_get at ffffffff811cbf70
    Support Debian/Ubuntu Style Packages #11 [ffff88020e53dd00] blkdev_open at ffffffff811cbff1
    Update core ZFS code from OpenSolaris #12 [ffff88020e53dd30] __dentry_open at ffffffff8118b1ba
    Split zpios off in to its own project #13 [ffff88020e53dd90] nameidata_to_filp at ffffffff8118b524
    Use the Autotest Test Suite #14 [ffff88020e53ddb0] do_filp_open at ffffffff811a12d0
    Reduce Stack Usage #15 [ffff88020e53df20] do_sys_open at ffffffff8118af77
    SPL tests fails #16 [ffff88020e53df70] sys_open at ffffffff8118b080
    UTS_RELEASE checking is depricated #17 [ffff88020e53df80] system_call_fastpath at ffffffff8100b0f2

The first code path takes dsl pool config lock, then tries to take zvol_state_lock, while the second, takes the zvol_state_lock, and then tries to get the dsl pool config lock.

It is interesting that zvol_first_open() called from zvol_open() takes care to obtain spa_namespace_lock, and fails if it cannot, in order to avoid the locking order inversion between spa_namespace_lock and the zvol_state_lock.

Yet the txg_sync_thread does not seem to hold the spa_namespace_lock when calling dsl_pool_sync() and further, before trying to get the zvol_state_lock. That is why I believe this defensive check is not effective in this particular case.

@bprotopopov
Copy link
Contributor Author

I am seriously considering performing zvol__minor related functions invoked from dsl_dataset.c, spa.c, and zfs_ioctl.c, through a taskq, in order to isolate it from any sort of locking conditions in the callers.

@behlendorf
Copy link
Contributor

@bprotopopov what about changing the mutex to a rwlock. The zvol_state_lock is there to primarily protect against addition and removal from the list. For places like zvol_rename_minors where the list is just traversed it would be safe to use a read lock. A proper mutex to lock the individual zvol_state_t member during modification would probably be a good idea as well.

@bprotopopov
Copy link
Contributor Author

@behlendorf I believe the zvol_first_open() would then have to take the rwlock as a writer whereas the txg_sync_thread would take is as a reader, still resulting in the same deadlock, or am I missing something?

I have prototyped the asynchronous (in a taskq) zvol minor operations, and I think the code is much simplified as far as the analysis of the possible deadlocks. I also think with this approach, there is no longer a need to try-lock the spa_namespace_lock in the zvol_open() path (because everybody now takes the zvol_state_lock first and then other locks). Given that the udev-related side effects of the minor operations are asynchronous as well, would this approach seem like a more straightforward way to get things done?

@behlendorf
Copy link
Contributor

@bprotopopov perhaps the race would be smaller but I think your right. It would still exist. I'm definitely not opposed to doing the minor creation asynchronously I was just tossing out another possible idea. I'd love to see your patch for this once your happy with it.

In fact, what would be ideal is if you could build up a small patch stack with this change and your other zvol improvements. Then we could get them all reviewed in one go and merged.

@bprotopopov
Copy link
Contributor Author

@behlendorf sure thing

@behlendorf behlendorf added this to the 0.7.0 milestone Aug 31, 2015
@bprotopopov
Copy link
Contributor Author

Hope to be done by end of week

@bprotopopov
Copy link
Contributor Author

@behlendorf, hope you can help, what is the proper way to build SPL and ZFS to enable asserts and other debug checks ?

I did make distclean, ./autogen.sh, ./configure --enable-debug, make rpm-utils && make rpm-kmod, and yum localinstall the RPMs for SPL first, then the same thing for ZFS.

Yet I get a panic
...
SPLError: 2477:0:(list.h:77:list_create()) ASSERTION(size >= offset + sizeof(list_node_t)) failed
SPLError: 2477:0:(list.h:77:list_create()) SPL PANIC
SPL: Showing stack for process 2477
Pid: 2477, comm: modprobe Tainted: P --------------- 2.6.32-504.23.4.el6.x86_64 #1
Call Trace:
[] ? spl_debug_dumpstack+0x46/0x60 [spl]
[] ? spl_debug_bug+0x81/0xd0 [spl]
[] ? spl_PANIC+0xba/0xf0 [spl]
[] ? kmem_alloc_debug+0x251/0x490 [spl]
[] ? spl_kmem_cache_create+0x331/0xf50 [spl]
[] ? __thread_create+0x13f/0x390 [spl]
[] ? mutex_lock+0x1e/0x50
[] ? zfs_znode_cache_constructor+0x0/0x1b0 [zfs]
[] ? spl__init+0x0/0x20 [zfs]
[] ? zvol_init+0x88/0x290 [zfs]
[] ? spl__init+0x0/0x20 [zfs]
[] ? spl__init+0x0/0x20 [zfs]
[] ? _init+0x22/0xb10 [zfs]
[] ? spl__init+0x0/0x20 [zfs]
[] ? spl__init+0x13/0x20 [zfs]
[] ? do_one_initcall+0x3c/0x1d0
[] ? sys_init_module+0xe1/0x250
[] ? system_call_fastpath+0x16/0x1b

What am I missing ?
Should I do make && make install instead of using RPMs ?

@behlendorf
Copy link
Contributor

I'm not sure I follow the question. Passing - - enable-debug as you did will enable the assertions and it looks like your hitting one so that issue will need to be resolved.

@bprotopopov
Copy link
Contributor Author

Well the assertion looks like mismatch of type sizes in SPL and ZFS modules, e.g. as if one is build with and one without debug support.

Thanks for confirming the procedure, it must be something in my environment.

Typos courtesy of my iPhone

On Sep 18, 2015, at 6:57 PM, Brian Behlendorf <notifications@github.commailto:notifications@github.com> wrote:

I'm not sure I follow the question. Passing - - enable-debug as you did will enable the assertions and it looks like your hitting one so that issue will need to be resolved.

Reply to this email directly or view it on GitHubhttps://github.com//issues/3681#issuecomment-141589105.

bprotopopov added a commit to bprotopopov/zfs that referenced this issue Sep 21, 2015
…() and

dsl_pool_sync()...zvol_rename_minors()

Execute zvol_*_minor(s) functions asynchronously in taskqs (one
taskq per pool) in order to avoid having to deal with the locking
state of the caller.
bprotopopov added a commit to bprotopopov/zfs that referenced this issue Sep 21, 2015
…n() and

dsl_pool_sync()...zvol_rename_minors()

Execute zvol_*_minor(s) functions asynchronously in taskqs (one
taskq per pool) in order to avoid having to deal with the locking
state of the caller.
@bprotopopov
Copy link
Contributor Author

@behlendorf, wanted to clarify regarding the other zvol improvements you mentioned to be included with this bugfix; I can only think of the zvol minors scalability issue (#2217); I rebased that fix as well and will put it the pull request shortly.

Please let me know if that's what you meant.

@behlendorf
Copy link
Contributor

Thanks for linking to your patches and working on this. I think this improvement will help us close out a lot of long standing issues. I took a quick look at the patch and I have a few questions.

  • Why is a necessary to create a taskq per-pool on-demand rather than global one at module load? That definitely complicates the code and I'm not sure what advantage is has over a single global taskq in zvol.c. Are you concerned a deadlock is possible if multiple pools are sharing the same taskq? Or do you feel you'll need more threads when a system contains many pools. If it's the latter I'd suggest just using TASKQ_DYNAMIC and allowing a large number of threads to be created if they're needed.
  • It feels to me like the make_async_arg arg logic should be pulled in to zvol_async_dispatch. That would allow all the helper functions to simplified.
  • KM_SLEEP can now be used safely, KM_PUSHPAGE isn't required.
  • Did you resolve your ASSERT issue?

@bprotopopov
Copy link
Contributor Author

Hi, @behlendorf,
I am still refining the code a bit; to answer your questions:

  1. taskq per pool - I was trying to make the minor-related actions for different pools independent; thinking more about this, this might be an overkill, since the implementations take the zvol_state_lock which is global across the pools. I don't mind reworking the code with a single queue and TASKQ_DYNAMIC perhaps
  2. I will look into that as well - I have no preference
  3. perhaps because I was testing with the older code first, I did see the warnings about using KM_SLEEP in the sync context, so I looked elsewhere in zvol.c code, and I saw KM_PUSHPAGE being used exclusively, so, I followed the example - again, no preference here - do you think I should change to KM_SLEEP ?
  4. I did resolve the asset due to a programming error I am not willing to discuss in public :)

@bprotopopov
Copy link
Contributor Author

Hi, @behlendorf,

sorry for the confusion, so I remembered why a taskq per pool (barring limitations of concurrency due to the zvol_state_lock) - the taskqs must be single-threaded to guarantee ordering or request processing, and it felt a bit unsettling to limit all the minor-related work for all the pools to a single thread.

At present, the minor-related work can originate from many user contexts, and at least some of that processing happens concurrently (e.g. dmu_objset_find() - the actual reading from disk). So, I did not want to serialize I/O for multiple pools because that could result in various undesirable dependencies between pools (e.g. hung I/O on one pool hanging minor processing on other pools, etc.). I am not sure how to achieve this concurrency while guaranteeing the ordering of request processing without multiple queues.

It also seemed that for a single pool, it should be OK to have a single thread executing requests sequentially, since the consecutive dmu_objset_find()s will likely co-benefit from I/O caching. In my testing, I did not notice any issues with performance.

@bprotopopov
Copy link
Contributor Author

@behlendorf, I have moved the make_async_arg() logic into zvol_async_dispatch(). I held off on KM_SLEEP vs. MK_PUSHPAGE, and I think it is a good idea to keep a queue per pool as discussed above. What's your take ?

bprotopopov added a commit to bprotopopov/zfs that referenced this issue Sep 22, 2015
…n() and

dsl_pool_sync()...zvol_rename_minors()

Execute zvol_*_minor(s) functions asynchronously in taskqs (one
taskq per pool) in order to avoid having to deal with the locking
state of the caller.
@behlendorf
Copy link
Contributor

@bprotopopov I see, that makes sense. For example, we wouldn't want it to be possible for a minor removal to potentially be handled before its matching minor create so we're limited to a single thread to keep things simple and sequential. I suspect a single thread for the entire system would be more than enough but it's completely reasonable to create one per-pool just in case. However, in that case I'd suggest hanging the taskq directly off the spa_t, creating it right away in spa_activate(), and the destroying it in spa_deactivate() to keep things simple. You'll need to decide if it's worth the added complexity.

The KM_SLEEP change is safe to make so I'd go ahead with it. If it wasn't nothing would work in the latest code!

@bprotopopov
Copy link
Contributor Author

@behlendorf, clarification: should I replace all KM_PUSHPAGE with KM_SLEEP in zvol.c ?

@bprotopopov
Copy link
Contributor Author

@behlendorf, I looked into following your advice regarding creating queues in spa_activate(); the function is expected to succeed (no return value), yet taskq_create() could return NULL - many places in the code check for non-NUL return. So, I could make taskq creation ''best effort" and log a WARNING message if failed to create.

@bprotopopov
Copy link
Contributor Author

I think it will be extremely rare if we cannot create a taskq at pool activation, and if we can't there are other more pressing issues to attend to.

bprotopopov added a commit to bprotopopov/zfs that referenced this issue Sep 23, 2015
…n() and

dsl_pool_sync()...zvol_rename_minors()

Execute zvol_*_minor(s) functions asynchronously in taskqs (one
taskq per pool) in order to avoid having to deal with the locking
state of the caller.
bprotopopov added a commit to bprotopopov/zfs that referenced this issue Sep 23, 2015
…n() and

dsl_pool_sync()...zvol_rename_minors()

Remove trylock of spa_namespace_lock as it is no longer needed;
the code paths that take zvol_state_lock no longer take
spa_namespace_lock prior to taking zvol_state_lock
bprotopopov added a commit to bprotopopov/zfs that referenced this issue Sep 23, 2015
…n() and

dsl_pool_sync()...zvol_rename_minors()

Create/destroy per-pool zvol asynchroous work taskqs in
spa_activate()/spa_deactivate(), remove explicit calls to
create/destroy taskqs at pool create/import/export/destroy.
Remove the zvol taskq list and rwlock in zvol.c
@bprotopopov
Copy link
Contributor Author

So, if one creates task queues in spa_activate(), one later has to get at those queues at work dispatch time. One simple way of doing so (spa_open()) seems to deadlock (again) in zvol_rename/zvol_create paths while trying to get the spa_namespace_lock in spa_open_common().

On the other hand, I am not sure if there is a way of getting at the taskq in some other way without modifying the callers of the zvolminor() functions. I have the commits shown above.

The original code with creation of taskqs on demand dealt with this explicitly by keeping a list of queues in for opened pools. Getting those queues on demand seems to be problematic.

@behlendorf
Copy link
Contributor

should I replace all KM_PUSHPAGE with KM_SLEEP in zvol.c ?

Yes.

I think it will be extremely rare if we cannot create a taskq at pool activation

Right, it should be exceptionally unlikely. And in fact this assumption is already made by spa_create_zio_taskqs() which is called by spa_activate().

Getting those queues on demand seems to be problematic.

I assumed the relevant spa_t point would already be available for all the zvol_minor_() callers. If it is then passing it as an augment seems entirely reasonable to me and is what I had in mind. As for spa_open() we should avoid using that if possible, that route leads to more locking headaches and deadlocks as you discovered.

@bprotopopov
Copy link
Contributor Author

Unfortunately, I found that associating task queues with the pool structures has a major issue.

It does not work for the cases of pool export/destruction where we also need to remove minors for those pools after they are exported/destroyed (both imply removal from the pool namespace). The removal implies that that a valid pointer to the pool structure cannot be obtained.

In fact, were the tests to not deadlock, I would have quickly found that zvol_async_remove_minors() in

static int
zfs_ioc_pool_destroy(zfs_cmd_t *zc)
{
int error;
zfs_log_history(zc);
error = spa_destroy(zc->zc_name);
if (error == 0)
zvol_async_remove_minors(zc->zc_name);
return (error);
}

is guaranteed to fail because spa_destroyed() has called spa_remove() on the pool that should contain the pointer to the task queue.

I think that given the ideas we have tried, it is advisable to go with the original approach of explicit tracking of per-pool queues in the zvol code (commit e390e94). The additional complexity seems minimal compared to the concerns raised by interacting with the pool calls and structures.

bprotopopov added a commit to bprotopopov/zfs that referenced this issue Sep 24, 2015
…n() and

dsl_pool_sync()...zvol_rename_minors()

Execute zvol_*_minor(s) functions asynchronously in taskqs (one
taskq per pool) in order to avoid having to deal with the locking
state of the caller.
behlendorf pushed a commit to behlendorf/zfs that referenced this issue Mar 10, 2016
zfsonlinux issue openzfs#3681 - lock order inversion between zvol_open() and
dsl_pool_sync()...zvol_rename_minors()

Remove trylock of spa_namespace_lock as it is no longer needed when
zvol minor operations are performed in a separate context with no
prior locking state; the spa_namespace_lock is no longer held
when bdev->bd_mutex or zfs_state_lock might be taken in the code
paths originating from the zvol minor operation callbacks.

Signed-off-by: Boris Protopopov <boris.protopopov@actifio.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes openzfs#3681
nedbass pushed a commit that referenced this issue Mar 23, 2016
zfsonlinux issue #2217 - zvol minor operations: check snapdev
property before traversing snapshots of a dataset

zfsonlinux issue #3681 - lock order inversion between zvol_open()
and dsl_pool_sync()...zvol_rename_minors()

Create a per-pool zvol taskq for asynchronous zvol tasks.
There are a few key design decisions to be aware of.

* Each taskq must be single threaded to ensure tasks are always
  processed in the order in which they were dispatched.

* There is a taskq per-pool in order to keep the pools independent.
  This way if one pool is suspended it will not impact another.

* The preferred location to dispatch a zvol minor task is a sync
  task.  In this context there is easy access to the spa_t and
  minimal error handling is required because the sync task must
  succeed.

Support for asynchronous zvol minor operations address issue #3681.

Signed-off-by: Boris Protopopov <boris.protopopov@actifio.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2217
Closes #3678
Closes #3681
nedbass pushed a commit that referenced this issue Mar 23, 2016
zfsonlinux issue #3681 - lock order inversion between zvol_open() and
dsl_pool_sync()...zvol_rename_minors()

Remove trylock of spa_namespace_lock as it is no longer needed when
zvol minor operations are performed in a separate context with no
prior locking state; the spa_namespace_lock is no longer held
when bdev->bd_mutex or zfs_state_lock might be taken in the code
paths originating from the zvol minor operation callbacks.

Signed-off-by: Boris Protopopov <boris.protopopov@actifio.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #3681
@behlendorf behlendorf modified the milestones: 0.6.5.6, 0.7.0 Mar 23, 2016
@beloncfy
Copy link

beloncfy commented Oct 3, 2018

This issue still remains unresolved, I just encountered this error last night on ZoL 0.7.11 during send and receive, this is a full SSD system and running CentOS 7.5 x64.

Oct 2 15:03:03 node2 kernel: INFO: task z_zvol:880 blocked for more than 120 seconds.
Oct 2 15:03:03 node2 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Oct 2 15:03:03 node2 kernel: z_zvol D ffff9a6508a2bf40 0 880 2 0x00000000
Oct 2 15:03:03 node2 kernel: Call Trace:
Oct 2 15:03:03 node2 kernel: [] schedule_preempt_disabled+0x29/0x70
Oct 2 15:03:03 node2 kernel: [] __mutex_lock_slowpath+0xc7/0x1d0
Oct 2 15:03:03 node2 kernel: [] mutex_lock+0x1f/0x2f
Oct 2 15:03:03 node2 kernel: [] dmu_objset_from_ds+0x3d/0x170 [zfs]
Oct 2 15:03:03 node2 kernel: [] dmu_objset_hold+0x86/0xd0 [zfs]
Oct 2 15:03:03 node2 kernel: [] dsl_prop_get+0x44/0xa0 [zfs]
Oct 2 15:03:03 node2 kernel: [] dsl_prop_get_integer+0x1e/0x20 [zfs]
Oct 2 15:03:03 node2 kernel: [] zvol_create_minors_cb+0x3b/0x140 [zfs]
Oct 2 15:03:03 node2 kernel: [] ? rrw_exit+0x60/0x170 [zfs]
Oct 2 15:03:03 node2 kernel: [] ? zvol_prefetch_minors_impl+0x90/0x90 [zfs]
Oct 2 15:03:03 node2 kernel: [] dmu_objset_find_impl+0x115/0x430 [zfs]
Oct 2 15:03:03 node2 kernel: [] ? zvol_prefetch_minors_impl+0x90/0x90 [zfs]
Oct 2 15:03:03 node2 kernel: [] ? zvol_prefetch_minors_impl+0x90/0x90 [zfs]
Oct 2 15:03:03 node2 kernel: [] dmu_objset_find+0x58/0x90 [zfs]
Oct 2 15:03:03 node2 kernel: [] zvol_task_cb+0x51b/0x5a0 [zfs]
Oct 2 15:03:03 node2 kernel: [] ? spl_kmem_free+0x35/0x40 [spl]
Oct 2 15:03:03 node2 kernel: [] taskq_thread+0x2ac/0x4f0 [spl]
Oct 2 15:03:03 node2 kernel: [] ? wake_up_state+0x20/0x20
Oct 2 15:03:03 node2 kernel: [] ? taskq_thread_spawn+0x60/0x60 [spl]
Oct 2 15:03:03 node2 kernel: [] kthread+0xd1/0xe0
Oct 2 15:03:03 node2 kernel: [] ? insert_kthread_work+0x40/0x40
Oct 2 15:03:03 node2 kernel: [] ret_from_fork_nospec_begin+0x21/0x21
Oct 2 15:03:03 node2 kernel: [] ? insert_kthread_work+0x40/0x40
Oct 2 15:03:03 node2 kernel: INFO: task txg_sync:1106 blocked for more than 120 seconds.
Oct 2 15:03:03 node2 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Oct 2 15:03:03 node2 kernel: txg_sync D ffff9a64ff2ebf40 0 1106 2 0x00000000
Oct 2 15:03:03 node2 kernel: Call Trace:
Oct 2 15:03:03 node2 kernel: [] schedule+0x29/0x70
Oct 2 15:03:03 node2 kernel: [] cv_wait_common+0x125/0x150 [spl]
Oct 2 15:03:03 node2 kernel: [] ? wake_up_atomic_t+0x30/0x30
Oct 2 15:03:03 node2 kernel: [] __cv_wait+0x15/0x20 [spl]
Oct 2 15:03:03 node2 kernel: [] rrw_enter_write+0x45/0xb0 [zfs]
Oct 2 15:03:03 node2 kernel: [] rrw_enter+0x13/0x30 [zfs]
Oct 2 15:03:03 node2 kernel: [] dsl_sync_task_sync+0x8d/0x130 [zfs]
Oct 2 15:03:03 node2 kernel: [] dsl_pool_sync+0x2eb/0x440 [zfs]
Oct 2 15:03:03 node2 kernel: [] spa_sync+0x437/0xd90 [zfs]
Oct 2 15:03:03 node2 kernel: [] ? default_wake_function+0x12/0x20
Oct 2 15:03:03 node2 kernel: [] ? __mutex_unlock_slowpath+0x70/0x90
Oct 2 15:03:03 node2 kernel: [] txg_sync_thread+0x301/0x510 [zfs]
Oct 2 15:03:03 node2 kernel: [] ? txg_fini+0x2a0/0x2a0 [zfs]
Oct 2 15:03:03 node2 kernel: [] thread_generic_wrapper+0x73/0x80 [spl]
Oct 2 15:03:03 node2 kernel: [] ? __thread_exit+0x20/0x20 [spl]
Oct 2 15:03:03 node2 kernel: [] kthread+0xd1/0xe0
Oct 2 15:03:03 node2 kernel: [] ? insert_kthread_work+0x40/0x40
Oct 2 15:03:03 node2 kernel: [] ret_from_fork_nospec_begin+0x21/0x21
Oct 2 15:03:03 node2 kernel: [] ? insert_kthread_work+0x40/0x40
Oct 2 15:03:03 node2 kernel: INFO: task zfs:13696 blocked for more than 120 seconds.
Oct 2 15:03:03 node2 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Oct 2 15:03:03 node2 kernel: zfs D ffff9a635e731fa0 0 13696 13695 0x00000080

@bprotopopov
Copy link
Contributor Author

It seems to me that the right direction here is to move the minor management code outside of any external locking domains;

@behlendorf, what do you think about running minor management tasks in sinple threads the nstead if pool tasks ?

@behlendorf
Copy link
Contributor

Adding a dedicated thread would be fine, but it's not clear to me from the stacks above that actually would resolve the issue. We're already do this all asynchronously in a taskq thread and in most cases not blocking waiting. Could you explain further what you had in mind?

@bprotopopov
Copy link
Contributor Author

@behlendorf, I am referring to the pool config lock taken prior to running minor management code. An external thread would not need to do that and the lock order inversion would not occur unless i am missing something.

@behlendorf
Copy link
Contributor

Could you propose a patch with what your proposing. I think that would probably be clearest.

@bprotopopov
Copy link
Contributor Author

Will try to get to this as soon as i can.

@bprotopopov
Copy link
Contributor Author

@beloncfy, could you please clarify how do you see a clear evidence of a deadlock (between the two contexts whose traces are shown)?

@beloncfy
Copy link

@beloncfy, could you please clarify how do you see a clear evidence of a deadlock (between the two contexts whose traces are shown)?

It was zero io read and write to my zfs and my server load average were increased to > 200 without any CPU usage, the zfs list command were hanging there as well.

It happened few times on my other servers as well with different version of ZoL.

I guest it was triggered by the zfs receive, my servers were doing send and receive quite aggressive (every 2 minutes) and running concurrently (more than 5 volumes simultaneously). It has some VM running on it.

I have tried to reduced the send and receive frequency to 20 minutes and the concurrency to 1 volume per job, and so far so good.

@bprotopopov
Copy link
Contributor Author

@beloncfy, i get a sense that there are more threads that might be involved, can u dump thread traces with crash perhaps and attach to the ticket?

ahrens added a commit to ahrens/zfs that referenced this issue Jan 24, 2020
When we finish a zfs receive, dmu_recv_end_sync() calls
zvol_create_minors(async=TRUE).  This kicks off some other threads that
create the minor device nodes (in /dev/zvol/poolname/...).  These async
threads call zvol_prefetch_minors_impl() and zvol_create_minor(), which
both call dmu_objset_own(), which puts a "long hold" on the dataset.
Since the zvol minor node creation is asynchronous, this can happen
after the `ZFS_IOC_RECV[_NEW]` ioctl and `zfs receive` process have
completed.

After the first receive ioctl has completed, userland may attempt to do
another receive into the same dataset (e.g. the next incremental
stream).  This second receive and the asynchronous minor node creation
can interfere with one another in several different ways, because they
both require exclusive access to the dataset:

1. When the second receive is finishing up, dmu_recv_end_check() does
dsl_dataset_handoff_check(), which can fail with EBUSY if the async
minor node creation already has a "long hold" on this dataset.  This
causes the 2nd receive to fail.

2. The async udev rule can fail if zvol_id and/or systemd-udevd try to
open the device while the the second receive's async attempt at minor
node creation owns the dataset (via zvol_prefetch_minors_impl).  This
causes the minor node (/dev/zd*) to exist, but the udev-generated
/dev/zvol/... to not exist.

3. The async minor node creation can silently fail with EBUSY if the
first receive's zvol_create_minor() trys to own the dataset while the
second receive's zvol_prefetch_minors_impl already owns the dataset.

To address these problems, this change synchronously creates the minor
node.  To avoid the lock ordering problems that the asynchrony was
introduced to fix (see openzfs#3681), we create the minor nodes from open
context, with no locks held, rather than from syncing contex as was
originally done.

Implementation notes:

We generally do not need to traverse children or prefetch anything (e.g.
when running the recv, snapshot, create, or clone subcommands of zfs).
We only need recursion when importing/opening a pool and when loading
encryption keys.  The existing recursive, asynchronous, prefetching code
is preserved for use in these cases.

Channel programs may need to create zvol minor nodes, when creating a
snapshot of a zvol with the snapdev property set.  We figure out what
snapshots are created when running the LUA program in syncing context.
In this case we need to remember what snapshots were created, and then
try to create their minor nodes from open context, after the LUA code
has completed.

There are additional zvol use cases that asynchronously own the dataset,
which can cause similar problems.  E.g. changing the volmode or snapdev
properties.  These are less problematic because they are not recursive
and don't touch datasets that are not involved in the operation, there
is still potential for interference with subsequent operations.  In the
future, these cases should be similarly converted to create the zvol
minor node synchronously from open context.

The async tasks of removing and renaming minors do not own the objset,
so they do not have this problem.  However, it may make sense to also
convert these operations to happen synchronously from open context, in
the future.

Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
External issue: DLPX-65948
Closes openzfs#7863
ahrens added a commit to ahrens/zfs that referenced this issue Jan 24, 2020
When we finish a zfs receive, dmu_recv_end_sync() calls
zvol_create_minors(async=TRUE).  This kicks off some other threads that
create the minor device nodes (in /dev/zvol/poolname/...).  These async
threads call zvol_prefetch_minors_impl() and zvol_create_minor(), which
both call dmu_objset_own(), which puts a "long hold" on the dataset.
Since the zvol minor node creation is asynchronous, this can happen
after the `ZFS_IOC_RECV[_NEW]` ioctl and `zfs receive` process have
completed.

After the first receive ioctl has completed, userland may attempt to do
another receive into the same dataset (e.g. the next incremental
stream).  This second receive and the asynchronous minor node creation
can interfere with one another in several different ways, because they
both require exclusive access to the dataset:

1. When the second receive is finishing up, dmu_recv_end_check() does
dsl_dataset_handoff_check(), which can fail with EBUSY if the async
minor node creation already has a "long hold" on this dataset.  This
causes the 2nd receive to fail.

2. The async udev rule can fail if zvol_id and/or systemd-udevd try to
open the device while the the second receive's async attempt at minor
node creation owns the dataset (via zvol_prefetch_minors_impl).  This
causes the minor node (/dev/zd*) to exist, but the udev-generated
/dev/zvol/... to not exist.

3. The async minor node creation can silently fail with EBUSY if the
first receive's zvol_create_minor() trys to own the dataset while the
second receive's zvol_prefetch_minors_impl already owns the dataset.

To address these problems, this change synchronously creates the minor
node.  To avoid the lock ordering problems that the asynchrony was
introduced to fix (see openzfs#3681), we create the minor nodes from open
context, with no locks held, rather than from syncing contex as was
originally done.

Implementation notes:

We generally do not need to traverse children or prefetch anything (e.g.
when running the recv, snapshot, create, or clone subcommands of zfs).
We only need recursion when importing/opening a pool and when loading
encryption keys.  The existing recursive, asynchronous, prefetching code
is preserved for use in these cases.

Channel programs may need to create zvol minor nodes, when creating a
snapshot of a zvol with the snapdev property set.  We figure out what
snapshots are created when running the LUA program in syncing context.
In this case we need to remember what snapshots were created, and then
try to create their minor nodes from open context, after the LUA code
has completed.

There are additional zvol use cases that asynchronously own the dataset,
which can cause similar problems.  E.g. changing the volmode or snapdev
properties.  These are less problematic because they are not recursive
and don't touch datasets that are not involved in the operation, there
is still potential for interference with subsequent operations.  In the
future, these cases should be similarly converted to create the zvol
minor node synchronously from open context.

The async tasks of removing and renaming minors do not own the objset,
so they do not have this problem.  However, it may make sense to also
convert these operations to happen synchronously from open context, in
the future.

Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
External-issue: DLPX-65948
Closes openzfs#7863
ahrens added a commit to ahrens/zfs that referenced this issue Jan 24, 2020
When we finish a zfs receive, dmu_recv_end_sync() calls
zvol_create_minors(async=TRUE).  This kicks off some other threads that
create the minor device nodes (in /dev/zvol/poolname/...).  These async
threads call zvol_prefetch_minors_impl() and zvol_create_minor(), which
both call dmu_objset_own(), which puts a "long hold" on the dataset.
Since the zvol minor node creation is asynchronous, this can happen
after the `ZFS_IOC_RECV[_NEW]` ioctl and `zfs receive` process have
completed.

After the first receive ioctl has completed, userland may attempt to do
another receive into the same dataset (e.g. the next incremental
stream).  This second receive and the asynchronous minor node creation
can interfere with one another in several different ways, because they
both require exclusive access to the dataset:

1. When the second receive is finishing up, dmu_recv_end_check() does
dsl_dataset_handoff_check(), which can fail with EBUSY if the async
minor node creation already has a "long hold" on this dataset.  This
causes the 2nd receive to fail.

2. The async udev rule can fail if zvol_id and/or systemd-udevd try to
open the device while the the second receive's async attempt at minor
node creation owns the dataset (via zvol_prefetch_minors_impl).  This
causes the minor node (/dev/zd*) to exist, but the udev-generated
/dev/zvol/... to not exist.

3. The async minor node creation can silently fail with EBUSY if the
first receive's zvol_create_minor() trys to own the dataset while the
second receive's zvol_prefetch_minors_impl already owns the dataset.

To address these problems, this change synchronously creates the minor
node.  To avoid the lock ordering problems that the asynchrony was
introduced to fix (see openzfs#3681), we create the minor nodes from open
context, with no locks held, rather than from syncing contex as was
originally done.

Implementation notes:

We generally do not need to traverse children or prefetch anything (e.g.
when running the recv, snapshot, create, or clone subcommands of zfs).
We only need recursion when importing/opening a pool and when loading
encryption keys.  The existing recursive, asynchronous, prefetching code
is preserved for use in these cases.

Channel programs may need to create zvol minor nodes, when creating a
snapshot of a zvol with the snapdev property set.  We figure out what
snapshots are created when running the LUA program in syncing context.
In this case we need to remember what snapshots were created, and then
try to create their minor nodes from open context, after the LUA code
has completed.

There are additional zvol use cases that asynchronously own the dataset,
which can cause similar problems.  E.g. changing the volmode or snapdev
properties.  These are less problematic because they are not recursive
and don't touch datasets that are not involved in the operation, there
is still potential for interference with subsequent operations.  In the
future, these cases should be similarly converted to create the zvol
minor node synchronously from open context.

The async tasks of removing and renaming minors do not own the objset,
so they do not have this problem.  However, it may make sense to also
convert these operations to happen synchronously from open context, in
the future.

Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
External-issue: DLPX-65948
Closes openzfs#7863
ahrens added a commit to ahrens/zfs that referenced this issue Jan 24, 2020
When we finish a zfs receive, dmu_recv_end_sync() calls
zvol_create_minors(async=TRUE).  This kicks off some other threads that
create the minor device nodes (in /dev/zvol/poolname/...).  These async
threads call zvol_prefetch_minors_impl() and zvol_create_minor(), which
both call dmu_objset_own(), which puts a "long hold" on the dataset.
Since the zvol minor node creation is asynchronous, this can happen
after the `ZFS_IOC_RECV[_NEW]` ioctl and `zfs receive` process have
completed.

After the first receive ioctl has completed, userland may attempt to do
another receive into the same dataset (e.g. the next incremental
stream).  This second receive and the asynchronous minor node creation
can interfere with one another in several different ways, because they
both require exclusive access to the dataset:

1. When the second receive is finishing up, dmu_recv_end_check() does
dsl_dataset_handoff_check(), which can fail with EBUSY if the async
minor node creation already has a "long hold" on this dataset.  This
causes the 2nd receive to fail.

2. The async udev rule can fail if zvol_id and/or systemd-udevd try to
open the device while the the second receive's async attempt at minor
node creation owns the dataset (via zvol_prefetch_minors_impl).  This
causes the minor node (/dev/zd*) to exist, but the udev-generated
/dev/zvol/... to not exist.

3. The async minor node creation can silently fail with EBUSY if the
first receive's zvol_create_minor() trys to own the dataset while the
second receive's zvol_prefetch_minors_impl already owns the dataset.

To address these problems, this change synchronously creates the minor
node.  To avoid the lock ordering problems that the asynchrony was
introduced to fix (see openzfs#3681), we create the minor nodes from open
context, with no locks held, rather than from syncing contex as was
originally done.

Implementation notes:

We generally do not need to traverse children or prefetch anything (e.g.
when running the recv, snapshot, create, or clone subcommands of zfs).
We only need recursion when importing/opening a pool and when loading
encryption keys.  The existing recursive, asynchronous, prefetching code
is preserved for use in these cases.

Channel programs may need to create zvol minor nodes, when creating a
snapshot of a zvol with the snapdev property set.  We figure out what
snapshots are created when running the LUA program in syncing context.
In this case we need to remember what snapshots were created, and then
try to create their minor nodes from open context, after the LUA code
has completed.

There are additional zvol use cases that asynchronously own the dataset,
which can cause similar problems.  E.g. changing the volmode or snapdev
properties.  These are less problematic because they are not recursive
and don't touch datasets that are not involved in the operation, there
is still potential for interference with subsequent operations.  In the
future, these cases should be similarly converted to create the zvol
minor node synchronously from open context.

The async tasks of removing and renaming minors do not own the objset,
so they do not have this problem.  However, it may make sense to also
convert these operations to happen synchronously from open context, in
the future.

Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
External-issue: DLPX-65948
Closes openzfs#7863
behlendorf pushed a commit that referenced this issue Feb 3, 2020
When we finish a zfs receive, dmu_recv_end_sync() calls
zvol_create_minors(async=TRUE).  This kicks off some other threads that
create the minor device nodes (in /dev/zvol/poolname/...).  These async
threads call zvol_prefetch_minors_impl() and zvol_create_minor(), which
both call dmu_objset_own(), which puts a "long hold" on the dataset.
Since the zvol minor node creation is asynchronous, this can happen
after the `ZFS_IOC_RECV[_NEW]` ioctl and `zfs receive` process have
completed.

After the first receive ioctl has completed, userland may attempt to do
another receive into the same dataset (e.g. the next incremental
stream).  This second receive and the asynchronous minor node creation
can interfere with one another in several different ways, because they
both require exclusive access to the dataset:

1. When the second receive is finishing up, dmu_recv_end_check() does
dsl_dataset_handoff_check(), which can fail with EBUSY if the async
minor node creation already has a "long hold" on this dataset.  This
causes the 2nd receive to fail.

2. The async udev rule can fail if zvol_id and/or systemd-udevd try to
open the device while the the second receive's async attempt at minor
node creation owns the dataset (via zvol_prefetch_minors_impl).  This
causes the minor node (/dev/zd*) to exist, but the udev-generated
/dev/zvol/... to not exist.

3. The async minor node creation can silently fail with EBUSY if the
first receive's zvol_create_minor() trys to own the dataset while the
second receive's zvol_prefetch_minors_impl already owns the dataset.

To address these problems, this change synchronously creates the minor
node.  To avoid the lock ordering problems that the asynchrony was
introduced to fix (see #3681), we create the minor nodes from open
context, with no locks held, rather than from syncing contex as was
originally done.

Implementation notes:

We generally do not need to traverse children or prefetch anything (e.g.
when running the recv, snapshot, create, or clone subcommands of zfs).
We only need recursion when importing/opening a pool and when loading
encryption keys.  The existing recursive, asynchronous, prefetching code
is preserved for use in these cases.

Channel programs may need to create zvol minor nodes, when creating a
snapshot of a zvol with the snapdev property set.  We figure out what
snapshots are created when running the LUA program in syncing context.
In this case we need to remember what snapshots were created, and then
try to create their minor nodes from open context, after the LUA code
has completed.

There are additional zvol use cases that asynchronously own the dataset,
which can cause similar problems.  E.g. changing the volmode or snapdev
properties.  These are less problematic because they are not recursive
and don't touch datasets that are not involved in the operation, there
is still potential for interference with subsequent operations.  In the
future, these cases should be similarly converted to create the zvol
minor node synchronously from open context.

The async tasks of removing and renaming minors do not own the objset,
so they do not have this problem.  However, it may make sense to also
convert these operations to happen synchronously from open context, in
the future.

Reviewed-by: Paul Dagnelie <pcd@delphix.com>
Reviewed-by: Prakash Surya <prakash.surya@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
External-issue: DLPX-65948
Closes #7863
Closes #9885
ahrens added a commit to ahrens/zfs that referenced this issue Feb 14, 2020
…eceiveTest failed due to "DelphixFatalException: unexpected ZFS error: EBUSY on..."

Cherry-pick of the following commit from master.  For the backport, I
had to move some of the code around because the zvol os-specific
restructuring is not in 6.0.

async zvol minor node creation interferes with receive

When we finish a zfs receive, dmu_recv_end_sync() calls
zvol_create_minors(async=TRUE).  This kicks off some other threads that
create the minor device nodes (in /dev/zvol/poolname/...).  These async
threads call zvol_prefetch_minors_impl() and zvol_create_minor(), which
both call dmu_objset_own(), which puts a "long hold" on the dataset.
Since the zvol minor node creation is asynchronous, this can happen
after the `ZFS_IOC_RECV[_NEW]` ioctl and `zfs receive` process have
completed.

After the first receive ioctl has completed, userland may attempt to do
another receive into the same dataset (e.g. the next incremental
stream).  This second receive and the asynchronous minor node creation
can interfere with one another in several different ways, because they
both require exclusive access to the dataset:

1. When the second receive is finishing up, dmu_recv_end_check() does
dsl_dataset_handoff_check(), which can fail with EBUSY if the async
minor node creation already has a "long hold" on this dataset.  This
causes the 2nd receive to fail.

2. The async udev rule can fail if zvol_id and/or systemd-udevd try to
open the device while the the second receive's async attempt at minor
node creation owns the dataset (via zvol_prefetch_minors_impl).  This
causes the minor node (/dev/zd*) to exist, but the udev-generated
/dev/zvol/... to not exist.

3. The async minor node creation can silently fail with EBUSY if the
first receive's zvol_create_minor() trys to own the dataset while the
second receive's zvol_prefetch_minors_impl already owns the dataset.

To address these problems, this change synchronously creates the minor
node.  To avoid the lock ordering problems that the asynchrony was
introduced to fix (see openzfs#3681), we create the minor nodes from open
context, with no locks held, rather than from syncing contex as was
originally done.

Implementation notes:

We generally do not need to traverse children or prefetch anything (e.g.
when running the recv, snapshot, create, or clone subcommands of zfs).
We only need recursion when importing/opening a pool and when loading
encryption keys.  The existing recursive, asynchronous, prefetching code
is preserved for use in these cases.

Channel programs may need to create zvol minor nodes, when creating a
snapshot of a zvol with the snapdev property set.  We figure out what
snapshots are created when running the LUA program in syncing context.
In this case we need to remember what snapshots were created, and then
try to create their minor nodes from open context, after the LUA code
has completed.

There are additional zvol use cases that asynchronously own the dataset,
which can cause similar problems.  E.g. changing the volmode or snapdev
properties.  These are less problematic because they are not recursive
and don't touch datasets that are not involved in the operation, there
is still potential for interference with subsequent operations.  In the
future, these cases should be similarly converted to create the zvol
minor node synchronously from open context.

The async tasks of removing and renaming minors do not own the objset,
so they do not have this problem.  However, it may make sense to also
convert these operations to happen synchronously from open context, in
the future.

Reviewed-by: Paul Dagnelie <pcd@delphix.com>
Reviewed-by: Prakash Surya <prakash.surya@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
External-issue: DLPX-65948
Closes openzfs#7863
Closes openzfs#9885
uqs pushed a commit to freebsd/freebsd-src that referenced this issue Jun 12, 2020
With this change all ZVOL updates are initiated from the SPA sync
context instead of a mix of the sync and open contexts.  The updates are
queued to be applied by a dedicated thread in the original order.  This
should ensure that ZVOLs always accurately reflect the corresponding
datasets.  ZFS ioctl operations wait on the mentioned thread to complete
its work.  Thus, the illusion of the synchronous ZVOL update is
preserved.  At the same time, the SPA sync thread never blocks on ZVOL
related operations avoiding problems like reported in bug 203864.

This change is based on earlier work in the same direction: D7179 and
D14669 by Anthoine Bourgeois.  D7179 tried to perform ZVOL operations
in the open context and that opened races between them.  D14669 uses a
design very similar to this change but with different implementation
details.

This change also heavily borrows from similar code in ZoL, but there are
many differences too.  See:
- openzfs/zfs@a0bd735
- openzfs/zfs#3681
- openzfs/zfs#2217

PR:		203864
MFC after:	5 weeks
Sponsored by:	CyberSecure
Differential Revision: https://reviews.freebsd.org/D23478


git-svn-id: svn+ssh://svn.freebsd.org/base/head@362047 ccf9f872-aa2e-dd11-9fc8-001c23d0bc1f
uqs pushed a commit to freebsd/freebsd-src that referenced this issue Jun 12, 2020
With this change all ZVOL updates are initiated from the SPA sync
context instead of a mix of the sync and open contexts.  The updates are
queued to be applied by a dedicated thread in the original order.  This
should ensure that ZVOLs always accurately reflect the corresponding
datasets.  ZFS ioctl operations wait on the mentioned thread to complete
its work.  Thus, the illusion of the synchronous ZVOL update is
preserved.  At the same time, the SPA sync thread never blocks on ZVOL
related operations avoiding problems like reported in bug 203864.

This change is based on earlier work in the same direction: D7179 and
D14669 by Anthoine Bourgeois.  D7179 tried to perform ZVOL operations
in the open context and that opened races between them.  D14669 uses a
design very similar to this change but with different implementation
details.

This change also heavily borrows from similar code in ZoL, but there are
many differences too.  See:
- openzfs/zfs@a0bd735
- openzfs/zfs#3681
- openzfs/zfs#2217

PR:		203864
MFC after:	5 weeks
Sponsored by:	CyberSecure
Differential Revision: https://reviews.freebsd.org/D23478
brooksdavis pushed a commit to CTSRD-CHERI/cheribsd that referenced this issue Jun 12, 2020
With this change all ZVOL updates are initiated from the SPA sync
context instead of a mix of the sync and open contexts.  The updates are
queued to be applied by a dedicated thread in the original order.  This
should ensure that ZVOLs always accurately reflect the corresponding
datasets.  ZFS ioctl operations wait on the mentioned thread to complete
its work.  Thus, the illusion of the synchronous ZVOL update is
preserved.  At the same time, the SPA sync thread never blocks on ZVOL
related operations avoiding problems like reported in bug 203864.

This change is based on earlier work in the same direction: D7179 and
D14669 by Anthoine Bourgeois.  D7179 tried to perform ZVOL operations
in the open context and that opened races between them.  D14669 uses a
design very similar to this change but with different implementation
details.

This change also heavily borrows from similar code in ZoL, but there are
many differences too.  See:
- openzfs/zfs@a0bd735
- openzfs/zfs#3681
- openzfs/zfs#2217

PR:		203864
MFC after:	5 weeks
Sponsored by:	CyberSecure
Differential Revision: https://reviews.freebsd.org/D23478
mat813 pushed a commit to mat813/freebsd that referenced this issue Jun 15, 2020
With this change all ZVOL updates are initiated from the SPA sync
context instead of a mix of the sync and open contexts.  The updates are
queued to be applied by a dedicated thread in the original order.  This
should ensure that ZVOLs always accurately reflect the corresponding
datasets.  ZFS ioctl operations wait on the mentioned thread to complete
its work.  Thus, the illusion of the synchronous ZVOL update is
preserved.  At the same time, the SPA sync thread never blocks on ZVOL
related operations avoiding problems like reported in bug 203864.

This change is based on earlier work in the same direction: D7179 and
D14669 by Anthoine Bourgeois.  D7179 tried to perform ZVOL operations
in the open context and that opened races between them.  D14669 uses a
design very similar to this change but with different implementation
details.

This change also heavily borrows from similar code in ZoL, but there are
many differences too.  See:
- openzfs/zfs@a0bd735
- openzfs/zfs#3681
- openzfs/zfs#2217

PR:		203864
MFC after:	5 weeks
Sponsored by:	CyberSecure
Differential Revision: https://reviews.freebsd.org/D23478


git-svn-id: https://svn.freebsd.org/base/head@362047 ccf9f872-aa2e-dd11-9fc8-001c23d0bc1f
shivankgarg98 pushed a commit to shivankgarg98/freebsd that referenced this issue Jun 15, 2020
With this change all ZVOL updates are initiated from the SPA sync
context instead of a mix of the sync and open contexts.  The updates are
queued to be applied by a dedicated thread in the original order.  This
should ensure that ZVOLs always accurately reflect the corresponding
datasets.  ZFS ioctl operations wait on the mentioned thread to complete
its work.  Thus, the illusion of the synchronous ZVOL update is
preserved.  At the same time, the SPA sync thread never blocks on ZVOL
related operations avoiding problems like reported in bug 203864.

This change is based on earlier work in the same direction: D7179 and
D14669 by Anthoine Bourgeois.  D7179 tried to perform ZVOL operations
in the open context and that opened races between them.  D14669 uses a
design very similar to this change but with different implementation
details.

This change also heavily borrows from similar code in ZoL, but there are
many differences too.  See:
- openzfs/zfs@a0bd735
- openzfs/zfs#3681
- openzfs/zfs#2217

PR:		203864
MFC after:	5 weeks
Sponsored by:	CyberSecure
Differential Revision: https://reviews.freebsd.org/D23478
jsai20 pushed a commit to jsai20/zfs that referenced this issue Mar 30, 2021
When we finish a zfs receive, dmu_recv_end_sync() calls
zvol_create_minors(async=TRUE).  This kicks off some other threads that
create the minor device nodes (in /dev/zvol/poolname/...).  These async
threads call zvol_prefetch_minors_impl() and zvol_create_minor(), which
both call dmu_objset_own(), which puts a "long hold" on the dataset.
Since the zvol minor node creation is asynchronous, this can happen
after the `ZFS_IOC_RECV[_NEW]` ioctl and `zfs receive` process have
completed.

After the first receive ioctl has completed, userland may attempt to do
another receive into the same dataset (e.g. the next incremental
stream).  This second receive and the asynchronous minor node creation
can interfere with one another in several different ways, because they
both require exclusive access to the dataset:

1. When the second receive is finishing up, dmu_recv_end_check() does
dsl_dataset_handoff_check(), which can fail with EBUSY if the async
minor node creation already has a "long hold" on this dataset.  This
causes the 2nd receive to fail.

2. The async udev rule can fail if zvol_id and/or systemd-udevd try to
open the device while the the second receive's async attempt at minor
node creation owns the dataset (via zvol_prefetch_minors_impl).  This
causes the minor node (/dev/zd*) to exist, but the udev-generated
/dev/zvol/... to not exist.

3. The async minor node creation can silently fail with EBUSY if the
first receive's zvol_create_minor() trys to own the dataset while the
second receive's zvol_prefetch_minors_impl already owns the dataset.

To address these problems, this change synchronously creates the minor
node.  To avoid the lock ordering problems that the asynchrony was
introduced to fix (see openzfs#3681), we create the minor nodes from open
context, with no locks held, rather than from syncing contex as was
originally done.

Implementation notes:

We generally do not need to traverse children or prefetch anything (e.g.
when running the recv, snapshot, create, or clone subcommands of zfs).
We only need recursion when importing/opening a pool and when loading
encryption keys.  The existing recursive, asynchronous, prefetching code
is preserved for use in these cases.

Channel programs may need to create zvol minor nodes, when creating a
snapshot of a zvol with the snapdev property set.  We figure out what
snapshots are created when running the LUA program in syncing context.
In this case we need to remember what snapshots were created, and then
try to create their minor nodes from open context, after the LUA code
has completed.

There are additional zvol use cases that asynchronously own the dataset,
which can cause similar problems.  E.g. changing the volmode or snapdev
properties.  These are less problematic because they are not recursive
and don't touch datasets that are not involved in the operation, there
is still potential for interference with subsequent operations.  In the
future, these cases should be similarly converted to create the zvol
minor node synchronously from open context.

The async tasks of removing and renaming minors do not own the objset,
so they do not have this problem.  However, it may make sense to also
convert these operations to happen synchronously from open context, in
the future.

Reviewed-by: Paul Dagnelie <pcd@delphix.com>
Reviewed-by: Prakash Surya <prakash.surya@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
External-issue: DLPX-65948
Closes openzfs#7863
Closes openzfs#9885
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants