Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

issues #2217, #3681 - set of commits dealing with zvol_*minor*() processing #3830

Closed
wants to merge 2 commits into from

Conversation

bprotopopov
Copy link
Contributor

Two issues are being addressed: improved performance of dmu_objset_find() scans called from zvol_minor() operations, and a deadlock resulting from lock order inversion resulting from zvol_minor() work occurring in the context of various callers with complex locking states (multiple locks taken in specific order).

The first issue is addressed by performing dmu_objset_find() scan in two stages:

  1. scan for zvols only
  2. check zvol's snapdev property and perform snapshot scan for this zvol only if snapdev is 'visible'

The second issue is resolved by performing zvol_minor() operations asynchronously (taskq) in a context that is free from any pre-existing locking state. This allows for much simplified deadlock analysis and (hopefully) deadlock-free implementation. The implementation includes two commits - commit 3586fa9 with the bulk of the asynchronous work support, and commit bf7f92e that removes the trylock() of the spa_namespace_lock that is no longer needed.

@odoucet
Copy link

odoucet commented Jan 21, 2016

Hi @bprotopopov any update on this pull request ? I can help testing it if you rebase it on current master.

@kernelOfTruth
Copy link
Contributor

@odoucet a not so recent rebase is at: #4131

@kernelOfTruth
Copy link
Contributor

@odoucet keep in mind, that there might be some issues lurking in these patches in their current state ( sempervictus had those in his patchstack and data integrity or corruption (?) issues) - so unless the buildbot TESTs runs pass I wouldn't use those in production or even testing

@bprotopopov
Copy link
Contributor Author

Sure, I will rebase asap. Will go through the discussion to see the source of potential data integrity concerns (in openzfs/spl#514 I presume).

@bprotopopov
Copy link
Contributor Author

I would be very interested in seeing an example of such failures. I've been using these patches for a while and did not observe any issues to date.

Typos courtesy of my iPhone

On Jan 22, 2016, at 10:13 AM, kernelOfTruth aka. kOT, Gentoo user <notifications@github.commailto:notifications@github.com> wrote:

@odoucethttps://github.com/odoucet keep in mind, that there might be some issues lurking in these patches in their current state ( sempervictus had those in his patchstack and data integrity or corruption (?) issues) - so unless the buildbot TESTs runs pass I wouldn't use those in production or even testing

Reply to this email directly or view it on GitHubhttps://github.com//pull/3830#issuecomment-173946387.

@kernelOfTruth
Copy link
Contributor

@bprotopopov there's an error message referencing one of these commits from @sempervictus at openzfs/spl#514 (comment) , openzfs/spl#514 (comment)

particularly openzfs/spl#514 (comment)

it might be an incompatibility issue with ABD2 patch stack - so when not combining it with the ABD2 patch stack on top of zfs master assumedly safe ? (unless the buildbots state otherwise)

Sorry for making it sound like these patches eat data for breakfast ;)

No blame was originally intended

@bprotopopov
Copy link
Contributor Author

Not a problem, @kernelOfTruth :)

So, this patch deals with interfacing with Linux udev - creating, removing, renaming device links for zvols and snapshots. Originally, this was done synchronously, but that approach had issues, e.g. deadlocks.

Since Linux udev is already asynchronous, I thought it would make sense to propagate that asynchrony to the zvol*minor() operations, and in doing so, to simplify resolution of the locking issues. If there are issues with say not being able to dispatch some operations, then this could result at worst in missing or dangling device links for zvols or snapshots, but could not result in zvol's data integrity issues, to my knowledge.

The error message you are referring to is intended to indicate that there was an operation "create minor" dispatched for a zvol in a pool, but when the processing commenced, the pool was already gone (perhaps exported or destroyed). Yet looking at the error message, I can see that the pool name in the argument seems wrong, perhaps corrupted - pool names cannot contain '@' characters to my knowledge.

If there is a way to reproduce this issue, I would like to look into it further to see what is going on.

Best regards.

@behlendorf
Copy link
Contributor

@bprotopopov thanks improving this. I'd like to spend some time going over it and it would be helpful if you could rebase it against master.

@bprotopopov
Copy link
Contributor Author

Rebased against tip of the master, ran quick tests (ztest).

@bprotopopov
Copy link
Contributor Author

@behlendorf, I noticed some failures in zconfig.sh reported above, so I tried to run it, and based on the error messages I get, I conclude that the test makes some assumptions about where the build artifacts are. In my case, they are not there (I simply built/installed the RPMs - make rpm-utils rpm-kmod, yum localinstall *).

Can you please explain how to build spl/zfs so that zconfig.sh knows where to find the zfs/spl modules to load ?

@bprotopopov
Copy link
Contributor Author

Ah, never mind, @behlendorf, if I run

# /usr/share/zfs/zconfig.sh 

the test finds the modules.

@bprotopopov
Copy link
Contributor Author

I have Centos 6.5 handy, it runs the test cleanly:

[root@centos-6 ~]# cat /etc/redhat-release 
CentOS release 6.5 (Final)
[root@centos-6 ~]# /usr/share/zfs/zconfig.sh -c -s10
1    persistent zpool.cache             Pass
2    scan disks for pools to import     Pass
3    zpool import/export device         Pass
4    zpool insmod/rmmod device          Pass
5    zvol+ext2 volume                   Pass
6    zvol+ext2 snapshot                 Pass
7    zvol+ext2 clone                    Pass
8    zfs send/receive                   Pass
9    zpool events                       Pass
10   zpool add/remove vdev              Skip
[root@centos-6 ~]# 

@behlendorf
Copy link
Contributor

@bprotopopov all the tests are run by first installing the zfs packages in to the image so it's just like a real user system. Then grabbing and running a given test script from the zfsonlinux/zfs-buildbot repository, which is shown at the top of the test output in buildbot. This way you should be able to easily reproduce a failure on your system.

In this case wget the bb-test-zconfig.sh and try running it locally.

wget https://raw.githubusercontent.com/zfsonlinux/zfs-buildbot/master/scripts/bb-test-zconfig.sh
./bb-test-config.sh

You can run some of these scripts in tree but be aware they often depend on having udev rules being installed to create devices as such. In this case it looks like maybe a slightly change in behavior is causing the issue.

@bprotopopov
Copy link
Contributor Author

Ran the test on CentOs 6.7 - runs clean.

[root@centos-6 build]# /usr/share/zfs/zconfig.sh -c -v -s10
Destroying 
1    persistent zpool.cache             Pass
2    scan disks for pools to import     Pass
3    zpool import/export device         Pass
4    zpool insmod/rmmod device          Pass
5    zvol+ext2 volume                   Pass
6    zvol+ext2 snapshot                 Pass
7    zvol+ext2 clone                    Pass
8    zfs send/receive                   Pass
9    zpool events                       Pass
10   zpool add/remove vdev              Skip
[root@centos-6 build]# cat /etc/redhat-release 
CentOS release 6.7 (Final)
[root@centos-6 build]#

Not sure why it does not pass in the automated runs.

@bprotopopov
Copy link
Contributor Author

Hi, @behlendorf
so I did wget and ran the tests:

[build@centos-6 ~]$ ./bb-test-zconfig.sh 
+ TEST_ZCONFIG_OPTIONS=-c
+ sudo -E dmesg -c
+ sudo -E zfs.sh
+ CHILD=510
+ wait 510
+ sudo -E zconfig.sh -c
1    persistent zpool.cache             Pass
2    scan disks for pools to import     Pass
3    zpool import/export device         Pass
4    zpool insmod/rmmod device          Pass
5    zvol+ext2 volume                   Pass
6    zvol+ext2 snapshot                 Pass
7    zvol+ext2 clone                    Pass
8    zfs send/receive                   Pass
9    zpool events                       Pass
10   zpool add/remove vdev              Pass
+ RESULT=0
+ exit 0
+ cleanup
+ dmesg
+ sudo -E zfs.sh -u
+ dmesg
[build@centos-6 ~]$ 

Accidentally left test 10 enabled - passed as well. I am running the tests on a VM. Could this be an issue?

@behlendorf
Copy link
Contributor

@bprotopopov the buildbot runs all the tests in VMs. Let me pull the patch and see if I can reproduce it locally.

@behlendorf
Copy link
Contributor

I was able to reproduce the issue under CentOS 7.1 It appears that despite snapdev=visible being set the the snapshots were not created under /dev/zvol/ after import. Which makes sense since exit code 11 indicated missing devices according to the script.

$ sudo /usr/share/zfs/zconfig.sh -c
1    persistent zpool.cache             Pass
2    scan disks for pools to import     Pass
3    zpool import/export device      Fail (11)
[behlendo@ovirt-guest-221 zfs]$ ls /dev/zvol/tank/
clone  clone-part1  clone-part2  volume  volume-part1  volume-part2

I'm not sure why you weren't able to reproduce this, nor have I carefully looked at the patch to determine why this happens.

@bprotopopov
Copy link
Contributor Author

@behlendorf, yes, that seems strange. I'll look through the script to see what could be going on.

@bprotopopov
Copy link
Contributor Author

@behlendorf, I believe I understand what might be happening here.

My changes result in asynchronous scanning for zvols/snapshots to be presented with device links. Meaning, asynchronous w.r.t. completion of zpool import, for instance. So, even if one does
udevadm trigger
udevadm settle
one is not guaranteed to see the the links right after the import. This is because if, for some reason, the udevadm commands are run before the scan for the zvols/snapshots and generation of udev events is completed, the udevadm commands have nothing to do, yet the device links are not there.

Looking at the test_3(), when the snapdev is set first time, there are a few commands run before the test uses the links, and it works, but the second time, the test tries to use device links right after import, and that may fail if snapdev_set() dispatch is delayed.

The two reasons I did this are as follows. First, I wanted to make all the minor-related operations asynchronous and to cleanly deal with the deadlock between zvol_open() and zfs_rename() (as well as other similar deadlocks). The zvol_*minor() and zvol_set_snapdev() call dmu_objset_find() which takes all sorts of locks in all sorts of orders. Second, doing such things asynchronously speeds up operations like import quite significantly if the pool has large DSL namespace and logs of zvols/snapshot all over the place.

On the other hand, this does seem to have an unexpected effect on the guarantees provided by udevadm settle. [Those guarantees of course are conditional on synchronous generation of udev events by some previously run commands, as illustrated by the failure in question.] To address this, I can think of a way to implement an analog of udev trigger/settle for the zvol_*minor() and zvol_set_snapdev() operations, e.g. 'zpool settle [poolname]' that could be run before udevadm trigger/settle. The former will guarantee that all the previously posted udev event generation requests have been processed and the events generated, while the latter will guarantee that all the rules for those were processed.

What do you think?

@behlendorf
Copy link
Contributor

@bprotopopov I think we definitely want to leave this as an asynchronous operation for all the reasons you mentioned. Anyone using Linux these days should already be familiar with the fact that block devices can, and do, appear asynchronously when they're good and ready. If that's all that's going on here I think it would be reasonable to update the zconfig test case it account for this possible behavior.

However, I don't think that's all that going on here. It's been almost two hours now and the VM where I reproduced the issue still hasn't created those devices. On top of that it's fairly clear the kernel never registered them because they don't appear under /sys/block/.

It's feels more like snapdev was somehow set to hidden when the devices should have been registered. Sorry, I should have included this in my original comment.

As an aside if your need to you can modify scripts/zconfig.sh in your patch and it'll test using that version.

@bprotopopov
Copy link
Contributor Author

Yes, two hours is definitely a problem. Is there a way I can get the dmesg or system logs ? Is the snapdev property still set on the zvol ?
Anyway, I need to figure out how to reproduce this. I will put a delay in the taskq worker thread code to see if this widens the race window.

@behlendorf
Copy link
Contributor

The console link in the buildbot results will have the dmesg output for that test. I'll try and take a closer look in to what's happening tomorrow since I was able to reproduce it.

@bprotopopov
Copy link
Contributor Author

@behlendorf, I was able to reproduce the issue after I have inserted a 1 sec delay in the worker thread function, effectively delaying processing of the async requests. I am looking into the issue.

@bprotopopov
Copy link
Contributor Author

@behlendorf, this is what I came up with so far.
I have addressed the dev link removal and re-creation but I decided to leave the dmu_objser_find_dp() in the sync context, thinking that the common use cases are not affected by this
I put the changes in a separate commit for easy review.

@bprotopopov bprotopopov force-pushed the issue-3681 branch 3 times, most recently from 2e4fc08 to 418852a Compare February 23, 2016 21:39
@bprotopopov
Copy link
Contributor Author

Added Copyright lines in the comments.

@odoucet
Copy link

odoucet commented Feb 24, 2016

I can see build issues, but I guess this is not related to this specific commit. With some work already commited on master, can we expect this PR to be commited as well ? Do you have any deadlock remaining ?

@behlendorf
Copy link
Contributor

@odoucet yes that's the plan, hopefully after one last round of review. @bprotopopov can you do a quick rebase on to master and drop the one patch which has already been merged.

@bprotopopov
Copy link
Contributor Author

@behlendorf, done, the only conflict in the comments. I can also squash the 54718f5 with d2b67a3 if that is preferred

@bprotopopov
Copy link
Contributor Author

I think it does make sense to squash. I only kept the commits separate for the review purposes.

@bprotopopov
Copy link
Contributor Author

@behlendorf I have rebased, done squashing, dropped the extra patch.

@bprotopopov
Copy link
Contributor Author

Pushing to prompt the test cycle.

@bprotopopov
Copy link
Contributor Author

@behlendorf, which branch in zfsonlinux/zfs should I be rebasing against ? Is it master or a release branch ?

@odoucet
Copy link

odoucet commented Mar 3, 2016

You need to rebase on master branch.

@bprotopopov
Copy link
Contributor Author

OK, thanks, @odoucet.

property before traversing snapshots of a dataset

zfsonlinux issue openzfs#3681 - lock order inversion between zvol_open()
and dsl_pool_sync()...zvol_rename_minors()

Add support for asynchronous zvol minor operations.

Create a per-pool zvol taskq for asynchronous zvol tasks.
There are a few key design decisions to be aware of.

* Each taskq must be single threaded to ensure tasks are always
  processed in the order in which they were dispatched.

* There is a taskq per-pool in order to keep the pools independent.
  This way if one pool is suspended it will not impact another.

* The preferred location to dispatch a zvol minor task is a sync
  task.  In this context there is easy access to the spa_t and
  minimal error handling is required because the sync task must
  succeed.

Support for asynchronous zvol minor operations address issue openzfs#3681.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
…n() and

dsl_pool_sync()...zvol_rename_minors()

Remove trylock of spa_namespace_lock as it is no longer needed when
zvol minor operations are performed in a separate context with no
prior locking state; the spa_namespace_lock is no longer held
when bdev->bd_mutex or zfs_state_lock might be taken in the code
paths originating from the zvol minor operation callbacks.
@behlendorf
Copy link
Contributor

@bprotopopov thanks for rebasing these patch. I made another pass looking over the latest patches and they look good to me. I'm going to run them through some additional manual testing but assuming I don't find anything and you don't want to make any additional change I believe they're ready to merge.

@behlendorf
Copy link
Contributor

@bprotopopov everything LGTM. I'll get these updated patches merged, thanks for all your hard work on this!

@behlendorf
Copy link
Contributor

Merged as:

1ee159f Fix lock order inversion with zvol_open()
a0bd735 Add support for asynchronous zvol minor operations

@behlendorf behlendorf closed this Mar 10, 2016
@bprotopopov
Copy link
Contributor Author

It was nice working with you on this, @behlendorf !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants