Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve rate at which new zvols are processed #8615

Merged
merged 1 commit into from
May 4, 2019

Conversation

jgallag88
Copy link
Contributor

Motivation and Context

The kernel function which adds new zvols as disks to the system, add_disk(), briefly opens and closes the zvol as part of its work. Closing a zvol involves waiting for two txgs to sync. This, combined with the fact that the taskq processing new zvols is single threaded, can make this processing of new zvols quite slow.

Issue #8526

Description

Waiting for these txgs to sync is only necessary if the zvol has been written to, which is not the case during add_disk(). This change adds tracking of whether a zvol has been opened in write mode so that we can skip the txg_wait_synced() calls when they are unnecessary.

One of the txg_wait_synced() calls happens in zil_close(). To prevent this wait, this change avoids opening a ZIL for the zvol until the zvol is opened in write mode, so that the call to zil_close() can be skipped entirely when the zvol is closed. It might also be possible to prevent zil_close() from calling txg_wait_synced() if no data has been written to the ZIL. However, zil_close() calls zil_commit() which dirties the ZIL even if nothing has been written to it yet. Not being too familiar with how the ZIL works, I wasn't sure how best to prevent that, so I opted to avoid opening a ZIL until it was needed.

This change also fixes the flags passed to blkdev_get_by_path() by vdev_disk_open() to be FMODE_READ | FMODE_WRITE | FMODE_EXCL instead of just FMODE_EXCL. The flags were being incorrectly calculated because we were using the wrong version of vdev_bdev_mode(). Without this change tests which create a zpool on top of a zvol would cause crashes because we would write to the zvol even though we hadn't passed in the FMODE_WRITE flag when we opened it.

How Has This Been Tested?

Tested by creating a large number of zvols in succession in a pool with a moderate write load and then waiting for all of the links to appear in /dev/zvol:

for i in `seq 1 100`; do sudo zfs create -V 8M rpool/zvol$i & done; time while [[ $(find /dev/zvol/rpool | wc -l) -lt 100 ]]; do sleep .05; done

This test ran ~500x faster with this change than it did without (5sec vs 41min).

Tested that writes to a zvol are still handled correctly when the zvol is opened multiple times concurrently in different modes, using variations of

$ zfs create -V 1M rpool/vol1
$ dd if=/dev/urandom of=test.dat bs=1M count=1
# Open zvol in read mode, then in write mode. Then open in write mode again and write to it.
$ { { dd if=test.dat of=/dev/zvol/rpool/vol1; } >/dev/zvol/rpool/vol1; } </dev/zvol/rpool/vol1
$ cmp test.dat /dev/zvol/rpool/vol1 || echo "Do not match"

Also used bpftrace to confirm that the correct flags (FMODE_READ | FMODE_WRITE | FMODE_EXCL) are being passed into blkdev_get_by_path()

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Performance enhancement (non-breaking change which improves efficiency)
  • Code cleanup (non-breaking change which makes code smaller or more readable)
  • Breaking change (fix or feature that would cause existing functionality to change)
  • Documentation (a change to man pages or other documentation)

Checklist:

@behlendorf behlendorf added the Status: Code Review Needed Ready for review and testing label Apr 11, 2019
Copy link
Contributor

@behlendorf behlendorf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, that's a nice improvement!

One interesting thing I noticed that the CI caught was that apparently we somehow issued a flush to the block device when ZVOL_RDONLY wasn't set, but neither was ZVOL_OPENED_WR. This resulted in the following stack which caused several of the bots to fail.

Hardware name: Xen HVM domU, BIOS 4.2.amazon 08/24/2006
task: ffff9b3752d18000 task.stack: ffffbb1c01154000
RIP: 0010:zil_commit+0x5/0x50 [zfs]
RSP: 0018:ffffbb1c01157db0 EFLAGS: 00010206
RAX: 0000000000040801 RBX: ffff9b37d1b99a00 RCX:
 zvol_request+0x163/0x300 [zfs]
 generic_make_request+0x123/0x300
 submit_bio+0x6c/0x140
 submit_bio_wait+0x57/0x80
 blkdev_issue_flush+0x7c/0xb0
 blkdev_fsync+0x2f/0x40
 do_fsync+0x38/0x60
 SyS_fsync+0xc/0x10

* need to open a ZIL.
*/
zv->zv_flags |= ZVOL_OPENED_WR;
zv->zv_zilog = zil_open(zv->zv_objset, zvol_get_data);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be a good idea to add an assertion to zvol_write() and zvol_discard() that ZVOL_OPENED_WR is set (or that zv->zv_zilog != NULL) since they may call zil_commit().

@jgallag88
Copy link
Contributor Author

One interesting thing I noticed that the CI caught was that apparently we somehow issued a flush to the block device when ZVOL_RDONLY wasn't set, but neither was ZVOL_OPENED_WR.

Thanks for pointing that out. I'm not sure why running the zfs-tests on my machine didn't hit this issue, but I can reproduce it easily manually by opening the zvol read-only and then calling fsync() on it.

@jgallag88 jgallag88 added Status: Revision Needed Changes are required for the PR to be accepted and removed Status: Code Review Needed Ready for review and testing labels Apr 13, 2019
@behlendorf
Copy link
Contributor

Thanks for addressing the fsync() issue, it looks like there's at least one more similar NULL derefernce though. The Amazon 2 bot hit the the same kind of situation in zil_replaying().

It looks like the call path was zvol_write()->zvol_log_write()->zil_replaying() when zv->zv_zilog == NULL. A similar situation may be possible with zvol_log_truncate()->zil_replaying().

@jgallag88
Copy link
Contributor Author

It turns out that a device can be written to even if it has never been opened with FMODE_WRITE. If a device is partitioned and a partition is opened multiple times, the kernel will only open the device itself once. If the partition is opened with a different mode the second time it is opened, the device will never know about the different mode.

For example, in the following scenario the partition is opened twice, first read-only and then write-only. The zvol rpool/vol1 will only be opened once, with FMODE_READ. It will then be written to even though FMODE_WRITE was never passed to zvol_open.

$ { dd if=/dev/urandom of=/dev/zvol/rpool/vol1-part1 bs=1K count=1; } < /dev/zvol/rpool/vol1-part1

The logic that controls this behavior is in __blkdev_get() in the kernel.

Given that this is the way the kernel behaves, I've made the call to zil_open a bit lazier. Now it gets called in zvol_request the first time we get asked to do a write for a particular volume. That way we don't need to rely on the mode flags passed to zvol_open.

@codecov
Copy link

codecov bot commented May 1, 2019

Codecov Report

Merging #8615 into master will decrease coverage by 0.59%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #8615      +/-   ##
==========================================
- Coverage   79.34%   78.75%    -0.6%     
==========================================
  Files         262      381     +119     
  Lines       77786   117594   +39808     
==========================================
+ Hits        61723    92606   +30883     
- Misses      16063    24988    +8925
Flag Coverage Δ
#kernel 79.29% <100%> (-0.06%) ⬇️
#user 67.18% <ø> (?)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 2a15c00...f6c0ad0. Read the comment docs.

@jgallag88 jgallag88 added Status: Code Review Needed Ready for review and testing and removed Status: Revision Needed Changes are required for the PR to be accepted labels May 1, 2019
Copy link
Contributor

@behlendorf behlendorf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've made the call to zil_open a bit lazier. Now it gets called in zvol_request the first time we get asked to do a write for a particular volume.

This may be lazier but I think the resulting patch is much better and easier to reason about. My only concern is the introduction of taking the zv->zv_state_lock mutex in the main, relatively hot, zvol_request() call path.

What if instead we protected zv->zv_zilog with the existing zv->zv_suspend_lock lock. It's already taken as a reader in zvol_request() so it'd be easy to add the zv->zv_zilog == NULL check, and only promote it to a writer for the zil_open(). zvol_setup_zv and zvol_shutdown_zv already assert that it's held. What do you think?

The kernel function which adds new zvols as disks to the system,
add_disk(), briefly opens and closes the zvol as part of its work.
Closing a zvol involves waiting for two txgs to sync. This, combined
with the fact that the taskq processing new zvols is single threaded,
makes this processing new zvols slow.

Waiting for these txgs to sync is only neccessary if the zvol has been
written to, which is not the case during add_disk(). This change adds
tracking of whether a zvol has been written to so that we can skip the
txg_wait_synced() calls when they are unecessary.

This change also fixes the flags passed to blkdev_get_by_path() by
vdev_disk_open() to be FMODE_READ | FMODE_WRITE | FMODE_EXCL instead of
just FMODE_EXCL. The flags were being incorrectly calculated because
we were using the wrong version of vdev_bdev_mode().

Signed-off-by: John Gallagher <john.gallagher@delphix.com>
@jgallag88
Copy link
Contributor Author

@behlendorf That makes sense to me. I've updated the PR with your suggestion.

Copy link
Contributor

@behlendorf behlendorf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. Does this alternate version of the PR result in a similar speed up to the original? (I'd expect it to). If we can get a second reviewers let's see about getting this pretty small change in before 0.8.

@jgallag88
Copy link
Contributor Author

Does this alternate version of the PR result in a similar speed up to the original? (I'd expect it to)

Yes, it does.

@behlendorf behlendorf added Status: Accepted Ready to integrate (reviewed, tested) and removed Status: Code Review Needed Ready for review and testing Status: Post 0.8.0 labels May 2, 2019
@behlendorf behlendorf merged commit 1eacf2b into openzfs:master May 4, 2019
allanjude pushed a commit to allanjude/zfs that referenced this pull request Jun 7, 2019
The kernel function which adds new zvols as disks to the system,
add_disk(), briefly opens and closes the zvol as part of its work.
Closing a zvol involves waiting for two txgs to sync. This, combined
with the fact that the taskq processing new zvols is single threaded,
makes this processing new zvols slow.

Waiting for these txgs to sync is only necessary if the zvol has been
written to, which is not the case during add_disk(). This change adds
tracking of whether a zvol has been written to so that we can skip the
txg_wait_synced() calls when they are unnecessary.

This change also fixes the flags passed to blkdev_get_by_path() by
vdev_disk_open() to be FMODE_READ | FMODE_WRITE | FMODE_EXCL instead of
just FMODE_EXCL. The flags were being incorrectly calculated because
we were using the wrong version of vdev_bdev_mode().

Reviewed-by: George Wilson <george.wilson@delphix.com>
Reviewed-by: Serapheim Dimitropoulos <serapheim@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: John Gallagher <john.gallagher@delphix.com>
Closes openzfs#8526 
Closes openzfs#8615
allanjude pushed a commit to allanjude/zfs that referenced this pull request Jun 15, 2019
The kernel function which adds new zvols as disks to the system,
add_disk(), briefly opens and closes the zvol as part of its work.
Closing a zvol involves waiting for two txgs to sync. This, combined
with the fact that the taskq processing new zvols is single threaded,
makes this processing new zvols slow.

Waiting for these txgs to sync is only necessary if the zvol has been
written to, which is not the case during add_disk(). This change adds
tracking of whether a zvol has been written to so that we can skip the
txg_wait_synced() calls when they are unnecessary.

This change also fixes the flags passed to blkdev_get_by_path() by
vdev_disk_open() to be FMODE_READ | FMODE_WRITE | FMODE_EXCL instead of
just FMODE_EXCL. The flags were being incorrectly calculated because
we were using the wrong version of vdev_bdev_mode().

Reviewed-by: George Wilson <george.wilson@delphix.com>
Reviewed-by: Serapheim Dimitropoulos <serapheim@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: John Gallagher <john.gallagher@delphix.com>
Closes openzfs#8526 
Closes openzfs#8615
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Status: Accepted Ready to integrate (reviewed, tested)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants