New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
OpenZFS: 'vdev initialize' feature #8230
Conversation
|
@behlendorf I'll get this reviewed over the pre-new-year weekend. |
|
how it is different with illumos version? i have tested illumos version on DilOS as well with some @grwilson helps :) |
|
@ikozhukhov the change is basically the same as applied to illumos. @dweeezil made minimal changes to the test cases and a few other tweaks to make it Linux friendly. There are two followup commits 73526ba and 41158e8 which improve CLI performance when initializing more than a single vdev. This was an important use case for us since we'd like to use this functionality to shake down new hardware for pools with 10's or 100's of vdevs. Commit 2849493 was also added to address a cosmetic issue for those who manage their pools using using the CLI. |
This comment has been minimized.
This comment has been minimized.
026837e
to
9e4a98f
Compare
|
@ahrens refreshed with your requested change for @kpande thanks for catching that, fixed. @grwilson in the interests of making it easy to upstream and review the performance improvement for multiple vdevs I've squashed your changes and mine in to a single commit and added both of our signed-off-bys. If you'd rather keep them separate just let me know and I can split them up again. |
Codecov Report
@@ Coverage Diff @@
## master #8230 +/- ##
==========================================
+ Coverage 78.46% 78.6% +0.13%
==========================================
Files 379 380 +1
Lines 114924 115651 +727
==========================================
+ Hits 90176 90907 +731
+ Misses 24748 24744 -4
Continue to review full report at Codecov.
|
14e0136
to
cf3a5bb
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not counting some minor nits and the fact "zpool_initialize" test cases seems to be prone to race conditions with devices smaller than 4G (at least on my home builder) this LGTM, thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
PROBLEM
========
The first access to a block incurs a performance penalty on some platforms
(e.g. AWS's EBS, VMware VMDKs). Therefore we recommend that volumes are
"thick provisioned", where supported by the platform (VMware). This can
create a large delay in getting a new virtual machines up and running (or
adding storage to an existing Engine). If the thick provision step is
omitted, write performance will be suboptimal until all blocks on the LUN
have been written.
SOLUTION
=========
This feature introduces a way to 'initialize' the disks at install or in the
background to make sure we don't incur this first read penalty.
When an entire LUN is added to ZFS, we make all space available immediately,
and allow ZFS to find unallocated space and zero it out. This works with
concurrent writes to arbitrary offsets, ensuring that we don't zero out
something that has been (or is in the middle of being) written. This scheme
can also be applied to existing pools (affecting only free regions on the
vdev). Detailed design:
- new subcommand:zpool initialize [-cs] <pool> [<vdev> ...]
- start, suspend, or cancel initialization
- Creates new open-context thread for each vdev
- Thread iterates through all metaslabs in this vdev
- Each metaslab:
- select a metaslab
- load the metaslab
- mark the metaslab as being zeroed
- walk all free ranges within that metaslab and translate
them to ranges on the leaf vdev
- issue a "zeroing" I/O on the leaf vdev that corresponds to
a free range on the metaslab we're working on
- continue until all free ranges for this metaslab have been
"zeroed"
- reset/unmark the metaslab being zeroed
- if more metaslabs exist, then repeat above tasks.
- if no more metaslabs, then we're done.
- progress for the initialization is stored on-disk in the vdev’s
leaf zap object. The following information is stored:
- the last offset that has been initialized
- the state of the initialization process (i.e. active,
suspended, or canceled)
- the start time for the initialization
- progress is reported via the zpool status command and shows
information for each of the vdevs that are initializing
Porting notes:
- Added zfs_initialize_value module parameter to set the pattern
written by "zpool initialize".
Authored by: George Wilson <george.wilson@delphix.com>
Reviewed by: John Wren Kennedy <john.kennedy@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed by: Prakash Surya <prakash.surya@delphix.com>
Approved by: Richard Lowe <richlowe@richlowe.net>
Signed-off-by: Tim Chase <tim@chase2k.com>
Ported-by: Tim Chase <tim@chase2k.com>
OpenZFS-issue: https://www.illumos.org/issues/9102
OpenZFS-commit: openzfs/openzfs@c3963210eb
PROBLEM ======== When invoking "zpool initialize" on a pool the command will create a thread to initialize each disk. Unfortunately, it does this serially across many transaction groups which can result in commands taking a long time to return to the user and may appear hung. The same thing is true when trying to suspend/cancel the operation. SOLUTION ========= This change refactors the way we invoke the initialize interface to ensure we can start or stop the intialization in just a few transaction groups. When stopping or cancelling a vdev initialization perform it in two phases. First signal each vdev initialization thread that it should exit, then after all threads have been signaled wait for them to exit. On a pool with 40 leaf vdevs this reduces the vdev initialize stop/cancel time from ~10 minutes to under a second. The reason for this is spa_vdev_initialize() no longer needs to wait on multiple full TXGs per leaf vdev being stopped. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: George Wilson <george.wilson@delphix.com> Signed-off-by: Tim Chase <tim@chase2k.com>
Only display the full details of the vdev initialization state in 'zpool status' output when requested with the -i option. By default display '(initializing)' after vdevs when they are being actively initialized. This is consistent with the established precident of appending '(resilvering), etc' and fits within the default 80 column terminal width making it easy to read. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
The contents of the user space provided input "initialize_vdevs" nvlist must be validated to ensure all values are uint64s. This is done in zfs_ioc_pool_initialize() in order to keep all of these checks in a single location. Additionally, update the innvl and outnvl comments to match the formatting used for all other new sytle ioctls. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
cf3a5bb
to
a858920
Compare
|
Rebased and updated to address the remaining nits. I can squash those additional fixes in the final merge but for the moment kept them separate for easier review. |
* Add the following module options which were missing: zfs_vdev_initializing_min_active, zfs_vdev_initializing_max_active, zfs_vdev_removal_min_active, and zfs_vdev_removal_max_active. * Move spa_open() down in zfs_ioc_pool_initialize() to simply cleanup. * Fix duplicate option handling * Update man page to make it clear the options are mutually exclusive. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
a858920
to
0c4c341
Compare
PROBLEM ======== When invoking "zpool initialize" on a pool the command will create a thread to initialize each disk. Unfortunately, it does this serially across many transaction groups which can result in commands taking a long time to return to the user and may appear hung. The same thing is true when trying to suspend/cancel the operation. SOLUTION ========= This change refactors the way we invoke the initialize interface to ensure we can start or stop the intialization in just a few transaction groups. When stopping or cancelling a vdev initialization perform it in two phases. First signal each vdev initialization thread that it should exit, then after all threads have been signaled wait for them to exit. On a pool with 40 leaf vdevs this reduces the vdev initialize stop/cancel time from ~10 minutes to under a second. The reason for this is spa_vdev_initialize() no longer needs to wait on multiple full TXGs per leaf vdev being stopped. This commit additionally adds some missing checks for the passed "initialize_vdevs" input nvlist. The contents of the user provided input "initialize_vdevs" nvlist must be validated to ensure all values are uint64s. This is done in zfs_ioc_pool_initialize() in order to keep all of these checks in a single location. Updated the innvl and outnvl comments to match the formatting used for all other new sytle ioctls. Reviewed by: Matt Ahrens <mahrens@delphix.com> Reviewed-by: loli10K <ezomori.nozomu@gmail.com> Reviewed-by: Tim Chase <tim@chase2k.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: George Wilson <george.wilson@delphix.com> Closes openzfs#8230
Only display the full details of the vdev initialization state in 'zpool status' output when requested with the -i option. By default display '(initializing)' after vdevs when they are being actively initialized. This is consistent with the established precident of appending '(resilvering), etc' and fits within the default 80 column terminal width making it easy to read. Additionally, updated the 'zpool initialize' documentation to make it clear the options are mutually exclusive, but allow duplicate options like all other zfs/zpool commands. Reviewed by: Matt Ahrens <mahrens@delphix.com> Reviewed-by: loli10K <ezomori.nozomu@gmail.com> Reviewed-by: Tim Chase <tim@chase2k.com> Reviewed-by: George Wilson <george.wilson@delphix.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes openzfs#8230
Motivation and Context
Updated version of
vdev initialize. This PR builds on the work done by @grwilson and @dweeezil in#7708, #7954, and #7955 to wrap up to remaining issues preventing this feature from being integrated. The following two commits have been added:
2849493 Add 'zpool status -i' option
41158e8 Speed up vdev initialize stop/cancel
Commit 2849493 is entirely cosmetic but addresses a usability concern for administrators who manage their pools primarily via the CLI. The
zpool statuscommand was updated to only output the 'vdev initialization' status when the-ioption is provided.Commit 41158e8 resolves the more substantial issues. When initializing a pool with multiple leaf vdevs stopping/cancelling the operation could result in the spa_namespace_lock being held for many minutes.
During which time no other CLI commands could be run making the pool appear to be hung. This was resolved by restructuring the stop/cancel code.
Description
This feature introduces a way to 'initialize' the disks at install or in the background to make sure we don't incur this first read penalty.
When an entire LUN is added to ZFS, we make all space available immediately, and allow ZFS to find unallocated space and zero it out. This works with concurrent writes to arbitrary offsets, ensuring that we don't zero out something that has been (or is in the middle of being) written. This scheme can also be applied to existing pools (affecting only free regions on the vdev). Detailed design:
new subcommand:zpool initialize [-cs] [ ...]
Creates new open-context thread for each vdev
Thread iterates through all metaslabs in this vdev
Each metaslab:
progress for the initialization is stored on-disk in the vdev’s leaf zap object. The following information is stored:
progress is reported via the zpool status command and shows information for each of the videos that are initializing
How Has This Been Tested?
ztestwithout failuresTypes of changes
Checklist:
Signed-off-by.