Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OpenZFS: 'vdev initialize' feature #8230

Closed
wants to merge 5 commits into from

Conversation

behlendorf
Copy link
Contributor

@behlendorf behlendorf commented Dec 27, 2018

Motivation and Context

Updated version of vdev initialize. This PR builds on the work done by @grwilson and @dweeezil in
#7708, #7954, and #7955 to wrap up to remaining issues preventing this feature from being integrated. The following two commits have been added:

2849493 Add 'zpool status -i' option
41158e8 Speed up vdev initialize stop/cancel

Commit 2849493 is entirely cosmetic but addresses a usability concern for administrators who manage their pools primarily via the CLI. The zpool status command was updated to only output the 'vdev initialization' status when the -i option is provided.

Commit 41158e8 resolves the more substantial issues. When initializing a pool with multiple leaf vdevs stopping/cancelling the operation could result in the spa_namespace_lock being held for many minutes.
During which time no other CLI commands could be run making the pool appear to be hung. This was resolved by restructuring the stop/cancel code.

Description

This feature introduces a way to 'initialize' the disks at install or in the background to make sure we don't incur this first read penalty.

When an entire LUN is added to ZFS, we make all space available immediately, and allow ZFS to find unallocated space and zero it out. This works with concurrent writes to arbitrary offsets, ensuring that we don't zero out something that has been (or is in the middle of being) written. This scheme can also be applied to existing pools (affecting only free regions on the vdev). Detailed design:

  • new subcommand:zpool initialize [-cs] [ ...]

    • start, suspend, or cancel initialization
  • Creates new open-context thread for each vdev

  • Thread iterates through all metaslabs in this vdev

  • Each metaslab:

    • select a metaslab
    • load the metaslab
    • mark the metaslab as being zeroed
    • walk all free ranges within that metaslab and translate them to ranges on the leaf vdev
    • issue a "zeroing" I/O on the leaf vdev that corresponds to a free range on the metaslab we're working on
    • continue until all free ranges for this metaslab have been "zeroed"
    • reset/unmark the metaslab being zeroed
    • if more metaslabs exist, then repeat above tasks.
    • if no more metaslabs, then we're done.
  • progress for the initialization is stored on-disk in the vdev’s leaf zap object. The following information is stored:

    • the last offset that has been initialized
    • the state of the initialization process (i.e. active, suspended, or canceled)
    • the start time for the initialization
  • progress is reported via the zpool status command and shows information for each of the videos that are initializing

How Has This Been Tested?

  • Locally ran the ZTS, all tests pass.
  • Locally ran ztest without failures
  • Manually tested on a large-ish pool of 40 HDDs.

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Performance enhancement (non-breaking change which improves efficiency)
  • Code cleanup (non-breaking change which makes code smaller or more readable)
  • Breaking change (fix or feature that would cause existing functionality to change)
  • Documentation (a change to man pages or other documentation)

Checklist:

@dweeezil
Copy link
Contributor

@behlendorf I'll get this reviewed over the pre-new-year weekend.

@ikozhukhov
Copy link
Contributor

how it is different with illumos version? i have tested illumos version on DilOS as well with some @grwilson helps :)

@behlendorf
Copy link
Contributor Author

@ikozhukhov the change is basically the same as applied to illumos. @dweeezil made minimal changes to the test cases and a few other tweaks to make it Linux friendly.

There are two followup commits 73526ba and 41158e8 which improve CLI performance when initializing more than a single vdev. This was an important use case for us since we'd like to use this functionality to shake down new hardware for pools with 10's or 100's of vdevs.

Commit 2849493 was also added to address a cosmetic issue for those who manage their pools using using the CLI.

@mailinglists35

This comment has been minimized.

@behlendorf behlendorf force-pushed the vdev_initialize branch 2 times, most recently from 026837e to 9e4a98f Compare January 3, 2019 00:02
@behlendorf
Copy link
Contributor Author

@ahrens refreshed with your requested change for zpool status.

@kpande thanks for catching that, fixed.

@grwilson in the interests of making it easy to upstream and review the performance improvement for multiple vdevs I've squashed your changes and mine in to a single commit and added both of our signed-off-bys. If you'd rather keep them separate just let me know and I can split them up again.

@codecov
Copy link

codecov bot commented Jan 3, 2019

Codecov Report

Merging #8230 into master will increase coverage by 0.13%.
The diff coverage is 95.67%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #8230      +/-   ##
==========================================
+ Coverage   78.46%    78.6%   +0.13%     
==========================================
  Files         379      380       +1     
  Lines      114924   115651     +727     
==========================================
+ Hits        90176    90907     +731     
+ Misses      24748    24744       -4
Flag Coverage Δ
#kernel 79.03% <96.12%> (+0.25%) ⬆️
#user 67.8% <92.48%> (+0.31%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update c87db59...0c4c341. Read the comment docs.

cmd/ztest/ztest.c Outdated Show resolved Hide resolved
module/zfs/spa.c Outdated Show resolved Hide resolved
Copy link
Contributor

@loli10K loli10K left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not counting some minor nits and the fact "zpool_initialize" test cases seems to be prone to race conditions with devices smaller than 4G (at least on my home builder) this LGTM, thanks!

module/zfs/zfs_ioctl.c Outdated Show resolved Hide resolved
module/zfs/vdev_queue.c Show resolved Hide resolved
cmd/zpool/zpool_main.c Show resolved Hide resolved
Copy link
Contributor

@dweeezil dweeezil left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

grwilson and others added 4 commits January 6, 2019 11:03
PROBLEM
========

The first access to a block incurs a performance penalty on some platforms
(e.g. AWS's EBS, VMware VMDKs). Therefore we recommend that volumes are
"thick provisioned", where supported by the platform (VMware). This can
create a large delay in getting a new virtual machines up and running (or
adding storage to an existing Engine). If the thick provision step is
omitted, write performance will be suboptimal until all blocks on the LUN
have been written.

SOLUTION
=========

This feature introduces a way to 'initialize' the disks at install or in the
background to make sure we don't incur this first read penalty.

When an entire LUN is added to ZFS, we make all space available immediately,
and allow ZFS to find unallocated space and zero it out. This works with
concurrent writes to arbitrary offsets, ensuring that we don't zero out
something that has been (or is in the middle of being) written. This scheme
can also be applied to existing pools (affecting only free regions on the
vdev). Detailed design:
        - new subcommand:zpool initialize [-cs] <pool> [<vdev> ...]
                - start, suspend, or cancel initialization
        - Creates new open-context thread for each vdev
        - Thread iterates through all metaslabs in this vdev
        - Each metaslab:
                - select a metaslab
                - load the metaslab
                - mark the metaslab as being zeroed
                - walk all free ranges within that metaslab and translate
                  them to ranges on the leaf vdev
                - issue a "zeroing" I/O on the leaf vdev that corresponds to
                  a free range on the metaslab we're working on
                - continue until all free ranges for this metaslab have been
                  "zeroed"
                - reset/unmark the metaslab being zeroed
                - if more metaslabs exist, then repeat above tasks.
                - if no more metaslabs, then we're done.

        - progress for the initialization is stored on-disk in the vdev’s
          leaf zap object. The following information is stored:
                - the last offset that has been initialized
                - the state of the initialization process (i.e. active,
                  suspended, or canceled)
                - the start time for the initialization

        - progress is reported via the zpool status command and shows
          information for each of the vdevs that are initializing

Porting notes:
- Added zfs_initialize_value module parameter to set the pattern
  written by "zpool initialize".

Authored by: George Wilson <george.wilson@delphix.com>
Reviewed by: John Wren Kennedy <john.kennedy@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed by: Prakash Surya <prakash.surya@delphix.com>
Approved by: Richard Lowe <richlowe@richlowe.net>
Signed-off-by: Tim Chase <tim@chase2k.com>
Ported-by: Tim Chase <tim@chase2k.com>

OpenZFS-issue: https://www.illumos.org/issues/9102
OpenZFS-commit: openzfs/openzfs@c3963210eb
PROBLEM
========

When invoking "zpool initialize" on a pool the command will
create a thread to initialize each disk. Unfortunately, it does
this serially across many transaction groups which can result
in commands taking a long time to return to the user and may
appear hung. The same thing is true when trying to suspend/cancel
the operation.

SOLUTION
=========

This change refactors the way we invoke the initialize interface
to ensure we can start or stop the intialization in just a few
transaction groups.

When stopping or cancelling a vdev initialization perform it
in two phases.  First signal each vdev initialization thread
that it should exit, then after all threads have been signaled
wait for them to exit.

On a pool with 40 leaf vdevs this reduces the vdev initialize
stop/cancel time from ~10 minutes to under a second.  The reason
for this is spa_vdev_initialize() no longer needs to wait on
multiple full TXGs per leaf vdev being stopped.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: George Wilson <george.wilson@delphix.com>
Signed-off-by: Tim Chase <tim@chase2k.com>
Only display the full details of the vdev initialization state
in 'zpool status' output when requested with the -i option.
By default display '(initializing)' after vdevs when they are
being actively initialized.  This is consistent with the
established precident of appending '(resilvering), etc' and
fits within the default 80 column terminal width making it
easy to read.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
The contents of the user space provided input "initialize_vdevs"
nvlist must be validated to ensure all values are uint64s.  This
is done in zfs_ioc_pool_initialize() in order to keep all of these
checks in a single location.

Additionally, update the innvl and outnvl comments to match the
formatting used for all other new sytle ioctls.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
@behlendorf
Copy link
Contributor Author

Rebased and updated to address the remaining nits. I can squash those additional fixes in the final merge but for the moment kept them separate for easier review.

* Add the following module options which were missing:
  zfs_vdev_initializing_min_active, zfs_vdev_initializing_max_active,
  zfs_vdev_removal_min_active, and zfs_vdev_removal_max_active.

* Move spa_open() down in zfs_ioc_pool_initialize() to simply cleanup.

* Fix duplicate option handling

* Update man page to make it clear the options are mutually exclusive.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
@behlendorf behlendorf added the Status: Accepted Ready to integrate (reviewed, tested) label Jan 7, 2019
@behlendorf behlendorf removed the Status: Code Review Needed Ready for review and testing label Jan 7, 2019
behlendorf pushed a commit to behlendorf/zfs that referenced this pull request Jan 7, 2019
PROBLEM
========

When invoking "zpool initialize" on a pool the command will
create a thread to initialize each disk. Unfortunately, it does
this serially across many transaction groups which can result
in commands taking a long time to return to the user and may
appear hung. The same thing is true when trying to suspend/cancel
the operation.

SOLUTION
=========

This change refactors the way we invoke the initialize interface
to ensure we can start or stop the intialization in just a few
transaction groups.

When stopping or cancelling a vdev initialization perform it
in two phases.  First signal each vdev initialization thread
that it should exit, then after all threads have been signaled
wait for them to exit.

On a pool with 40 leaf vdevs this reduces the vdev initialize
stop/cancel time from ~10 minutes to under a second.  The reason
for this is spa_vdev_initialize() no longer needs to wait on
multiple full TXGs per leaf vdev being stopped.

This commit additionally adds some missing checks for the passed
"initialize_vdevs" input nvlist.  The contents of the user provided
input "initialize_vdevs" nvlist must be validated to ensure all
values are uint64s.  This is done in zfs_ioc_pool_initialize() in
order to keep all of these checks in a single location.

Updated the innvl and outnvl comments to match the formatting used
for all other new sytle ioctls.

Reviewed by: Matt Ahrens <mahrens@delphix.com>
Reviewed-by: loli10K <ezomori.nozomu@gmail.com>
Reviewed-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: George Wilson <george.wilson@delphix.com>
Closes openzfs#8230
behlendorf added a commit to behlendorf/zfs that referenced this pull request Jan 7, 2019
Only display the full details of the vdev initialization state
in 'zpool status' output when requested with the -i option.
By default display '(initializing)' after vdevs when they are
being actively initialized.  This is consistent with the
established precident of appending '(resilvering), etc' and
fits within the default 80 column terminal width making it
easy to read.

Additionally, updated the 'zpool initialize' documentation to
make it clear the options are mutually exclusive, but allow
duplicate options like all other zfs/zpool commands.

Reviewed by: Matt Ahrens <mahrens@delphix.com>
Reviewed-by: loli10K <ezomori.nozomu@gmail.com>
Reviewed-by: Tim Chase <tim@chase2k.com>
Reviewed-by: George Wilson <george.wilson@delphix.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes openzfs#8230
@behlendorf behlendorf closed this in 619f097 Jan 7, 2019
@behlendorf behlendorf deleted the vdev_initialize branch April 19, 2021 20:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Status: Accepted Ready to integrate (reviewed, tested)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants