Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OpenZFS 7614 - zfs device evacuation/removal #6900

Closed
wants to merge 7 commits into from

Conversation

dweeezil
Copy link
Contributor

Description

This project allows top-level vdevs to be removed from the storage pool
with "zpool remove", reducing the total amount of storage in the pool.
This operation copies all allocated regions of the device to be removed
onto other devices, recording the mapping from old to new location.
After the removal is complete, read and free operations to the removed
(now "indirect") vdev must be remapped and performed at the new location
on disk. The indirect mapping table is kept in memory whenever the pool
is loaded, so there is minimal performance overhead when doing
operations on the indirect vdev.

The size of the in-memory mapping table will be reduced when its entries
become "obsolete" because they are no longer used by any block pointers
in the pool. An entry becomes obsolete when all the blocks that use it
are freed. An entry can also become obsolete when all the snapshots
that reference it are deleted, and the block pointers that reference it
have been "remapped" in all filesystems/zvols (and clones). Whenever an
indirect block is written, all the block pointers in it will be
"remapped" to their new (concrete) locations if possible. This process
can be accelerated by using the "zfs remap" command to proactively
rewrite all indirect blocks that reference indirect (removed) vdevs.

Note that when a device is removed, we do not verify the checksum of the
data that is copied. This makes the process much faster, but if it were
used on redundant vdevs (i.e. mirror or raidz vdevs), it would be
possible to copy the wrong data, when we have the correct data on e.g.
the other side of the mirror. Therefore, mirror and raidz devices can
not be removed.

Motivation and Context

See above.

How Has This Been Tested?

Additions to the test suite in functional/removal.

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Performance enhancement (non-breaking change which improves efficiency)
  • Code cleanup (non-breaking change which makes code smaller or more readable)
  • Breaking change (fix or feature that would cause existing functionality to change)
  • Documentation (a change to man pages or other documentation)

Checklist:

  • My code follows the ZFS on Linux code style requirements.
  • I have updated the documentation accordingly.
  • I have read the CONTRIBUTING document.
  • I have added tests to cover my changes.
  • All new and existing tests passed.
  • All commit messages are properly formatted and contain Signed-off-by.
  • Change has been approved by a ZFS on Linux member.

@dweeezil
Copy link
Contributor Author

dweeezil commented Nov 27, 2017

I'm posting this to give the bots a chance to confirm the few test suite failures I'm seeing as well as a general overall health-check.

The few test suite failures seem to be related to the inability of zdb to work properly when a removal operation is in progress and/ore related issues. I've not looked into this too deeply yet but it may very well be some sort of latent linux-ism.

There are also a number of areas I came across during the ports which need looking into.

That all said, this does seem to work in most of the tests and during my manual testing. Also, I'll note that it hasn't been merged upstream yet but I'm hoping this port will give some wider coverage and also allow me and, hopefully others, to review the upstream patch.

[EDIT] There may still be some large kmem allocations.

[EDIT 2] And at least one lingering style issue.

@dweeezil
Copy link
Contributor Author

Ugh, that first round of tests was rather uninspiring! I'll get some of the basic stuff fixed up soon and re-push.

@dweeezil
Copy link
Contributor Author

Since it's likely some of the issues uncovered here will have applicability in OpenZFS in general, anyone interested in following the progress of this should also follow the upstream PR in openzfs/openzfs#482.

@behlendorf behlendorf added the Status: Work in Progress Not yet ready for general review label Nov 29, 2017
@dweeezil
Copy link
Contributor Author

dweeezil commented Dec 6, 2017

Fixed a bunch of things, rebased to current master and synced with a couple of other new changes in the upstream PR. The removal_remap_deadlists is still failing for reasons I've not yet determined due to a strange failure with zdb.

@dweeezil
Copy link
Contributor Author

dweeezil commented Jan 3, 2018

Re-submitting shortly to give the ZoL buildbots another crack it this. My local tests are still failing on removal_remap_deadlists for reasons I've still not determined, but it appears to be some sort of timing issue when running zdb quickly following a pool configuration change. Manual testing of various removal scenarios all work properly as do all the other test suite tests.

@dweeezil dweeezil force-pushed the illumos-7614 branch 3 times, most recently from 9f45b17 to 74d652c Compare January 7, 2018 21:41
@dweeezil
Copy link
Contributor Author

dweeezil commented Jan 8, 2018

Status update: There are 3 main issues the buildbots are running into:

Addressing these in order:

The two types of ztest problems seen by the buildbots seem to be hangs and segfaults in spa_remap_blkptr(). The only hangs I've been able to reproduce, particularly in KVM guest environments, are due to the issue outlined in #7017 so I suspect the same may be happening with the buildbots.

The segfault is, as best as I can tell, in this line:

vd->vdev_ops->vdev_op_remap(vd, offset, size, remap_blkptr_cb, &rbca);

which would likely indicate a bogus vdev; presumably one without .vdev_op_remap set. I've not been able to reproduce the problem but the buildbots seem to run into it pretty regularly. I'll ask about this in the upstream PR. In the mean time, I plan on doing some more code inspection to see whether there's an obvious path by which this can happen.

Finally, a bunch of the new tests for device removal assume that file systems can be cleanly unmounted while files are open and actively being written to. I added an -l (lazy unmount) option to a handful of commands (see e865f67) to try to work around the issues and it seems to help for my local tests but not always on the buildbots. I'd be interested in advice from anyone else with experience porting test suite scripts which may have also had this issue.

@dweeezil
Copy link
Contributor Author

Another update: I've been able to locally reproduce both of the ztest issues seen by the buildbots.

@ahrens
Copy link
Member

ahrens commented Jan 10, 2018

This just landed in illumos. Looking forward to seeing it in ZoL too! illumos/illumos-gate@5cabbc6

@behlendorf
Copy link
Contributor

a bunch of the new tests for device removal assume that file systems can be cleanly unmounted while files are open and actively being written to

We've typically addressed this issue by slightly reworking to test cases to kill any active process in the mount point prior to issuing the unmount.

@megari
Copy link
Contributor

megari commented Jan 11, 2018

We've typically addressed this issue by slightly reworking to test cases to kill any active process in the mount point prior to issuing the unmount.

fuser -km /mountpoint should do the trick, I think.

@dweeezil
Copy link
Contributor Author

@behlendorf and @megari Thanks, I'll look into getting the processes killed that are preventing the unmount (actually zpool destroy, etc.).

Also, another update: I've fixed the segfault in ztest; it was caused by a new dnode-iterating loop which was not large dnode aware. The other ztest issue is a deadlock which, unfortunately, is a pain to diagnose with gdb. The deadlock seems to always involve a call to ztest_ddt_repair() when it tries to do rw_wrlock(&ztest_name_lock);.

Once the unmounting issue is handled, I think all the test suite should be good.

@behlendorf
Copy link
Contributor

Sounds good!

The deadlock seems to always involve a call to ztest_ddt_repair() when it tries to do rw_wrlock(&ztest_name_lock);.

This is currently a preexisting known unrelated issue in master so it shouldn't hold up this PR. Let me go ahead and open a new issue for with the debugging I have and we can resolve it in a separate PR

@dweeezil dweeezil force-pushed the illumos-7614 branch 2 times, most recently from de0eb14 to bb19db2 Compare January 14, 2018 20:31
@dweeezil
Copy link
Contributor Author

dweeezil commented Jan 22, 2018

I'm rather surprised at this type of failure on the "Amazon 2" buildbot:

Test: /usr/share/zfs/zfs-tests/tests/functional/removal/remove_raidz (run as root) [00:10] [FAIL]
22:48:20.41 SUCCESS: mkfile 268435456 /tmp/dsk1
22:48:20.75 SUCCESS: mkfile 268435456 /tmp/dsk2
22:48:21.11 SUCCESS: mkfile 268435456 /tmp/dsk3
22:48:21.11 NOTE: begin default_setup_noexit
22:48:30.01 SUCCESS: zpool create -f testpool /tmp/dsk1 raidz /tmp/dsk2 /tmp/dsk3
...
22:48:30.57 rm: cannot remove 'raidz': Is a directory
22:48:30.57 ERROR: rm -f /tmp/dsk1 raidz /tmp/dsk2 /tmp/dsk3 exited 1

I realize that it's probably a good idea to do "rm -f $DISKS", however, I've never seen any version of the "rm" program which fails on a nonexistent argument when run as "rm -f".

[EDIT] This seems to happen on other buildbot instance types.

@ivucica
Copy link

ivucica commented Jan 27, 2018

openzfs/openzfs#251 matches this PR's description in that removal of mirror devices is not supported.

This PR, however, refers to openzfs/openzfs#482 which omits mention of mirror devices in the PR description.

Is it intended that this PR #6900 supports removal of mirror devices?

Thanks :-)

@ivucica
Copy link

ivucica commented Jan 27, 2018

Also, @dweeezil: the failure seems to be:

22:48:30.57 rm: cannot remove 'raidz': Is a directory

It's not that an argument is missing, but that it's present as a directory.

@dweeezil
Copy link
Contributor Author

dweeezil commented Mar 4, 2018

I just pushed a refresh of this patch stack. The previous deadlocks were caused by the problem fixed in #7234. The issue of copying corrupted data from one leg of a mirror remains so the patch stack now contains a hack to ztest and zloop to not allow fault injection when device evacuation is enabled.

@ahrens, the problem fixed in openzfs/openzfs#561 has been ported in this patch stack. Could you please take a peek at a6de073, particularly vdev_mirror.c to see whether it looks like handled the ZoL differences correctly. Also, the openzfs/openzfs#561 patch does not fix the same problem I alluded to above. I made the extremely hack-ish script at https://gist.github.com/dweeezil/b17fd5c4c85100acaa1b12f77a11ca03 (which is likely filled with Linux-isms) to demonstrate that corruption on one half of a mirror can be copied to both halves of another mirror. It's not clear to me how this can work properly given that the DVAs created in spa_vdev_copy_segment() don't allow for checksums, however, it's also possible that I'm not fully understanding the whole evacuation process.

In any case, the last commit it this patch stack is intended to work around the issue in the context of ztest to see how everything checks out by the buildbots.

@dweeezil
Copy link
Contributor Author

dweeezil commented Mar 5, 2018

Given all the newly-appearing failures here (https://github.com/dweeezil/zfs/blob/illumos-7614/module/zfs/vdev_mirror.c#L628), I suspect the merge of openzfs/openzfs#561 may have been bad. At least I've been able to reproduce it locally and can investigate further.

Also, I suspect that most of the removal-related test suite failures are device busy issues when trying to export pools.

All in all, a rather uninspiring round of buildbot tests :|

Prakash Surya and others added 2 commits April 13, 2018 18:17
It's possible for the following assertion to be tripped when
running ztest:

    assertion failed for thread 0xf09fca40, thread-id 549:
    spa->spa_max_ashift == spa->spa_min_ashift (0xc == 0x9),
    file ../../../uts/common/fs/zfs/vdev_removal.c, line 965

    > $c
    libc.so.1`_lwp_kill+7(ebdde6c0, ebdde6c0, a9, fee7865e)
    libc.so.1`_assfail+0x214(ebddea28, fed7ac3c, 3c5, fef62000)
    libc.so.1`assfail3+0xde(fed7b130, c, 0, fed812cb, 9, 0)
    libzpool.so.1`spa_vdev_copy_impl+0x26b(89a4b40, ebddef74,
        ebddef68, 8992dc0, ebe10a00, fef073c0)
    libzpool.so.1`spa_vdev_remove_thread+0x6cd(87450c0, 0, 0, fee8f43a)
    libc.so.1`_thrp_setup+0x8c(f09fca40)
    libc.so.1`_lwp_start(f09fca40, 0, 0, 0, 0, 0)

    > ::spa -v
    ADDR         STATE NAME
    08723000    ACTIVE ztest

        ADDR     STATE     AUX          DESCRIPTION
        087466c0 HEALTHY   -            root
        087450c0 HEALTHY   -              /rpool/tmp/ztest.0a
        08745640 HEALTHY   -              indirect
        08745bc0 HEALTHY   -              /rpool/tmp/ztest.2a
        08746140 HEALTHY   -              /rpool/tmp/ztest.3a
        -        -         -            spares
        08744b40 HEALTHY   -              /rpool/tmp/ztest.spares.0

Authored by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Prashanth Sreenivasa <pks@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Ported-by: Tim Chase <tim@chase2k.com>

OpenZFS-Issue: https://www.illumos.org/issues/9084
OpenZFS-Commit: openzfs/openzfs@18acba7
Issue openzfs#6900
Authored by: Toomas Soome <tsoome@me.com>
Reviewed by: Yuri Pankov <yuripv@yuripv.net>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Approved by: Dan McDonald <danmcd@joyent.com>
Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>

OpenZFS-Issue: https://illumos.org/issues/9036
OpenZFS-Commit: openzfs/openzfs@f02c28e434
Issue openzfs#6900
@codecov
Copy link

codecov bot commented Apr 14, 2018

Codecov Report

Merging #6900 into master will increase coverage by 0.34%.
The diff coverage is 88.43%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #6900      +/-   ##
==========================================
+ Coverage   76.17%   76.51%   +0.34%     
==========================================
  Files         330      335       +5     
  Lines      104316   107018    +2702     
==========================================
+ Hits        79458    81886    +2428     
- Misses      24858    25132     +274
Flag Coverage Δ
#kernel 76.95% <84.97%> (+1.1%) ⬆️
#user 66.07% <80.2%> (+0.59%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 4b0f5b2...0976df5. Read the comment docs.

@ahrens ahrens added Type: Feature Feature request or new feature and removed Status: Work in Progress Not yet ready for general review labels Apr 14, 2018
behlendorf pushed a commit that referenced this pull request Apr 14, 2018
Mirrors are supposed to provide redundancy in the face of whole-disk
failure and silent damage (e.g. some data on disk is not right, but ZFS
hasn't detected the whole device as being broken). However, the current
device removal implementation bypasses some of the mirror's redundancy.
Note that in no case is incorrect data returned, but we might get a
checksum error when we should have been able to find the right data.

There are two underlying problems:

1. When we remove a mirror device, we only read one side of the mirror.
Since we can't verify the checksum, this side may be silently bad, but
the good data is on the other side of the mirror (which we didn't read).
This can cause the removal to "bake in" the busted data – all copies of
the data in the new location are the same, busted version, while we left
the good version behind.

The fix for this is to read and copy both sides of the mirror. If the
old and new vdevs are mirrors, we will read both sides of the old
mirror, and write each copy to the corresponding side of the new mirror.
(If the old and new vdevs have a different number of children, we will
do this as best as possible.) Even though we aren't verifying checksums,
this ensures that as long as there's a good copy of the data, we'll have
a good copy after the removal, even if there's silent damage to one side
of the mirror. If we're removing a mirror that has some silent damage,
we'll have exactly the same damage in the new location (assuming that
the new location is also a mirror).

2. When we read from an indirect vdev that points to a mirror vdev, we
only consider one copy of the data. This can lead to reduced effective
redundancy, because we might read a bad copy of the data from one side
of the mirror, and not retry the other, good side of the mirror.

Note that the problem is not with the removal process, but rather after
the removal has completed (having copied correct data to both sides of
the mirror), if one side of the new mirror is silently damaged, we
encounter the problem when reading the relocated data via the indirect
vdev. Also note that the problem doesn't occur when ZFS knows that one
side of the mirror is bad, e.g. when a disk entirely fails or is
offlined.

The impact is that reads (from indirect vdevs that point to mirrors) may
return a checksum error even though the good data exists on one side of
the mirror, and scrub doesn't repair all data on the mirror (if some of
it is pointed to via an indirect vdev).

The fix for this is complicated by "split blocks" - one logical block
may be split into two (or more) pieces with each piece moved to a
different new location. In this case we need to read all versions of
each split (one from each side of the mirror), and figure out which
combination of versions results in the correct checksum, and then repair
the incorrect versions.

This ensures that we supply the same redundancy whether you use device
removal or not. For example, if a mirror has small silent errors on all
of its children, we can still reconstruct the correct data, as long as
those errors are at sufficiently-separated offsets (specifically,
separated by the largest block size - default of 128KB, but up to 16MB).

Porting notes:

* A new indirect vdev check was moved from dsl_scan_needs_resilver_cb()
  to dsl_scan_needs_resilver(), which was added to ZoL as part of the
  sequential scrub work.

* Passed NULL for zfs_ereport_post_checksum()'s zbookmark_phys_t
  parameter.  The extra parameter is unique to ZoL.

* When posting indirect checksum errors the ABD can be passed directly,
  zfs_ereport_post_checksum() is not yet ABD-aware in OpenZFS.

Authored by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Tim Chase <tim@chase2k.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Ported-by: Tim Chase <tim@chase2k.com>

OpenZFS-issue: https://illumos.org/issues/9290
OpenZFS-commit: openzfs/openzfs#591
Closes #6900
behlendorf added a commit that referenced this pull request Apr 14, 2018
Remove duplicate segment copies to minimize the possible search
space for reconstruction.  Once reduced an accurate assessment can
be made regarding the difficulty in reconstructing the block.

Also, ztest will now run zdb with
zfs_reconstruct_indirect_combinations_max set to 1000000 in an attempt
to avoid checksum errors.

Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #6900
behlendorf pushed a commit that referenced this pull request Apr 14, 2018
…d for indirect vdevs

The timeline of the race condition is the following:

[1] Thread A is about to finish condesing the first vdev in
    spa_condense_indirect_thread(), so it calls the
    spa_condense_indirect_complete_sync() sync task which sets
    the spa_condensing_indirect field to NULL. Waiting for the
    sync task to finish, thread A sleeps until the txg is done.
    When this happens, thread A will acquire spa_async_lock and
    set spa_condense_thread to NULL.

[2] While thread A waits for the txg to finish, thread B which is
    running spa_sync() checks whether it should condense the
    second vdev in vdev_indirect_should_condense() by checking the
    spa_condensing_indirect field which was set to NULL by
    spa_condense_indirect_thread() from thread A. So it goes on
    and tries to spawn a new condensing thread in
    spa_condense_indirect_start_sync() and the aforementioned
    assertions fails because thread A has not set spa_condense_thread
    to NULL (which is basically the last thing it does before returning).

The main issue here is that we rely on both spa_condensing_indirect
and spa_condense_thread to signify whether a condensing thread is
running. Ideally we would only use one throughout the codebase. In
addition, for managing spa_condense_thread we currently use
spa_async_lock which basically tights condensing to scrubing when
it comes to pausing and resuming those actions during spa export.

This commit introduces the ZTHR infrastructure, which is basically
threads created during spa_load()/spa_create() and exist until we
export or destroy the pool. ZTHRs sleep the majority of the time,
until they are notified to wake up and do some predefined type of work.

In the context of the current bug, a zthr to does the condensing of
indirect mappings replacing the older code that used bare kthreads.
When a pool is created, the condensing zthr is spawned but sleeps
right away, until it is awaken by a signal from spa_sync(). If an
existing pool is loaded, the condensing zthr looks if there is
anything to condense before going to sleep, in case we were condensing
mappings in the pool before it got exported.

The benefits of this solution are the following:
- The current bug is fixed
- spa_condensing_indirect is the sole indicator of whether we are
  currently condensing or not
- condensing is more decoupled from the spa_async_thread related
  functionality.

As a final note, this commit also sets up the path on upstreaming
other features that use the ZTHR code like zpool checkpoint and
fast clone deletion.

Authored by: Serapheim Dimitropoulos <serapheim@delphix.com>
Reviewed by: Matt Ahrens <mahrens@delphix.com>
Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com>
Approved by: Hans Rosenfeld <rosenfeld@grumpf.hope-2000.org>
Ported-by: Tim Chase <tim@chase2k.com>

OpenZFS-issue: https://illumos.org/issues/9079
OpenZFS-commit: openzfs/openzfs@3dc606ee
Closes #6900
behlendorf pushed a commit that referenced this pull request Apr 14, 2018
…rect_remap()

Authored by: Serapheim Dimitropoulos <serapheim.dimitro@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Approved by: Hans Rosenfeld <rosenfeld@grumpf.hope-2000.org>
Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>

OpenZFS-issue: https://illumos.org/issues/9080
OpenZFS-commit: openzfs/openzfs@bdfded42e6
Closes #6900
behlendorf pushed a commit that referenced this pull request Apr 14, 2018
It's possible for the following assertion to be tripped when
running ztest:

    assertion failed for thread 0xf09fca40, thread-id 549:
    spa->spa_max_ashift == spa->spa_min_ashift (0xc == 0x9),
    file ../../../uts/common/fs/zfs/vdev_removal.c, line 965

    > $c
    libc.so.1`_lwp_kill+7(ebdde6c0, ebdde6c0, a9, fee7865e)
    libc.so.1`_assfail+0x214(ebddea28, fed7ac3c, 3c5, fef62000)
    libc.so.1`assfail3+0xde(fed7b130, c, 0, fed812cb, 9, 0)
    libzpool.so.1`spa_vdev_copy_impl+0x26b(89a4b40, ebddef74,
        ebddef68, 8992dc0, ebe10a00, fef073c0)
    libzpool.so.1`spa_vdev_remove_thread+0x6cd(87450c0, 0, 0, fee8f43a)
    libc.so.1`_thrp_setup+0x8c(f09fca40)
    libc.so.1`_lwp_start(f09fca40, 0, 0, 0, 0, 0)

    > ::spa -v
    ADDR         STATE NAME
    08723000    ACTIVE ztest

        ADDR     STATE     AUX          DESCRIPTION
        087466c0 HEALTHY   -            root
        087450c0 HEALTHY   -              /rpool/tmp/ztest.0a
        08745640 HEALTHY   -              indirect
        08745bc0 HEALTHY   -              /rpool/tmp/ztest.2a
        08746140 HEALTHY   -              /rpool/tmp/ztest.3a
        -        -         -            spares
        08744b40 HEALTHY   -              /rpool/tmp/ztest.spares.0

Authored by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Prashanth Sreenivasa <pks@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Approved by: Dan McDonald <danmcd@joyent.com>
Ported-by: Tim Chase <tim@chase2k.com>

OpenZFS-issue: https://www.illumos.org/issues/9084
OpenZFS-commit: openzfs/openzfs@18acba7
Closes #6900
behlendorf pushed a commit that referenced this pull request Apr 14, 2018
Authored by: Toomas Soome <tsoome@me.com>
Reviewed by: Yuri Pankov <yuripv@yuripv.net>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Approved by: Dan McDonald <danmcd@joyent.com>
Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>

OpenZFS-issue: https://illumos.org/issues/9036
OpenZFS-commit: openzfs/openzfs@f02c28e434
Closes #6900
@elpamyelhsa
Copy link

I'm a little lost, did this get accepted into the main release or is it waiting for something?

@Skaronator
Copy link

This has been merged in to the master branch and will be available on the next major release 0.8.0

behlendorf added a commit that referenced this pull request Oct 1, 2018
Due to a flaw in 4589f3a the number of unique combinations
could be calculated incorrectly.  This could result in the
random combinations reconstruction being used when it would
have been possible to check all combinations.

This change fixes the unique combinations calculation and
simplifies the reconstruction logic by maintaining a per-
segment list of unique copies.

The vdev_indirect_splits_damage() function was introduced
to validate both the enumeration and random reconstruction
logic with ztest.  It is implemented such it will never
make a known recoverable block unrecoverable.

Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Serapheim Dimitropoulos <serapheim@delphix.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #6900 
Closes #7934
behlendorf pushed a commit that referenced this pull request Nov 7, 2018
This patch fixes a race condition where the end of
vdev_remove_replace_with_indirect(), which holds
svr_lock, would race against spa_vdev_removal_destroy(),
which destroys the same lock and is called asynchronously
via dsl_sync_task_nowait().

Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Issue #6900 
Closes #8083
behlendorf added a commit that referenced this pull request Dec 4, 2018
* Detect IO errors during device removal

While device removal cannot verify the checksums of individual
blocks during device removal, it can reasonably detect hard IO
errors from the leaf vdevs.  Failure to perform this error
checking can result in device removal completing successfully,
but moving no data which will permanently corrupt the pool.

Situation 1: faulted/degraded vdevs

In the configuration shown below, the removal of mirror-0 will
permanently corrupt the pool.  Device removal will preferentially
copy data from 'vdev1 -> vdev3' and from 'vdev2 -> vdev4'.  Which
in this case will result in nothing being copied since one vdev
in each of those groups in unavailable.  However, device removal
will complete successfully since all IO errors are ignored.

  tank                DEGRADED     0     0     0
    mirror-0          DEGRADED     0     0     0
      /var/tmp/vdev1  FAULTED      0     0     0  external fault
      /var/tmp/vdev2  ONLINE       0     0     0
    mirror-1          DEGRADED     0     0     0
      /var/tmp/vdev3  ONLINE       0     0     0
      /var/tmp/vdev4  FAULTED      0     0     0  external fault

This issue is resolved by updating the source child selection
logic to exclude unreadable leaf vdevs.  Additionally, unwritable
destination child vdevs which can never succeed are skipped to
prevent generating a large number of write IO errors.

Situation 2: individual hard IO errors

During removal if an unexpected hard IO error is encountered when
either reading or writing the child vdev the entire removal
operation is cancelled.  While it may be possible to reconstruct
the data after removal that cannot be guaranteed.  The only
strictly safe thing to do is to cancel the removal.

As a future improvement we may want to instead suspend the removal
process and allow the damaged region to be retried.  But that work
is left for another time, hard IO errors during the removal process
are expected to be exceptionally rare.

Reviewed-by: Serapheim Dimitropoulos <serapheim@delphix.com>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Tom Caputi <tcaputi@datto.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #6900
Closes #8161
GregorKopka pushed a commit to GregorKopka/zfs that referenced this pull request Jan 7, 2019
Due to a flaw in 4589f3a the number of unique combinations
could be calculated incorrectly.  This could result in the
random combinations reconstruction being used when it would
have been possible to check all combinations.

This change fixes the unique combinations calculation and
simplifies the reconstruction logic by maintaining a per-
segment list of unique copies.

The vdev_indirect_splits_damage() function was introduced
to validate both the enumeration and random reconstruction
logic with ztest.  It is implemented such it will never
make a known recoverable block unrecoverable.

Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Serapheim Dimitropoulos <serapheim@delphix.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue openzfs#6900 
Closes openzfs#7934
GregorKopka pushed a commit to GregorKopka/zfs that referenced this pull request Jan 7, 2019
This patch fixes a race condition where the end of
vdev_remove_replace_with_indirect(), which holds
svr_lock, would race against spa_vdev_removal_destroy(),
which destroys the same lock and is called asynchronously
via dsl_sync_task_nowait().

Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Issue openzfs#6900 
Closes openzfs#8083
GregorKopka pushed a commit to GregorKopka/zfs that referenced this pull request Jan 7, 2019
* Detect IO errors during device removal

While device removal cannot verify the checksums of individual
blocks during device removal, it can reasonably detect hard IO
errors from the leaf vdevs.  Failure to perform this error
checking can result in device removal completing successfully,
but moving no data which will permanently corrupt the pool.

Situation 1: faulted/degraded vdevs

In the configuration shown below, the removal of mirror-0 will
permanently corrupt the pool.  Device removal will preferentially
copy data from 'vdev1 -> vdev3' and from 'vdev2 -> vdev4'.  Which
in this case will result in nothing being copied since one vdev
in each of those groups in unavailable.  However, device removal
will complete successfully since all IO errors are ignored.

  tank                DEGRADED     0     0     0
    mirror-0          DEGRADED     0     0     0
      /var/tmp/vdev1  FAULTED      0     0     0  external fault
      /var/tmp/vdev2  ONLINE       0     0     0
    mirror-1          DEGRADED     0     0     0
      /var/tmp/vdev3  ONLINE       0     0     0
      /var/tmp/vdev4  FAULTED      0     0     0  external fault

This issue is resolved by updating the source child selection
logic to exclude unreadable leaf vdevs.  Additionally, unwritable
destination child vdevs which can never succeed are skipped to
prevent generating a large number of write IO errors.

Situation 2: individual hard IO errors

During removal if an unexpected hard IO error is encountered when
either reading or writing the child vdev the entire removal
operation is cancelled.  While it may be possible to reconstruct
the data after removal that cannot be guaranteed.  The only
strictly safe thing to do is to cancel the removal.

As a future improvement we may want to instead suspend the removal
process and allow the damaged region to be retried.  But that work
is left for another time, hard IO errors during the removal process
are expected to be exceptionally rare.

Reviewed-by: Serapheim Dimitropoulos <serapheim@delphix.com>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Tom Caputi <tcaputi@datto.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue openzfs#6900
Closes openzfs#8161
@jdrch
Copy link

jdrch commented Jun 11, 2019

I'm looking into using ZFS and am a bit confused by the wording of the pull request. Specifically, this part:

mirror and raidz devices cannot be removed.

This seems to contradict the 1st sentence:

allows top-level vdevs to be removed from the storage pool with "zpool remove", reducing the total amount of storage in the pool.

Unless the differentiating factor between the 2 statements is the use of "device" vs. "vdev." Assuming this is the case, let's say I have a pool, Pool A:

Pool A:
vdev1 = array of diska and diskb
vdev2 = array of diskc and diskd
vdev3 = array of diske and diskf
vdev4 = diskg

Per how I read the above, I'd be able to remove any of the vdevs and have Pool A's size be reduced accordingly, but I would be unable to remove any of the individual disks from the pool except vdev4. Is that correct?

Following from the above, if I wanted to remove an individual disk from vdev1, vdev2, or vdev3, I'd have to remove the vdev containing it and then remove it from that vdev, but doing so would "destroy" that vdev as far as ZFS is concerned. Is that correct?

Just want to ensure I'm understanding this.

@behlendorf
Copy link
Contributor

@jdrch the original text for this PR included some awkward wording and was incorrect in a few places. Let me refer you to the finalized wording in the man page which should help clarify the restrictions surrounding device removal.

     zpool remove [-np] pool device...
             Removes the specified device from the pool.  This command sup‐
             ports removing hot spare, cache, log, and both mirrored and non-
             redundant primary top-level vdevs, including dedup and special
             vdevs.  When the primary pool storage includes a top-level raidz
             vdev only hot spare, cache, and log devices can be removed.

             Removing a top-level vdev reduces the total amount of space in
             the storage pool.  The specified device will be evacuated by
             copying all allocated space from it to the other devices in the
             pool.  In this case, the zpool remove command initiates the
             removal and returns, while the evacuation continues in the back‐
             ground.  The removal progress can be monitored with zpool status.
             If an IO error is encountered during the removal process it will
             be cancelled. The device_removal feature flag must be enabled to
             remove a top-level vdev, see zpool-features(5).

             A mirrored top-level device (log or data) can be removed by spec‐
             ifying the top-level mirror for the same.  Non-log devices or
             data devices that are part of a mirrored configuration can be
             removed using the zpool detach command.

The key restriction worth emphasizing for device removal, is that no top-level data device may be removed if there exists a top-level raidz vdev in the pool. Only mirror and non-redundant vdevs can be added to the pool if you intend to use device removal.

If you have a specific scenario in mind, I'd encourage you to create a pool using sparse file vdevs and use them to run through the process.

@jdrch
Copy link

jdrch commented Jun 11, 2019

@behlendorf Super! Thanks!

@candlerb
Copy link

@behlendorf:

The key restriction worth emphasizing for device removal, is that no top-level data device may be removed if there exists a top-level raidz vdev in the pool. Only mirror and non-redundant vdevs can be added to the pool if you intend to use device removal.

Solaris ZFS got the ability to remove top-level vdevs in 2018:
https://blogs.oracle.com/solaris/oracle-solaris-zfs-device-removal
... and one of the examples explicitly shows a pool consisting of a raidz, then a top-level vdev being added to the pool, and removed again (see under the heading "Misconfigured Pool Device")

I was recently bitten by ZOL's limitation not to be able to do this. Someone else was trying to add a hot spare to a zpool, but they wrongly added it as a top-level device (using -f). Then it became impossible to remove; worse, it got used for real data, and was unprotected.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Type: Feature Feature request or new feature
Projects
None yet
Development

Successfully merging this pull request may close these issues.