OpenZFS 7614 - zfs device evacuation/removal #6900

dweeezil · 2017-11-27T05:05:44Z

Description

This project allows top-level vdevs to be removed from the storage pool
with "zpool remove", reducing the total amount of storage in the pool.
This operation copies all allocated regions of the device to be removed
onto other devices, recording the mapping from old to new location.
After the removal is complete, read and free operations to the removed
(now "indirect") vdev must be remapped and performed at the new location
on disk. The indirect mapping table is kept in memory whenever the pool
is loaded, so there is minimal performance overhead when doing
operations on the indirect vdev.

The size of the in-memory mapping table will be reduced when its entries
become "obsolete" because they are no longer used by any block pointers
in the pool. An entry becomes obsolete when all the blocks that use it
are freed. An entry can also become obsolete when all the snapshots
that reference it are deleted, and the block pointers that reference it
have been "remapped" in all filesystems/zvols (and clones). Whenever an
indirect block is written, all the block pointers in it will be
"remapped" to their new (concrete) locations if possible. This process
can be accelerated by using the "zfs remap" command to proactively
rewrite all indirect blocks that reference indirect (removed) vdevs.

Note that when a device is removed, we do not verify the checksum of the
data that is copied. This makes the process much faster, but if it were
used on redundant vdevs (i.e. mirror or raidz vdevs), it would be
possible to copy the wrong data, when we have the correct data on e.g.
the other side of the mirror. Therefore, mirror and raidz devices can
not be removed.

Motivation and Context

See above.

How Has This Been Tested?

Additions to the test suite in functional/removal.

Types of changes

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Performance enhancement (non-breaking change which improves efficiency)
Code cleanup (non-breaking change which makes code smaller or more readable)
Breaking change (fix or feature that would cause existing functionality to change)
Documentation (a change to man pages or other documentation)

Checklist:

My code follows the ZFS on Linux code style requirements.
I have updated the documentation accordingly.
I have read the CONTRIBUTING document.
I have added tests to cover my changes.
All new and existing tests passed.
All commit messages are properly formatted and contain Signed-off-by.
Change has been approved by a ZFS on Linux member.

dweeezil · 2017-11-27T05:08:40Z

I'm posting this to give the bots a chance to confirm the few test suite failures I'm seeing as well as a general overall health-check.

The few test suite failures seem to be related to the inability of zdb to work properly when a removal operation is in progress and/ore related issues. I've not looked into this too deeply yet but it may very well be some sort of latent linux-ism.

There are also a number of areas I came across during the ports which need looking into.

That all said, this does seem to work in most of the tests and during my manual testing. Also, I'll note that it hasn't been merged upstream yet but I'm hoping this port will give some wider coverage and also allow me and, hopefully others, to review the upstream patch.

[EDIT] There may still be some large kmem allocations.

[EDIT 2] And at least one lingering style issue.

dweeezil · 2017-11-27T15:04:03Z

Ugh, that first round of tests was rather uninspiring! I'll get some of the basic stuff fixed up soon and re-push.

dweeezil · 2017-11-27T17:44:04Z

Since it's likely some of the issues uncovered here will have applicability in OpenZFS in general, anyone interested in following the progress of this should also follow the upstream PR in openzfs/openzfs#482.

dweeezil · 2017-12-06T04:53:42Z

Fixed a bunch of things, rebased to current master and synced with a couple of other new changes in the upstream PR. The removal_remap_deadlists is still failing for reasons I've not yet determined due to a strange failure with zdb.

dweeezil · 2018-01-03T21:53:25Z

Re-submitting shortly to give the ZoL buildbots another crack it this. My local tests are still failing on removal_remap_deadlists for reasons I've still not determined, but it appears to be some sort of timing issue when running zdb quickly following a pool configuration change. Manual testing of various removal scenarios all work properly as do all the other test suite tests.

dweeezil · 2018-01-08T05:41:23Z

Status update: There are 3 main issues the buildbots are running into:

ztest hangs, seemingly due to Hangs in ztest due to /dev/random entropy exhaustion #7017
ztest segfaulting at spa_remap_blkptr+0x139
Linux's refusal to unmount file systems

Addressing these in order:

The two types of ztest problems seen by the buildbots seem to be hangs and segfaults in spa_remap_blkptr(). The only hangs I've been able to reproduce, particularly in KVM guest environments, are due to the issue outlined in #7017 so I suspect the same may be happening with the buildbots.

The segfault is, as best as I can tell, in this line:

vd->vdev_ops->vdev_op_remap(vd, offset, size, remap_blkptr_cb, &rbca);

which would likely indicate a bogus vdev; presumably one without .vdev_op_remap set. I've not been able to reproduce the problem but the buildbots seem to run into it pretty regularly. I'll ask about this in the upstream PR. In the mean time, I plan on doing some more code inspection to see whether there's an obvious path by which this can happen.

Finally, a bunch of the new tests for device removal assume that file systems can be cleanly unmounted while files are open and actively being written to. I added an -l (lazy unmount) option to a handful of commands (see e865f67) to try to work around the issues and it seems to help for my local tests but not always on the buildbots. I'd be interested in advice from anyone else with experience porting test suite scripts which may have also had this issue.

dweeezil · 2018-01-10T16:32:52Z

Another update: I've been able to locally reproduce both of the ztest issues seen by the buildbots.

ahrens · 2018-01-10T20:52:51Z

This just landed in illumos. Looking forward to seeing it in ZoL too! illumos/illumos-gate@5cabbc6

behlendorf · 2018-01-10T21:33:53Z

a bunch of the new tests for device removal assume that file systems can be cleanly unmounted while files are open and actively being written to

We've typically addressed this issue by slightly reworking to test cases to kill any active process in the mount point prior to issuing the unmount.

megari · 2018-01-11T15:55:28Z

We've typically addressed this issue by slightly reworking to test cases to kill any active process in the mount point prior to issuing the unmount.

fuser -km /mountpoint should do the trick, I think.

dweeezil · 2018-01-11T21:56:27Z

@behlendorf and @megari Thanks, I'll look into getting the processes killed that are preventing the unmount (actually zpool destroy, etc.).

Also, another update: I've fixed the segfault in ztest; it was caused by a new dnode-iterating loop which was not large dnode aware. The other ztest issue is a deadlock which, unfortunately, is a pain to diagnose with gdb. The deadlock seems to always involve a call to ztest_ddt_repair() when it tries to do rw_wrlock(&ztest_name_lock);.

Once the unmounting issue is handled, I think all the test suite should be good.

behlendorf · 2018-01-11T22:07:27Z

Sounds good!

The deadlock seems to always involve a call to ztest_ddt_repair() when it tries to do rw_wrlock(&ztest_name_lock);.

This is currently a preexisting known unrelated issue in master so it shouldn't hold up this PR. Let me go ahead and open a new issue for with the debugging I have and we can resolve it in a separate PR

dweeezil · 2018-01-22T04:45:18Z

I'm rather surprised at this type of failure on the "Amazon 2" buildbot:

Test: /usr/share/zfs/zfs-tests/tests/functional/removal/remove_raidz (run as root) [00:10] [FAIL]
22:48:20.41 SUCCESS: mkfile 268435456 /tmp/dsk1
22:48:20.75 SUCCESS: mkfile 268435456 /tmp/dsk2
22:48:21.11 SUCCESS: mkfile 268435456 /tmp/dsk3
22:48:21.11 NOTE: begin default_setup_noexit
22:48:30.01 SUCCESS: zpool create -f testpool /tmp/dsk1 raidz /tmp/dsk2 /tmp/dsk3
...
22:48:30.57 rm: cannot remove 'raidz': Is a directory
22:48:30.57 ERROR: rm -f /tmp/dsk1 raidz /tmp/dsk2 /tmp/dsk3 exited 1

I realize that it's probably a good idea to do "rm -f $DISKS", however, I've never seen any version of the "rm" program which fails on a nonexistent argument when run as "rm -f".

[EDIT] This seems to happen on other buildbot instance types.

ivucica · 2018-01-27T14:43:53Z

openzfs/openzfs#251 matches this PR's description in that removal of mirror devices is not supported.

This PR, however, refers to openzfs/openzfs#482 which omits mention of mirror devices in the PR description.

Is it intended that this PR #6900 supports removal of mirror devices?

Thanks :-)

ivucica · 2018-01-27T14:53:29Z

Also, @dweeezil: the failure seems to be:

22:48:30.57 rm: cannot remove 'raidz': Is a directory

It's not that an argument is missing, but that it's present as a directory.

dweeezil · 2018-03-04T21:16:27Z

I just pushed a refresh of this patch stack. The previous deadlocks were caused by the problem fixed in #7234. The issue of copying corrupted data from one leg of a mirror remains so the patch stack now contains a hack to ztest and zloop to not allow fault injection when device evacuation is enabled.

@ahrens, the problem fixed in openzfs/openzfs#561 has been ported in this patch stack. Could you please take a peek at a6de073, particularly vdev_mirror.c to see whether it looks like handled the ZoL differences correctly. Also, the openzfs/openzfs#561 patch does not fix the same problem I alluded to above. I made the extremely hack-ish script at https://gist.github.com/dweeezil/b17fd5c4c85100acaa1b12f77a11ca03 (which is likely filled with Linux-isms) to demonstrate that corruption on one half of a mirror can be copied to both halves of another mirror. It's not clear to me how this can work properly given that the DVAs created in spa_vdev_copy_segment() don't allow for checksums, however, it's also possible that I'm not fully understanding the whole evacuation process.

In any case, the last commit it this patch stack is intended to work around the issue in the context of ztest to see how everything checks out by the buildbots.

dweeezil · 2018-03-05T05:22:53Z

Given all the newly-appearing failures here (https://github.com/dweeezil/zfs/blob/illumos-7614/module/zfs/vdev_mirror.c#L628), I suspect the merge of openzfs/openzfs#561 may have been bad. At least I've been able to reproduce it locally and can investigate further.

Also, I suspect that most of the removal-related test suite failures are device busy issues when trying to export pools.

All in all, a rather uninspiring round of buildbot tests :|

It's possible for the following assertion to be tripped when running ztest: assertion failed for thread 0xf09fca40, thread-id 549: spa->spa_max_ashift == spa->spa_min_ashift (0xc == 0x9), file ../../../uts/common/fs/zfs/vdev_removal.c, line 965 > $c libc.so.1`_lwp_kill+7(ebdde6c0, ebdde6c0, a9, fee7865e) libc.so.1`_assfail+0x214(ebddea28, fed7ac3c, 3c5, fef62000) libc.so.1`assfail3+0xde(fed7b130, c, 0, fed812cb, 9, 0) libzpool.so.1`spa_vdev_copy_impl+0x26b(89a4b40, ebddef74, ebddef68, 8992dc0, ebe10a00, fef073c0) libzpool.so.1`spa_vdev_remove_thread+0x6cd(87450c0, 0, 0, fee8f43a) libc.so.1`_thrp_setup+0x8c(f09fca40) libc.so.1`_lwp_start(f09fca40, 0, 0, 0, 0, 0) > ::spa -v ADDR STATE NAME 08723000 ACTIVE ztest ADDR STATE AUX DESCRIPTION 087466c0 HEALTHY - root 087450c0 HEALTHY - /rpool/tmp/ztest.0a 08745640 HEALTHY - indirect 08745bc0 HEALTHY - /rpool/tmp/ztest.2a 08746140 HEALTHY - /rpool/tmp/ztest.3a - - - spares 08744b40 HEALTHY - /rpool/tmp/ztest.spares.0 Authored by: Prakash Surya <prakash.surya@delphix.com> Reviewed by: Prashanth Sreenivasa <pks@delphix.com> Reviewed by: George Wilson <george.wilson@delphix.com> Ported-by: Tim Chase <tim@chase2k.com> OpenZFS-Issue: https://www.illumos.org/issues/9084 OpenZFS-Commit: openzfs/openzfs@18acba7 Issue openzfs#6900

Authored by: Toomas Soome <tsoome@me.com> Reviewed by: Yuri Pankov <yuripv@yuripv.net> Reviewed by: Matthew Ahrens <mahrens@delphix.com> Approved by: Dan McDonald <danmcd@joyent.com> Ported-by: Brian Behlendorf <behlendorf1@llnl.gov> OpenZFS-Issue: https://illumos.org/issues/9036 OpenZFS-Commit: openzfs/openzfs@f02c28e434 Issue openzfs#6900

codecov · 2018-04-14T08:33:26Z

Codecov Report

Merging #6900 into master will increase coverage by 0.34%.
The diff coverage is 88.43%.

@@            Coverage Diff             @@
##           master    #6900      +/-   ##
==========================================
+ Coverage   76.17%   76.51%   +0.34%     
==========================================
  Files         330      335       +5     
  Lines      104316   107018    +2702     
==========================================
+ Hits        79458    81886    +2428     
- Misses      24858    25132     +274

Flag	Coverage Δ
#kernel	`76.95% <84.97%> (+1.1%)`	⬆️
#user	`66.07% <80.2%> (+0.59%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 4b0f5b2...0976df5. Read the comment docs.

Mirrors are supposed to provide redundancy in the face of whole-disk failure and silent damage (e.g. some data on disk is not right, but ZFS hasn't detected the whole device as being broken). However, the current device removal implementation bypasses some of the mirror's redundancy. Note that in no case is incorrect data returned, but we might get a checksum error when we should have been able to find the right data. There are two underlying problems: 1. When we remove a mirror device, we only read one side of the mirror. Since we can't verify the checksum, this side may be silently bad, but the good data is on the other side of the mirror (which we didn't read). This can cause the removal to "bake in" the busted data – all copies of the data in the new location are the same, busted version, while we left the good version behind. The fix for this is to read and copy both sides of the mirror. If the old and new vdevs are mirrors, we will read both sides of the old mirror, and write each copy to the corresponding side of the new mirror. (If the old and new vdevs have a different number of children, we will do this as best as possible.) Even though we aren't verifying checksums, this ensures that as long as there's a good copy of the data, we'll have a good copy after the removal, even if there's silent damage to one side of the mirror. If we're removing a mirror that has some silent damage, we'll have exactly the same damage in the new location (assuming that the new location is also a mirror). 2. When we read from an indirect vdev that points to a mirror vdev, we only consider one copy of the data. This can lead to reduced effective redundancy, because we might read a bad copy of the data from one side of the mirror, and not retry the other, good side of the mirror. Note that the problem is not with the removal process, but rather after the removal has completed (having copied correct data to both sides of the mirror), if one side of the new mirror is silently damaged, we encounter the problem when reading the relocated data via the indirect vdev. Also note that the problem doesn't occur when ZFS knows that one side of the mirror is bad, e.g. when a disk entirely fails or is offlined. The impact is that reads (from indirect vdevs that point to mirrors) may return a checksum error even though the good data exists on one side of the mirror, and scrub doesn't repair all data on the mirror (if some of it is pointed to via an indirect vdev). The fix for this is complicated by "split blocks" - one logical block may be split into two (or more) pieces with each piece moved to a different new location. In this case we need to read all versions of each split (one from each side of the mirror), and figure out which combination of versions results in the correct checksum, and then repair the incorrect versions. This ensures that we supply the same redundancy whether you use device removal or not. For example, if a mirror has small silent errors on all of its children, we can still reconstruct the correct data, as long as those errors are at sufficiently-separated offsets (specifically, separated by the largest block size - default of 128KB, but up to 16MB). Porting notes: * A new indirect vdev check was moved from dsl_scan_needs_resilver_cb() to dsl_scan_needs_resilver(), which was added to ZoL as part of the sequential scrub work. * Passed NULL for zfs_ereport_post_checksum()'s zbookmark_phys_t parameter. The extra parameter is unique to ZoL. * When posting indirect checksum errors the ABD can be passed directly, zfs_ereport_post_checksum() is not yet ABD-aware in OpenZFS. Authored by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: Tim Chase <tim@chase2k.com> Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov> Ported-by: Tim Chase <tim@chase2k.com> OpenZFS-issue: https://illumos.org/issues/9290 OpenZFS-commit: openzfs/openzfs#591 Closes #6900

Remove duplicate segment copies to minimize the possible search space for reconstruction. Once reduced an accurate assessment can be made regarding the difficulty in reconstructing the block. Also, ztest will now run zdb with zfs_reconstruct_indirect_combinations_max set to 1000000 in an attempt to avoid checksum errors. Reviewed-by: Matthew Ahrens <mahrens@delphix.com> Reviewed-by: Tim Chase <tim@chase2k.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #6900

…d for indirect vdevs The timeline of the race condition is the following: [1] Thread A is about to finish condesing the first vdev in spa_condense_indirect_thread(), so it calls the spa_condense_indirect_complete_sync() sync task which sets the spa_condensing_indirect field to NULL. Waiting for the sync task to finish, thread A sleeps until the txg is done. When this happens, thread A will acquire spa_async_lock and set spa_condense_thread to NULL. [2] While thread A waits for the txg to finish, thread B which is running spa_sync() checks whether it should condense the second vdev in vdev_indirect_should_condense() by checking the spa_condensing_indirect field which was set to NULL by spa_condense_indirect_thread() from thread A. So it goes on and tries to spawn a new condensing thread in spa_condense_indirect_start_sync() and the aforementioned assertions fails because thread A has not set spa_condense_thread to NULL (which is basically the last thing it does before returning). The main issue here is that we rely on both spa_condensing_indirect and spa_condense_thread to signify whether a condensing thread is running. Ideally we would only use one throughout the codebase. In addition, for managing spa_condense_thread we currently use spa_async_lock which basically tights condensing to scrubing when it comes to pausing and resuming those actions during spa export. This commit introduces the ZTHR infrastructure, which is basically threads created during spa_load()/spa_create() and exist until we export or destroy the pool. ZTHRs sleep the majority of the time, until they are notified to wake up and do some predefined type of work. In the context of the current bug, a zthr to does the condensing of indirect mappings replacing the older code that used bare kthreads. When a pool is created, the condensing zthr is spawned but sleeps right away, until it is awaken by a signal from spa_sync(). If an existing pool is loaded, the condensing zthr looks if there is anything to condense before going to sleep, in case we were condensing mappings in the pool before it got exported. The benefits of this solution are the following: - The current bug is fixed - spa_condensing_indirect is the sole indicator of whether we are currently condensing or not - condensing is more decoupled from the spa_async_thread related functionality. As a final note, this commit also sets up the path on upstreaming other features that use the ZTHR code like zpool checkpoint and fast clone deletion. Authored by: Serapheim Dimitropoulos <serapheim@delphix.com> Reviewed by: Matt Ahrens <mahrens@delphix.com> Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com> Approved by: Hans Rosenfeld <rosenfeld@grumpf.hope-2000.org> Ported-by: Tim Chase <tim@chase2k.com> OpenZFS-issue: https://illumos.org/issues/9079 OpenZFS-commit: openzfs/openzfs@3dc606ee Closes #6900

…rect_remap() Authored by: Serapheim Dimitropoulos <serapheim.dimitro@delphix.com> Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: George Wilson <george.wilson@delphix.com> Approved by: Hans Rosenfeld <rosenfeld@grumpf.hope-2000.org> Ported-by: Brian Behlendorf <behlendorf1@llnl.gov> OpenZFS-issue: https://illumos.org/issues/9080 OpenZFS-commit: openzfs/openzfs@bdfded42e6 Closes #6900

It's possible for the following assertion to be tripped when running ztest: assertion failed for thread 0xf09fca40, thread-id 549: spa->spa_max_ashift == spa->spa_min_ashift (0xc == 0x9), file ../../../uts/common/fs/zfs/vdev_removal.c, line 965 > $c libc.so.1`_lwp_kill+7(ebdde6c0, ebdde6c0, a9, fee7865e) libc.so.1`_assfail+0x214(ebddea28, fed7ac3c, 3c5, fef62000) libc.so.1`assfail3+0xde(fed7b130, c, 0, fed812cb, 9, 0) libzpool.so.1`spa_vdev_copy_impl+0x26b(89a4b40, ebddef74, ebddef68, 8992dc0, ebe10a00, fef073c0) libzpool.so.1`spa_vdev_remove_thread+0x6cd(87450c0, 0, 0, fee8f43a) libc.so.1`_thrp_setup+0x8c(f09fca40) libc.so.1`_lwp_start(f09fca40, 0, 0, 0, 0, 0) > ::spa -v ADDR STATE NAME 08723000 ACTIVE ztest ADDR STATE AUX DESCRIPTION 087466c0 HEALTHY - root 087450c0 HEALTHY - /rpool/tmp/ztest.0a 08745640 HEALTHY - indirect 08745bc0 HEALTHY - /rpool/tmp/ztest.2a 08746140 HEALTHY - /rpool/tmp/ztest.3a - - - spares 08744b40 HEALTHY - /rpool/tmp/ztest.spares.0 Authored by: Prakash Surya <prakash.surya@delphix.com> Reviewed by: Prashanth Sreenivasa <pks@delphix.com> Reviewed by: George Wilson <george.wilson@delphix.com> Approved by: Dan McDonald <danmcd@joyent.com> Ported-by: Tim Chase <tim@chase2k.com> OpenZFS-issue: https://www.illumos.org/issues/9084 OpenZFS-commit: openzfs/openzfs@18acba7 Closes #6900

Authored by: Toomas Soome <tsoome@me.com> Reviewed by: Yuri Pankov <yuripv@yuripv.net> Reviewed by: Matthew Ahrens <mahrens@delphix.com> Approved by: Dan McDonald <danmcd@joyent.com> Ported-by: Brian Behlendorf <behlendorf1@llnl.gov> OpenZFS-issue: https://illumos.org/issues/9036 OpenZFS-commit: openzfs/openzfs@f02c28e434 Closes #6900

elpamyelhsa · 2018-06-26T14:33:09Z

I'm a little lost, did this get accepted into the main release or is it waiting for something?

Skaronator · 2018-06-26T14:57:05Z

This has been merged in to the master branch and will be available on the next major release 0.8.0

Due to a flaw in 4589f3a the number of unique combinations could be calculated incorrectly. This could result in the random combinations reconstruction being used when it would have been possible to check all combinations. This change fixes the unique combinations calculation and simplifies the reconstruction logic by maintaining a per- segment list of unique copies. The vdev_indirect_splits_damage() function was introduced to validate both the enumeration and random reconstruction logic with ztest. It is implemented such it will never make a known recoverable block unrecoverable. Reviewed-by: Matthew Ahrens <mahrens@delphix.com> Reviewed-by: Serapheim Dimitropoulos <serapheim@delphix.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue #6900 Closes #7934

This patch fixes a race condition where the end of vdev_remove_replace_with_indirect(), which holds svr_lock, would race against spa_vdev_removal_destroy(), which destroys the same lock and is called asynchronously via dsl_sync_task_nowait(). Reviewed-by: Matthew Ahrens <mahrens@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Tom Caputi <tcaputi@datto.com> Issue #6900 Closes #8083

* Detect IO errors during device removal While device removal cannot verify the checksums of individual blocks during device removal, it can reasonably detect hard IO errors from the leaf vdevs. Failure to perform this error checking can result in device removal completing successfully, but moving no data which will permanently corrupt the pool. Situation 1: faulted/degraded vdevs In the configuration shown below, the removal of mirror-0 will permanently corrupt the pool. Device removal will preferentially copy data from 'vdev1 -> vdev3' and from 'vdev2 -> vdev4'. Which in this case will result in nothing being copied since one vdev in each of those groups in unavailable. However, device removal will complete successfully since all IO errors are ignored. tank DEGRADED 0 0 0 mirror-0 DEGRADED 0 0 0 /var/tmp/vdev1 FAULTED 0 0 0 external fault /var/tmp/vdev2 ONLINE 0 0 0 mirror-1 DEGRADED 0 0 0 /var/tmp/vdev3 ONLINE 0 0 0 /var/tmp/vdev4 FAULTED 0 0 0 external fault This issue is resolved by updating the source child selection logic to exclude unreadable leaf vdevs. Additionally, unwritable destination child vdevs which can never succeed are skipped to prevent generating a large number of write IO errors. Situation 2: individual hard IO errors During removal if an unexpected hard IO error is encountered when either reading or writing the child vdev the entire removal operation is cancelled. While it may be possible to reconstruct the data after removal that cannot be guaranteed. The only strictly safe thing to do is to cancel the removal. As a future improvement we may want to instead suspend the removal process and allow the damaged region to be retried. But that work is left for another time, hard IO errors during the removal process are expected to be exceptionally rare. Reviewed-by: Serapheim Dimitropoulos <serapheim@delphix.com> Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Tom Caputi <tcaputi@datto.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue #6900 Closes #8161

Due to a flaw in 4589f3a the number of unique combinations could be calculated incorrectly. This could result in the random combinations reconstruction being used when it would have been possible to check all combinations. This change fixes the unique combinations calculation and simplifies the reconstruction logic by maintaining a per- segment list of unique copies. The vdev_indirect_splits_damage() function was introduced to validate both the enumeration and random reconstruction logic with ztest. It is implemented such it will never make a known recoverable block unrecoverable. Reviewed-by: Matthew Ahrens <mahrens@delphix.com> Reviewed-by: Serapheim Dimitropoulos <serapheim@delphix.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue openzfs#6900 Closes openzfs#7934

This patch fixes a race condition where the end of vdev_remove_replace_with_indirect(), which holds svr_lock, would race against spa_vdev_removal_destroy(), which destroys the same lock and is called asynchronously via dsl_sync_task_nowait(). Reviewed-by: Matthew Ahrens <mahrens@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Tom Caputi <tcaputi@datto.com> Issue openzfs#6900 Closes openzfs#8083

* Detect IO errors during device removal While device removal cannot verify the checksums of individual blocks during device removal, it can reasonably detect hard IO errors from the leaf vdevs. Failure to perform this error checking can result in device removal completing successfully, but moving no data which will permanently corrupt the pool. Situation 1: faulted/degraded vdevs In the configuration shown below, the removal of mirror-0 will permanently corrupt the pool. Device removal will preferentially copy data from 'vdev1 -> vdev3' and from 'vdev2 -> vdev4'. Which in this case will result in nothing being copied since one vdev in each of those groups in unavailable. However, device removal will complete successfully since all IO errors are ignored. tank DEGRADED 0 0 0 mirror-0 DEGRADED 0 0 0 /var/tmp/vdev1 FAULTED 0 0 0 external fault /var/tmp/vdev2 ONLINE 0 0 0 mirror-1 DEGRADED 0 0 0 /var/tmp/vdev3 ONLINE 0 0 0 /var/tmp/vdev4 FAULTED 0 0 0 external fault This issue is resolved by updating the source child selection logic to exclude unreadable leaf vdevs. Additionally, unwritable destination child vdevs which can never succeed are skipped to prevent generating a large number of write IO errors. Situation 2: individual hard IO errors During removal if an unexpected hard IO error is encountered when either reading or writing the child vdev the entire removal operation is cancelled. While it may be possible to reconstruct the data after removal that cannot be guaranteed. The only strictly safe thing to do is to cancel the removal. As a future improvement we may want to instead suspend the removal process and allow the damaged region to be retried. But that work is left for another time, hard IO errors during the removal process are expected to be exceptionally rare. Reviewed-by: Serapheim Dimitropoulos <serapheim@delphix.com> Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Tom Caputi <tcaputi@datto.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue openzfs#6900 Closes openzfs#8161

jdrch · 2019-06-11T19:34:16Z

I'm looking into using ZFS and am a bit confused by the wording of the pull request. Specifically, this part:

mirror and raidz devices cannot be removed.

This seems to contradict the 1st sentence:

allows top-level vdevs to be removed from the storage pool with "zpool remove", reducing the total amount of storage in the pool.

Unless the differentiating factor between the 2 statements is the use of "device" vs. "vdev." Assuming this is the case, let's say I have a pool, Pool A:

Pool A:
vdev1 = array of diska and diskb
vdev2 = array of diskc and diskd
vdev3 = array of diske and diskf
vdev4 = diskg

Per how I read the above, I'd be able to remove any of the vdevs and have Pool A's size be reduced accordingly, but I would be unable to remove any of the individual disks from the pool except vdev4. Is that correct?

Following from the above, if I wanted to remove an individual disk from vdev1, vdev2, or vdev3, I'd have to remove the vdev containing it and then remove it from that vdev, but doing so would "destroy" that vdev as far as ZFS is concerned. Is that correct?

Just want to ensure I'm understanding this.

behlendorf · 2019-06-11T21:00:42Z

@jdrch the original text for this PR included some awkward wording and was incorrect in a few places. Let me refer you to the finalized wording in the man page which should help clarify the restrictions surrounding device removal.

     zpool remove [-np] pool device...
             Removes the specified device from the pool.  This command sup‐
             ports removing hot spare, cache, log, and both mirrored and non-
             redundant primary top-level vdevs, including dedup and special
             vdevs.  When the primary pool storage includes a top-level raidz
             vdev only hot spare, cache, and log devices can be removed.

             Removing a top-level vdev reduces the total amount of space in
             the storage pool.  The specified device will be evacuated by
             copying all allocated space from it to the other devices in the
             pool.  In this case, the zpool remove command initiates the
             removal and returns, while the evacuation continues in the back‐
             ground.  The removal progress can be monitored with zpool status.
             If an IO error is encountered during the removal process it will
             be cancelled. The device_removal feature flag must be enabled to
             remove a top-level vdev, see zpool-features(5).

             A mirrored top-level device (log or data) can be removed by spec‐
             ifying the top-level mirror for the same.  Non-log devices or
             data devices that are part of a mirrored configuration can be
             removed using the zpool detach command.

The key restriction worth emphasizing for device removal, is that no top-level data device may be removed if there exists a top-level raidz vdev in the pool. Only mirror and non-redundant vdevs can be added to the pool if you intend to use device removal.

If you have a specific scenario in mind, I'd encourage you to create a pool using sparse file vdevs and use them to run through the process.

jdrch · 2019-06-11T21:19:09Z

@behlendorf Super! Thanks!

candlerb · 2019-12-16T20:17:18Z

@behlendorf:

The key restriction worth emphasizing for device removal, is that no top-level data device may be removed if there exists a top-level raidz vdev in the pool. Only mirror and non-redundant vdevs can be added to the pool if you intend to use device removal.

Solaris ZFS got the ability to remove top-level vdevs in 2018:
https://blogs.oracle.com/solaris/oracle-solaris-zfs-device-removal
... and one of the examples explicitly shows a pool consisting of a raidz, then a top-level vdev being added to the pool, and removed again (see under the heading "Misconfigured Pool Device")

I was recently bitten by ZOL's limitation not to be able to do this. Someone else was trying to add a hot spare to a zpool, but they wrongly added it as a top-level device (using -f). Then it became impossible to remove; worse, it got used for real data, and was unprotected.

behlendorf added the Status: Work in Progress Not yet ready for general review label Nov 29, 2017

dweeezil force-pushed the illumos-7614 branch from 3d88404 to 2f3473c Compare December 6, 2017 04:50

ahrens mentioned this pull request Dec 20, 2017

7614 zfs device evacuation/removal openzfs/openzfs#482

Closed

dweeezil force-pushed the illumos-7614 branch 3 times, most recently from 9f45b17 to 74d652c Compare January 7, 2018 21:41

dweeezil force-pushed the illumos-7614 branch 2 times, most recently from de0eb14 to bb19db2 Compare January 14, 2018 20:31

dweeezil force-pushed the illumos-7614 branch from bb19db2 to 35d0e29 Compare January 22, 2018 05:10

behlendorf added this to the 0.8.0 milestone Feb 9, 2018

This was referenced Feb 10, 2018

[feature request] implement delphix device removal #3371

Closed

ztest: metaslab_group_passivate() deadlock #7160

Closed

dweeezil force-pushed the illumos-7614 branch from 35d0e29 to 88b143f Compare March 4, 2018 21:04

Prakash Surya and others added 2 commits April 13, 2018 18:17

dweeezil force-pushed the illumos-7614 branch from b22a603 to 0976df5 Compare April 14, 2018 04:10

ahrens added Type: Feature Feature request or new feature and removed Status: Work in Progress Not yet ready for general review labels Apr 14, 2018

behlendorf closed this in a1d477c Apr 14, 2018

dd1dd1 mentioned this pull request Jul 16, 2019

cannot zpool remove special allocation class device #9038

Closed

jdrch mentioned this pull request Jul 22, 2019

When will ZFS on Linux (ZoL) 0.8.0 be implemented in the plugin? OpenMediaVault-Plugin-Developers/openmediavault-zfs#56

Closed

GregorKopka mentioned this pull request Aug 5, 2019

RFC: generic device removal #9129

Open

arturpzol mentioned this pull request Oct 20, 2020

Unable to import pool, mirror pool gets corrupted #10942

Closed

segdy mentioned this pull request Dec 16, 2020

zpool remove fails with "Out of space" #11356

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

OpenZFS 7614 - zfs device evacuation/removal #6900

OpenZFS 7614 - zfs device evacuation/removal #6900

dweeezil commented Nov 27, 2017

dweeezil commented Nov 27, 2017 •

edited

Loading

dweeezil commented Nov 27, 2017

dweeezil commented Nov 27, 2017

dweeezil commented Dec 6, 2017

dweeezil commented Jan 3, 2018

dweeezil commented Jan 8, 2018

dweeezil commented Jan 10, 2018

ahrens commented Jan 10, 2018

behlendorf commented Jan 10, 2018

megari commented Jan 11, 2018

dweeezil commented Jan 11, 2018

behlendorf commented Jan 11, 2018

dweeezil commented Jan 22, 2018 •

edited

Loading

ivucica commented Jan 27, 2018

ivucica commented Jan 27, 2018

dweeezil commented Mar 4, 2018

dweeezil commented Mar 5, 2018

codecov bot commented Apr 14, 2018 •

edited

Loading

elpamyelhsa commented Jun 26, 2018

Skaronator commented Jun 26, 2018

jdrch commented Jun 11, 2019 •

edited

Loading

behlendorf commented Jun 11, 2019

jdrch commented Jun 11, 2019

candlerb commented Dec 16, 2019

OpenZFS 7614 - zfs device evacuation/removal #6900

OpenZFS 7614 - zfs device evacuation/removal #6900

Conversation

dweeezil commented Nov 27, 2017

Description

Motivation and Context

How Has This Been Tested?

Types of changes

Checklist:

dweeezil commented Nov 27, 2017 • edited Loading

dweeezil commented Nov 27, 2017

dweeezil commented Nov 27, 2017

dweeezil commented Dec 6, 2017

dweeezil commented Jan 3, 2018

dweeezil commented Jan 8, 2018

dweeezil commented Jan 10, 2018

ahrens commented Jan 10, 2018

behlendorf commented Jan 10, 2018

megari commented Jan 11, 2018

dweeezil commented Jan 11, 2018

behlendorf commented Jan 11, 2018

dweeezil commented Jan 22, 2018 • edited Loading

ivucica commented Jan 27, 2018

ivucica commented Jan 27, 2018

dweeezil commented Mar 4, 2018

dweeezil commented Mar 5, 2018

codecov bot commented Apr 14, 2018 • edited Loading

Codecov Report

elpamyelhsa commented Jun 26, 2018

Skaronator commented Jun 26, 2018

jdrch commented Jun 11, 2019 • edited Loading

behlendorf commented Jun 11, 2019

jdrch commented Jun 11, 2019

candlerb commented Dec 16, 2019

dweeezil commented Nov 27, 2017 •

edited

Loading

dweeezil commented Jan 22, 2018 •

edited

Loading

codecov bot commented Apr 14, 2018 •

edited

Loading

jdrch commented Jun 11, 2019 •

edited

Loading