Join GitHub today
GitHub is home to over 40 million developers working together to host and review code, manage projects, and build software together.
Sign up7614 zfs device evacuation/removal #482
Conversation
This comment has been minimized.
This comment has been minimized.
mailinglists35
commented
Oct 25, 2017
|
what is needed to have this running on ZoL, and is it feature complete, as in can we users safely start testing it? |
This comment has been minimized.
This comment has been minimized.
|
@mailinglists35 Yes this is feature complete to start testing it as users. The removal/evacuation for striped vdevs has been in production for years now at Delphix and we have been testing the mirrored top-level vdev removal for months now. |
This comment has been minimized.
This comment has been minimized.
|
The relevant test failures are:
|
This comment has been minimized.
This comment has been minimized.
dweeezil
commented
Nov 20, 2017
•
|
FYI, I'm working on porting this to ZoL. The basic porting is complete but there's quite a bit of interference due to the removal of range tree functions not managing caller's locks and also ZoL's extra range tree functions. Another point of conflict is openzfs/zfs@0ea05c6 [EDIT: which was added to illumos after this] and a few gratuitous changes that crept into ZoL with the encryption support. I'll mention @mailinglists35 since you asked about it. This bit of porting ought to give me enough familiarity with the feature to hopefully offer a reasonable review. |
This comment has been minimized.
This comment has been minimized.
|
@dweeezil That's great, thank you Tim! Let me know if there's anything I can offer insight/advice on for the port. |
This comment has been minimized.
This comment has been minimized.
ikozhukhov
commented
Nov 27, 2017
|
hey, what is status of this PR? |
ikozhukhov left a comment
|
i have tested it by integration to dilos with some manual tests in different scenarios |
This comment has been minimized.
This comment has been minimized.
|
Thanks for sharing your test results @ikozhukhov |
This comment has been minimized.
This comment has been minimized.
dweeezil
commented
Nov 27, 2017
|
I've gotten the port to ZoL to the point that it works but am now going to have to plow through a bunch of testing issues which may have applicability in OpenZFS/illumos as well. @ahrens The first issue I've run across is the addition of the addition of the |
| static int | ||
| dva_mapping_overlap_compare(const void *v_key, const void *v_array_elem) | ||
| { | ||
| const uint64_t const *key = v_key; |
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
| { | ||
| const uint64_t const *key = v_key; | ||
| const vdev_indirect_mapping_entry_phys_t const *array_elem = | ||
| v_array_elem; |
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
|
@dweeezil I think that was a mistake. It doesn't make sense to "clear" an indirect (non-concrete) vdev. We should probably just |
This comment has been minimized.
This comment has been minimized.
dweeezil
commented
Nov 28, 2017
|
@ahrens Like this?
Also, it looks like we need something similar in |
This comment has been minimized.
This comment has been minimized.
|
@dweeezil That diff makes sense to me. But I don't understand what you're saying about vdev_free(). Why is vdev_child not NULL at that point? I think that ASSERT is worthwhile since otherwise you'll leak the memory of vdev_child. |
This comment has been minimized.
This comment has been minimized.
dweeezil
commented
Nov 28, 2017
|
@ahrens I looked into the |
This comment has been minimized.
This comment has been minimized.
dweeezil
commented
Nov 29, 2017
|
@ahrens, I figured it out: In illumos,
Note: ZoL has always used |
This comment has been minimized.
This comment has been minimized.
|
@dweeezil great find. I think that on illumos we're also trying to avoid kmem_alloc(0) and kmem_free(0), so this will be a good change to make on illumos too. |
This comment has been minimized.
This comment has been minimized.
mailinglists35
commented
Dec 1, 2017
|
if i understand correctly, this feature requires enabling a feature flag thus irreversibly changing the on-disk format. this means if on a non-patched version you do the mistake of adding the unwanted vdev, in order to use the feature you need to upgrade the pool. can the feature be shipped also separate as a |
This comment has been minimized.
This comment has been minimized.
Yes, the on-disk format changes (and the feature property changes to
No, software that doesn't understand device removal has no hope of ever reading a pool with removed devices. This applies regardless of whether you were to use zdb, zhack, or the kernel to change the on-disk format. |
| /* Fail if its raidz */ | ||
| if (vd->vdev_ops == &vdev_raidz_ops) { | ||
| return (spa_vdev_exit(spa, vd, txg, EINVAL)); | ||
| } |
This comment has been minimized.
This comment has been minimized.
dweeezil
Dec 3, 2017
This (the "vdev_ops" test) doesn't work and causes the removal_with_add test to fail its "add a raidz" test because when adding a raidz, "vdev->vdev_ops" is a "root" vdev and its "vdev_child[0]" is the raidz vdev type.
This comment has been minimized.
This comment has been minimized.
prashks
Dec 3, 2017
Author
Yes, i have fixed this already and it passes the zfs test, below is the change set that addresses this -
diff --git a/usr/src/uts/common/fs/zfs/spa.c b/usr/src/uts/common/fs/zfs/spa.c
index 3c0ef4c..6b35858 100644
--- a/usr/src/uts/common/fs/zfs/spa.c
+++ b/usr/src/uts/common/fs/zfs/spa.c
@@ -5527,29 +5527,31 @@ spa_vdev_add(spa_t *spa, nvlist_t *nvroot)
* If we are in the middle of a device removal, we can only add
* devices which match the existing devices in the pool.
* If we are in the middle of a removal, or have some indirect
- * vdevs, we can not add redundant toplevels. This ensures that
+ * vdevs, we can not add raidz toplevels. This ensures that
* we do not rely on resilver, which does not properly handle
* indirect vdevs.
*/
if (spa->spa_vdev_removal != NULL ||
spa->spa_removing_phys.sr_prev_indirect_vdev != -1) {
for (int c = 0; c < vd->vdev_children; c++) {
+ tvd = vd->vdev_child[c];
if (spa->spa_vdev_removal != NULL &&
- vd->vdev_child[c]->vdev_ashift !=
+ tvd->vdev_ashift !=
spa->spa_vdev_removal->svr_vdev->vdev_ashift) {
return (spa_vdev_exit(spa, vd, txg, EINVAL));
}
- /* Fail if its raidz */
- if (vd->vdev_ops == &vdev_raidz_ops) {
+ /* Fail if top level vdev is raidz */
+ if (tvd->vdev_ops == &vdev_raidz_ops) {
return (spa_vdev_exit(spa, vd, txg, EINVAL));
}
/*
- * Need the mirror to be mirror of leaf vdevs only
+ * Need the top level mirror to be
+ * a mirror of leaf vdevs only
*/
- if (vd->vdev_ops == &vdev_mirror_ops) {
+ if (tvd->vdev_ops == &vdev_mirror_ops) {
for (uint64_t cid = 0;
- cid < vd->vdev_children; cid++) {
- vdev_t *cvd = vd->vdev_child[cid];
+ cid < tvd->vdev_children; cid++) {
+ vdev_t *cvd = tvd->vdev_child[cid];
if (!cvd->vdev_ops->vdev_op_leaf) {
return (spa_vdev_exit(spa, vd,
txg, EINVAL));
This comment has been minimized.
This comment has been minimized.
| The following command removes the mirrored log device | ||
| .Sy mirror-2 . | ||
| .It Sy Example 14 No Removing a Mirrored top-level (Log or Data) Device | ||
| The following commands removes the mirrored log device |
This comment has been minimized.
This comment has been minimized.
Patch from openzfs/openzfs#482.
This comment has been minimized.
This comment has been minimized.
|
Thanks for the reviews and testing - @dweeezil, @ikozhukhov, @rlaager See the updated changes at: Added changes for - /sbin/zpool should be able to estimate memory used by device removal |
This comment has been minimized.
This comment has been minimized.
mailinglists35
commented
Dec 5, 2017
whan I meant in the question is: is it trivial to modify zdb logic based on this PR to perform offline device evacuation without changing the on-disk format? |
This comment has been minimized.
This comment has been minimized.
No. |
Patch from openzfs/openzfs#482.
ikozhukhov left a comment
|
i have tested update - no issues found |
This comment has been minimized.
This comment has been minimized.
|
@ofcaah Typically, after a ZFS change is integrated to OpenZFS/illumos, someone from the ZFSonLinux community will port it to Linux, and someone from the FreeBSD community will port it to FreeBSD. @dweeezil has started this work for Linux: openzfs/zfs#6900 |
Modified remove_mirror zfs-test and other code review comments
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
rlaager
commented
Dec 21, 2017
|
@ahrens If we're using the Linux kernel contributor sign-off definitions, I'm okay with Acked-by, but I think Reviewed-by would overstate the amount of review I have given it (or even could give it, given my lack of kernel and ZFS-internals knowledge). However, if it is normal practice in OpenZFS to convert "looks good to me" to "Reviewed by", then I'm absolutely okay with that. |
This comment has been minimized.
This comment has been minimized.
dweeezil
commented
Dec 27, 2017
|
@ahrens Yes, I'm on way way home from a vacation now and will give it a final review within the next day once I catch up the ZoL PR. |
This comment has been minimized.
This comment has been minimized.
|
@dweeezil did you have a chance to look at the final version? |
This comment has been minimized.
This comment has been minimized.
dweeezil
commented
Jan 3, 2018
|
@ahrens I was out of town for a week and am getting caught up now. I'm working on reconciling the changes to my ZoL PR and should be able to give this a final review today. |
This comment has been minimized.
This comment has been minimized.
dweeezil
commented
Jan 3, 2018
|
This is looking pretty good. I'm going to submit it to the ZoL buildbot now for additional verification. |
dweeezil left a comment
|
I'm still working through this on ZoL. |
| } | ||
|
|
||
| if (!locked) | ||
| return (spa_vdev_exit(spa, NULL, txg, error)); |
This comment has been minimized.
This comment has been minimized.
dweeezil
Jan 4, 2018
This early return can cause the events created above to not be posted. It should probably be changed back to "error = spa_vdev_exit(...".
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
dweeezil
commented
Jan 4, 2018
|
I've also added the patch to |
dweeezil left a comment
|
Changed event when removing a log device. |
| *txg = spa_vdev_config_enter(spa); | ||
|
|
||
| sysevent_t *ev = spa_event_create(spa, vd, NULL, | ||
| ESC_ZFS_VDEV_REMOVE_AUX); |
This comment has been minimized.
This comment has been minimized.
dweeezil
Jan 4, 2018
Previously, removing a log device would post a VDEV_REMOVE_DEV event but now it posts a VDEV_REMOVE_AUX. Is this intentional? If so, event consumers may need to be updated appropriately.
This comment has been minimized.
This comment has been minimized.
dweeezil left a comment
|
Renamed tunable possibly unrelated to device evacutation? |
| int zfs_resilver_min_time_ms = 3000; /* min millisecs to resilver per txg */ | ||
| boolean_t zfs_no_scrub_io = B_FALSE; /* set to disable scrub i/o */ | ||
| boolean_t zfs_no_scrub_prefetch = B_FALSE; /* set to disable scrub prefetch */ | ||
| enum ddt_class zfs_scrub_ddt_class_max = DDT_CLASS_DUPLICATE; | ||
| int dsl_scan_delay_completion = B_FALSE; /* set to delay scan completion */ | ||
| /* max number of blocks to free in a single TXG */ | ||
| uint64_t zfs_free_max_blocks = UINT64_MAX; | ||
| uint64_t zfs_async_block_max_blocks = UINT64_MAX; |
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
ahrens
Jan 5, 2018
Member
It is relevant because we've changed the meaning to cover not only async frees but also obsoleting of the indirect mapping - see dsl_scan_obsolete_block_cb(). The process is similar to background deletion of snapshots. Now, the name of the new variable might leave something to be desired...
This comment has been minimized.
This comment has been minimized.
|
Updated my branch to address code review comments and some cstyle/copyright fixes. |
This comment has been minimized.
This comment has been minimized.
Skaronator
commented on usr/src/cmd/zpool/zpool_main.c in 8b37b1d
Jan 8, 2018
|
Shouldn't this be |
This comment has been minimized.
This comment has been minimized.
|
@Skaronator The policy from the Delphix legal team is that this should be Each copyright holder is free to choose the format of their copyright messages, so others can do it as you've described, if they wish. |
This comment has been minimized.
This comment has been minimized.
Skaronator
commented
Jan 9, 2018
|
Ah Gotcha! |
This comment has been minimized.
This comment has been minimized.
ivucica
commented
Jan 27, 2018
•
|
@prakashsurya To explicitly confirm: Is f539f1e closing #482 or #251? Descriptions differ slightly with regards to support for removal of mirror devices... Commit:
#482:
#251:
Thanks :-) |
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
ivucica
commented
Jan 30, 2018
|
Thank you - much appreciated! |
This comment has been minimized.
This comment has been minimized.
jdrch
commented
Jun 11, 2019
•
|
I'm looking into using ZFS and am a bit confused by the wording of the pull request. Specifically, this part:
This seems to contradict the 1st sentence:
Unless the differentiating factor between the 2 statements is the use of "device" vs. "vdev." Assuming this is the case, let's say I have a pool, Pool A: Pool A: Per how I read the above, I'd be able to remove any of the vdevs and have Pool A's size be reduced accordingly, but I would be unable to remove any of the individual disks from the pool except vdev4. Is that correct? Following from the above, if I wanted to remove an individual disk from vdev1, vdev2, or vdev3, I'd have to remove the vdev containing it and then remove it from that vdev, but doing so would "destroy" that vdev as far as ZFS is concerned. Is that correct? Just want to ensure I'm understanding this. |
This comment has been minimized.
This comment has been minimized.
|
@jdrch That sentence should read, "allows some top-level vdevs to be removed" In your example:
|
This comment has been minimized.
This comment has been minimized.
jdrch
commented
Jun 11, 2019
•
I meant mirror or RAID, since I thought both were subsets/special cases of arrays. Perhaps I was wrong; if I was, apologies.
Ideally I'd like to maximize pool flexibility in terms of being able to move drives in and out of it, which is why my pool example had different vdev types. EDIT: After realizing that:
I finally realize this isn't a good idea.
"Any" as in RAIDZ vdevs, mirrored vdevs, and standalone disk vdevs, or "any" as in any of the vdevs in the example I gave? EDIT: This Reddit reply thread combined with the top-level vdev removal OpenZFS roadmap clears it up. Condition 2 above needs to be satisfied for
Thanks for the information! EDIT 2: Most comprehensive explanation so far. |
prashks commentedOct 24, 2017
Reviewed by: Alex Reece alex@delphix.com
Reviewed by: George Wilson george.wilson@delphix.com
Reviewed by: John Kennedy john.kennedy@delphix.com
Reviewed by: Prakash Surya prakash.surya@delphix.com
This project allows top-level vdevs to be removed from the storage pool
with "zpool remove", reducing the total amount of storage in the pool.
This operation copies all allocated regions of the device to be removed
onto other devices, recording the mapping from old to new location.
After the removal is complete, read and free operations to the removed
(now "indirect") vdev must be remapped and performed at the new location
on disk. The indirect mapping table is kept in memory whenever the pool
is loaded, so there is minimal performance overhead when doing
operations on the indirect vdev.
The size of the in-memory mapping table will be reduced when its entries
become "obsolete" because they are no longer used by any block pointers
in the pool. An entry becomes obsolete when all the blocks that use it
are freed. An entry can also become obsolete when all the snapshots
that reference it are deleted, and the block pointers that reference it
have been "remapped" in all filesystems/zvols (and clones). Whenever an
indirect block is written, all the block pointers in it will be
"remapped" to their new (concrete) locations if possible. This process
can be accelerated by using the "zfs remap" command to proactively
rewrite all indirect blocks that reference indirect (removed) vdevs.
Note that when a device is removed, we do not verify the checksum of the
data that is copied. This makes the process much faster, but if it were
used on raidz vdevs, it would be possible to copy the wrong data,
when we have the correct data on e.g. the other side of the raidz.
Therefore, raidz devices can not be removed.