Skip to content
This repository has been archived by the owner on Nov 7, 2019. It is now read-only.

7614 zfs device evacuation/removal #251

Closed
wants to merge 1 commit into from
Closed

Conversation

ahrens
Copy link
Member

@ahrens ahrens commented Nov 26, 2016

Reviewed by: Alex Reece alex@delphix.com
Reviewed by: George Wilson george.wilson@delphix.com
Reviewed by: John Kennedy john.kennedy@delphix.com
Reviewed by: Prakash Surya prakash.surya@delphix.com

This project allows top-level vdevs to be removed from the storage pool
with “zpool remove”, reducing the total amount of storage in the pool.
This operation copies all allocated regions of the device to be removed
onto other devices, recording the mapping from old to new location.
After the removal is complete, read and free operations to the removed
(now “indirect”) vdev must be remapped and performed at the new location
on disk. The indirect mapping table is kept in memory whenever the pool
is loaded, so there is minimal performance overhead when doing
operations on the indirect vdev.

The size of the in-memory mapping table will be reduced when its entries
become “obsolete” because they are no longer used by any block pointers
in the pool. An entry becomes obsolete when all the blocks that use it
are freed. An entry can also become obsolete when all the snapshots
that reference it are deleted, and the block pointers that reference it
have been “remapped” in all filesystems/zvols (and clones). Whenever an
indirect block is written, all the block pointers in it will be
“remapped” to their new (concrete) locations if possible. This process
can be accelerated by using the “zfs remap” command to proactively
rewrite all indirect blocks that reference indirect (removed) vdevs.

Note that when a device is removed, we do not verify the checksum of the
data that is copied. This makes the process much faster, but if it were
used on redundant vdevs (i.e. mirror or raidz vdevs), it would be
possible to copy the wrong data, when we have the correct data on e.g.
the other side of the mirror. Therefore, mirror and raidz devices can
not be removed.

@ahrens
Copy link
Member Author

ahrens commented Nov 26, 2016

@zettabot go

1 similar comment
@ahrens
Copy link
Member Author

ahrens commented Nov 28, 2016

@zettabot go

@ahrens
Copy link
Member Author

ahrens commented Dec 13, 2016

@zettabot go

@ahrens
Copy link
Member Author

ahrens commented Dec 19, 2016

@zettabot go

@jimklimov
Copy link

Nice news. 👍

In our pocket of the IRC-verse, we decided that the last sentence limits the use of this new wonderful feature to containing (or amending) those "Oh, crap!" moments when people mistake a zpool add disk vs. zpool attach disk - which indeed is a major source of worry whenever one does one of these operations?

Are there plans to enable the feature for other VDEV types, perhaps in a later PR and learning eventually some lessons from this one?

@ahrens
Copy link
Member Author

ahrens commented Dec 21, 2016

@jimklimov

Are there plans to enable the feature for other VDEV types, perhaps in a later PR and learning eventually some lessons from this one?

I don't have any plans to do so, but I would welcome effort along those lines.

ahrens referenced this pull request in openzfs/zfs Dec 21, 2016
range_tree_verify() was the only range tree support function which
locked rt_lock whereas all the other functions required the lock to
be taken by the caller.  If the lock is taken in range_tree_verify(),
it's not possible to atomically verify a set of related range trees
(those which are likely protected by the same lock).

In the previous implementation, checking "related" trees would be done
as follows:

    range_tree_verify(tree1, offset, size);
    /* tree1's rt_lock is not taken here */
    range_tree_verify(tree2, offset, size);

The new implementation requires:

    mutex_enter(tree1->rt_lock);
    range_tree_verify(tree1, offset, size);
    range_tree_verify(tree2, offset, size);
    mutex_exit(tree1->rt_lock);

Currently, the only consumer of range_tree_verify() is
metaslab_check_free() which verifies a set of realted range trees in
a metaslab.  The TRIM/DISCARD code adds an additional set of checks of
the current and previous trimsets, both of which are represented as
range trees.

metaslab_check_free() has been updated to lock ms_lock once for each
vdev's metaslab and also for debugging builds to verify that the each
tree's rt_lock matches the metaslab's ms_lock to prove they're related.
@ahrens
Copy link
Member Author

ahrens commented Jan 12, 2017

@zettabot go

@ahrens
Copy link
Member Author

ahrens commented Feb 1, 2017

@zettabot go

@ahrens
Copy link
Member Author

ahrens commented Apr 9, 2017

@zettabot go

@ahrens
Copy link
Member Author

ahrens commented Apr 13, 2017

@zettabot go

@ahrens ahrens force-pushed the devrm6 branch 2 times, most recently from 1188b7d to b103dc8 Compare June 6, 2017 18:49
@galindro
Copy link

galindro commented Jun 16, 2017

@ahrens when do you expect to merge this PR in master and release a new version with this great feature?

Reviewed by: Alex Reece <alex@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: John Kennedy <john.kennedy@delphix.com>
Reviewed by: Prakash Surya <prakash.surya@delphix.com>

This project allows top-level vdevs to be removed from the storage pool
with "zpool remove", reducing the total amount of storage in the pool.
This operation copies all allocated regions of the device to be removed
onto other devices, recording the mapping from old to new location.
After the removal is complete, read and free operations to the removed
(now "indirect") vdev must be remapped and performed at the new location
on disk.  The indirect mapping table is kept in memory whenever the pool
is loaded, so there is minimal performance overhead when doing
operations on the indirect vdev.

The size of the in-memory mapping table will be reduced when its entries
become "obsolete" because they are no longer used by any block pointers
in the pool.  An entry becomes obsolete when all the blocks that use it
are freed.  An entry can also become obsolete when all the snapshots
that reference it are deleted, and the block pointers that reference it
have been "remapped" in all filesystems/zvols (and clones).  Whenever an
indirect block is written, all the block pointers in it will be
"remapped" to their new (concrete) locations if possible.  This process
can be accelerated by using the "zfs remap" command to proactively
rewrite all indirect blocks that reference indirect (removed) vdevs.

Note that when a device is removed, we do not verify the checksum of the
data that is copied.  This makes the process much faster, but if it were
used on redundant vdevs (i.e. mirror or raidz vdevs), it would be
possible to copy the wrong data, when we have the correct data on e.g.
the other side of the mirror.  Therefore, mirror and raidz devices can
not be removed.
@ahrens
Copy link
Member Author

ahrens commented Jun 28, 2017

@galindro I've been working on getting this rebased onto master, which is now completed. I'm hoping to get it merged sometime in July.

@galindro
Copy link

Nice.
tks @ahrens

@ahrens
Copy link
Member Author

ahrens commented Jun 28, 2017

@gmelikov I was wondering if you'd be interested in porting this to Linux before we integrate it to illumos. It would be great to get additional feedback on this change, and we might attract more attention on ZoL.

@gmelikov
Copy link
Member

@ahrens I'm afraid I won't have much time to do it quickly now, but i've already began, sooner or later we'll port it.

@gmelikov
Copy link
Member

@ahrens unfortunately I did't have enough time this summer to port it due to unexpected load at work, I'm very sorry that I misled you.

@ahrens
Copy link
Member Author

ahrens commented Sep 25, 2017

@gmelikov no worries. We're also behind on our work upstreaming it to illumos, but we hope to have the final version out by OpenZFS DevSummit (Oct 24).

@ikozhukhov
Copy link

hey, when we can see it integrated? :)

@ahrens
Copy link
Member Author

ahrens commented Oct 10, 2017

@ikozhukhov: @prashks is working on adding support for removing mirror devices. He's almost done and we hope to have the review updated by the OpenZFS DevSummit later this month.

@ahrens
Copy link
Member Author

ahrens commented Oct 24, 2017

superseded by #482

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
5 participants