Skip to content

Commit a1d477c

Browse files
ahrensbehlendorf
authored andcommitted
OpenZFS 7614, 9064 - zfs device evacuation/removal
OpenZFS 7614 - zfs device evacuation/removal OpenZFS 9064 - remove_mirror should wait for device removal to complete This project allows top-level vdevs to be removed from the storage pool with "zpool remove", reducing the total amount of storage in the pool. This operation copies all allocated regions of the device to be removed onto other devices, recording the mapping from old to new location. After the removal is complete, read and free operations to the removed (now "indirect") vdev must be remapped and performed at the new location on disk. The indirect mapping table is kept in memory whenever the pool is loaded, so there is minimal performance overhead when doing operations on the indirect vdev. The size of the in-memory mapping table will be reduced when its entries become "obsolete" because they are no longer used by any block pointers in the pool. An entry becomes obsolete when all the blocks that use it are freed. An entry can also become obsolete when all the snapshots that reference it are deleted, and the block pointers that reference it have been "remapped" in all filesystems/zvols (and clones). Whenever an indirect block is written, all the block pointers in it will be "remapped" to their new (concrete) locations if possible. This process can be accelerated by using the "zfs remap" command to proactively rewrite all indirect blocks that reference indirect (removed) vdevs. Note that when a device is removed, we do not verify the checksum of the data that is copied. This makes the process much faster, but if it were used on redundant vdevs (i.e. mirror or raidz vdevs), it would be possible to copy the wrong data, when we have the correct data on e.g. the other side of the mirror. At the moment, only mirrors and simple top-level vdevs can be removed and no removal is allowed if any of the top-level vdevs are raidz. Porting Notes: * Avoid zero-sized kmem_alloc() in vdev_compact_children(). The device evacuation code adds a dependency that vdev_compact_children() be able to properly empty the vdev_child array by setting it to NULL and zeroing vdev_children. Under Linux, kmem_alloc() and related functions return a sentinel pointer rather than NULL for zero-sized allocations. * Remove comment regarding "mpt" driver where zfs_remove_max_segment is initialized to SPA_MAXBLOCKSIZE. Change zfs_condense_indirect_commit_entry_delay_ticks to zfs_condense_indirect_commit_entry_delay_ms for consistency with most other tunables in which delays are specified in ms. * ZTS changes: Use set_tunable rather than mdb Use zpool sync as appropriate Use sync_pool instead of sync Kill jobs during test_removal_with_operation to allow unmount/export Don't add non-disk names such as "mirror" or "raidz" to $DISKS Use $TEST_BASE_DIR instead of /tmp Increase HZ from 100 to 1000 which is more common on Linux removal_multiple_indirection.ksh Reduce iterations in order to not time out on the code coverage builders. removal_resume_export: Functionally, the test case is correct but there exists a race where the kernel thread hasn't been fully started yet and is not visible. Wait for up to 1 second for the removal thread to be started before giving up on it. Also, increase the amount of data copied in order that the removal not finish before the export has a chance to fail. * MMP compatibility, the concept of concrete versus non-concrete devices has slightly changed the semantics of vdev_writeable(). Update mmp_random_leaf_impl() accordingly. * Updated dbuf_remap() to handle the org.zfsonlinux:large_dnode pool feature which is not supported by OpenZFS. * Added support for new vdev removal tracepoints. * Test cases removal_with_zdb and removal_condense_export have been intentionally disabled. When run manually they pass as intended, but when running in the automated test environment they produce unreliable results on the latest Fedora release. They may work better once the upstream pool import refectoring is merged into ZoL at which point they will be re-enabled. Authored by: Matthew Ahrens <mahrens@delphix.com> Reviewed-by: Alex Reece <alex@delphix.com> Reviewed-by: George Wilson <george.wilson@delphix.com> Reviewed-by: John Kennedy <john.kennedy@delphix.com> Reviewed-by: Prakash Surya <prakash.surya@delphix.com> Reviewed by: Richard Laager <rlaager@wiktel.com> Reviewed by: Tim Chase <tim@chase2k.com> Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov> Approved by: Garrett D'Amore <garrett@damore.org> Ported-by: Tim Chase <tim@chase2k.com> Signed-off-by: Tim Chase <tim@chase2k.com> OpenZFS-issue: https://www.illumos.org/issues/7614 OpenZFS-commit: openzfs/openzfs@f539f1eb Closes #6900
1 parent 4b0f5b2 commit a1d477c

File tree

127 files changed

+9862
-912
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

127 files changed

+9862
-912
lines changed

cmd/zdb/zdb.c

Lines changed: 632 additions & 62 deletions
Large diffs are not rendered by default.

cmd/zfs/zfs_main.c

Lines changed: 24 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -111,6 +111,7 @@ static int zfs_do_release(int argc, char **argv);
111111
static int zfs_do_diff(int argc, char **argv);
112112
static int zfs_do_bookmark(int argc, char **argv);
113113
static int zfs_do_channel_program(int argc, char **argv);
114+
static int zfs_do_remap(int argc, char **argv);
114115
static int zfs_do_load_key(int argc, char **argv);
115116
static int zfs_do_unload_key(int argc, char **argv);
116117
static int zfs_do_change_key(int argc, char **argv);
@@ -163,6 +164,7 @@ typedef enum {
163164
HELP_HOLDS,
164165
HELP_RELEASE,
165166
HELP_DIFF,
167+
HELP_REMAP,
166168
HELP_BOOKMARK,
167169
HELP_CHANNEL_PROGRAM,
168170
HELP_LOAD_KEY,
@@ -226,6 +228,7 @@ static zfs_command_t command_table[] = {
226228
{ "holds", zfs_do_holds, HELP_HOLDS },
227229
{ "release", zfs_do_release, HELP_RELEASE },
228230
{ "diff", zfs_do_diff, HELP_DIFF },
231+
{ "remap", zfs_do_remap, HELP_REMAP },
229232
{ "load-key", zfs_do_load_key, HELP_LOAD_KEY },
230233
{ "unload-key", zfs_do_unload_key, HELP_UNLOAD_KEY },
231234
{ "change-key", zfs_do_change_key, HELP_CHANGE_KEY },
@@ -356,6 +359,8 @@ get_usage(zfs_help_t idx)
356359
case HELP_DIFF:
357360
return (gettext("\tdiff [-FHt] <snapshot> "
358361
"[snapshot|filesystem]\n"));
362+
case HELP_REMAP:
363+
return (gettext("\tremap <filesystem | volume>\n"));
359364
case HELP_BOOKMARK:
360365
return (gettext("\tbookmark <snapshot> <bookmark>\n"));
361366
case HELP_CHANNEL_PROGRAM:
@@ -4363,6 +4368,7 @@ zfs_do_receive(int argc, char **argv)
43634368
#define ZFS_DELEG_PERM_RELEASE "release"
43644369
#define ZFS_DELEG_PERM_DIFF "diff"
43654370
#define ZFS_DELEG_PERM_BOOKMARK "bookmark"
4371+
#define ZFS_DELEG_PERM_REMAP "remap"
43664372
#define ZFS_DELEG_PERM_LOAD_KEY "load-key"
43674373
#define ZFS_DELEG_PERM_CHANGE_KEY "change-key"
43684374

@@ -4390,6 +4396,7 @@ static zfs_deleg_perm_tab_t zfs_deleg_perm_tbl[] = {
43904396
{ ZFS_DELEG_PERM_SHARE, ZFS_DELEG_NOTE_SHARE },
43914397
{ ZFS_DELEG_PERM_SNAPSHOT, ZFS_DELEG_NOTE_SNAPSHOT },
43924398
{ ZFS_DELEG_PERM_BOOKMARK, ZFS_DELEG_NOTE_BOOKMARK },
4399+
{ ZFS_DELEG_PERM_REMAP, ZFS_DELEG_NOTE_REMAP },
43934400
{ ZFS_DELEG_PERM_LOAD_KEY, ZFS_DELEG_NOTE_LOAD_KEY },
43944401
{ ZFS_DELEG_PERM_CHANGE_KEY, ZFS_DELEG_NOTE_CHANGE_KEY },
43954402

@@ -7059,7 +7066,7 @@ zfs_do_diff(int argc, char **argv)
70597066

70607067
if (argc < 1) {
70617068
(void) fprintf(stderr,
7062-
gettext("must provide at least one snapshot name\n"));
7069+
gettext("must provide at least one snapshot name\n"));
70637070
usage(B_FALSE);
70647071
}
70657072

@@ -7101,6 +7108,22 @@ zfs_do_diff(int argc, char **argv)
71017108
return (err != 0);
71027109
}
71037110

7111+
static int
7112+
zfs_do_remap(int argc, char **argv)
7113+
{
7114+
const char *fsname;
7115+
int err = 0;
7116+
if (argc != 2) {
7117+
(void) fprintf(stderr, gettext("wrong number of arguments\n"));
7118+
usage(B_FALSE);
7119+
}
7120+
7121+
fsname = argv[1];
7122+
err = zfs_remap_indirects(g_zfs, fsname);
7123+
7124+
return (err);
7125+
}
7126+
71047127
/*
71057128
* zfs bookmark <fs@snap> <fs#bmark>
71067129
*

cmd/zpool/zpool_main.c

Lines changed: 193 additions & 14 deletions
Original file line numberDiff line numberDiff line change
@@ -344,7 +344,7 @@ get_usage(zpool_help_t idx)
344344
return (gettext("\treplace [-f] [-o property=value] "
345345
"<pool> <device> [new-device]\n"));
346346
case HELP_REMOVE:
347-
return (gettext("\tremove <pool> <device> ...\n"));
347+
return (gettext("\tremove [-nps] <pool> <device> ...\n"));
348348
case HELP_REOPEN:
349349
return (gettext("\treopen [-n] <pool>\n"));
350350
case HELP_SCRUB:
@@ -782,37 +782,95 @@ zpool_do_add(int argc, char **argv)
782782
/*
783783
* zpool remove <pool> <vdev> ...
784784
*
785-
* Removes the given vdev from the pool. Currently, this supports removing
786-
* spares, cache, and log devices from the pool.
785+
* Removes the given vdev from the pool.
787786
*/
788787
int
789788
zpool_do_remove(int argc, char **argv)
790789
{
791790
char *poolname;
792791
int i, ret = 0;
793792
zpool_handle_t *zhp = NULL;
793+
boolean_t stop = B_FALSE;
794+
char c;
795+
boolean_t noop = B_FALSE;
796+
boolean_t parsable = B_FALSE;
794797

795-
argc--;
796-
argv++;
798+
/* check options */
799+
while ((c = getopt(argc, argv, "nps")) != -1) {
800+
switch (c) {
801+
case 'n':
802+
noop = B_TRUE;
803+
break;
804+
case 'p':
805+
parsable = B_TRUE;
806+
break;
807+
case 's':
808+
stop = B_TRUE;
809+
break;
810+
case '?':
811+
(void) fprintf(stderr, gettext("invalid option '%c'\n"),
812+
optopt);
813+
usage(B_FALSE);
814+
}
815+
}
816+
817+
argc -= optind;
818+
argv += optind;
797819

798820
/* get pool name and check number of arguments */
799821
if (argc < 1) {
800822
(void) fprintf(stderr, gettext("missing pool name argument\n"));
801823
usage(B_FALSE);
802824
}
803-
if (argc < 2) {
804-
(void) fprintf(stderr, gettext("missing device\n"));
805-
usage(B_FALSE);
806-
}
807825

808826
poolname = argv[0];
809827

810828
if ((zhp = zpool_open(g_zfs, poolname)) == NULL)
811829
return (1);
812830

813-
for (i = 1; i < argc; i++) {
814-
if (zpool_vdev_remove(zhp, argv[i]) != 0)
831+
if (stop && noop) {
832+
(void) fprintf(stderr, gettext("stop request ignored\n"));
833+
return (0);
834+
}
835+
836+
if (stop) {
837+
if (argc > 1) {
838+
(void) fprintf(stderr, gettext("too many arguments\n"));
839+
usage(B_FALSE);
840+
}
841+
if (zpool_vdev_remove_cancel(zhp) != 0)
815842
ret = 1;
843+
} else {
844+
if (argc < 2) {
845+
(void) fprintf(stderr, gettext("missing device\n"));
846+
usage(B_FALSE);
847+
}
848+
849+
for (i = 1; i < argc; i++) {
850+
if (noop) {
851+
uint64_t size;
852+
853+
if (zpool_vdev_indirect_size(zhp, argv[i],
854+
&size) != 0) {
855+
ret = 1;
856+
break;
857+
}
858+
if (parsable) {
859+
(void) printf("%s %llu\n",
860+
argv[i], (unsigned long long)size);
861+
} else {
862+
char valstr[32];
863+
zfs_nicenum(size, valstr,
864+
sizeof (valstr));
865+
(void) printf("Memory that will be "
866+
"used after removing %s: %s\n",
867+
argv[i], valstr);
868+
}
869+
} else {
870+
if (zpool_vdev_remove(zhp, argv[i]) != 0)
871+
ret = 1;
872+
}
873+
}
816874
}
817875
zpool_close(zhp);
818876

@@ -1655,6 +1713,7 @@ print_status_config(zpool_handle_t *zhp, status_cbdata_t *cb, const char *name,
16551713
uint64_t notpresent;
16561714
spare_cbdata_t spare_cb;
16571715
const char *state;
1716+
char *type;
16581717
char *path = NULL;
16591718

16601719
if (nvlist_lookup_nvlist_array(nv, ZPOOL_CONFIG_CHILDREN,
@@ -1664,6 +1723,11 @@ print_status_config(zpool_handle_t *zhp, status_cbdata_t *cb, const char *name,
16641723
verify(nvlist_lookup_uint64_array(nv, ZPOOL_CONFIG_VDEV_STATS,
16651724
(uint64_t **)&vs, &c) == 0);
16661725

1726+
verify(nvlist_lookup_string(nv, ZPOOL_CONFIG_TYPE, &type) == 0);
1727+
1728+
if (strcmp(type, VDEV_TYPE_INDIRECT) == 0)
1729+
return;
1730+
16671731
state = zpool_state_to_name(vs->vs_state, vs->vs_aux);
16681732
if (isspare) {
16691733
/*
@@ -3668,6 +3732,9 @@ print_vdev_stats(zpool_handle_t *zhp, const char *name, nvlist_t *oldnv,
36683732

36693733
calcvs = safe_malloc(sizeof (*calcvs));
36703734

3735+
if (strcmp(name, VDEV_TYPE_INDIRECT) == 0)
3736+
return (ret);
3737+
36713738
if (oldnv != NULL) {
36723739
verify(nvlist_lookup_uint64_array(oldnv,
36733740
ZPOOL_CONFIG_VDEV_STATS, (uint64_t **)&oldvs, &c) == 0);
@@ -4964,6 +5031,9 @@ print_list_stats(zpool_handle_t *zhp, const char *name, nvlist_t *nv,
49645031
else
49655032
format = ZFS_NICENUM_1024;
49665033

5034+
if (strcmp(name, VDEV_TYPE_INDIRECT) == 0)
5035+
return;
5036+
49675037
if (scripted)
49685038
(void) printf("\t%s", name);
49695039
else if (strlen(name) + depth > cb->cb_namewidth)
@@ -5982,7 +6052,7 @@ zpool_do_scrub(int argc, char **argv)
59826052
/*
59836053
* Print out detailed scrub status.
59846054
*/
5985-
void
6055+
static void
59866056
print_scan_status(pool_scan_stat_t *ps)
59876057
{
59886058
time_t start, end, pause;
@@ -6129,6 +6199,111 @@ print_scan_status(pool_scan_stat_t *ps)
61296199
}
61306200
}
61316201

6202+
/*
6203+
* Print out detailed removal status.
6204+
*/
6205+
static void
6206+
print_removal_status(zpool_handle_t *zhp, pool_removal_stat_t *prs)
6207+
{
6208+
char copied_buf[7], examined_buf[7], total_buf[7], rate_buf[7];
6209+
time_t start, end;
6210+
nvlist_t *config, *nvroot;
6211+
nvlist_t **child;
6212+
uint_t children;
6213+
char *vdev_name;
6214+
6215+
if (prs == NULL || prs->prs_state == DSS_NONE)
6216+
return;
6217+
6218+
/*
6219+
* Determine name of vdev.
6220+
*/
6221+
config = zpool_get_config(zhp, NULL);
6222+
nvroot = fnvlist_lookup_nvlist(config,
6223+
ZPOOL_CONFIG_VDEV_TREE);
6224+
verify(nvlist_lookup_nvlist_array(nvroot, ZPOOL_CONFIG_CHILDREN,
6225+
&child, &children) == 0);
6226+
assert(prs->prs_removing_vdev < children);
6227+
vdev_name = zpool_vdev_name(g_zfs, zhp,
6228+
child[prs->prs_removing_vdev], B_TRUE);
6229+
6230+
(void) printf(gettext("remove: "));
6231+
6232+
start = prs->prs_start_time;
6233+
end = prs->prs_end_time;
6234+
zfs_nicenum(prs->prs_copied, copied_buf, sizeof (copied_buf));
6235+
6236+
/*
6237+
* Removal is finished or canceled.
6238+
*/
6239+
if (prs->prs_state == DSS_FINISHED) {
6240+
uint64_t minutes_taken = (end - start) / 60;
6241+
6242+
(void) printf(gettext("Removal of vdev %llu copied %s "
6243+
"in %lluh%um, completed on %s"),
6244+
(longlong_t)prs->prs_removing_vdev,
6245+
copied_buf,
6246+
(u_longlong_t)(minutes_taken / 60),
6247+
(uint_t)(minutes_taken % 60),
6248+
ctime((time_t *)&end));
6249+
} else if (prs->prs_state == DSS_CANCELED) {
6250+
(void) printf(gettext("Removal of %s canceled on %s"),
6251+
vdev_name, ctime(&end));
6252+
} else {
6253+
uint64_t copied, total, elapsed, mins_left, hours_left;
6254+
double fraction_done;
6255+
uint_t rate;
6256+
6257+
assert(prs->prs_state == DSS_SCANNING);
6258+
6259+
/*
6260+
* Removal is in progress.
6261+
*/
6262+
(void) printf(gettext(
6263+
"Evacuation of %s in progress since %s"),
6264+
vdev_name, ctime(&start));
6265+
6266+
copied = prs->prs_copied > 0 ? prs->prs_copied : 1;
6267+
total = prs->prs_to_copy;
6268+
fraction_done = (double)copied / total;
6269+
6270+
/* elapsed time for this pass */
6271+
elapsed = time(NULL) - prs->prs_start_time;
6272+
elapsed = elapsed > 0 ? elapsed : 1;
6273+
rate = copied / elapsed;
6274+
rate = rate > 0 ? rate : 1;
6275+
mins_left = ((total - copied) / rate) / 60;
6276+
hours_left = mins_left / 60;
6277+
6278+
zfs_nicenum(copied, examined_buf, sizeof (examined_buf));
6279+
zfs_nicenum(total, total_buf, sizeof (total_buf));
6280+
zfs_nicenum(rate, rate_buf, sizeof (rate_buf));
6281+
6282+
/*
6283+
* do not print estimated time if hours_left is more than
6284+
* 30 days
6285+
*/
6286+
(void) printf(gettext(" %s copied out of %s at %s/s, "
6287+
"%.2f%% done"),
6288+
examined_buf, total_buf, rate_buf, 100 * fraction_done);
6289+
if (hours_left < (30 * 24)) {
6290+
(void) printf(gettext(", %lluh%um to go\n"),
6291+
(u_longlong_t)hours_left, (uint_t)(mins_left % 60));
6292+
} else {
6293+
(void) printf(gettext(
6294+
", (copy is slow, no estimated time)\n"));
6295+
}
6296+
}
6297+
6298+
if (prs->prs_mapping_memory > 0) {
6299+
char mem_buf[7];
6300+
zfs_nicenum(prs->prs_mapping_memory, mem_buf, sizeof (mem_buf));
6301+
(void) printf(gettext(" %s memory used for "
6302+
"removed device mappings\n"),
6303+
mem_buf);
6304+
}
6305+
}
6306+
61326307
static void
61336308
print_error_log(zpool_handle_t *zhp)
61346309
{
@@ -6294,8 +6469,7 @@ status_callback(zpool_handle_t *zhp, void *data)
62946469
else
62956470
(void) printf("\n");
62966471

6297-
verify(nvlist_lookup_nvlist(config, ZPOOL_CONFIG_VDEV_TREE,
6298-
&nvroot) == 0);
6472+
nvroot = fnvlist_lookup_nvlist(config, ZPOOL_CONFIG_VDEV_TREE);
62996473
verify(nvlist_lookup_uint64_array(nvroot, ZPOOL_CONFIG_VDEV_STATS,
63006474
(uint64_t **)&vs, &c) == 0);
63016475
health = zpool_state_to_name(vs->vs_state, vs->vs_aux);
@@ -6555,11 +6729,16 @@ status_callback(zpool_handle_t *zhp, void *data)
65556729
nvlist_t **spares, **l2cache;
65566730
uint_t nspares, nl2cache;
65576731
pool_scan_stat_t *ps = NULL;
6732+
pool_removal_stat_t *prs = NULL;
65586733

65596734
(void) nvlist_lookup_uint64_array(nvroot,
65606735
ZPOOL_CONFIG_SCAN_STATS, (uint64_t **)&ps, &c);
65616736
print_scan_status(ps);
65626737

6738+
(void) nvlist_lookup_uint64_array(nvroot,
6739+
ZPOOL_CONFIG_REMOVAL_STATS, (uint64_t **)&prs, &c);
6740+
print_removal_status(zhp, prs);
6741+
65636742
cbp->cb_namewidth = max_width(zhp, nvroot, 0, 0,
65646743
cbp->cb_name_flags | VDEV_NAME_TYPE_ID);
65656744
if (cbp->cb_namewidth < 10)

0 commit comments

Comments
 (0)