-
Notifications
You must be signed in to change notification settings - Fork 1.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
OpenZFS 9166 - zfs storage pool checkpoint #7570
Conversation
I've only done light manual testing of this feature and it seems to work properly. Linux-specific changes to the ZTS have not yet been made. It's being submitted now to get an initial run through the buildbots to find any gross problems. |
1399feb
to
68ff1ff
Compare
1615849
to
7db8ffc
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hello there!
Thank you for working on this.
Overall the changes look good.
I also went through the split in the dsl_scan() code that you brought up and they seem fine too.
I found a couple of small omissions that I pointed out below and I also raised some questions (most of them are probably due to my lack of knowledge of Linux dev).
cmd/ztest/ztest.c
Outdated
@@ -397,6 +398,8 @@ ztest_info_t ztest_info[] = { | |||
ZTI_INIT(ztest_vdev_aux_add_remove, 1, &ztest_opts.zo_vdevtime), | |||
ZTI_INIT(ztest_device_removal, 1, &zopt_sometimes), | |||
ZTI_INIT(ztest_remap_blocks, 1, &zopt_sometimes), | |||
ZTI_INIT(ztest_remap_blocks, 1, &zopt_sometimes), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no need to duplicate ZTI_INIT(ztest_remap_blocks....),
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed.
cmd/ztest/ztest.c
Outdated
@@ -6977,6 +7074,7 @@ ztest_import(ztest_shared_t *zs) | |||
|
|||
(void) rwlock_destroy(&ztest_name_lock); | |||
mutex_destroy(&ztest_vdev_lock); | |||
mutex_destroy(&ztest_checkpoint_lock); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please correct me if I'm missing something here but shouldn't we init the lock in this functions before we destroy it? (similarly to ztest_vdev_lock which is destroyed here but it is also initialized in this functions?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed.
cmd/ztest/ztest.c
Outdated
@@ -6991,6 +7089,7 @@ ztest_init(ztest_shared_t *zs) | |||
int i; | |||
|
|||
mutex_init(&ztest_vdev_lock, NULL, MUTEX_DEFAULT, NULL); | |||
mutex_init(&ztest_checkpoint_lock, NULL, USYNC_THREAD, NULL); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we also destroy destroy this lock in this function the same way we destroy ztest_vdev_lock?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed. These init/destroy mis-merges were due to ztest_import()
which was added by the MMP work.
@@ -497,8 +497,6 @@ vn_open(char *path, int x1, int flags, int mode, vnode_t **vpp, int x2, int x3) | |||
#ifdef __linux__ | |||
flags |= O_DIRECT; | |||
#endif | |||
/* We shouldn't be writing to block devices in userspace */ | |||
VERIFY(!(flags & FWRITE)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Out of curiosity, what is this for? Is this related somehow to the rest of the checkpoint changes?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was an additional sanity check which was added a while back when we only expected zdb
to access the block devices read-only from user space. When zhack
was added that changed and this check wasn't dropped at the time. One of the new test cases uncovered this since it uses zhack
on a pool with disks instead of files.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense. Thanks!
@@ -574,6 +579,29 @@ dsl_pool_dirty_delta(dsl_pool_t *dp, int64_t delta) | |||
cv_signal(&dp->dp_spaceavail_cv); | |||
} | |||
|
|||
#ifdef ZFS_DEBUG |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is that #ifdef really necessary? This function is only called through an ASSERT.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It prevents an unused function warning in non-debug builds.
*/ | ||
#define __EXTENSIONS__ | ||
|
||
#include <libzfs_core.h> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I may be missing something but why is libzfs_core.h included here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IIRC, it was to give access to the typedefs such as uint64_t
, etc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wouldn't something like stdint.h
work?
http://pubs.opengroup.org/onlinepubs/009695399/basedefs/stdint.h.html
It just seems unnecessary to me to include a whole new dependency (libzfs_core) for a small utility like randwritecomp.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes indeed. Using libzfs_core.h
was expedient at the time. Using stdint.h
works perfectly well and also allows to remove -I$(top_srcdir)/lib/libspl/include
from Makefile.am
. I'll be updating it accordingly (and also adding an include of string.h
so grab the declaration of strcmp()
(which we apparently need under Linux).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's great! Thank you Tim
log_must truncate -s $DISKSIZE $FILEDISK1 | ||
log_must truncate -s $DISKSIZE $FILEDISK2 | ||
|
||
log_must zpool create -O sync=disabled $NESTEDPOOL $FILEDISKS |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's probably some illumos/Linux difference that I'm missing but why is sync disabled for these pools?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was an attempt to speed up the "setup" process, which all by itself was taking almost 10 minutes per test in a VM guest. I've not yet determined whether the sync=disabled
makes any difference. I added it on both the outer and the nested pools. Speaking of which, are the nested pools really necessary? It was my understanding (and @behlendorf's too, I think) that nested pools weren't guaranteed to be deadlock-free (but maybe they are in illumos).
-I$(top_srcdir)/lib/libspl/include | ||
|
||
pkgexec_PROGRAMS = randwritecomp | ||
mktree_SOURCES = randwritecomp.c |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/mktree_SOURCES/randwritecomp_SOURCES
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oops, copy/paste bug. Fixed.
module/zfs/spa_checkpoint.c
Outdated
|
||
#ifdef DEBUG | ||
spa_checkpoint_accounting_verify(vd->vdev_spa); | ||
#endif |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Results in a 'spa_checkpoint_accounting_verify' defined but not used [-Wunused-function]
warning.
I'd suggest wrapping the function itself with ZFS_DEBUG
. You could also
#define spa_checkpoint_accounting_verify ((void) 0)
for the non-debug case to get rid on the unsightly DEBUG
wrapper where it's called.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Personally I find this way least ugly:
/* ARGSUSED */
void
foo_verify(...)
{
#ifdef ZFS_DEBUG
...
#endif
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed to ZFS_DEBUG
. Also wrapped the definition of spa_checkpoint_accounting_verify()
in an #ifdef ZFS_DEBUG
.
@@ -497,8 +497,6 @@ vn_open(char *path, int x1, int flags, int mode, vnode_t **vpp, int x2, int x3) | |||
#ifdef __linux__ | |||
flags |= O_DIRECT; | |||
#endif | |||
/* We shouldn't be writing to block devices in userspace */ | |||
VERIFY(!(flags & FWRITE)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was an additional sanity check which was added a while back when we only expected zdb
to access the block devices read-only from user space. When zhack
was added that changed and this check wasn't dropped at the time. One of the new test cases uncovered this since it uses zhack
on a pool with disks instead of files.
@@ -0,0 +1,25 @@ | |||
#!/usr/bin/ksh -p |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be #!/bin/ksh
for all the new test scripts. It was causing the CentOS 6 failures.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes! I saw that in one of the tests introduced by vdev removal and totally forgot to look for it.
df76923
to
bbbd614
Compare
ef59091
to
3b69b11
Compare
Codecov Report
@@ Coverage Diff @@
## master #7570 +/- ##
==========================================
+ Coverage 78.04% 78.12% +0.07%
==========================================
Files 366 368 +2
Lines 110618 111627 +1009
==========================================
+ Hits 86329 87205 +876
- Misses 24289 24422 +133
Continue to review full report at Codecov.
|
5f0c492
to
cddfe80
Compare
Otherwise the output is consumed by the output redirection. Signed-off-by: Tim Chase <tim@chase2k.com> Requires-builders: none
97c4672
to
3397340
Compare
Details about the motivation of this feature and its usage can be found in this blogpost: https://sdimitro.github.io/post/zpool-checkpoint/ A lightning talk of this feature can be found here: https://www.youtube.com/watch?v=fPQA8K40jAM Implementation details can be found in big block comment of spa_checkpoint.c Side-changes that are relevant to this commit but not explained elsewhere: * renames members of "struct metaslab trees to be shorter without losing meaning * space_map_{alloc,truncate}() accept a block size as a parameter. The reason is that in the current state all space maps that we allocate through the DMU use a global tunable (space_map_blksz) which defauls to 4KB. This is ok for metaslab space maps in terms of bandwirdth since they are scattered all over the disk. But for other space maps this default is probably not what we want. Examples are device removal's vdev_obsolete_sm or vdev_chedkpoint_sm from this review. Both of these have a 1:1 relationship with each vdev and could benefit from a bigger block size. Authored by: Serapheim Dimitropoulos <serapheim.dimitro@delphix.com> Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: John Kennedy <john.kennedy@delphix.com> Reviewed by: Dan Kimmel <dan.kimmel@delphix.com> Approved by: Richard Lowe <richlowe@richlowe.net> OpenZFS-issue: https://illumos.org/issues/9166 OpenZFS-commit: openzfs/openzfs@7159fdb Ported-by: Tim Chase <tim@chase2k.com> Signed-off-by: Tim Chase <tim@chase2k.com> Porting notes: The part of dsl_scan_sync() which handles async destroys has been moved into the new dsl_process_async_destroys() function. Remove "VERIFY(!(flags & FWRITE))" in "kernel.c" so zhack can write to block device backed pools. ZTS: Fix get_txg() in zpool_sync_001_pos due to "checkpoint_txg". Don't use large dd block sizes on /dev/urandom under Linux in checkpoint_capacity. Adopt Delphix-OS's setting of 4 (spa_asize_inflation = SPA_DVAS_PER_BP + 1) for the checkpoint_capacity test to speed its attempts to fill the pool Create the base and nested pools with sync=disabled to speed up the "setup" phase. Clear labels in test pool between checkpoint tests to avoid duplicate pool issues. The import_rewind_device_replaced test has been marked as "known to fail" for the reasons listed in its DISCLAIMER. New module parameters: zfs_spa_discard_memory_limit, zfs_remove_max_bytes_pause (not documented - debugging only) vdev_max_ms_count (formerly metaslabs_per_vdev) vdev_min_ms_count
Details about the motivation of this feature and its usage can be found in this blogpost: https://sdimitro.github.io/post/zpool-checkpoint/ A lightning talk of this feature can be found here: https://www.youtube.com/watch?v=fPQA8K40jAM Implementation details can be found in big block comment of spa_checkpoint.c Side-changes that are relevant to this commit but not explained elsewhere: * renames members of "struct metaslab trees to be shorter without losing meaning * space_map_{alloc,truncate}() accept a block size as a parameter. The reason is that in the current state all space maps that we allocate through the DMU use a global tunable (space_map_blksz) which defauls to 4KB. This is ok for metaslab space maps in terms of bandwirdth since they are scattered all over the disk. But for other space maps this default is probably not what we want. Examples are device removal's vdev_obsolete_sm or vdev_chedkpoint_sm from this review. Both of these have a 1:1 relationship with each vdev and could benefit from a bigger block size. Porting notes: * The part of dsl_scan_sync() which handles async destroys has been moved into the new dsl_process_async_destroys() function. * Remove "VERIFY(!(flags & FWRITE))" in "kernel.c" so zhack can write to block device backed pools. * ZTS: * Fix get_txg() in zpool_sync_001_pos due to "checkpoint_txg". * Don't use large dd block sizes on /dev/urandom under Linux in checkpoint_capacity. * Adopt Delphix-OS's setting of 4 (spa_asize_inflation = SPA_DVAS_PER_BP + 1) for the checkpoint_capacity test to speed its attempts to fill the pool * Create the base and nested pools with sync=disabled to speed up the "setup" phase. * Clear labels in test pool between checkpoint tests to avoid duplicate pool issues. * The import_rewind_device_replaced test has been marked as "known to fail" for the reasons listed in its DISCLAIMER. * New module parameters: zfs_spa_discard_memory_limit, zfs_remove_max_bytes_pause (not documented - debugging only) vdev_max_ms_count (formerly metaslabs_per_vdev) vdev_min_ms_count Authored by: Serapheim Dimitropoulos <serapheim.dimitro@delphix.com> Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: John Kennedy <john.kennedy@delphix.com> Reviewed by: Dan Kimmel <dan.kimmel@delphix.com> Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov> Approved by: Richard Lowe <richlowe@richlowe.net> Ported-by: Tim Chase <tim@chase2k.com> Signed-off-by: Tim Chase <tim@chase2k.com> OpenZFS-issue: https://illumos.org/issues/9166 OpenZFS-commit: openzfs/openzfs@7159fdb8 Closes #7570
Description
See references below.
Motivation and Context
Details about the motivation of this feature and its usage can
be found in this blogpost:
A lightning talk of this feature can be found here:
https://www.youtube.com/watch?v=fPQA8K40jAM
Implementation details can be found in big block comment of
spa_checkpoint.c
Side-changes that are relevant to this commit but not explained
elsewhere:
parameter. The reason is that in the current state all space
maps that we allocate through the DMU use a global tunable
(space_map_blksz) which defauls to 4KB. This is ok for
metaslab space maps in terms of bandwirdth since they are
scattered all over the disk. But for other space maps this
default is probably not what we want. Examples are device
removal's vdev_obsolete_sm or vdev_chedkpoint_sm from this
review. Both of these have a 1:1 relationship with each vdev
and could benefit from a bigger block size.
Authored by: Serapheim Dimitropoulos serapheim.dimitro@delphix.com
Reviewed by: Matthew Ahrens mahrens@delphix.com
Reviewed by: John Kennedy john.kennedy@delphix.com
Reviewed by: Dan Kimmel dan.kimmel@delphix.com
Approved by: Richard Lowe richlowe@richlowe.net
OpenZFS-issue: https://illumos.org/issues/9166
OpenZFS-commit: openzfs/openzfs@7159fdb
Ported-by: Tim Chase tim@chase2k.com
Signed-off-by: Tim Chase tim@chase2k.com
Porting notes:
Desirable administrative feature, particularly to allow reverting pool configuration changes.
How Has This Been Tested?
Light manual testing so far.
Types of changes
Checklist:
Signed-off-by
.