New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Defer new resilvers until the current one ends #7732
Conversation
|
IMHO, this feature should check the progress of the current scan:
because, if you're almost finished with a scan, defer makes good sense. OTOH, if you're just starting a scan, it will be faster to restart. |
include/sys/spa_impl.h
Outdated
| @@ -270,6 +270,7 @@ struct spa { | |||
| uint64_t spa_scan_pass_scrub_spent_paused; /* total paused */ | |||
| uint64_t spa_scan_pass_exam; /* examined bytes per pass */ | |||
| uint64_t spa_scan_pass_issued; /* issued bytes per pass */ | |||
| boolean_t spa_resilver_defered; /* resilver has been defered */ | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
typo: "defered" should be "deferred"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This idea has merit.
But, the other side is if you are re-silvering a 12TB disk that is 9TBs full, you may not want to re-start at any point. Except perhaps very early on. This only gets worse as larger drives are released.
Two ways to handle this:
- Time based. If the existing is going to finish soonish, (a debatable time...), then continue.
- Size / percent based.
Another way to handle this is for the feature flag to have a companion variable that specifies percent complete for restarting. Perhaps defaulting to 0 / zero for deferring all re-silvers. And 100 for allowing a second to cause a restart at any point.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How do we ensure that we don't issue unnecessary resilver zio's, on vdevs that have DTL's but that are not part of the current resilver (i.e. they've been deferred)? I was expecting to see some change to dsl_scan_need_resilver(), its callees, or the DTL code.
include/sys/spa_impl.h
Outdated
| @@ -270,6 +270,7 @@ struct spa { | |||
| uint64_t spa_scan_pass_scrub_spent_paused; /* total paused */ | |||
| uint64_t spa_scan_pass_exam; /* examined bytes per pass */ | |||
| uint64_t spa_scan_pass_issued; /* issued bytes per pass */ | |||
| boolean_t spa_resilver_defered; /* resilver has been defered */ | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you elaborate on the comment, e.g. we are in the middle of a resilver, and another resilver is needed once this one completes. This is set iff any vdev_resilver_deferred is set.
That said, I'm not entirely sure that spa_resilver_deferred is needed. Could we check if the feature is active instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I actually spent a good bit of time thinking about this. The problem is that we don't always have a tx handy when we want to defer a resilver. This is the reason we increment the feature count in dsl_scan_sync().
| vdev_resilver_needed(spa->spa_root_vdev, NULL, NULL)) { | ||
| zfs_dbgmsg("starting defered resilver txg=%llu", | ||
| (longlong_t)tx->tx_txg); | ||
| spa_async_request(spa, SPA_ASYNC_RESILVER); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We mark these vdevs no longer "resilver deferred" in this txg (in the config), but the resilver is started asynchronously. So I think we could "lose" the resilver if we crash before it starts. Is that OK because we'll re-check for resilver when we open the pool again? Assuming that's the case, let's leave a comment here explaining it.
Maybe another way to make this more clear would be to have dsl_scan_reset_defered() not return anything, and only check vdev_resilver_needed() here. I think that would provide the same behavior. But that way it doesn't look like we're depending on resilver_deferred to tell us to do a resilver. I think that the resilver_deferred flag might be more accurately described as "vdev is not part of current resilver". Because all it needs to do is tell us to ignore this vdev's DTL for resilver purposes (not issue resilver i/os to it, and not clear the DTL when the resilver completes). (Incidentally, this is why a "participates in resilver" flag would be more natural, but I understand that we don't want to activate a new on-disk feature during all resilvers.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with Matt's comments. How about changing the name of dsl_scan_reset_deferred() to dsl_scan_clear_deferred() and change the variables resilver_available and should_resilver to resilver_needed.
I would also change your comment slightly:
/*
* Clear any deferred_resilver flags in the config. If there are drives that
* need resilvering, kick off an asynchronous request to start resilver.
* ...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good.
module/zfs/dsl_scan.c
Outdated
| /* | ||
| * Check for scn_restart_txg before checking spa_load_state, so | ||
| * that we can restart an old-style scan while the pool is being | ||
| * imported (see dsl_scan_init). | ||
| */ | ||
| if (dsl_scan_restarting(scn, tx)) { | ||
| if (!spa->spa_resilver_defered && dsl_scan_restarting(scn, tx)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if it could be better to never set scn_restart_txg, rather than setting it and then ignoring it here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We already don't set it. I originally added it to deal with the old-style resilver code that also wanted to set th restart txg. However, this code should never be run when features are active anyway, so it should be safe to put this line back to the way it was.
|
With RAIDZ2 or triple mirrors what if 1 drive is resilvering, but 2 drives fail in another vdev simultaneously. Do you still finish the first resilver without redundancy? |
|
Sorry. everybody. I've been on vacation for the last week. I'll be able to address all comments on Monday. I am definitely aware that we need some kind parameter that will determine at what point we should still allow existing resilvers to restart (so that you can resilver 2 driver at once for instance). @ahrens the biggest issue I have with this how I wrote this PR (as you mentioned in one of your comments) is that I dont like the fact that I don't have a good way to atomically dirty the vdev config AND activate the feature refcount. Right now I'm getting around this by having the sync thread check if it needs to activate the feature every time, but that doesnt seem like the best way to do this. Do you know of something better I can do here? You are also correct that I omitted a check that would prevent unnecessary IOs from being issued to a defered resilver vdev and I will correct that on Monday. I am adding a WIP tag to this PR (which definitely should have been there when I posted it). |
|
@scineram Yes, in fact that is a primary use case for this. You really want that first failed drive to resilver ASAP to restore some of the redundancy. With the current logic, and a RAIDZ3 vdev, imagine the following scenerio: one disk dies With the new logic, the same pattern of failures would cause: So with RAIDZ3, this change makes us 1/3 as vulnerable to data loss. |
With the current logic, you can resilver 2 drives at once, if they both fail while a resilver is in progress. (The "deferred" resilver will resilver all drives that died during the previous resilver.) I'm not sure we would want a parameter to allow resilvers to restart, but if we do, it should be set very low (e.g. only if we are <10% complete and another drive fails then it's OK to lose the progress and restart the resilver). |
|
@ahrens Yeah. My concern was for the use case where 2 drives fail, the sysadmin replaces one and it starts resilvering and then immediately replaces the other and it gets defered because the initial resilver is still going. |
|
@tcaputi That makes sense. |
|
it gets more complicated when there is more than one top-level vdev, in which case the policy might change depending on the redundancy risk... somewhere around here I have models for this stuff... |
|
@ahrens, my question was more about the multiple vdev scenario. I have |
63b5e31
to
a9b3ba9
Compare
|
The patch has been updated to address comments. Still don't have a mechanism for preventing deferrals when they are not desired yet. |
a9b3ba9
to
d719196
Compare
c350841
to
e6dd10a
Compare
63bc054
to
4f96adf
Compare
|
@scineram That is an interesting scenario. Let's compare the time until we are "safe" and can tolerate another (third) failure on vdev 2: Current code: Proposed change, timing 1: Proposed change, timing 2: So you're right to be concerned that in some cases, this change could extend the risk window. But in other cases it could reduce the risk window. I think it's worth considering restarting the resilver if the current resilver is not restoring the last redundancy of any vdev (e.g. we're fixing 1 drive in a RAIDZ2), and restarting the resilver would restore the last redundancy of some vdev (e.g. fix 2 drives in a RAIDZ2). But I think we can leave that work for a future PR. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good to me.
I think you'll also need to update the manpage sections for "zpool attach" and "zpool scrub".
module/zfs/dsl_scan.c
Outdated
| @@ -2966,6 +3026,26 @@ dsl_scan_active(dsl_scan_t *scn) | |||
| return (used != 0); | |||
| } | |||
|
|
|||
| static boolean_t | |||
| dsl_scan_check_deferred_vdevs(vdev_t *vd) | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems similar to dsl_scan_reset_deferred(), so a similar name would be dsl_scan_check_deferred() (or add _vdevs to both?)
You might also put the two functions next to each other in the file.
module/zfs/vdev.c
Outdated
| @@ -749,6 +749,11 @@ vdev_alloc(spa_t *spa, vdev_t **vdp, nvlist_t *nv, vdev_t *parent, uint_t id, | |||
| (void) nvlist_lookup_uint64(nv, ZPOOL_CONFIG_RESILVER_TXG, | |||
| &vd->vdev_resilver_txg); | |||
|
|
|||
| vd->vdev_resilver_deferred = (nvlist_lookup_boolean(nv, | |||
| ZPOOL_CONFIG_RESILVER_DEFER) == 0); | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FYI, another way to do this would be ... = nvlist_exists(...);. Not sure if that's any more or less clean.
module/zfs/spa.c
Outdated
| if (dsl_scan_resilvering(spa_get_dsl(spa)) && | ||
| spa_feature_is_enabled(spa, SPA_FEATURE_RESILVER_DEFER)) { | ||
| newvd->vdev_resilver_deferred = B_TRUE; | ||
| spa->spa_resilver_deferred = B_TRUE; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we keep this code next to the logical else dsl_resilver_restart()?
4f96adf
to
dcab844
Compare
|
Apologies for letting this sit for a week. PR is rebased on master and @ahrens comments have been addressed. |
dcab844
to
9b52eeb
Compare
include/sys/fs/zfs.h
Outdated
| @@ -988,6 +989,7 @@ typedef struct vdev_stat { | |||
| uint64_t vs_scan_processed; /* scan processed bytes */ | |||
| uint64_t vs_fragmentation; /* device fragmentation */ | |||
| uint64_t vs_checkpoint_space; /* checkpoint-consumed space */ | |||
| uint64_t vs_resilver_deferred; /* resilver deferred? */ | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: why the the ?, this shouldn't be ambiguous.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was following the convention of vs_scan_removing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see. It's a little thing but would you mind removing it. My concern is that the question mark implies, to me at least, that vs_resilver_deferred can't be entirely trusted. Being set may or may not imply a resilver is currently deferred.
module/zcommon/zfeature_common.c
Outdated
| zfeature_register(SPA_FEATURE_RESILVER_DEFER, | ||
| "com.datto:resilver_defer", "resilver_defer", | ||
| "Support for defering new resilvers when one is already running.", | ||
| ZFEATURE_FLAG_READONLY_COMPAT, NULL); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How do you disable this? All new pools enable all features by default, and once enabled this feature cannot be disabled. I think this functionality needs an additional pool property to set the policy, how about resilver=immediate|defer|never.
immediate - the current behavior
defer - the new defered behavior
never - do not automatically start resilvers (a reasonable option now that there's a sub-command)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what do you mean that it can't be disabled once it is enabled? Cant you just disable it like any other feature?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Keep in mind the distinction between disabled (not in use, not allowed to use), enabled (but inactive - not in use, but allowed to use), and active (in use, downgrade to older software not supported). There is no feature disabling in ZFS. However features can go back and forth between enabled and active.
All that said, I disagree with @behlendorf's suggestion of making this a pool property, because I think that the new behavior is appropriate in nearly all cases.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is no feature disabling in ZFS
This is specifically my concern. I agree the new behavior may be appropriate for the majority of cases, but there are operational reasons for preferring the existing behavior. For example, any existing infrastructure built on ZFS may depend on it.
This change provides no mechanism to opt-out except at pool creation time. By the time you discover this is an issue, it could be too late and it's non-reversible. That's not a good user experience.
If it were possible to run zfs set feature@resilver_defer=disabled <pool> when the feature is enabled (but not active) that would be fine. Adding a module option would also be reasonable. As long as there's a mechanism if you absolutely need it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@behlendorf I think you can accomplish what you want by using zpool create -o feature@resilver_defer=disabled `
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think @behlendorf was saying there is no way to disable a feature after the feature is enabled.
9b52eeb
to
cf716af
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like the new test case might need a little longer to run:
http://build.zfsonlinux.org/builders/Ubuntu%2017.04%20x86_64%20Coverage%20%28TEST%29/builds/3710
cf716af
to
9e402dc
Compare
|
@behlendorf Timeout has been doubled and I also added a line to reset |
module/zfs/dsl_scan.c
Outdated
| dsl_scan_reset_deferred(spa->spa_root_vdev, tx); | ||
| if (should_resilver) { | ||
| zfs_dbgmsg("starting deferred resilver txg=%llu", | ||
| (longlong_t)tx->tx_txg); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This shouldn't be necessary - spa_history_log_internal() should also generate a dbgmsg (but feel free to double check this - code is in spa_history_log_sync()).
9e402dc
to
b1033ba
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you add logic to the positive test that verifies the state of the feature as it moves from enabled to active and back?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Generally looks good. Let me know if my suggestions make sense.
cmd/zpool/zpool_main.c
Outdated
| (ps->pss_func == POOL_SCAN_RESILVER) ? | ||
| "resilvering" : "repairing"); | ||
| } else if (vs->vs_resilver_deferred) { | ||
| (void) printf(gettext(" (resilver deferred)")); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if a better message here might be (awaiting resilver) or perhaps (needs resilvering)?
| spa_feature_is_enabled(spa, SPA_FEATURE_RESILVER_DEFER)) | ||
| vdev_set_deferred_resilver(spa, vd); | ||
| else | ||
| spa_async_request(spa, SPA_ASYNC_RESILVER); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about a macro/helper-function like: boolean_t spa_should_defer_resilver(spa)? This would check if the feature was set and then look to see if we are currently resilvering. This could be used in at least 4 places in the code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Matt and I actually talked a bit about this before offline. The checks are actually slightly different. 2 of them could use a common helper function, but I personally try to find 3 use-cases before abstracting something to a separate function.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, fair enough, although I think my suggestion was a little more narrow than the original suggestion from Matt. I was suggesting just combining the two checks "dsl_scan_resilvering()" and "spa_feature_is_enabled()" into a single function call. I think you use that in more than two locations. This was more about clarity than code reduction.
| vdev_resilver_needed(spa->spa_root_vdev, NULL, NULL)) { | ||
| zfs_dbgmsg("starting defered resilver txg=%llu", | ||
| (longlong_t)tx->tx_txg); | ||
| spa_async_request(spa, SPA_ASYNC_RESILVER); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with Matt's comments. How about changing the name of dsl_scan_reset_deferred() to dsl_scan_clear_deferred() and change the variables resilver_available and should_resilver to resilver_needed.
I would also change your comment slightly:
/*
* Clear any deferred_resilver flags in the config. If there are drives that
* need resilvering, kick off an asynchronous request to start resilver.
* ...
b1033ba
to
f8edef7
Compare
Currently, if a resilver is triggered for any reason while an existing one is running, zfs will immediately restart the existing resilver from the beginning to include the new drive. This causes problems for system administrators when a drive fails while another is already resilvering. In this case, the optimal thing to do to reduce risk of data loss is to wait for the current resilver to end before immediately replacing the second failed drive, which allows the system to operate with two incomplete drives for the minimum amount of time. This patch introduces the resilver_defer feature that essentially does this for the admin without forcing them to wait and monitor the resilver manually. The change requires an on-disk feature since we must mark drives that are part of a defered resilver in the vdev config to ensure that we do not assume they are done resilvering when an existing resilver completes. Signed-off-by: Tom Caputi <tcaputi@datto.com>
f8edef7
to
09156cf
Compare
|
@mmaybee your comments have been addressed and the patch has been rebased on master (it actually conflicted with the new |
Codecov Report
@@ Coverage Diff @@
## master #7732 +/- ##
==========================================
+ Coverage 78.46% 78.58% +0.12%
==========================================
Files 377 377
Lines 114404 114488 +84
==========================================
+ Hits 89765 89974 +209
+ Misses 24639 24514 -125
Continue to review full report at Codecov.
|
| vdev_resilver_needed(spa->spa_root_vdev, NULL, NULL)) { | ||
| zfs_dbgmsg("starting defered resilver txg=%llu", | ||
| (longlong_t)tx->tx_txg); | ||
| spa_async_request(spa, SPA_ASYNC_RESILVER); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good.
| spa_feature_is_enabled(spa, SPA_FEATURE_RESILVER_DEFER)) | ||
| vdev_set_deferred_resilver(spa, vd); | ||
| else | ||
| spa_async_request(spa, SPA_ASYNC_RESILVER); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, fair enough, although I think my suggestion was a little more narrow than the original suggestion from Matt. I was suggesting just combining the two checks "dsl_scan_resilvering()" and "spa_feature_is_enabled()" into a single function call. I think you use that in more than two locations. This was more about clarity than code reduction.
Currently, if a resilver is triggered for any reason while an existing one is running, zfs will immediately restart the existing resilver from the beginning to include the new drive. This causes problems for system administrators when a drive fails while another is already resilvering. In this case, the optimal thing to do to reduce risk of data loss is to wait for the current resilver to end before immediately replacing the second failed drive, which allows the system to operate with two incomplete drives for the minimum amount of time. This patch introduces the resilver_defer feature that essentially does this for the admin without forcing them to wait and monitor the resilver manually. The change requires an on-disk feature since we must mark drives that are part of a deferred resilver in the vdev config to ensure that we do not assume they are done resilvering when an existing resilver completes. Reviewed-by: Matthew Ahrens <mahrens@delphix.com> Reviewed-by: John Kennedy <john.kennedy@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: @mmaybee Signed-off-by: Tom Caputi <tcaputi@datto.com> Closes openzfs#7732
vdev_clear() can call vdev_set_deferred_resilver() with a non-leaf vdev to setup a deferred resilver. However, this function is currently written to only handle leaf vdevs. This bug was introduced with deferred resilvers in 80a91e7. This patch makes this function recursive so that it can find appropriate vdevs to resilver and set vdev_resilver_deferred on them. Reviewed-by: Matthew Ahrens <mahrens@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Tom Caputi <tcaputi@datto.com> Issue #7732 Closes #8082
Currently, if a resilver is triggered for any reason while an existing one is running, zfs will immediately restart the existing resilver from the beginning to include the new drive. This causes problems for system administrators when a drive fails while another is already resilvering. In this case, the optimal thing to do to reduce risk of data loss is to wait for the current resilver to end before immediately replacing the second failed drive, which allows the system to operate with two incomplete drives for the minimum amount of time. This patch introduces the resilver_defer feature that essentially does this for the admin without forcing them to wait and monitor the resilver manually. The change requires an on-disk feature since we must mark drives that are part of a deferred resilver in the vdev config to ensure that we do not assume they are done resilvering when an existing resilver completes. Reviewed-by: Matthew Ahrens <mahrens@delphix.com> Reviewed-by: John Kennedy <john.kennedy@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: @mmaybee Signed-off-by: Tom Caputi <tcaputi@datto.com> Closes openzfs#7732
vdev_clear() can call vdev_set_deferred_resilver() with a non-leaf vdev to setup a deferred resilver. However, this function is currently written to only handle leaf vdevs. This bug was introduced with deferred resilvers in 80a91e7. This patch makes this function recursive so that it can find appropriate vdevs to resilver and set vdev_resilver_deferred on them. Reviewed-by: Matthew Ahrens <mahrens@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Tom Caputi <tcaputi@datto.com> Issue openzfs#7732 Closes openzfs#8082
Currently, if a resilver is triggered for any reason while an
existing one is running, zfs will immediately restart the existing
resilver from the beginning to include the new drive. This causes
problems for system administrators when a drive fails while another
is already resilvering. In this case, the optimal thing to do to
reduce risk of data loss is to wait for the current resilver to end
before immediately replacing the second failed drive, which allows
the system to operate with two incomplete drives for the minimum
amount of time.
This patch introduces the resilver_defer feature that essentially
does this for the admin without forcing them to wait and monitor
the resilver manually. The change requires an on-disk feature
since we must mark drives that are part of a defered resilver in
the vdev config to ensure that we do not assume they are done
resilvering when an existing resilver completes.
Signed-off-by: Tom Caputi tcaputi@datto.com
Types of changes
Checklist:
Signed-off-by.