New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Device Rebuild Feature #10349
Device Rebuild Feature #10349
Conversation
2a9eb0b
to
144327e
Compare
Codecov Report
@@ Coverage Diff @@
## master #10349 +/- ##
==========================================
- Coverage 79.52% 79.51% -0.01%
==========================================
Files 391 392 +1
Lines 123887 124609 +722
==========================================
+ Hits 98519 99086 +567
- Misses 25368 25523 +155
Continue to review full report at Codecov.
|
ab8f6ed
to
433795d
Compare
433795d
to
f2583c0
Compare
tests/zfs-tests/tests/functional/replacement/attach_rebuild.ksh
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wow, for such a large patch I don't see any major issues (although I can't claim to understand everything going on).
One thing that I did want to bring up is that we currently have a resilver_finish-start-scrub.sh zedlet that you can enable to automatically kick off a scrub after a resilver. We should replicate this for rebuilds as well (along with updating the scrub_after_resilver.ksh test case). I don't think it's worth holding up this big PR to include it though.
eeb745b
to
c6e8095
Compare
|
Rebased on master and conflicts resolved. |
|
I have a couple of general comments: First, I think it is a bit confusing to introduce this functionality as "rebuild" and contrast it with the existing "resilver" functionality in ZFS. These algorithms are actually trying to achieve the same thing: the reconstruction of missing data. It is actually a bit unfortunate that we ended up using "resilver" as the term used for all reconstruction (it's a bit confusing when the layout is raidz or draid). I think we should characterize these two algorithms as "verified reconstruction" (aka rasilver) and "unverified reconstruction" (aka rebuild). Second, given that rebuild is unverified (via checksums), I am a bit concerned about how viable (or at least recommendable) it is as an option. Multiple drive failures (with rebuilds but without an intervening scrub) can lead to data loss. We have seen examples of this during draid stress testing. I think the scrub really is necessary. That being the case, is there really any realistic example where rebuild+scrub is faster than resilver? Maybe an option could be: trigger "rebuild" if pool is "scrub clean", but trigger resilver if the scrub (following the last rebuild) is incomplete. |
|
@mmaybee Thanks for raising these points. Here are my thoughts (including some background info that I'm sure you're already aware of):
I like the idea of coming up with more descriptive terms for these - "rebuild" and "resilver" don't actually tell me anything about the difference, they sound like the same thing. "sequential reconstruction" would be a more positive term than "unverified reconstruction". And I wonder if there's a term that encapsulates the additional silent-data-corruption-repair that happens with scrub/resilver, in addition to just verification. Maybe "healing reconstruction" and "sequential reconstruction"?
I think that for the primary use case of reconstructing 2-way mirrors, doing the checksum-based healing doesn't get you much, because only ditto blocks have any chance of actually being corrected. That wouldn't be the case for double-parity (and higher) DRAID (or triple-wide mirrors), but see below...
The downside of "sequential reconstruction" is that (without a subsequent scrub), it doesn't heal silently-corrupted data. The most common case where this will be important is with DRAID2. If one drive has suffered silent data corruption (e.g. the drive silently dropped a write), and a different drive fails, we could reconstruct the failed drive using the bad data from the silently-corrupted drive. In this case a given stripe would have incorrect data on 2 drives (the incorrectly reconstructed drive and the silently-corrupted drive). If we read this block (or scrub), we can detect this problem and reconstruct the correct data from the remaining drives. But if we lose another drive before that happens, then the silently-corrupted block will be lost. I agree that the scrub is recommended, although personally I wouldn't go so far as to say that it's required for 2-way mirrors. I would advocate for having ZFS automatically start a scrub after a "sequential reconstruction" completes (even on a 2-way mirror). If the admin wants to cancel the scrub, I would let them take that risk (even on DRAID2+).
I don't see how it could be faster, since scrub does at least as much work as resilver. But I think there's value to "sequential reconstruction" because the reconstruction gets you back to having full parity (except for blocks that have been silently corrupted) much faster than resilver. I'm OK with (slightly) separating the process of whole-disk failure and reconstruction from healing of silently-corrupted blocks. RAIDZ resilver times especially are so slow that you run a real risk of additional drives failing before the resilver completes. If the 2nd drive fails after the "sequential reconstruction" but before the scrub completes, you won't lose data (except for silent disk errors). But if a 2nd drive fails before a "healing reconstruction" completes, you lose everything. Sun's customers had this problem a decade+ ago, and this problem has only gotten worse as HDD sizes have increased (while IOPS has remained roughly constant). "sequential reconstruction" is a way for DRAID to fix this problem. That said, my argument is predicated on "sequential reconstruction" being much faster than "healing reconstruction". The data point in this PR of being 2x faster is good, but I actually expected much better. Resilvering at ~70MB/s might be good enough. I'm not sure if 2x is worth the complexity, given the trade-off with checksum verification. Maybe the "sequential-ish resilver" work has gotten us to the point of resilver being "good enough"? Changing the indirect block size from 16K to 128K is also an improvement since the "old days". However, this test case may not be representative of the "worst common case" of resilvering. For example, since the test case has only 1TB of data, "sequential-ish resilver" may be able to cache and sort a good fraction of all the block pointers, which would not be the case with a more realistically-sized pool (e.g. 100TB+ of data). For reference, Delphix has customers with >100TB of data with ashift=9, recordsize=8k, compression=on, at ~80% capacity, and ~85% fragmented. I suspect that this use case would have much worse resilver throughput, even with "sequential-ish resilver". Showing an improvement closer to 10x (timing "sequential reconstruction" without scrub versus "healing reconstruction") in a realistic workload would be a strong argument for the value of this work. |
I like it. Calling this "rebuild" always felt awkward to me as well. Let me see about updating the documentation, code comments and As for the command line option I'm torn but I think there is value in keeping it. Though I would like to change the new They would need to opt in and the option can be clearly explained in the man page. In addition, after a unverified rebuild completes (aka rebuild) the
That depends on how you measure. It can absolutely reduce your exposure time compared to a normal resilver which allows the pool to suffer a new failure without any data loss (most likely). In the common case we expect the scrub not to discover any blocks which need to be repaired. |
I'd like to find a way to both automatically kick off a scrub, and display the final rebuild results in
I think that would significantly reduce the usefulness of rebuild. The case where rebuild helps is where a 2nd or subsequent drive dies after a rebuild completes, but before a resilver could have completed. If you have a sequence of 3 drives failing, this new restriction could unnecessarily delay the reconstruction of the 2nd failure, causing the 3rd failure to take out the whole pool. e.g.:
|
|
Some notes:
|
|
How about introducing a "targeted" scrub? Where we provide a vdev id to the command such that it trims it's traversal to read/verify only block pointers involving the specified drive? This could significantly reduce the time to verify the rebuild. Although this would be a bit complicated for dRAID, as all block pointers reference the same top-level (but we could still figure it out using the offset). |
The "rebuild" information is still available to The original thought was to use a different name for this kind of reconstruction in order to distinguish it from resilvering. However, if we automatically start a scrub when the rebuild completes perhaps there isn't a meaningful distinction. This could still be reported as a resilver in
In order to keep the testing realistic I avoided some of the worst case scenarios. It's definitely possible to see more than a 2x improvement when the pool is highly fragmented. But rather than show how bad resilver can be (and without the sequential scan code it can be really bad), I wanted to show the new rebuild code could saturate the vdev despite the fragmentation. Things get even better with dRAID when there are multiple redundancy groups since when rebuilding to a distributed hot spare it's possible to determine which blocks are degraded and only rebuild those. For example, if you have a dRAID with 3 redundancy groups you'll only need to rebuild 1/3 of the blocks all the devices can participate. But digressing in to dRAID specifics felt a little out of scope for this PR. |
c6e8095
to
c05d36d
Compare
Done. I've update the PR to kick off a scrub when that last active rebuild completes. While scrubbing both the in progress scrub and final rebuild results are reported by
This is a great idea, but I don't think it's necessary for this PR. If the scrub takes a little longer to complete because it checks the entire pool that seems fine. We can always make this improvement at a later date.
I've been giving this some thought but haven't come up with better names. Unless I hear a better suggestion I'll update the documentation to use this language in the next day or so. As for the |
c7547c8
to
501d841
Compare
|
@mmaybee @ahrens @richardelling @tonyhutter I've updated this PR with the following noteworthy changes based on the conversation above. I believe this should address all of the concerns thus far. Please give it another look.
|
|
Nice, this is an improvement.
Is this true of both "healing" and "sequential reconstruction", or only of "sequential"? It would be nice if the statistic for the resilver was somehow annotated (with "verified" perhaps). A "healing reconstruction" would always have this annotation, while a "sequential reconstruction" would get the annotation once a scrub had completed. |
Only for sequential resilvers, since healing resilvers and scrubs are both reported with the dsl scan stats. Sadly we don't have the old healing resilver data when the scrub is running. We could add something but that causes some backward compatibly issues with the user/kernel interface, so I opted not to change this.
That's an interesting idea. There's no verified annotation currently, but until the scrub completes after a sequential resilver the |
|
Since we're now using the RESILVER_START/FINISH event for both types of resilvers, we should add a field in the event to say what type of resilver it is. Specifically, I think you could add it as a "resilver_type=<sequential,healing>" single nvlist to the 3rd arg of spa_event_notify(): spa_event_notify(spa, NULL, NULL,
scn->scn_phys.scn_min_txg ?
ESC_ZFS_RESILVER_FINISH : ESC_ZFS_SCRUB_FINISH);I believe that nvlist then gets appended to the normal event nvlist. We could then key off that value in |
7850706
to
ec11272
Compare
The device_rebuild feature enables sequential reconstruction when
resilvering. Mirror vdevs can be rebuilt in LBA order which may
more quickly restore redundancy depending on the pools average block
size, overall fragmentation and the performance characteristics
of the devices (SMR). However, block checksums cannot be verified
as part of the rebuild thus a scrub is automatically started after
the sequential resilver completes.
The new '-s' option has been added to the `zpool attach` and
`zpool replace` command to request a rebuild instead of a resilver.
zpool attach -s <pool> <existing vdev> <new vdev>
zpool replace -s <pool> <old vdev> <new vdev>
The `zpool status` output has been updated to report the progress
of sequential resilvering in the same way as healing resilvering.
The one notable differce is that multiple sequential resilvers may
may be in progress as long as they're operating on different
top-level vdevs.
The `zpool wait -t resilver` command was extended to wait on
sequential resilvers. From this perspective they are no different
than healing resilvers.
Sequential resilvers cannot be supported for RAIDZ, but are
compatible with the dRAID feature being developed.
As part of this change the resilver_restart_* tests were moved
in to the functional/replacement directory. Additionally, the
replacement tests were renamed and extended to verify both
resilvering and rebuilding.
Original-patch-by: Isaac Huang <he.huang@intel.com>
Co-authored-by: Isaac Huang <he.huang@intel.com>
Co-authored-by: Mark Maybee <mmaybee@cray.com>
Co-authored-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
When reporting the scan status print both the scrub and rebuild statistics when they're available. In order to avoid reporting stale rebuild information it now cleared when starting a resilver. Additionally, automatically try and start a scrub when the last active rebuild completes. If for some reason this fails the user will be adviced via 'zpool status' that a scrub should be run. The documentation was updated accordingly. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
* Update the user documentation to describe "healing reconstruction" and "sequential reconstruction" (a.k.a. rebuild) as two different algorithms which can be used to resilver. The code continues to use the rebuild termination to distinguish it from the traditional scrub/resilver algorithm. * Updated zpool wait to consolidate the resilver and rebuild cases. As far as zpool wait is concerned a rebuild is the same as normal healing resilvering. * Changed `zpool attach|replace -r` to `zpool attach|replace -s`. This is no longer referred to as rebuild so the -r doesn't make sense, plus -r was easy to confuse with "(r)esilver" instead of "(r)ebuild". -s is used for (s)equential. * 'zpool status' was updated to report healing and sequential reconstruction both as resilvering. However, different status values are still reported to user space to ensure they aren't conflated. * Removed ESC_ZFS_REBUILD_START and ESC_ZFS_REBUILD_FINISH events. Instead post ESC_ZFS_RESILVER_START and ESC_ZFS_RESILVER_FINISH since we're calling this resilver. This has the additional benefit that the existing ZED scripts understand these events. * Updated error messages. * Updated test cases to use -s instead of -r. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Add the new key 'resilver_type="healing|sequential"' to resilver events in order to determine what type of resilver was performed. Update the `resilver_finish-start-scrub.sh` zedlet to only start a scrub when enabled for healing resilvers. This is not needed for sequential resilvers since a scrub is already automatically started when the last active resilver completes. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
ec11272
to
a9956a7
Compare
|
@tonyhutter good idea. I've updated to PR to add the "resilver_type=<sequential,healing>" to the posted event. |
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I manually tried both a sequential and healing resilver, and confirmed that resilver_type was set correctly in the events.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks like a really nice feature. FWIW, I've built and tested it and it seems to work well.
* Updated the language in two comments. * Switched to an atomic in vdev_stat_update() for consistency with the surrounding code. No real lock contention was observed but this ensures that will continue to be the case. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
|
Interesting feature, thanks for adding. One question though:
I get it for the case where a drive is missing as then it has to be rebuild from the raidz stripes parity with require a metadata walk to locate, but what happens in case the replace is targeting a still online drive? Will |
That's not supported by the current version. However, I believe that functionally could be added with some additional development work. |
The device_rebuild feature enables sequential reconstruction when
resilvering. Mirror vdevs can be rebuilt in LBA order which may
more quickly restore redundancy depending on the pools average block
size, overall fragmentation and the performance characteristics
of the devices. However, block checksums cannot be verified
as part of the rebuild thus a scrub is automatically started after
the sequential resilver completes.
The new '-s' option has been added to the `zpool attach` and
`zpool replace` command to request sequential reconstruction
instead of healing reconstruction when resilvering.
zpool attach -s <pool> <existing vdev> <new vdev>
zpool replace -s <pool> <old vdev> <new vdev>
The `zpool status` output has been updated to report the progress
of sequential resilvering in the same way as healing resilvering.
The one notable difference is that multiple sequential resilvers
may be in progress as long as they're operating on different
top-level vdevs.
The `zpool wait -t resilver` command was extended to wait on
sequential resilvers. From this perspective they are no different
than healing resilvers.
Sequential resilvers cannot be supported for RAIDZ, but are
compatible with the dRAID feature being developed.
As part of this change the resilver_restart_* tests were moved
in to the functional/replacement directory. Additionally, the
replacement tests were renamed and extended to verify both
resilvering and rebuilding.
Original-patch-by: Isaac Huang <he.huang@intel.com>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: John Poduska <jpoduska@datto.com>
Co-authored-by: Mark Maybee <mmaybee@cray.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes openzfs#10349
The device_rebuild feature enables sequential reconstruction when
resilvering. Mirror vdevs can be rebuilt in LBA order which may
more quickly restore redundancy depending on the pools average block
size, overall fragmentation and the performance characteristics
of the devices. However, block checksums cannot be verified
as part of the rebuild thus a scrub is automatically started after
the sequential resilver completes.
The new '-s' option has been added to the `zpool attach` and
`zpool replace` command to request sequential reconstruction
instead of healing reconstruction when resilvering.
zpool attach -s <pool> <existing vdev> <new vdev>
zpool replace -s <pool> <old vdev> <new vdev>
The `zpool status` output has been updated to report the progress
of sequential resilvering in the same way as healing resilvering.
The one notable difference is that multiple sequential resilvers
may be in progress as long as they're operating on different
top-level vdevs.
The `zpool wait -t resilver` command was extended to wait on
sequential resilvers. From this perspective they are no different
than healing resilvers.
Sequential resilvers cannot be supported for RAIDZ, but are
compatible with the dRAID feature being developed.
As part of this change the resilver_restart_* tests were moved
in to the functional/replacement directory. Additionally, the
replacement tests were renamed and extended to verify both
resilvering and rebuilding.
Original-patch-by: Isaac Huang <he.huang@intel.com>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: John Poduska <jpoduska@datto.com>
Co-authored-by: Mark Maybee <mmaybee@cray.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes openzfs#10349
Motivation and Context
The sequential reconstruction feature adds a more traditional RAID rebuild mechanism to ZFS. Specifically, it allows for mirror vdevs to be rebuilt in LBA order. Depending on the pools average block size, overall fragmentation, and the performance characteristics of the devices (SMR) sequential reconstruction can restore redundancy in less time than a traditional healing resilver. However, it cannot verify block checksums as part of the rebuild. Therefore a scrub is automatically started when the last active sequential resilver completes.
Sequential reconstruction cannot be supported for RAIDZ configurations but is compatible with the dRAID feature being developed. This functionality was part of the dRAID PR #10102 but is generically useful and can be merged independently.
Description
The new
-soption has been added to thezpool attachandzpool replacecommand to request sequential reconstruction be performed instead of healing reconstruction when resilvering.Regardless of if a sequential or healing resilver was requested the
zpool wait -t resilvercommand can be used to block until the resilver completes.The
zpool statusoutput was updated to report the progress of a sequential resilver in the same way as a normal healing resilver. The only difference is the top-level vdev is included in the status output. This is important because sequential resilvers are handled at the top-level vdev which means multiple sequential resilvers may be in progress as long as they're operating on different top-level vdevs. Sequential/healing resilvers and scrubs are all still mutually exclusive operations and only one at a time is permitted.How Has This Been Tested?
As part of this change the
resilver_restart_*tests were moved in to thefunctional/replacementdirectory, and thereplacementtests were renamed and extended to verify both resilvering and rebuilding.Performance was tested using a single vdev pool comprised of an 8 TB Seagate Constellation HDD (ST8000NM0075). The attached and replaced drives were identical HDD devices. The pools were created with default settings (128K record size, no compression) and filled with 1TiB of data.
The sustained performance of resilvering and rebuilding is shown in MiB/s for four reference pools with a range of block sizes and fragmentation caused by simulated pool aging. Additionally, scrub performance has been included for reference. All testing was performed while the pool was idle.
The key take away from the chart below is that pools containing predominantly small blocks (4k-16k) show approximately a 2x increase in performance when rebuilding vs resilvering, thereby effectively halving the replacement time. Pools with large average block sizes (>=128k) perform the same when rebuilding vs resilvering.
Terms:
Performance tunings to explore:
Up to 16MiB blocks allowed, capping at 1MiB may improve performanceSurveyzfs_rebuild_queue_limitSurveyzfs_vdev_rebuild_min_active,zfs_vdev_rebuild_min_active[edit] After limiting the maximum rebuild segment size to 1M rebuild and resilver performance in now the same for pools with large blocks. The bar graph was updated to show the updated performance test results.
Types of changes
Checklist:
Signed-off-by.