zed: Take no action on scrub/resilver checksum errors #13499

behlendorf · 2022-05-23T23:46:38Z

Motivation and Context

When performing a preventative scrub pre-existing damage to a
vdev resulted in the ZED marking multiple vdevs DEGRADED
and unnecessarily committing all of the hot spares to use.

Description

When scrubbing/resilvering a pool it can be counter productive to
cancel the scan and kick of a replace operation to a hot spare
when encountering checksum errors. In this case, the best course
of action is to allow the scrub/resilver to complete as quickly
as possible and to keep the vdevs fully online if possible.

Realistically, this is less of an issue for a RAIDZ since a
traditional resilver must be used and checksums will be verified.
However, this is not the case for a mirror or dRAID pool which is
sequentially resilvered and checksum verification is deferred
until after the replace operation completes.

Regardless, we apply this policy to all pool types since it's
a good idea for all vdev types. Degrading additional vdevs has
the potential to make a bad situtation worse. Note the checksum
errors will still be reported as both an event and by zpool status.
This change only prevents the ZED from proactively taking any action.

How Has This Been Tested?

Manually verified by creating a dRAID with with a distributed
spare, adding some data to the pool, then zeroing out one of
the vdev. Two scenarios were then tested.

Started a zpool scrub and verified that the ZED was
notified of the damage but took no action since it was part
of a scrub IO. Cleared pool errors, then ran scrub again
to verify everything was repaired.
Importing the pool and then read all the files to /dev/null.
Verified that in this case the ZED was notified of the errors
and took action by kicking in the distributed spare.

While it'd be nice to add a ZTS test case for this orchestrating
the test would be fairly involved and likely somewhat unreliable.
Therefore, I've opted to instead rely on manual testing.

Types of changes

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Performance enhancement (non-breaking change which improves efficiency)
Code cleanup (non-breaking change which makes code smaller or more readable)
Breaking change (fix or feature that would cause existing functionality to change)
Library ABI change (libzfs, libzfs_core, libnvpair, libuutil and libzfsbootenv)
Documentation (a change to man pages or other documentation)

Checklist:

My code follows the OpenZFS code style requirements.
I have updated the documentation accordingly.
I have read the contributing document.
I have added tests to cover my changes.
I have run the ZFS Test Suite with this change applied.
All commit messages are properly formatted and contain Signed-off-by.

When scrubbing/resilvering a pool it can be counter productive to cancel the scan and kick of a replace operation to a hot spare when encountering checksum errors. In this case, the best course of action is to allow the scrub/resilver to complete as quickly as possible and to keep the vdevs fully online if possible. Realistically, this is less of an issue for a RAIDZ since a traditional resilver must be used and checksums will be verified. However, this is not the case for a mirror or dRAID pool which is sequentially resilvered and checksum verification is deferred until after the replace operation completes. Regardless, we apply this policy to all pool types since it's a good idea regardless. Degrading additional vdev has the potential to make a bad situtation worse. Note the checksum errors will still be reported as both an event and by `zpool status`. This change only prevents the ZED from proactively taking any action. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>

When scrubbing/resilvering a pool it can be counter productive to cancel the scan and kick of a replace operation to a hot spare when encountering checksum errors. In this case, the best course of action is to allow the scrub/resilver to complete as quickly as possible and to keep the vdevs fully online if possible. Realistically, this is less of an issue for a RAIDZ since a traditional resilver must be used and checksums will be verified. However, this is not the case for a mirror or dRAID pool which is sequentially resilvered and checksum verification is deferred until after the replace operation completes. Regardless, we apply this policy to all pool types since it's a good idea for all vdevs. Degrading additional vdevs has the potential to make a bad situation worse. Note the checksum errors will still be reported as both an event and by `zpool status`. This change only prevents the ZED from proactively taking any action. Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Tony Nguyen <tony.nguyen@delphix.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes openzfs#13499

When scrubbing/resilvering a pool it can be counter productive to cancel the scan and kick of a replace operation to a hot spare when encountering checksum errors. In this case, the best course of action is to allow the scrub/resilver to complete as quickly as possible and to keep the vdevs fully online if possible. Realistically, this is less of an issue for a RAIDZ since a traditional resilver must be used and checksums will be verified. However, this is not the case for a mirror or dRAID pool which is sequentially resilvered and checksum verification is deferred until after the replace operation completes. Regardless, we apply this policy to all pool types since it's a good idea for all vdevs. Degrading additional vdevs has the potential to make a bad situation worse. Note the checksum errors will still be reported as both an event and by `zpool status`. This change only prevents the ZED from proactively taking any action. Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Tony Nguyen <tony.nguyen@delphix.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #13499

When scrubbing/resilvering a pool it can be counter productive to cancel the scan and kick of a replace operation to a hot spare when encountering checksum errors. In this case, the best course of action is to allow the scrub/resilver to complete as quickly as possible and to keep the vdevs fully online if possible. Realistically, this is less of an issue for a RAIDZ since a traditional resilver must be used and checksums will be verified. However, this is not the case for a mirror or dRAID pool which is sequentially resilvered and checksum verification is deferred until after the replace operation completes. Regardless, we apply this policy to all pool types since it's a good idea for all vdevs. Degrading additional vdevs has the potential to make a bad situation worse. Note the checksum errors will still be reported as both an event and by `zpool status`. This change only prevents the ZED from proactively taking any action. Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Tony Nguyen <tony.nguyen@delphix.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes openzfs#13499

This is not associated with a specific upstream commit but apparently a local diff applied as part of: commit e3aa18ad71782a73d3dd9dd3d526bbd2b607ca16 Merge: 645886d028c8 b9d9845 Author: Martin Matuska <mm@FreeBSD.org> Date: Fri Jun 3 17:58:39 2022 +0200 zfs: merge openzfs/zfs@b9d98453f Notable upstream pull request merges: openzfs#12321 Fix inflated quiesce time caused by lwb_tx during zil_commit() openzfs#13244 zstd early abort openzfs#13360 Verify BPs as part of spa_load_verify_cb() openzfs#13452 More speculative prefetcher improvements openzfs#13466 Expose zpool guids through kstats openzfs#13476 Refactor Log Size Limit openzfs#13484 FreeBSD: libspl: Add locking around statfs globals openzfs#13498 Cancel in-progress rebuilds when we finish removal openzfs#13499 zed: Take no action on scrub/resilver checksum errors openzfs#13513 Remove wrong assertion in log spacemap Obtained from: OpenZFS OpenZFS commit: b9d9845

behlendorf added the Component: ZED ZFS Event Daemon label May 23, 2022

behlendorf requested a review from tonyhutter May 23, 2022 23:46

tonyhutter approved these changes May 24, 2022

View reviewed changes

tonynguien approved these changes May 24, 2022

View reviewed changes

behlendorf added the Status: Accepted Ready to integrate (reviewed, tested) label May 24, 2022

behlendorf merged commit cf70c0f into openzfs:master May 24, 2022

behlendorf mentioned this pull request Jun 1, 2022

zpool clear strangeness #12090

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

zed: Take no action on scrub/resilver checksum errors #13499

zed: Take no action on scrub/resilver checksum errors #13499

behlendorf commented May 23, 2022

zed: Take no action on scrub/resilver checksum errors #13499

zed: Take no action on scrub/resilver checksum errors #13499

Conversation

behlendorf commented May 23, 2022

Motivation and Context

Description

How Has This Been Tested?

Types of changes

Checklist: