Strategy to gracefully handle failing disks #18511

trossoma · 2026-05-08T03:29:28Z

trossoma
May 8, 2026

I’m trying to better understand how ZFS handles disk errors and how to mitigate cases where bad disks are causing performance issues.

Dealing with servers in production environments where availability is top priority. Running older OpenZFS v0.8.6 on AlmaLinux 8.x. Couple of zpools present with many drives, with vdevs cfg’d for RAIDZ3. Disks cannot be replaced in production.

One of the vdevs has only 6 disks and can suffer varying levels of performance degradation when disks have errors and ZFS is performing self healing operations.

It seems like the disks have to be in a pretty bad state before ZFS will give up on them. Looking for some input on strategies for reducing the negative impacts of bad disks by either adjusting ZFS knobs and/or adding mechanism to monitor disk health and offline bad disks either via kernel or ZFS.

Any recommendations for monitoring and mitigating performance impact of failing disks?

tonyhutter · 2026-05-08T16:09:12Z

tonyhutter
May 8, 2026
Maintainer

Any recommendations for monitoring and mitigating performance impact of failing disks?

Assuming you'd be open to upgrading to a newer version of ZFS:

zpool status -c will list extended drive stats that may be helpful for monitoring (like SMART stats). Newer versions of ZFS have JSON output for easier parsing too.
The 2.4.x branch contains df55ba7. That auto-detects slow drives and temporarily "sits them out" so that ZFS favors reading from all other drives and reconstructing the data from parity rather than reading from the slow drive. It also allows you to manually sit out drives via a vdev property.
You can set allocating=off on a top-level vdev so that new writes don't land on any of the vdevs under it. That's a pretty big hammer though.

1 reply

trossoma May 8, 2026
Author

Any recommendations for monitoring and mitigating performance impact of failing disks?

Assuming you'd be open to upgrading to a newer version of ZFS:

zpool status -c will list extended drive stats that may be helpful for monitoring (like SMART stats). Newer versions of ZFS have JSON output for easier parsing too.

The 2.4.x branch contains df55ba7. That auto-detects slow drives and temporarily "sits them out" so that ZFS favors reading from all other drives and reconstructing the data from parity rather than reading from the slow drive. It also allows you to manually sit out drives via a vdev property.

You can set allocating=off on a top-level vdev so that new writes don't land on any of the vdevs under it. That's a pretty big hammer though.

Yeah, we previously tried to upgrade to 2.x version and ran into problem with pool suspensions and had to revert.

Working to develop automated monitoring mech that will offline disks when error threshold is met. The trick is coming up with the failure criteria.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Strategy to gracefully handle failing disks #18511

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment 1 reply

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Select a reply

Uh oh!

Uh oh!

Strategy to gracefully handle failing disks #18511

Uh oh!

trossoma May 8, 2026

Replies: 1 comment · 1 reply

Uh oh!

tonyhutter May 8, 2026 Maintainer

Uh oh!

Uh oh!

trossoma May 8, 2026 Author

trossoma
May 8, 2026

Replies: 1 comment 1 reply

tonyhutter
May 8, 2026
Maintainer

trossoma May 8, 2026
Author