status: report pool suspension state under failmode=continue #15297

robn · 2023-09-19T03:37:52Z

Motivation and Context

When failmode=continue is set and the pool suspends, both zpool status and the zfs/pool/state kstat ignore it and report the normal vdev tree state. There's no clear indicator that the pool is suspended. This is unlike suspend in failmode=wait, or suspend due to MMP check failure, which both report SUSPENDED explicitly.

This commit changes it so SUSPENDED is reported for failmode=continue the same as for other modes.

Rationale:

The historical behaviour of failmode=continue is roughly, "press on as though all is well". To this end, the fact that the pool had suspended was not shown, to maintain the façade that all is well.

Its unclear why hiding this information was considered appropriate. One possibility is that it was expected that a true pool fault would always be reported as DEGRADED or FAULTED, and that the pool could not suspend without these happening.

That is not necessarily true, as vdev health and suspend state are only loosely connected, such that a pool in (apparent) good health can be suspended for good reasons, and of course a degraded pool does not lead to suspension. Even if that expectation were true, there's still a difference in urgency - a degraded pool may not need to be attended to for hours, while a suspended pool is most often unusable until an operator intervenes.

An operator that has set failmode=continue has presumably done so because their workload is one that can continue to operate in a useful way when the pool suspends. In this case the operator still needs a clear indicator that there is a problem that needs attending to.

Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.

How Has This Been Tested?

Tested by hand, forcing a suspend and then checking zpool status output.

I used a script like the following:

zpool create -o failmode=continue tank raidz loop0 loop1 loop2 loop3

zinject -d loop0 -L uber -e io -f 5 tank
zinject -d loop1 -L uber -e io -f 5 tank
zinject -d loop2 -L uber -e io -f 5 tank
zinject -d loop3 -L uber -e io -f 5 tank

for n in $(seq 1 256) ; do
  dd if=/dev/random of=/tank/file$n bs=32K count=1 &
done

and then:

[    9.158386] WARNING: Pool 'tank' has encountered an uncorrectable I/O failure and has been suspended.
[    9.158386]

root@quiz:/# zpool status
  pool: tank
 state: SUSPENDED
status: One or more devices are faulted in response to IO failures.
action: Make sure the affected devices are connected, then run 'zpool clear'.
   see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-JQ
config:

	NAME        STATE     READ WRITE CKSUM
	tank        ONLINE       0     0     0
	 raidz1-0   ONLINE       0     0     0
	   loop0    ONLINE       0     4     0
	   loop1    ONLINE       0     4     0
	   loop2    ONLINE       0     4     0
	   loop3    ONLINE       0     4     0

(there are not really tests for suspend or failmode=continue and I was hoping to not have to write any just for this change).

Types of changes

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Performance enhancement (non-breaking change which improves efficiency)
Code cleanup (non-breaking change which makes code smaller or more readable)
Breaking change (fix or feature that would cause existing functionality to change)
Library ABI change (libzfs, libzfs_core, libnvpair, libuutil and libzfsbootenv)
Documentation (a change to man pages or other documentation)

Checklist:

My code follows the OpenZFS code style requirements.
I have updated the documentation accordingly.
I have read the contributing document.
I have added tests to cover my changes.
I have run the ZFS Test Suite with this change applied.
All commit messages are properly formatted and contain Signed-off-by.

When failmode=continue is set and the pool suspends, both 'zpool status' and the 'zfs/pool/state' kstat ignore it and report the normal vdev tree state. There's no clear indicator that the pool is suspended. This is unlike suspend in failmode=wait, or suspend due to MMP check failure, which both report "SUSPENDED" explicitly. This commit changes it so SUSPENDED is reported for failmode=continue the same as for other modes. Rationale: The historical behaviour of failmode=continue is roughly, "press on as though all is well". To this end, the fact that the pool had suspended was not shown, to maintain the façade that all is well. Its unclear why hiding this information was considered appropriate. One possibility is that it was expected that a true pool fault would always be reported as DEGRADED or FAULTED, and that the pool could not suspend without these happening. That is not necessarily true, as vdev health and suspend state are only loosely connected, such that a pool in (apparent) good health can be suspended for good reasons, and of course a degraded pool does not lead to suspension. Even if that expectation were true, there's still a difference in urgency - a degraded pool may not need to be attended to for hours, while a suspended pool is most often unusable until an operator intervenes. An operator that has set failmode=continue has presumably done so because their workload is one that can continue to operate in a useful way when the pool suspends. In this case the operator still needs a clear indicator that there is a problem that needs attending to. Signed-off-by: Rob Norris <rob.norris@klarasystems.com> Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc.

behlendorf

Agreed, there's no good reason to hide that the pool has been suspended. As for using failmode=continue these days its strongly discouraged unless you have known misbehaving hardware/drivers.

This support was added originally because it observed that with some now very, very, very old kernels it was possible for a disk bio to neither 1) complete, 2) error-out, or 3) timeout. This would effectively hang the entire pool with the only option being to continue, retry the I/O, and hope for the best.

robn · 2023-09-21T10:09:37Z

Thanks for the extra info; very hard to discover that early history!

When failmode=continue is set and the pool suspends, both 'zpool status' and the 'zfs/pool/state' kstat ignore it and report the normal vdev tree state. There's no clear indicator that the pool is suspended. This is unlike suspend in failmode=wait, or suspend due to MMP check failure, which both report "SUSPENDED" explicitly. This commit changes it so SUSPENDED is reported for failmode=continue the same as for other modes. Rationale: The historical behaviour of failmode=continue is roughly, "press on as though all is well". To this end, the fact that the pool had suspended was not shown, to maintain the façade that all is well. Its unclear why hiding this information was considered appropriate. One possibility is that it was expected that a true pool fault would always be reported as DEGRADED or FAULTED, and that the pool could not suspend without these happening. That is not necessarily true, as vdev health and suspend state are only loosely connected, such that a pool in (apparent) good health can be suspended for good reasons, and of course a degraded pool does not lead to suspension. Even if that expectation were true, there's still a difference in urgency - a degraded pool may not need to be attended to for hours, while a suspended pool is most often unusable until an operator intervenes. An operator that has set failmode=continue has presumably done so because their workload is one that can continue to operate in a useful way when the pool suspends. In this case the operator still needs a clear indicator that there is a problem that needs attending to. Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Rob Norris <rob.norris@klarasystems.com> Closes openzfs#15297

When failmode=continue is set and the pool suspends, both 'zpool status' and the 'zfs/pool/state' kstat ignore it and report the normal vdev tree state. There's no clear indicator that the pool is suspended. This is unlike suspend in failmode=wait, or suspend due to MMP check failure, which both report "SUSPENDED" explicitly. This commit changes it so SUSPENDED is reported for failmode=continue the same as for other modes. Rationale: The historical behaviour of failmode=continue is roughly, "press on as though all is well". To this end, the fact that the pool had suspended was not shown, to maintain the façade that all is well. Its unclear why hiding this information was considered appropriate. One possibility is that it was expected that a true pool fault would always be reported as DEGRADED or FAULTED, and that the pool could not suspend without these happening. That is not necessarily true, as vdev health and suspend state are only loosely connected, such that a pool in (apparent) good health can be suspended for good reasons, and of course a degraded pool does not lead to suspension. Even if that expectation were true, there's still a difference in urgency - a degraded pool may not need to be attended to for hours, while a suspended pool is most often unusable until an operator intervenes. An operator that has set failmode=continue has presumably done so because their workload is one that can continue to operate in a useful way when the pool suspends. In this case the operator still needs a clear indicator that there is a problem that needs attending to. Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Rob Norris <rob.norris@klarasystems.com> Closes #15297

When failmode=continue is set and the pool suspends, both 'zpool status' and the 'zfs/pool/state' kstat ignore it and report the normal vdev tree state. There's no clear indicator that the pool is suspended. This is unlike suspend in failmode=wait, or suspend due to MMP check failure, which both report "SUSPENDED" explicitly. This commit changes it so SUSPENDED is reported for failmode=continue the same as for other modes. Rationale: The historical behaviour of failmode=continue is roughly, "press on as though all is well". To this end, the fact that the pool had suspended was not shown, to maintain the façade that all is well. Its unclear why hiding this information was considered appropriate. One possibility is that it was expected that a true pool fault would always be reported as DEGRADED or FAULTED, and that the pool could not suspend without these happening. That is not necessarily true, as vdev health and suspend state are only loosely connected, such that a pool in (apparent) good health can be suspended for good reasons, and of course a degraded pool does not lead to suspension. Even if that expectation were true, there's still a difference in urgency - a degraded pool may not need to be attended to for hours, while a suspended pool is most often unusable until an operator intervenes. An operator that has set failmode=continue has presumably done so because their workload is one that can continue to operate in a useful way when the pool suspends. In this case the operator still needs a clear indicator that there is a problem that needs attending to. Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Rob Norris <rob.norris@klarasystems.com> Closes openzfs#15297

behlendorf approved these changes Sep 20, 2023

View reviewed changes

behlendorf added the Status: Accepted Ready to integrate (reviewed, tested) label Sep 20, 2023

behlendorf merged commit 4647353 into openzfs:master Sep 20, 2023
19 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

status: report pool suspension state under failmode=continue #15297

status: report pool suspension state under failmode=continue #15297

robn commented Sep 19, 2023 •

edited

Loading

behlendorf left a comment

robn commented Sep 21, 2023

status: report pool suspension state under failmode=continue #15297

status: report pool suspension state under failmode=continue #15297

Conversation

robn commented Sep 19, 2023 • edited Loading

Motivation and Context

How Has This Been Tested?

Types of changes

Checklist:

behlendorf left a comment

Choose a reason for hiding this comment

robn commented Sep 21, 2023

robn commented Sep 19, 2023 •

edited

Loading