You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Linux version 5.15.0-50-lowlatency (buildd@lcy02-amd64-093) (gcc (Ubuntu 11.2.0-19ubuntu1) 11.2.0, GNU ld (GNU Binutils for Ubuntu) 2.38) #56-Ubuntu SMP PREEMPT Wed Sep 21 13:57:05 UTC 2022
Approximately daily, z_wr_int_0 and z_wr_int_1 will go into an apparently infinite and tight loop chewing up CPU time until a reboot is required.
I believe this is triggered by one of the system's disks (forming a one-disk scratch pool) with a marginal USB enclosure intermittently falling off the bus for a couple of seconds.
ZFS is right to be unhappy about this but I would hope for a more graceful and recoverable response than spinning these threads (are they retrying writes?) while ramping up the write-error count indefinitely; perhaps take the disk straight to FAULTED or REMOVED state.
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
83374 root 20 0 0 0 0 S 33.3 0.0 38:56.03 z_wr_int_1
2 root 20 0 0 0 0 S 32.6 0.0 30:58.80 kthreadd
83369 root 39 19 0 0 0 S 32.6 0.0 37:52.49 z_wr_iss
83372 root 20 0 0 0 0 S 32.6 0.0 38:56.87 z_wr_int_0
2668 root 0 -20 0 0 0 D 15.5 0.0 16:08.31 spl_dynamic_tas
83380 root 0 -20 0 0 0 S 3.1 0.0 4:09.61 z_cl_iss
pool: scratch-on-6tbusb
state: ONLINE
status: One or more devices has experienced an error resulting in data
corruption. Applications may be affected.
action: Restore the file in question if possible. Otherwise restore the
entire pool from backup.
see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-8A
scan: scrub canceled on Sat Oct 15 11:20:54 2022
config:
NAME STATE READ WRITE CKSUM
scratch-on-6tbusb ONLINE 0 0 0
usb-WD_Elements_25A3_57583531443638444C303041-0:0 ONLINE 70 891M 0
cache
nvme0n1p2 ONLINE 0 0 0
errors: Permanent errors have been detected in the following files:
<metadata>:<0x0>
<metadata>:<0x1>
<metadata>:<0x21d>
<metadata>:<0x3bc>
scratch-on-6tbusb/scratch:<0x0>
/scratch/scratch
I/O to the affected pool will fail in userspace (as I expect) but a number of pool operations such as zpool sync will also hang indefinitely, which is undesirable.
FWIW despite the massive number of read/write errors and the 'permanent' errors detected, the pool/data/metadata is intact after a reboot & scrub.
FWIW^2 I know ZFS-on-USB is considered a poor idea but I'd like to imagine it's a poor idea for the pool's health, not the system's health.
Describe how to reproduce the problem
I'm not sure - can happen an hour after a reboot or a day after a reboot, but I'm assuming it's triggered by the above pool falling off the bus, even temporarily.
If that's the case, I'd expect a failure which is more graceful and recoverable than a few of ZFS' kernel threads going into an infinite loop. Perhaps the fundamental issue is that ZFS does not detect (USB) device removal and perhaps the kernel just substitutes a block device that fails every operation forever (even after the physical device reappears), if the zpoolconcepts(7) is any hint:
REMOVED The device was physically removed while the system was running. Device removal
detection is hardware-dependent and may not be supported on all platforms.
The text was updated successfully, but these errors were encountered:
A manual zpool reopen takes the pool to the UNAVAIL state and stops those spinning threads, which is progress! Unfortunately zpool reopen doesn't actually return, and outstanding zpool/dataset operations (zpool sync, zfs snapshot, etc) are also still left hanging.
I've found a careful use of zpool clear pool device and zpool reopen pool can sometimes be of use, but suspect something like the forced export work will be needed to truly get us rid of this sort of problem.
That said, do you have a flamegraph or stacktrace of where it's spinning?
System information
Describe the problem you're observing
Approximately daily,
z_wr_int_0
andz_wr_int_1
will go into an apparently infinite and tight loop chewing up CPU time until a reboot is required.I believe this is triggered by one of the system's disks (forming a one-disk scratch pool) with a marginal USB enclosure intermittently falling off the bus for a couple of seconds.
ZFS is right to be unhappy about this but I would hope for a more graceful and recoverable response than spinning these threads (are they retrying writes?) while ramping up the write-error count indefinitely; perhaps take the disk straight to
FAULTED
orREMOVED
state.I/O to the affected pool will fail in userspace (as I expect) but a number of pool operations such as
zpool sync
will also hang indefinitely, which is undesirable.FWIW despite the massive number of read/write errors and the 'permanent' errors detected, the pool/data/metadata is intact after a reboot & scrub.
FWIW^2 I know ZFS-on-USB is considered a poor idea but I'd like to imagine it's a poor idea for the pool's health, not the system's health.
Describe how to reproduce the problem
I'm not sure - can happen an hour after a reboot or a day after a reboot, but I'm assuming it's triggered by the above pool falling off the bus, even temporarily.
If that's the case, I'd expect a failure which is more graceful and recoverable than a few of ZFS' kernel threads going into an infinite loop. Perhaps the fundamental issue is that ZFS does not detect (USB) device removal and perhaps the kernel just substitutes a block device that fails every operation forever (even after the physical device reappears), if the
zpoolconcepts(7)
is any hint:The text was updated successfully, but these errors were encountered: