Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Need to fail more gracefully after surprise (USB) drive removal (z_wr_* spinning forever on CPU) #14032

Open
adamdmoss opened this issue Oct 15, 2022 · 3 comments
Labels
Type: Defect Incorrect behavior (e.g. crash, hang)

Comments

@adamdmoss
Copy link
Contributor

adamdmoss commented Oct 15, 2022

System information

Type Version/Name
Distribution Name Ubuntu
Distribution Version 22.04.1
Kernel Version Linux version 5.15.0-50-lowlatency (buildd@lcy02-amd64-093) (gcc (Ubuntu 11.2.0-19ubuntu1) 11.2.0, GNU ld (GNU Binutils for Ubuntu) 2.38) #56-Ubuntu SMP PREEMPT Wed Sep 21 13:57:05 UTC 2022
Architecture x64
OpenZFS Version 4d5aef3 git master

Describe the problem you're observing

Approximately daily, z_wr_int_0 and z_wr_int_1 will go into an apparently infinite and tight loop chewing up CPU time until a reboot is required.

I believe this is triggered by one of the system's disks (forming a one-disk scratch pool) with a marginal USB enclosure intermittently falling off the bus for a couple of seconds.

ZFS is right to be unhappy about this but I would hope for a more graceful and recoverable response than spinning these threads (are they retrying writes?) while ramping up the write-error count indefinitely; perhaps take the disk straight to FAULTED or REMOVED state.

    PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND                     
  83374 root      20   0       0      0      0 S  33.3   0.0  38:56.03 z_wr_int_1                  
      2 root      20   0       0      0      0 S  32.6   0.0  30:58.80 kthreadd                    
  83369 root      39  19       0      0      0 S  32.6   0.0  37:52.49 z_wr_iss                    
  83372 root      20   0       0      0      0 S  32.6   0.0  38:56.87 z_wr_int_0                  
   2668 root       0 -20       0      0      0 D  15.5   0.0  16:08.31 spl_dynamic_tas             
  83380 root       0 -20       0      0      0 S   3.1   0.0   4:09.61 z_cl_iss              
  pool: scratch-on-6tbusb
 state: ONLINE
status: One or more devices has experienced an error resulting in data
	corruption.  Applications may be affected.
action: Restore the file in question if possible.  Otherwise restore the
	entire pool from backup.
   see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-8A
  scan: scrub canceled on Sat Oct 15 11:20:54 2022
config:

	NAME                                                 STATE     READ WRITE CKSUM
	scratch-on-6tbusb                                    ONLINE       0     0     0
	  usb-WD_Elements_25A3_57583531443638444C303041-0:0  ONLINE      70  891M     0
	cache
	  nvme0n1p2                                          ONLINE       0     0     0

errors: Permanent errors have been detected in the following files:

        <metadata>:<0x0>
        <metadata>:<0x1>
        <metadata>:<0x21d>
        <metadata>:<0x3bc>
        scratch-on-6tbusb/scratch:<0x0>
        /scratch/scratch

I/O to the affected pool will fail in userspace (as I expect) but a number of pool operations such as zpool sync will also hang indefinitely, which is undesirable.

FWIW despite the massive number of read/write errors and the 'permanent' errors detected, the pool/data/metadata is intact after a reboot & scrub.

FWIW^2 I know ZFS-on-USB is considered a poor idea but I'd like to imagine it's a poor idea for the pool's health, not the system's health.

Describe how to reproduce the problem

I'm not sure - can happen an hour after a reboot or a day after a reboot, but I'm assuming it's triggered by the above pool falling off the bus, even temporarily.

If that's the case, I'd expect a failure which is more graceful and recoverable than a few of ZFS' kernel threads going into an infinite loop. Perhaps the fundamental issue is that ZFS does not detect (USB) device removal and perhaps the kernel just substitutes a block device that fails every operation forever (even after the physical device reappears), if the zpoolconcepts(7) is any hint:

     REMOVED   The device was physically removed while the system was running.  Device removal
               detection is hardware-dependent and may not be supported on all platforms.
@adamdmoss adamdmoss added the Type: Defect Incorrect behavior (e.g. crash, hang) label Oct 15, 2022
@adamdmoss
Copy link
Contributor Author

A manual zpool reopen takes the pool to the UNAVAIL state and stops those spinning threads, which is progress! Unfortunately zpool reopen doesn't actually return, and outstanding zpool/dataset operations (zpool sync, zfs snapshot, etc) are also still left hanging.

@rincebrain
Copy link
Contributor

I've found a careful use of zpool clear pool device and zpool reopen pool can sometimes be of use, but suspect something like the forced export work will be needed to truly get us rid of this sort of problem.

That said, do you have a flamegraph or stacktrace of where it's spinning?

@adamdmoss
Copy link
Contributor Author

Finally got a flamegraph of the spinning - blah

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Type: Defect Incorrect behavior (e.g. crash, hang)
Projects
None yet
Development

No branches or pull requests

2 participants