ZFS infinite retry after vdev IO error #13362

RinCat · 2022-04-22T00:13:37Z

System information

Type	Version/Name
Distribution Name	Gentoo
Distribution Version	default/linux/amd64/17.1
Kernel Version	5.15.32
Architecture	x86_64
OpenZFS Version	2.1.4

Describe the problem you're observing

ZFS infinite retry after vdev IO error, causes all operations / processes on the damaged pool to be in the D state, include zpool.
As a result, it is impossible to stop the related process, unmount, export or stop the pool.
Affected systems have no other option but to be hard rebooted.

Describe how to reproduce the problem

Not sure how to simulate IO errors, but the pool is running ZFS on LUKS.

Include any warning/errors/backtraces from the system logs

Millions of lines of

kernel: zio pool=[XXX] vdev=/dev/disk/by-id/XXXXXXXXX error=5 type=2 offset=312882950144 size=262144 flags=184880
kernel: zio pool=[XXX] vdev=/dev/disk/by-id/XXXXXXXXX error=5 type=2 offset=312888324096 size=262144 flags=184880
kernel: zio pool=[XXX] vdev=/dev/disk/by-id/XXXXXXXXX error=5 type=2 offset=312888324096 size=262144 flags=188881
kernel: zio pool=[XXX] vdev=/dev/disk/by-id/XXXXXXXXX error=5 type=2 offset=312882950144 size=262144 flags=188881
kernel: zio pool=[XXX] vdev=/dev/disk/by-id/XXXXXXXXX error=5 type=2 offset=312888324096 size=262144 flags=184880
kernel: zio pool=[XXX] vdev=/dev/disk/by-id/XXXXXXXXX error=5 type=2 offset=312889663488 size=262144 flags=184880
kernel: zio pool=[XXX] vdev=/dev/disk/by-id/XXXXXXXXX error=5 type=2 offset=312888324096 size=262144 flags=188881
kernel: zio pool=[XXX] vdev=/dev/disk/by-id/XXXXXXXXX error=5 type=2 offset=312889663488 size=262144 flags=188881
kernel: zio pool=[XXX] vdev=/dev/disk/by-id/XXXXXXXXX error=5 type=2 offset=312889663488 size=262144 flags=184880
kernel: zio pool=[XXX] vdev=/dev/disk/by-id/XXXXXXXXX error=5 type=2 offset=312895102976 size=262144 flags=184880
kernel: zio pool=[XXX] vdev=/dev/disk/by-id/XXXXXXXXX error=5 type=2 offset=312889663488 size=262144 flags=188881
kernel: zio pool=[XXX] vdev=/dev/disk/by-id/XXXXXXXXX error=5 type=2 offset=312895102976 size=262144 flags=188881

I expected that ZFS should give up after some attempts and return an IO error to the caller to resolve the IO deadlock. But it retries infinitely causing all operations to freeze.

The text was updated successfully, but these errors were encountered:

rincebrain · 2022-04-22T05:49:17Z

You may find the zpool property failmode useful, as it specifies the behavior on errors due to catastrophic unrecoverable failure.

Not to say it can't break in other ways, and it might not help you in particular, but there is a setting explicitly for not waiting forever hoping it gets better.

RinCat · 2022-04-22T06:11:05Z

@rincebrain Thanks for the advice!

failmode=continue Returns EIO to any new write I/O requests but allows reads to any of the remaining healthy devices. Any write requests that have yet to be committed to disk would be blocked.

This one seems still put exist write requests in the block state? Is there any particular reason to not discard all write requests in a unrecoverable failure to avoid the hanging? In this case the kernel cannot even reboot properly, because it is waiting for the blocked IO to finish.

stale · 2023-04-26T05:56:47Z

This issue has been automatically marked as "stale" because it has not had any activity for a while. It will be closed in 90 days if no further activity occurs. Thank you for your contributions.

RinCat · 2023-05-07T22:02:21Z

not stale

thoro · 2023-09-15T08:27:12Z

I'm actually getting these errors quite often - pretty sure there's a timing issue in the software I'm using - but basically what happens is:

zfs on multipathd on iscsi

somehow multipathd can pull the device out from under zfs, before the pool is successfully exported (multipathd has the queue_if_no_path option set), the multipath device is gone, and after that all zfs / zpool etc commands hang, and that in every version I have since tested (2.0.4 +) - only solution so far was the forced reset, which is quite annoying

Interestingly other pools continue to work fine, but it hangs when you want to disconnect them, or add new ones.

/proc/spl/kstat/zfs/ gives me a suspended state for the affected pool.

txgs shows the following:

599037   3066415402495469 C     0            0            0            0        0        5119892653   7654         168627       81092
599038   3066420522388122 S     0            0            0            0        0        5119910236   8085         25177        0
599039   3066425642298358 O     0            0            0            0        0        0            0            0            0

Edit: I do not know where the kernel module hangs, but all user applications hang at ioctl to the /dev/zfs device.

RinCat added the Type: Defect Incorrect behavior (e.g. crash, hang) label Apr 22, 2022

tomposmiko mentioned this issue Jan 13, 2023

Locked IO tasks #14140

Open

codyps mentioned this issue Feb 26, 2023

pool with ashift 12 on luks2 devices with sector size 4k causes repeated io errors #14533

Open

stale bot added the Status: Stale No recent activity for issue label Apr 26, 2023

stale bot removed the Status: Stale No recent activity for issue label May 7, 2023

rincebrain added the Bot: Not Stale Override for the stale bot label May 7, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ZFS infinite retry after vdev IO error #13362

ZFS infinite retry after vdev IO error #13362

RinCat commented Apr 22, 2022

rincebrain commented Apr 22, 2022

RinCat commented Apr 22, 2022

stale bot commented Apr 26, 2023

RinCat commented May 7, 2023

thoro commented Sep 15, 2023 •

edited

Loading

ZFS infinite retry after vdev IO error #13362

ZFS infinite retry after vdev IO error #13362

Comments

RinCat commented Apr 22, 2022

System information

Describe the problem you're observing

Describe how to reproduce the problem

Include any warning/errors/backtraces from the system logs

rincebrain commented Apr 22, 2022

RinCat commented Apr 22, 2022

stale bot commented Apr 26, 2023

RinCat commented May 7, 2023

thoro commented Sep 15, 2023 • edited Loading

thoro commented Sep 15, 2023 •

edited

Loading