Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ZFS infinite retry after vdev IO error #13362

Open
RinCat opened this issue Apr 22, 2022 · 5 comments
Open

ZFS infinite retry after vdev IO error #13362

RinCat opened this issue Apr 22, 2022 · 5 comments
Labels
Bot: Not Stale Override for the stale bot Type: Defect Incorrect behavior (e.g. crash, hang)

Comments

@RinCat
Copy link

RinCat commented Apr 22, 2022

System information

Type Version/Name
Distribution Name Gentoo
Distribution Version default/linux/amd64/17.1
Kernel Version 5.15.32
Architecture x86_64
OpenZFS Version 2.1.4

Describe the problem you're observing

ZFS infinite retry after vdev IO error, causes all operations / processes on the damaged pool to be in the D state, include zpool.
As a result, it is impossible to stop the related process, unmount, export or stop the pool.
Affected systems have no other option but to be hard rebooted.

Describe how to reproduce the problem

Not sure how to simulate IO errors, but the pool is running ZFS on LUKS.

Include any warning/errors/backtraces from the system logs

Millions of lines of

kernel: zio pool=[XXX] vdev=/dev/disk/by-id/XXXXXXXXX error=5 type=2 offset=312882950144 size=262144 flags=184880
kernel: zio pool=[XXX] vdev=/dev/disk/by-id/XXXXXXXXX error=5 type=2 offset=312888324096 size=262144 flags=184880
kernel: zio pool=[XXX] vdev=/dev/disk/by-id/XXXXXXXXX error=5 type=2 offset=312888324096 size=262144 flags=188881
kernel: zio pool=[XXX] vdev=/dev/disk/by-id/XXXXXXXXX error=5 type=2 offset=312882950144 size=262144 flags=188881
kernel: zio pool=[XXX] vdev=/dev/disk/by-id/XXXXXXXXX error=5 type=2 offset=312888324096 size=262144 flags=184880
kernel: zio pool=[XXX] vdev=/dev/disk/by-id/XXXXXXXXX error=5 type=2 offset=312889663488 size=262144 flags=184880
kernel: zio pool=[XXX] vdev=/dev/disk/by-id/XXXXXXXXX error=5 type=2 offset=312888324096 size=262144 flags=188881
kernel: zio pool=[XXX] vdev=/dev/disk/by-id/XXXXXXXXX error=5 type=2 offset=312889663488 size=262144 flags=188881
kernel: zio pool=[XXX] vdev=/dev/disk/by-id/XXXXXXXXX error=5 type=2 offset=312889663488 size=262144 flags=184880
kernel: zio pool=[XXX] vdev=/dev/disk/by-id/XXXXXXXXX error=5 type=2 offset=312895102976 size=262144 flags=184880
kernel: zio pool=[XXX] vdev=/dev/disk/by-id/XXXXXXXXX error=5 type=2 offset=312889663488 size=262144 flags=188881
kernel: zio pool=[XXX] vdev=/dev/disk/by-id/XXXXXXXXX error=5 type=2 offset=312895102976 size=262144 flags=188881

I expected that ZFS should give up after some attempts and return an IO error to the caller to resolve the IO deadlock. But it retries infinitely causing all operations to freeze.

@RinCat RinCat added the Type: Defect Incorrect behavior (e.g. crash, hang) label Apr 22, 2022
@rincebrain
Copy link
Contributor

You may find the zpool property failmode useful, as it specifies the behavior on errors due to catastrophic unrecoverable failure.

Not to say it can't break in other ways, and it might not help you in particular, but there is a setting explicitly for not waiting forever hoping it gets better.

@RinCat
Copy link
Author

RinCat commented Apr 22, 2022

@rincebrain Thanks for the advice!

failmode=continue Returns EIO to any new write I/O requests but allows reads to any of the remaining healthy devices. Any write requests that have yet to be committed to disk would be blocked.

This one seems still put exist write requests in the block state? Is there any particular reason to not discard all write requests in a unrecoverable failure to avoid the hanging? In this case the kernel cannot even reboot properly, because it is waiting for the blocked IO to finish.

@stale
Copy link

stale bot commented Apr 26, 2023

This issue has been automatically marked as "stale" because it has not had any activity for a while. It will be closed in 90 days if no further activity occurs. Thank you for your contributions.

@stale stale bot added the Status: Stale No recent activity for issue label Apr 26, 2023
@RinCat
Copy link
Author

RinCat commented May 7, 2023

not stale

@stale stale bot removed the Status: Stale No recent activity for issue label May 7, 2023
@rincebrain rincebrain added the Bot: Not Stale Override for the stale bot label May 7, 2023
@thoro
Copy link

thoro commented Sep 15, 2023

I'm actually getting these errors quite often - pretty sure there's a timing issue in the software I'm using - but basically what happens is:

zfs on multipathd on iscsi

somehow multipathd can pull the device out from under zfs, before the pool is successfully exported (multipathd has the queue_if_no_path option set), the multipath device is gone, and after that all zfs / zpool etc commands hang, and that in every version I have since tested (2.0.4 +) - only solution so far was the forced reset, which is quite annoying

Interestingly other pools continue to work fine, but it hangs when you want to disconnect them, or add new ones.

/proc/spl/kstat/zfs/ gives me a suspended state for the affected pool.

txgs shows the following:

599037   3066415402495469 C     0            0            0            0        0        5119892653   7654         168627       81092
599038   3066420522388122 S     0            0            0            0        0        5119910236   8085         25177        0
599039   3066425642298358 O     0            0            0            0        0        0            0            0            0

Edit: I do not know where the kernel module hangs, but all user applications hang at ioctl to the /dev/zfs device.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bot: Not Stale Override for the stale bot Type: Defect Incorrect behavior (e.g. crash, hang)
Projects
None yet
Development

No branches or pull requests

3 participants