-
Notifications
You must be signed in to change notification settings - Fork 1.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ZED will not engage hot-spare and resilver upon drive state change to faulted #8967
Comments
I've looked at zfs/cmd/zed/agents/zfs_retire.c and I cannot identify why the resilver to hot-spare doesn't execute. Detailed zpool events: https://pastebin.com/13r4UDL6 |
I tried to reproduce this in
However, I did see the spare kick in on the
|
Would the I was hoping there was a way, perhaps with zdb to echoing a string to a module parameter handle in sysfs that a message could be injected as a zfs event to diagnose what or how zed is engaging. |
see |
So I just tried |
I suspect what we really need is a "test spare activation after a drive has disappeared" test case. The current test cases just test faults via injecting IO/checksum errors. @AeonJJohnson mentioned that this is from a multipathed SAS array, which, if it's like the arrays I've seen, often have bad drives disappear and reappear. That may help us reproduce what @AeonJJohnson is seeing. @AeonJJohnson did you happen to see any SCSI/multipath errors in dmesg at the time of the disk failure? |
There were SCSI errors (hardware/media on writes)
And some multipath TUR checker errors as the drive started to take a dirt nap. Failures are rarely graceful.
Then correlate the above time stamps with the
|
Here is a pastebin of the I would think that ZED would've seen what it needed and engaged the S1 hot-spare after on the statechange at 00:02:19.107546574.
ZED saw something it liked as it turned on the drive slot fault LED at 00:02.19
|
I've also recently experienced failure to activate a spare on 0.7.13 under similar conditions (multipathd handling the drives). I can't offer a zevents because a recent zfs recv spilled over the events buffer. |
This issue has been automatically marked as "stale" because it has not had any activity for a while. It will be closed in 90 days if no further activity occurs. Thank you for your contributions. |
poke This still appears to be a defect, others reporting mixed results. We are seeing it at customer sites as well currently. Any thoughts on what/where to dig on this? |
@AeonJJohnson we have made a couple fixes in this area (4b5c9d9 0aacde2 3236664) over the last year or so, if you can reproduce this on demand I think it'd be worth trying the 2.1-rc6 release. In particular, I think 0aacde2 could sort this out since it updates the kernel modules to explicitly issue a removal event when they determine the vdev is gone. The ZED should already have been getting a udev notification for this, but if for some reason it got lost the spare might not get kicked in (and there's no harm is sending it twice). The |
Another potentially related change is #11355. |
This issue has been automatically marked as "stale" because it has not had any activity for a while. It will be closed in 90 days if no further activity occurs. Thank you for your contributions. |
System information
ZFS config environment
Describe the problem you're observing
ZED will not engage hot spare upon drive failure. Failing drive gets faulted, pool goes degraded, zpool history shows state change, ZED sees failure and sets enclosure fault LED but hot spare is not engaged and resilver to hot spare does not start.
Describe how to reproduce the problem
Build a pool. Wait for drive to fail blocks. Profit.
Include any warning/errors/backtraces from the system logs
ZED sees the fault state change on the failing drive and turns on its chassis LED
Pool events show state change:
The pool has a dedicated hot spare that is available
Running zed -F -v indicates that zed can properly see the current degraded state of the pool
I am able to manually resilver to the hot-spare by running:
zpool replace mypool A0 S1
I have seen this multiple times and am unable to track down why ZED won't engage the hot-spare S1 and start the resilver. There are no logs indicating any errors or failures encountered in an attempt to engage the hot-spare drive.
Autoreplace is set to off, which from everything I know and have read is not a requirement for zed to engage and resilver to a hot-spare upon drive fault.
I have seen reference to needing to specifically set hot-spare action in zed.rc but the 2014 references to this method do not appear in the 0.7.9 code tree so that appears to be a red herring:
The text was updated successfully, but these errors were encountered: