New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
statechange event not always generated when VDEV state changes #9437
Comments
|
@tonyhutter can you take a look at this? Thanks |
|
I have a fix for this. I'll generate a PR for it as soon as I am finished running the ZFS test suite on it. |
When the "zpool online" command is used bring a faulted or
offline drive back online, a resource.fs.zfs.statechange event
is generated. When the "zpool replace" command is used to bring
a faulted or offline drive back online, a statechange event is
not generated. Add the missing statechange event after
resilvering has finished. The new sequence of events looks like
this:
sysevent.fs.zfs.vdev_attach
sysevent.fs.zfs.resilver_start
sysevent.fs.zfs.history_event (scan setup)
sysevent.fs.zfs.history_event (scan done)
sysevent.fs.zfs.resilver_finish
sysevent.fs.zfs.config_sync
+ resource.fs.zfs.statechange
sysevent.fs.zfs.vdev_remove
sysevent.fs.zfs.history_event (vdev attach)
sysevent.fs.zfs.config_sync
sysevent.fs.zfs.history_event (detach)
Signed-off-by: Christopher Voltz <christopher.voltz@hpe.com>
External-issue: LU-12836
Closes openzfs#9437
When the "zpool online" command is used bring a faulted or
offline drive back online, a resource.fs.zfs.statechange event
is generated. When the "zpool replace" command is used to bring
a faulted or offline drive back online, a statechange event is
not generated. Add the missing statechange event after
resilvering has finished. The new sequence of events looks like
this:
sysevent.fs.zfs.vdev_attach
sysevent.fs.zfs.resilver_start
sysevent.fs.zfs.history_event (scan setup)
sysevent.fs.zfs.history_event (scan done)
sysevent.fs.zfs.resilver_finish
sysevent.fs.zfs.config_sync
+ resource.fs.zfs.statechange
sysevent.fs.zfs.vdev_remove
sysevent.fs.zfs.history_event (vdev attach)
sysevent.fs.zfs.config_sync
sysevent.fs.zfs.history_event (detach)
Signed-off-by: Christopher Voltz <christopher.voltz@hpe.com>
External-issue: LU-12836
Closes openzfs#9437
When the "zpool online" command is used bring a faulted or
offline drive back online, a resource.fs.zfs.statechange event
is generated. When the "zpool replace" command is used to bring
a faulted or offline drive back online, a statechange event is
not generated. Add the missing statechange event after
resilvering has finished. The new sequence of events looks like
this:
sysevent.fs.zfs.vdev_attach
sysevent.fs.zfs.resilver_start
sysevent.fs.zfs.history_event (scan setup)
sysevent.fs.zfs.history_event (scan done)
sysevent.fs.zfs.resilver_finish
sysevent.fs.zfs.config_sync
+ resource.fs.zfs.statechange
sysevent.fs.zfs.vdev_remove
sysevent.fs.zfs.history_event (vdev attach)
sysevent.fs.zfs.config_sync
sysevent.fs.zfs.history_event (detach)
Signed-off-by: Christopher Voltz <christopher.voltz@hpe.com>
External-issue: LU-12836
Closes openzfs#9437
When the "zpool online" command is used bring a faulted or
offline drive back online, a resource.fs.zfs.statechange event
is generated. When the "zpool replace" command is used to bring
a faulted or offline drive back online, a statechange event is
not generated. Add the missing statechange event after
resilvering has finished. The new sequence of events looks like
this:
sysevent.fs.zfs.vdev_attach
sysevent.fs.zfs.resilver_start
sysevent.fs.zfs.history_event (scan setup)
sysevent.fs.zfs.history_event (scan done)
sysevent.fs.zfs.resilver_finish
sysevent.fs.zfs.config_sync
+ resource.fs.zfs.statechange
sysevent.fs.zfs.vdev_remove
sysevent.fs.zfs.history_event (vdev attach)
sysevent.fs.zfs.config_sync
sysevent.fs.zfs.history_event (detach)
Signed-off-by: Christopher Voltz <christopher.voltz@hpe.com>
External-issue: LU-12836
Closes openzfs#9437
|
This issue has been automatically marked as "stale" because it has not had any activity for a while. It will be closed in 90 days if no further activity occurs. Thank you for your contributions. |
|
I'm reopening this since this hasn't yet been addressed to my knowledge. |
|
This issue has been automatically marked as "stale" because it has not had any activity for a while. It will be closed in 90 days if no further activity occurs. Thank you for your contributions. |
System information
Problem
When a drive in a pool is
FAULTED(e.g., due to I/O errors) or the drive goesOFFLINE(e.g. thezpool offlinecommand was run), theresource.fs.zfs.statechangeevent is generated with thevdev_stateset appropriately. If the drive is brought online (e.g., thezpool onlinecommand was run), theresource.fs.zfs.statechangeevent is generated with thevdev_stateset toONLINE. However, if the drive is replaced using thezpool replacecommand, theresource.fs.zfs.statechangeevent is not generated.Lustre 2.11 added the ZEDLET
statechange-lustre.shwhich changes theobdfilter.*.degradedproperty for a target when the pool's state changes. It sets thedegradedproperty if the pool isDEGRADEDand resets the property if the pool isONLINE. Since ZFS is not always generating the state change event, sometimes the target's degraded property is left set even when the pool isONLINE, which reduces performance of the Lustre filesystem.See https://jira.whamcloud.com/browse/LU-12836 for more information (including output from
zpool events -v).Steps to reproduce
pool=ost04 zpool create $pool \ -o ashift=12 \ -o cachefile=none \ -O canmount=off \ -O recordsize=1024K \ -f \ raidz2 /dev/mapper/d8000_sep500C0FF03C1AC73E_bay0{41..50}-0ONLINE:zpool list -H -o name,health $poolwipefs --all --force /dev/mapper/$spare_driveDEGRADED:ONLINE:zpool list -H -o name,health $poolOFFLINEinstead of also having a state change event for the pool goingONLINE. The output should have included an event like this:Changing the
zpool replace $pool $bad_drive $spare_drivecommand to
zpool online $pool $bad_drivewill result in the
resource.fs.zfs.statechangeevent being generated when the pool goesONLINE.The Lustre issue includes the test-degraded-drive script which can be used for testing.
While we are looking at this specific scenario, we should investigate whether there are any other scenarios where the pool could change to
ONLINEbut not generate a corresponding state change event.The text was updated successfully, but these errors were encountered: