New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Increase default zfs_multihost_fail_intervals and import_intervals #8495
Conversation
By default, when multihost is enabled for a pool, the pool is suspended if (zfs_multihost_fail_intervals*zfs_multihost_interval) ms pass without a successful MMP write. This is the recommended configuration. The default value for zfs_multihost_fail_intervals has been 5, and the default value for zfs_multihost_interval has been 1000, so pool suspension occurred at 5 seconds. There have been multiple cases where a single misbehaving device in a pool triggered a SCSI reset, and all I/O paused for 5-6 seconds. This in turn caused MMP to suspend the pool. In the cases observed, the rest of the devices were healthy and the pool was otherwise correctly performing I/O. The reset was handled correctly by ZFS, and by suspending the pool MMP made replacing the device more difficult as well as forcing the host to be rebooted. Increase the default value of zfs_multihost_fail_intervals to 10, so that MMP tolerates up to 10 seconds of failed MMP writes before suspending the pool. Increase the default value of zfs_multihost_import_intervals to 20, to maintain the 2:1 safety factor. This results in a force import taking approximately 20 seconds when MMP is enabled, with default values. Signed-off-by: Olaf Faaland <faaland1@llnl.gov>
Codecov Report
@@ Coverage Diff @@
## master #8495 +/- ##
==========================================
- Coverage 78.57% 78.5% -0.08%
==========================================
Files 380 380
Lines 116057 116057
==========================================
- Hits 91194 91108 -86
- Misses 24863 24949 +86
Continue to review full report at Codecov.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Stragely, I thought I reviewed this patch yesterday afternoon, but I don't see any record of that happening.
I assume that this patch is a "safe" approach to resolving issue #7709 that is suitable for inclusion into 0.8.0 and later 0.7.x and the recent larger patch in #7709 will mostly be targeted on 0.8.x?
|
@adilger
Yes.
You're correct that the recent larger patch is not intended for backport to 0.7.x. The patch referencing #7709 is #7842 and mostly addresses the issue of unnecessarily long import times after a failover when I/O leading up to the failure was delayed and ub_mmp_delay grew large. |
Pull request was openzfs/zfs#8495, commit openzfs/zfs@db2af93.
Motivation and Context
By default, when multihost is enabled for a pool, the pool is
suspended when (zfs_multihost_fail_intervals*zfs_multihost_interval) ms
pass without a successful MMP write. This is the recommended
configuration.
The default value for zfs_multihost_fail_intervals has been 5, and the
default value for zfs_multihost_interval has been 1000, so pool
suspension occurred at 5 seconds.
There have been multiple cases where a single misbehaving device in a
pool triggered a SCSI reset, and all I/O paused for 5-6 seconds. This
in turn caused MMP to suspend the pool.
In the cases observed, the rest of the devices were healthy and the
pool was otherwise correctly performing I/O. The reset was handled
correctly by ZFS, and by suspending the pool MMP made replacing the
device more difficult as well as forcing the host to be rebooted.
Description
Increase the default value of zfs_multihost_fail_intervals to 10, so
that MMP tolerates up to 10 seconds of failed MMP writes before
suspending the pool.
Increase the default value of zfs_multihost_import_intervals to 20, to
maintain the 2:1 safety factor. This results in a force import taking
approximately 20 seconds when MMP is enabled, with default values.
How Has This Been Tested?
Types of changes
Checklist:
Signed-off-by.