Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Increase default zfs_multihost_fail_intervals and import_intervals #8495

Merged
merged 1 commit into from Mar 13, 2019

Conversation

ofaaland
Copy link
Contributor

@ofaaland ofaaland commented Mar 12, 2019

Motivation and Context

By default, when multihost is enabled for a pool, the pool is
suspended when (zfs_multihost_fail_intervals*zfs_multihost_interval) ms
pass without a successful MMP write. This is the recommended
configuration.

The default value for zfs_multihost_fail_intervals has been 5, and the
default value for zfs_multihost_interval has been 1000, so pool
suspension occurred at 5 seconds.

There have been multiple cases where a single misbehaving device in a
pool triggered a SCSI reset, and all I/O paused for 5-6 seconds. This
in turn caused MMP to suspend the pool.

In the cases observed, the rest of the devices were healthy and the
pool was otherwise correctly performing I/O. The reset was handled
correctly by ZFS, and by suspending the pool MMP made replacing the
device more difficult as well as forcing the host to be rebooted.

Description

Increase the default value of zfs_multihost_fail_intervals to 10, so
that MMP tolerates up to 10 seconds of failed MMP writes before
suspending the pool.

Increase the default value of zfs_multihost_import_intervals to 20, to
maintain the 2:1 safety factor. This results in a force import taking
approximately 20 seconds when MMP is enabled, with default values.

How Has This Been Tested?

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Performance enhancement (non-breaking change which improves efficiency)
  • Code cleanup (non-breaking change which makes code smaller or more readable)
  • Breaking change (fix or feature that would cause existing functionality to change)
  • Documentation (a change to man pages or other documentation)

Checklist:

By default, when multihost is enabled for a pool, the pool is
suspended if (zfs_multihost_fail_intervals*zfs_multihost_interval) ms
pass without a successful MMP write.  This is the recommended
configuration.

The default value for zfs_multihost_fail_intervals has been 5, and the
default value for zfs_multihost_interval has been 1000, so pool
suspension occurred at 5 seconds.

There have been multiple cases where a single misbehaving device in a
pool triggered a SCSI reset, and all I/O paused for 5-6 seconds.  This
in turn caused MMP to suspend the pool.

In the cases observed, the rest of the devices were healthy and the
pool was otherwise correctly performing I/O.  The reset was handled
correctly by ZFS, and by suspending the pool MMP made replacing the
device more difficult as well as forcing the host to be rebooted.

Increase the default value of zfs_multihost_fail_intervals to 10, so
that MMP tolerates up to 10 seconds of failed MMP writes before
suspending the pool.

Increase the default value of zfs_multihost_import_intervals to 20, to
maintain the 2:1 safety factor.  This results in a force import taking
approximately 20 seconds when MMP is enabled, with default values.

Signed-off-by: Olaf Faaland <faaland1@llnl.gov>
@ofaaland ofaaland added the Status: Code Review Needed Ready for review and testing label Mar 12, 2019
@codecov
Copy link

codecov bot commented Mar 12, 2019

Codecov Report

Merging #8495 into master will decrease coverage by 0.07%.
The diff coverage is n/a.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #8495      +/-   ##
==========================================
- Coverage   78.57%    78.5%   -0.08%     
==========================================
  Files         380      380              
  Lines      116057   116057              
==========================================
- Hits        91194    91108      -86     
- Misses      24863    24949      +86
Flag Coverage Δ
#kernel 78.96% <ø> (-0.07%) ⬇️
#user 67.1% <ø> (-0.27%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update b1b94e9...fe1da58. Read the comment docs.

@behlendorf behlendorf added Status: Accepted Ready to integrate (reviewed, tested) and removed Status: Code Review Needed Ready for review and testing labels Mar 12, 2019
@behlendorf behlendorf added this to the 0.8.0 milestone Mar 12, 2019
Copy link
Contributor

@adilger adilger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Stragely, I thought I reviewed this patch yesterday afternoon, but I don't see any record of that happening.

I assume that this patch is a "safe" approach to resolving issue #7709 that is suitable for inclusion into 0.8.0 and later 0.7.x and the recent larger patch in #7709 will mostly be targeted on 0.8.x?

@ofaaland
Copy link
Contributor Author

ofaaland commented Mar 13, 2019

@adilger
Thanks for the review. You wrote:

I assume that this patch is a "safe" approach to resolving issue #7709 that is suitable for inclusion into 0.8.0 and later 0.7.x

Yes.

and the recent larger patch in #7709 will mostly be targeted on 0.8.x?

You're correct that the recent larger patch is not intended for backport to 0.7.x. The patch referencing #7709 is #7842 and mostly addresses the issue of unnecessarily long import times after a failover when I/O leading up to the failure was delayed and ub_mmp_delay grew large.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Status: Accepted Ready to integrate (reviewed, tested)
Projects
No open projects
0.7.14
  
To do
Development

Successfully merging this pull request may close these issues.

None yet

4 participants