Skip to content
This repository has been archived by the owner on Oct 12, 2020. It is now read-only.

Commit

Permalink
Browse files Browse the repository at this point in the history
Add udev-md-raid-safe-timeouts.rules
These udev rules attempt to set a safe kernel controller
timeout for disks containing RAID level 1 or higher
partitions for commodity disks which do not have SCTERC
capability, or do have it but it is disabled.

No attempt is made to change the STCERC settings on devices
which support it.

This attempts to mitigate the problem described here:

    https://raid.wiki.kernel.org/index.php/Timeout_Mismatch
    http://strugglers.net/~andy/blog/2015/11/09/linux-software-raid-and-drive-timeouts/

where the kernel controller may timeout on a read from a
disk after the default timeout of 30 seconds and consequently
cause mdraid to regard the disk as dead and eject it from the
RAID array.

The mitigation is to set the timeout to 180 seconds for disks
which contain a RAID level 1 or higher partition.

Signed-off-by: Jonathan G. Underwood <jonathan.underwood@gmail.com>
Signed-off-by: Jes Sorensen <jsorensen@fb.com>
  • Loading branch information
jonathanunderwood authored and Jes Sorensen committed Feb 1, 2018
1 parent 1db0376 commit b96c193
Showing 1 changed file with 61 additions and 0 deletions.
61 changes: 61 additions & 0 deletions udev-md-raid-safe-timeouts.rules
@@ -0,0 +1,61 @@
# Copyright (C) 2017 by Jonathan G. Underwood
# This file is part of mdraid-safe-timeouts.
#
# mdraid-safe-timeouts is free software: you can redistribute it
# and/or modify it under the terms of the GNU General Public License
# as published by the Free Software Foundation, either version 3 of
# the License, or (at your option) any later version.
#
# Foobar is distributed in the hope that it will be useful, but
# WITHOUT ANY WARRANTY; without even the implied warranty of
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
# General Public License for more details.
#
# You should have received a copy of the GNU General Public License
# along with mdraid-safe-timeouts. If not, see
# <http://www.gnu.org/licenses/>.

# This file causes block devices with Linux RAID (mdadm) signatures to
# attempt to set safe timeouts for the drives involved
# See udev(8) for syntax

# Don't process any events if anaconda is running as anaconda brings up
# raid devices manually
ENV{ANACONDA}=="?*", GOTO="md_timeouts_end"

SUBSYSTEM!="block|machinecheck", GOTO="md_timeouts_end"

# "noiswmd" on kernel command line stops mdadm from handling
# "isw" (aka IMSM - Intel RAID).
# "nodmraid" on kernel command line stops mdadm from handling
# "isw" or "ddf".
IMPORT{cmdline}="nodmraid"
ENV{nodmraid}=="?*", GOTO="md_timeouts_end"
IMPORT{cmdline}="noiswmd"
ENV{noiswmd}=="?*", GOTO="md_timeouts_end"

# Set controller timeout for parent disk of each partition if the
# partition is a mdraid partition of higher than raid 0, and the disk
# doesn't have scterc turned on (i.e. if it's disabled or the disk
# doesn't support it). We determine if the disk has SCTERC turned on
# by examining the output of smartctl and seeing if it contains the
# word "seconds". If the word "seconds" is found we take this to imply
# STCERC is turned on, and take no action. Otherwise we set the drive
# controller timeout to 180 seconds. It would be better to check the
# exit status code of smartctl rather than grepping for "seconds", but
# it's not clear what that will be in the three cases (supported and
# turned on, supported but disabled, not supported).

ENV{DEVTYPE}!="partition", GOTO="md_timeouts_end"

IMPORT{program}="/sbin/mdadm --examine --export $devnode"

ACTION=="add|change", \
ENV{ID_FS_TYPE}=="linux_raid_member", \
ENV{MD_LEVEL}=="raid[1-9]*", \
TEST=="/sys/block/$parent/device/timeout", \
TEST=="/usr/sbin/smartctl", \
PROGRAM!="/usr/bin/sh -c '/usr/sbin/smartctl -l scterc /dev/$parent | grep -q seconds && exit 0 || exit 1'", \
RUN+="/usr/bin/sh -c '/usr/bin/echo 180 > /sys/block/$parent/device/timeout && /usr/bin/logger timeout for /dev/$parent set to 180 secs'"

LABEL="md_timeouts_end"

2 comments on commit b96c193

@arekm
Copy link

@arekm arekm commented on b96c193 Feb 6, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not wired up in Makefile install-udev, any reason for that?

Also wouldn't be it better to have rules which enable SCT ERC, via smartctl -l seterc,70,70, too ?

@jonathanunderwood
Copy link
Contributor Author

@jonathanunderwood jonathanunderwood commented on b96c193 Feb 6, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that this repo is not where development happens these days - the mdadm list is the best place for discussion.

As to changing scterc settings: I took the approach that mdadm wasn't in the business of changing hardware firmware settings, so didn't add that into the patch I pushed upstream. However, I have experimented with that here:

https://github.com/jonathanunderwood/mdraid-safe-timeouts

I'd consider pushing more of that to mdadm, if the devs felt that messing with scterc settings was something mdadm shipped code should be doing.

Please sign in to comment.