Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Set fault beacon on drive failure #2375

Closed
behlendorf opened this issue Jun 9, 2014 · 11 comments
Closed

Set fault beacon on drive failure #2375

behlendorf opened this issue Jun 9, 2014 · 11 comments
Labels
Component: ZED ZFS Event Daemon Type: Feature Feature request or new feature
Milestone

Comments

@behlendorf
Copy link
Contributor

For JBOD style configurations it's desirable to set the device's fault beacon on drive failures. This can be done either through proc interface or through the sg_ses utilities. The relevant zed scripts (cmd/zed/zed.d/io-spare.sh), should be updated to take advantage of this functionality. The tricky bit is to map the device name to an enclosure and slot location.

# Set the fault beacon through proc
$ echo 1 >/sys/class/enclosure/6:0:9:0/000/fault

# Set the fault beacon with sg_ses
$ sg_ses --dev-slot-num=0 --set=ident /dev/sg3
$ sg_ses --dev-slot-num=0 --clear=ident /dev/sg3
@FransUrbo
Copy link
Contributor

udevadm info -q all -p /sys/block/sda can be used to get information about the device, including /sys/device/.... paths. Don't know if it will give out enclosure/slot though.

Have a look at https://raw.githubusercontent.com/FransUrbo/scripts/master/GetDiskInfo.sh (it's huge and kludgy and I'm trying to rewrite it in perl which would make it a little cleaner, but ... :). I do a lot of stuff like that...

I'd love to help (should be able to whip something up quite quickly - I'm bored :), but I don't have something like a enclosure.

@behlendorf
Copy link
Contributor Author

@FransUrbo That looks like a handy script!

I believe @nedbass figured out how to map the device to an enclosure and slot. Ned could you add the proceedure to this issue so we don't loose track of it. Then we can work it in to the scripts where appropriate.

@chrisrd
Copy link
Contributor

chrisrd commented Jun 11, 2014

Here's something we use to blink our lights, including the mapping of device to enclosure and slot:

#!/bin/bash
#
# Usage: disk-blink [--off] /dev/sd???
#
# ACHTUNG!
# ALLES TURISTEN UND NONTEKNISCHEN LOOKENPEEPERS!
# DAS KOMPUTERMASCHINE IST NICHT FÜR DER GEFINGERPOKEN UND MITTENGRABEN!
# ODERWISE IST EASY TO SCHNAPPEN DER SPRINGENWERK, BLOWENFUSEN UND POPPENCORKEN
# MIT SPITZENSPARKEN. IST NICHT FÜR GEWERKEN BEI DUMMKOPFEN. DER RUBBERNECKEN
# SIGHTSEEREN KEEPEN DAS COTTONPICKEN HÄNDER IN DAS POCKETS MUSS. ZO RELAXEN
# UND WATSCHEN DER BLINKENLICHTEN.
#
function usage
{
        cat <<END

Usage: $0 [--off] /dev/sd??

END
        exit 1
}

set -e -u

action=--set=locate

dev=$1
[ "${dev}" = --off ] && { action=--clear=locate; dev=$2; }
[ -b "${dev}" ] || { echo 1>&2 "${dev}: not a block device"; exit 1; }

sasaddr=$(
        lsscsi -tg | 
        sed -rn 's/.*sas:(0x[[:xdigit:]]+).*'"${dev//\//\\/}"'[[:space:]].*/\1/ p'
)       
[ "${sasaddr}" ] || { echo "${dev}: SAS address not found"; exit 1; }

#
# Scan all the enclosures for our SAS address
#
for encldev in $(lsscsi -tg | awk '$2 == "enclosu" { print $5 }')
do
        #
        # Note: we discard errors from sg_ses as, at version 1.64 20120118,
        # it prints an error on some enclosures like:
        #
        #  $ sg_ses -j /dev/sg45 > /dev/null
        #  join_work: oi=6, ei=255 (broken_ei=0) not in join_arr
        #
        # See Also: http://thread.gmane.org/gmane.linux.scsi/81514
        #
        slot=$(
                sg_ses -j "${encldev}" 2> /dev/null | 
                egrep "^Slot |^\s+SAS address:" | 
                grep -B1 ${sasaddr} | 
                awk '/^Slot/ { print $2 }'
        )       
        [ "${slot}" ] && break
done
[ "${slot}" ] || { echo 2>&1 "${dev}: enclosure/slot not found"; exit 1; }

#
# Light 'em up
#
sg_ses -D "Slot ${slot}" "${action}" "${encldev}"

exit 0

@behlendorf
Copy link
Contributor Author

@chrisrd Nice! Thanks for posting this.

@dasjoe
Copy link
Contributor

dasjoe commented Jul 31, 2014

I've got some enclosures with (LSI) SAS expanders, which are visible in /sys/class/enclosure/.
This makes Slot 01's fault LED light up:
echo 1 > /sys/class/enclosure/1\:0\:21\:0/Slot\ 01/fault
"locate" makes it blink:
echo 1 > /sys/class/enclosure/1\:0\:21\:0/Slot\ 01/locate

@behlendorf behlendorf modified the milestones: 0.6.5, 0.6.4 Feb 6, 2015
@behlendorf behlendorf modified the milestones: 0.7.0, 0.6.5 Jul 16, 2015
@cvoltz
Copy link
Contributor

cvoltz commented Jul 25, 2016

I'm working on implementing this feature. identify_failed_drive.feature.txt provides a detailed feature description.

@rlaager
Copy link
Member

rlaager commented Jul 25, 2016

@cvoltz Your description looks solid, except that I disagree that the UID light should be on. I would think only the fault light should be controlled. What is the advantage of adjusting both lights in lock-step? I think the UID should be left for administrator use.

@joehandzik
Copy link

@rlaager The UID could certainly be dropped, but there is potential value in a large configuration. With the UID + disk FAULT LED, customers know which chassis AND which disk a bit more easily.

@tonyhutter
Copy link
Contributor

@cvoltz I'm working on pretty much the same thing at LLNL. Have you had any luck with using libstoragemgmt to blink the LEDs? Have you tried it for multipath devices as well?

@cvoltz
Copy link
Contributor

cvoltz commented Jul 28, 2016

I updated the feature description to remove the references to the UID lights on the drive.

@behlendorf
Copy link
Contributor Author

For anyone following this issue you may want to checkout the latest master source which now has improved infrastructure for generically managing a devices SES LEDs, 1bbd877. The zedlet's environment will now contain a ZEVENT_VDEV_ENC_SYSFS_PATH variable when the SES sysfs path can be determined. This can be used to easily control the LEDs without any additional dependencies beyond the ses.ko kernel module. See the statechange-led.sh zedlet as an example.

This infrastructure is still being worked on but any feedback or testing on a wider range of configurations and hardware would be welcome.

behlendorf pushed a commit that referenced this issue Oct 25, 2016
- Fix autoreplace behaviour on statechange-led.sh script.

ZED sends the following events on an auto-replace:

1. statechange: Disk goes UNAVAIL->ONLINE
2. statechange: Disk goes ONLINE->UNAVAIL
3. vdev_attach: Disk goes ONLINE

Events 1-2 happen when ZED first attempts to do an auto-online.  When that
fails, ZED then tries an auto-replace, generating the vdev_attach event in #3.

In the previous code, statechange-led was only looking at the UNAVAIL->ONLINE
transition to turn off the LED.  It ignored the #2 ONLINE->UNAVAIL transition,
assuming it was just the "old" VDEV going offline.  This is problematic, as
a drive can go from ONLINE->UNAVAIL when it's malfunctioning, and we don't want
to ignore that.

This new patch correctly turns on the fault LED every time a drive becomes
UNAVAIL.  It also monitors vdev_attach events to trigger turning off the LED
when an auto-replaced disk comes online.

- Remove unnecessary libdevmapper warning with --with-config=kernel

This fixes an unnecessary libdevmapper warning when building
--with-config=kernel.  Kernel code does not use libdevmapper, so the warning
is not needed.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tony Hutter <hutter2@llnl.gov>
Closes #2375 
Closes #5312 
Closes #5331
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Component: ZED ZFS Event Daemon Type: Feature Feature request or new feature
Projects
None yet
Development

No branches or pull requests

8 participants