Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

smart: Gather S.M.A.R.T. information from storage devices #2449

Merged
merged 18 commits into from
Oct 4, 2017

Conversation

rickard-von-essen
Copy link
Contributor

@rickard-von-essen rickard-von-essen commented Feb 20, 2017

This adds a new input plugin which uses the smartctl utility from the
smartmontools package to gather metrics from S.M.A.R.T. storage devices.

Signed-off-by: Rickard von Essen rickard.von.essen@gmail.com

Supersedes #2402
Closes #1880

Required for all PRs:

  • CHANGELOG.md updated (we recommend not updating this until the PR has been approved by a maintainer)
  • Sign CLA (if not already signed)
  • README.md updated (if adding a new plugin)

@rickard-von-essen
Copy link
Contributor Author

TODO:

  • Properly handle exit codes
  • Run gathering of metrics concurrently
  • Split metrics into smart_device and smart_attribute.
  • Handle versions 5.41, 5.42, 5.43, 6.[0-6]
  • Update README.md, versions limitations (5.41, 5.42 + nocheck), new format, sudo

@rickard-von-essen
Copy link
Contributor Author

I have updated this with more documentation, concurrent metrics gathering, different metrics structure, verified against version 5.41, 5.42, 5.43, 6.0, 6.1, 6.2, 6.3, 6.4, and 6.5.

@sebito91 It would be awesome if you would help testing this by testing the performance on your 96-disk system and verify that this would also cover your use case.

@rickard-von-essen
Copy link
Contributor Author

Alternativt to #2319

@sebito91
Copy link
Contributor

sebito91 commented Mar 1, 2017

Have not had a chance to look at this yet, but happy to merge the two threads into one. Will test out on the mega machine tomorrow morning (EST).

@evanrich
Copy link

this is the one thing missing from telegraf that would make my life complete.

@rickard-von-essen
Copy link
Contributor Author

@evanrich It would be very valuable if you could test this PR and provide feedback.

@evanrich
Copy link

@rickard-von-essen I won't be able to pull and test till sometime next week, on vacation this weekend, but I'll give it a go as soon as I can.

@stemwinder
Copy link

Looking forward to this addition. Thanks for the work, @rickard-von-essen. I'll do some testing of my own on this PR.

Copy link
Contributor

@sebito91 sebito91 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm happy to drop my commit in favor of this one given this is more flexible and handles a variety of OS setups. That being said, still not seeing any of the attribute information on these hosts...

Running with the sudo wrapper you mention in the help text (which should be an option in the config IMHO), we get the output listed below with exit_status = 1. When running as root, we get exit_status = 0 but no smart_attribute information whatsover.

[telegraf@carf-metrics-influx02 ~]$ telegraf --test --config /etc/telegraf/telegraf.conf --input-filter smart
* Plugin: inputs.smart, Collection 1
> smart_device,dc=carf,host=carf-metrics-influx02,bu=linux,device=/dev/sdh,env=production,cls=server,trd=false,sr=metrics exit_status=1i 1491517478000000000
> smart_device,device=/dev/sdbu,bu=linux,env=production,cls=server,trd=false,sr=metrics,dc=carf,host=carf-metrics-influx02 exit_status=1i 1491517478000000000
> smart_device,dc=carf,device=/dev/sdl,host=carf-metrics-influx02,bu=linux,env=production,cls=server,trd=false,sr=metrics exit_status=1i 1491517478000000000
... (repeats ~90 more times)

# devices = [ "/dev/ada0 -d atacam" ]
```

To run `smartctl` with `sudo` create a wrapper script and use `path` in
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't we make this into a config file bool? Here's my wrapper so far, yields exit_status = 1...

[telegraf@carf-metrics-influx02 ~]$ cat tester
#!/bin/bash

sudo /usr/sbin/smartctl $1

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Your wrapper has to pass all arguments, so:

#!/usr/bin/env bash

sudo /usr/sbin/smartctl $@

Exit code 1 means command line pars failed for smartctl.

Can't we make this into a config file bool?

If the maintainers like to have that I can add it but IMHO it's unnecessary when you have path.

Metrics will be reported from the following `smartctl` command:

```
smartctl --info --attributes --health -n <nocheck> --format=brief <device>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is actually not generating any output for the attribute checker on RHEL7 using smartctl 6.2. See below...

[telegraf@carf-metrics-influx02 ~]$ ./tester --info --attributes --health --format=brief /dev/sdcp
smartctl 6.2 2013-07-26 r3841 [x86_64-linux-3.10.0-514.6.1.el7.jump1.x86_64] (local build)
Copyright (C) 2002-13, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Vendor:               TOSHIBA
Product:              PX02SMF040
Revision:             A3B3
User Capacity:        400,088,457,216 bytes [400 GB]
Logical block size:   512 bytes
LU is resource provisioned, LBPRZ=1
Rotation Rate:        Solid State Device
Form Factor:          2.5 inches
Logical Unit id:      0x500003965c89f4a0
Serial number:        65J0A025T0QB
Device type:          disk
Transport protocol:   SAS
Local Time is:        Thu Apr  6 17:36:40 2017 CDT
SMART support is:     Available - device has SMART capability.
SMART support is:     Enabled
Temperature Warning:  Disabled or Not Supported

=== START OF READ SMART DATA SECTION ===
SMART Health Status: OK

SS Media used endurance indicator: 0%
Current Drive Temperature:     38 C
Drive Trip Temperature:        60 C

Manufactured in week 25 of year 2015
Elements in grown defect list: 0

[telegraf@carf-metrics-influx02 ~]$ ./tester --info -x --attributes --health --format=brief /dev/sdcp
smartctl 6.2 2013-07-26 r3841 [x86_64-linux-3.10.0-514.6.1.el7.jump1.x86_64] (local build)
Copyright (C) 2002-13, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Vendor:               TOSHIBA
Product:              PX02SMF040
Revision:             A3B3
User Capacity:        400,088,457,216 bytes [400 GB]
Logical block size:   512 bytes
LU is resource provisioned, LBPRZ=1
Rotation Rate:        Solid State Device
Form Factor:          2.5 inches
Logical Unit id:      0x500003965c89f4a0
Serial number:        65J0A025T0QB
Device type:          disk
Transport protocol:   SAS
Local Time is:        Thu Apr  6 17:37:21 2017 CDT
SMART support is:     Available - device has SMART capability.
SMART support is:     Enabled
Temperature Warning:  Disabled or Not Supported
Read Cache is:        Enabled
Writeback Cache is:   Disabled

=== START OF READ SMART DATA SECTION ===
SMART Health Status: OK

SS Media used endurance indicator: 0%
Current Drive Temperature:     38 C
Drive Trip Temperature:        60 C

Manufactured in week 25 of year 2015
Elements in grown defect list: 0

Error counter log:
           Errors Corrected by           Total   Correction     Gigabytes    Total
               ECC          rereads/    errors   algorithm      processed    uncorrected
           fast | delayed   rewrites  corrected  invocations   [10^9 bytes]  errors
read:          0        0         0         0          0      33119.546           0
write:         0        0         0         0          0      74858.031           0
verify:        0        0         0         0          0          2.622           0

Non-medium error count:       32

SMART Self-test log
Num  Test              Status                 segment  LifeTime  LBA_first_err [SK ASC ASQ]
     Description                              number   (hours)
# 1  Background short  Completed                  64       3                 - [-   -    -]
# 2  Background long   Completed                  64       2                 - [-   -    -]
# 3  Background short  Completed                  64       2                 - [-   -    -]
Long (extended) Self Test duration: 1800 seconds [30.0 minutes]

Device does not support Background scan results logging
Protocol Specific port log page for SAS SSP
relative target port id = 1
  generation code = 4
  number of phys = 1
  phy identifier = 0
    attached device type: expander device
    attached reason: SMP phy control function
    reason: unknown
    negotiated logical link rate: reserved [11]
    attached initiator port: ssp=0 stp=0 smp=0
    attached target port: ssp=0 stp=0 smp=1
    SAS address = 0x500003965c89f4a2
    attached SAS address = 0x5f8db882fbf5737f
    attached phy identifier = 6
    Invalid DWORD count = 4
    Running disparity error count = 3
    Loss of DWORD synchronization = 0
    Phy reset problem = 0
relative target port id = 2
  generation code = 4
  number of phys = 1
  phy identifier = 1
    attached device type: expander device
    attached reason: SMP phy control function
    reason: loss of dword synchronization
    negotiated logical link rate: reserved [11]
    attached initiator port: ssp=0 stp=0 smp=0
    attached target port: ssp=0 stp=0 smp=1
    SAS address = 0x500003965c89f4a3
    attached SAS address = 0x5f8db882fbf573ff
    attached phy identifier = 6
    Invalid DWORD count = 0
    Running disparity error count = 0
    Loss of DWORD synchronization = 0
    Phy reset problem = 0
Only support protocol specific log page on SAS devices

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is probably the most important thing to figure out. Did the format change or does this drive just not have any attributes?

@sebito91
Copy link
Contributor

sebito91 commented Apr 6, 2017

I can't quite comment on the timing to run this engine as the attributes are not parsing. For now, the simple exit_status = {0,1} as listed above is returned quickly, but that's not really a good indicator of performance.

@rickard-von-essen
Copy link
Contributor Author

@sebito91 Running <wrapper> --info --attributes --health --format=brief <device> should output something including the attributes (example here).

@sebito91
Copy link
Contributor

sebito91 commented Apr 7, 2017 via email

@rickard-von-essen
Copy link
Contributor Author

@sebito91 Yes I tested all of 5.41, 5.42, 5.43, 6.[0-6].

@rickard-von-essen
Copy link
Contributor Author

@sebito91 What does smartctl -a /dev/sdcp give you? And what about smartctl --scan | grep /dev/sdcp?

@sebito91
Copy link
Contributor

sebito91 commented Apr 7, 2017

[telegraf@carf-metrics-influx02 ~]$ ./tester -a /dev/sdcp
smartctl 6.2 2013-07-26 r3841 [x86_64-linux-3.10.0-514.6.1.el7.jump1.x86_64] (local build)
Copyright (C) 2002-13, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Vendor:               TOSHIBA
Product:              PX02SMF040
Revision:             A3B3
User Capacity:        400,088,457,216 bytes [400 GB]
Logical block size:   512 bytes
LU is resource provisioned, LBPRZ=1
Rotation Rate:        Solid State Device
Form Factor:          2.5 inches
Logical Unit id:      0x500003965c89f4a0
Serial number:        65J0A025T0QB
Device type:          disk
Transport protocol:   SAS
Local Time is:        Fri Apr  7 10:36:58 2017 CDT
SMART support is:     Available - device has SMART capability.
SMART support is:     Enabled
Temperature Warning:  Disabled or Not Supported

=== START OF READ SMART DATA SECTION ===
SMART Health Status: OK

SS Media used endurance indicator: 0%
Current Drive Temperature:     39 C
Drive Trip Temperature:        60 C

Manufactured in week 25 of year 2015
Elements in grown defect list: 0

Error counter log:
           Errors Corrected by           Total   Correction     Gigabytes    Total
               ECC          rereads/    errors   algorithm      processed    uncorrected
           fast | delayed   rewrites  corrected  invocations   [10^9 bytes]  errors
read:          0        0         0         0          0      33119.568           0
write:         0        0         0         0          0      74931.603           0
verify:        0        0         0         0          0          2.622           0

Non-medium error count:       32

SMART Self-test log
Num  Test              Status                 segment  LifeTime  LBA_first_err [SK ASC ASQ]
     Description                              number   (hours)
# 1  Background short  Completed                  64       3                 - [-   -    -]
# 2  Background long   Completed                  64       2                 - [-   -    -]
# 3  Background short  Completed                  64       2                 - [-   -    -]
Long (extended) Self Test duration: 1800 seconds [30.0 minutes]

[telegraf@carf-metrics-influx02 ~]$ ./tester --scan | grep /dev/sdcp
/dev/sdcp -d scsi # /dev/sdcp, SCSI device

@danielnelson danielnelson added this to the 1.4.0 milestone May 8, 2017
@stemwinder
Copy link

After merging the branch @rickard-von-essen created into v1.3.1 and building from source, this is what I receive from a test run on one disk. Is this the expected results?

* Plugin: inputs.smart, Collection 1
> smart_attribute,fail=-,host=grabhammer,device=/dev/sdb,id=1,name=Raw_Read_Error_Rate,flags=POSR-- exit_status=0i,value=119i,worst=99i,threshold=6i,raw_value=221947792i 1496299210000000000
> smart_attribute,fail=-,host=grabhammer,device=/dev/sdb,id=3,name=Spin_Up_Time,flags=PO---- exit_status=0i,value=97i,worst=96i,threshold=0i,raw_value=0i 1496299210000000000
> smart_attribute,device=/dev/sdb,id=4,name=Start_Stop_Count,flags=-O--CK,fail=-,host=grabhammer exit_status=0i,value=99i,worst=99i,threshold=20i,raw_value=1727i 1496299210000000000
> smart_attribute,host=grabhammer,device=/dev/sdb,id=5,name=Reallocated_Sector_Ct,flags=PO--CK,fail=- worst=100i,threshold=10i,raw_value=0i,exit_status=0i,value=100i 1496299210000000000
> smart_attribute,name=Seek_Error_Rate,flags=POSR--,fail=-,host=grabhammer,device=/dev/sdb,id=7 exit_status=0i,value=72i,worst=60i,threshold=30i,raw_value=17678941i 1496299210000000000
> smart_attribute,id=9,name=Power_On_Hours,flags=-O--CK,fail=-,host=grabhammer,device=/dev/sdb exit_status=0i,value=86i,worst=86i,threshold=0i,raw_value=13006i 1496299210000000000
> smart_attribute,device=/dev/sdb,id=10,name=Spin_Retry_Count,flags=PO--C-,fail=-,host=grabhammer exit_status=0i,value=100i,worst=100i,threshold=97i,raw_value=0i 1496299210000000000
> smart_attribute,fail=-,host=grabhammer,device=/dev/sdb,id=12,name=Power_Cycle_Count,flags=-O--CK value=100i,worst=100i,threshold=20i,raw_value=58i,exit_status=0i 1496299210000000000
> smart_attribute,flags=-O--CK,fail=-,host=grabhammer,device=/dev/sdb,id=183,name=Runtime_Bad_Block threshold=0i,raw_value=0i,exit_status=0i,value=100i,worst=100i 1496299210000000000
> smart_attribute,host=grabhammer,device=/dev/sdb,id=184,name=End-to-End_Error,flags=-O--CK,fail=- raw_value=0i,exit_status=0i,value=100i,worst=100i,threshold=99i 1496299210000000000
> smart_attribute,id=187,name=Reported_Uncorrect,flags=-O--CK,fail=-,host=grabhammer,device=/dev/sdb threshold=0i,raw_value=0i,exit_status=0i,value=100i,worst=100i 1496299210000000000
> smart_attribute,device=/dev/sdb,id=188,name=Command_Timeout,flags=-O--CK,fail=-,host=grabhammer worst=100i,threshold=0i,raw_value=0i,exit_status=0i,value=100i 1496299210000000000
> smart_attribute,host=grabhammer,device=/dev/sdb,id=189,name=High_Fly_Writes,flags=-O-RCK,fail=- exit_status=0i,value=99i,worst=99i,threshold=0i,raw_value=1i 1496299210000000000
> smart_attribute,host=grabhammer,device=/dev/sdb,id=190,name=Airflow_Temperature_Cel,flags=-O---K,fail=- threshold=45i,raw_value=37i,exit_status=0i,value=63i,worst=50i 1496299210000000000
> smart_attribute,device=/dev/sdb,id=191,name=G-Sense_Error_Rate,flags=-O--CK,fail=-,host=grabhammer exit_status=0i,value=100i,worst=100i,threshold=0i,raw_value=0i 1496299210000000000
> smart_attribute,flags=-O--CK,fail=-,host=grabhammer,device=/dev/sdb,id=192,name=Power-Off_Retract_Count worst=100i,threshold=0i,raw_value=31i,exit_status=0i,value=100i 1496299210000000000
> smart_attribute,device=/dev/sdb,id=193,name=Load_Cycle_Count,flags=-O--CK,fail=-,host=grabhammer exit_status=0i,value=98i,worst=98i,threshold=0i,raw_value=5229i 1496299210000000000
> smart_attribute,host=grabhammer,device=/dev/sdb,id=194,name=Temperature_Celsius,flags=-O---K,fail=- raw_value=37i,exit_status=0i,value=37i,worst=50i,threshold=0i 1496299210000000000
> smart_attribute,device=/dev/sdb,id=197,name=Current_Pending_Sector,flags=-O--C-,fail=-,host=grabhammer exit_status=0i,value=100i,worst=100i,threshold=0i,raw_value=0i 1496299210000000000
> smart_attribute,device=/dev/sdb,id=198,name=Offline_Uncorrectable,flags=----C-,fail=-,host=grabhammer threshold=0i,raw_value=0i,exit_status=0i,value=100i,worst=100i 1496299210000000000
> smart_attribute,fail=-,host=grabhammer,device=/dev/sdb,id=199,name=UDMA_CRC_Error_Count,flags=-OSRCK exit_status=0i,value=200i,worst=200i,threshold=0i,raw_value=0i 1496299210000000000
> smart_attribute,fail=-,host=grabhammer,device=/dev/sdb,id=240,name=Head_Flying_Hours,flags=------ exit_status=0i,value=100i,worst=253i,threshold=0i,raw_value=42238190i 1496299210000000000
> smart_attribute,flags=------,fail=-,host=grabhammer,device=/dev/sdb,id=241,name=Total_LBAs_Written exit_status=0i,value=100i,worst=253i,threshold=0i,raw_value=13957047056i 1496299210000000000
> smart_attribute,device=/dev/sdb,id=242,name=Total_LBAs_Read,flags=------,fail=-,host=grabhammer worst=253i,threshold=0i,raw_value=91348334421i,exit_status=0i,value=100i 1496299210000000000
> smart_device,host=grabhammer,device=/dev/sdb,device_model=ST2000DM001-1ER164,serial_no=Z4Z3Z3H5,capacity=2000398934016,enabled=Enabled,health=PASSED exit_status=0i 1496299210000000000

@rickard-von-essen
Copy link
Contributor Author

@stemwinder Looks correct to me.

@andsens
Copy link

andsens commented Jun 27, 2017

Any idea on when this will get merged? I'd love me some HDD temp stats on my NAS :-)

Copy link
Contributor

@danielnelson danielnelson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also see #2319 (review)

Makefile Outdated
@@ -24,7 +25,7 @@ build-windows:
./cmd/telegraf/telegraf.go

build-for-docker:
CGO_ENABLED=0 GOOS=linux go build -installsuffix cgo -o telegraf -ldflags \
CGO_ENABLED=0 GOOS=$(GOOS) go build -installsuffix cgo -o telegraf -ldflags \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove this code from the Makefile for this pull request

* Tags:
- `capacity`
- `device`
- `device_model`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think just model, since the measurement name is smart_device.

- `id`
- `name`
* Fields:
- `exit_status`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would rather leave exit status to the internal plugin and the logging output.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I kept this and added some info in the README. It consists of a bit pattern that can be useful to find drives that are in some way failing or starting to fail.

# devices = [ "/dev/ada0 -d atacam" ]
```

To run `smartctl` with `sudo` create a wrapper script and use `path` in
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be a nice touch to have sudo support, it's just a little more convenient.

You can add a use_sudo field like we did in [fail2ban(https://github.com/influxdata/telegraf/blob/ca9cec2c84e7c8796c2e8a747d17d1ad86ce1ae6/plugins/inputs/fail2ban/README.md#configuration), or it might be more readable and extensible to have something like ansible: become_method = "sudo"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added use_sudo

#
## Skip checking disks in this power mode. Defaults to
## "standby" to not wake up disks that have stoped rotating.
## See --nockeck in the man pages for smartctl.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nockeck (sic)

// Get info and attributes for each S.M.A.R.T. device
func (m *Smart) getAttributes(acc telegraf.Accumulator, devices []string) []error {

errchan := make(chan error)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now that we have Accumulator.AddError you should use this to report errors.

Copy link
Contributor Author

@rickard-von-essen rickard-von-essen Aug 7, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume then I don't have to return those errors from Input.Gather() (this isn't obvious from the documentation)?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's right, only use AddError or return. This way Telegraf counts the correct number of errors in the internal plugin, and the logging looks right.


func gatherDisk(acc telegraf.Accumulator, path, nockeck, device string, err chan error) {

// smartctl 5.41 & 5.42 have are broken regarding handling of --nocheck/-n
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you know what distro's contain these versions?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Debian oldstable (7) is the only one that I'm aware of that is still supported.

Metrics will be reported from the following `smartctl` command:

```
smartctl --info --attributes --health -n <nocheck> --format=brief <device>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is probably the most important thing to figure out. Did the format change or does this drive just not have any attributes?

This adds a new input plugin which uses the `smartctl` utility from the
smartmontools package to gather metrics from S.M.A.R.T. storage devices.

Signed-off-by: Rickard von Essen <rickard.von.essen@gmail.com>
5.41 and 5.42 have problems determining the current power mode and don't
recognise the --nocheck argument even tough it's in the docs.
@rickard-von-essen
Copy link
Contributor Author

I'm working on addressing the review comments and some improvements. Stay tuned.

@stemwinder
Copy link

stemwinder commented Aug 7, 2017

FYI, I've been running this plugin for two months now with zero issues. Looking forward to it finding its way in to the official release.

I would recommend updating the documentation to suggest the user make use of /dev/disk/by-id/ as disk device names are apt to change, and the history is pretty useless when an update is made, like expanding an array.

@rickard-von-essen
Copy link
Contributor Author

@danielnelson This is ready for re-review.

@rickard-von-essen
Copy link
Contributor Author

I would recommend updating the documentation to suggest the user make use of /dev/disk/by-id/ as disk device names are apt to change, and the history is pretty useless when an update is made, like expanding an array.

@stemwinder Good suggestion, can you suggest something for that?

@danielnelson
Copy link
Contributor

I thought you were going to add support for the error counters log, but I don't see metrics for these:

Error counter log:
           Errors Corrected by           Total   Correction     Gigabytes    Total
               ECC          rereads/    errors   algorithm      processed    uncorrected
           fast | delayed   rewrites  corrected  invocations   [10^9 bytes]  errors
read:          0        0         0         0          0      33119.546           0
write:         0        0         0         0          0      74858.031           0
verify:        0        0         0         0          0          2.622           0

@danielnelson
Copy link
Contributor

Using /dev/disk is a really smart idea, we should probably start to suggest using these in the examples for all plugins that take block devices.

@stemwinder
Copy link

stemwinder commented Aug 10, 2017

@rickard-von-essen I would recommend tucking in something like the following on line 6 of the README:

## Device Names
Path based device names, e.g., `/dev/sda`, are *not persistent*, and may be subject to change across reboots or system changes. The use of these device names are not recommended. Instead, use persistent block device naming. Block devices' persistent names can be located by their respective *World Wide Identifier* (WWID) at the following location: `/dev/disk/by-id`.

And then follow that up by using something like /dev/disk/by-id/[DISK-WWID] in the examples where appropriate.

@danielnelson Hopefully this is a decent starting point for other plugins or general documentation. The use of non-persistent device paths really does muck things up unless you're dealing with a completely closed system.

@rickard-von-essen
Copy link
Contributor Author

@stemwinder I agree in general, but have two comments: 1) it's linux specific, I'll add something about that to the text. 2) If you have the problem of disks moving around you probably have lots of disks and/or hosts and then most likely you would like use the autodetect feature (--scan).

@rickard-von-essen
Copy link
Contributor Author

rickard-von-essen commented Aug 11, 2017

I thought you were going to add support for the error counters log

@danielnelson Yes, but I need some help from @sebito91. (Depending on how quick we can sort it out I'll get it into this or a new PR). Non of my drives have the Error counter log. Can you somehow verify that this is a SAS disk feature or is this a Toshiba specific feature? What is the minimal argument you need to pass to smartctl to get this printed?

@stemwinder
Copy link

@rickard-von-essen I completely agree that users with large amounts of systems and disks would probably just rather use the scan option and only pay attention to current state, disregarding state history. But this caveat does need to be explained I think, so that when people see turnovers from hundreds of thousands of start/stops to < 20 on the same device name, for example, they know why.

If it were me in their position, I would fork the plugin and change it to derive the WWID from the scan output.

@sebito91
Copy link
Contributor

@rickard-von-essen will get you some more HDD/SSD data later this weekend.

Added WWN to both smart_device and smart_attribute measurements. And
added serial_no also to smart_attribute.
@rickard-von-essen
Copy link
Contributor Author

@stemwinder Added WWN and some info in da606ac, please review.

@danielnelson danielnelson added the feat Improvement on an existing feature such as adding a new setting/mode to an existing plugin label Aug 14, 2017
@danielnelson danielnelson modified the milestones: 1.4.0, 1.5.0 Aug 16, 2017
@danielnelson
Copy link
Contributor

@rickard-von-essen Can you swap out strconv.Atoi(x) in favor of strconv.ParseInt(x, 10, 64), so that we won't have any problems on 32-bit systems, and then I think we are ready to merge.

@rickard-von-essen
Copy link
Contributor Author

Can you swap out strconv.Atoi(x) in favor of strconv.ParseInt(x, 10, 64)

Done

@danielnelson danielnelson merged commit e69c3f9 into influxdata:master Oct 4, 2017
@vlambaard
Copy link

Is there any windows support for this plugin? I have a fairly large system I would love to test this on

@rickard-von-essen
Copy link
Contributor Author

@vlambaard Yes I guess it should work as long as you have the smartctl binary running and according to their page it should work, see About Smartmontools.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feat Improvement on an existing feature such as adding a new setting/mode to an existing plugin new plugin
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants