smartd_log: wrong chart type for several attributes #7388

ilyam8 · 2019-11-29T10:18:23Z

Sorry, I was otherwise engaged. There are a few other entries which need fixing too.

Most of the "INCREMENTAL" entries should be looked at closely, however the important ones are:

read_total_err_corrected
read_total_unc_errors
write_total_err_corrected
write_total_unc_errors
verify_total_err_corrected
verify_total_unc_errors

spin_up_retries
calibration_retries
reallocated_sectors_count
program_fail_count
erase_fail_count
reallocation_event_count

When monitoring drives, it's easy to miss blips and what really matters is the total health numbers over time (It's worth reading the smart attribute tables for explanations of each item)

If these traces sit at zero it's far too easy to get a false sense of security

Originally posted by @Stoatwblr in #7383 (comment)

ilyam8 · 2019-11-29T10:23:31Z

Problem reported by @Stoatwblr.
All attributes from OP should be absolute, not incremental. We need to check it and fix it if it is true.

i tend to agree. incremental type for raw values is not good, because zero values (rate) on a chart are misleading, a user interpret them as all is ok, which is false.

cakrit · 2020-10-26T16:12:34Z

This should be a very quick fix. Can we do it?

Ferroin · 2020-10-26T23:12:12Z

In general, no SMART attributes should be treated as incremental unless we want a running total. They’re all inherently absolute counters or absolute gauges, because that’s quite simply how SMART works.

All of the stuff listed in the OP should indeed be incremental, and they’re also all definitely counters (they all count discrete events). Most of the other attributes with things like retries or count in their name should also be counters, and in general all of these should also be raw attributes. Notably, these can also go down in certain cases, and we should actually probably have alarms for that case for some of them (though that should be it’s own issue).

Any of the rate attributes should probably be absolute gauges, but we need to look at normalized values there because the raw values are almost invariably vendor-specific and kind of meaningless unless you have a lot of potentially proprietary knowledge about the drive firmware. Those are potentially strange though because the normalized values count down towards zero from 100 (except for temperature attributes, but those are weird for numerous other reasons).

Any of the temperature attributes should be absolute gauges with special (and rather complicated) handling interpreting the raw values directly. I’m 99% certain though that we handle all of these correctly though.

Possibly also see some of the discussion in #4285, as there was a lot of research that happened there relating to the smartd_log collector

ilyam8 added area/external/python bug help wanted labels Nov 29, 2019

cakrit assigned vlvkobal, ilyam8 and thiagoftsm Oct 26, 2020

thiagoftsm mentioned this issue Oct 27, 2020

smartd_log: Change from absolute to incremental #10144

Closed

ilyam8 removed the bug label Oct 27, 2020

ilyam8 added the group/d-c label Jan 27, 2021

ilyam8 unassigned thiagoftsm Aug 23, 2021

ilyam8 added collectors/python.d area/collectors Everything related to data collection and removed area/external/python labels Apr 14, 2022

vlvkobal removed their assignment May 17, 2022

ilyam8 mentioned this issue Apr 30, 2024

go.d smartctl #17536

Merged

ilyam8 closed this as completed in #17536 Apr 30, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

smartd_log: wrong chart type for several attributes #7388

smartd_log: wrong chart type for several attributes #7388

ilyam8 commented Nov 29, 2019

ilyam8 commented Nov 29, 2019

cakrit commented Oct 26, 2020

Ferroin commented Oct 26, 2020

smartd_log: wrong chart type for several attributes #7388

smartd_log: wrong chart type for several attributes #7388

Comments

ilyam8 commented Nov 29, 2019

ilyam8 commented Nov 29, 2019

cakrit commented Oct 26, 2020

Ferroin commented Oct 26, 2020