Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

smartd_log: wrong chart type for several attributes #7388

Closed
ilyam8 opened this issue Nov 29, 2019 · 3 comments · Fixed by #17536
Closed

smartd_log: wrong chart type for several attributes #7388

ilyam8 opened this issue Nov 29, 2019 · 3 comments · Fixed by #17536
Assignees
Labels
area/collectors Everything related to data collection collectors/python.d help wanted

Comments

@ilyam8
Copy link
Member

ilyam8 commented Nov 29, 2019

Sorry, I was otherwise engaged. There are a few other entries which need fixing too.

Most of the "INCREMENTAL" entries should be looked at closely, however the important ones are:

read_total_err_corrected
read_total_unc_errors
write_total_err_corrected
write_total_unc_errors
verify_total_err_corrected
verify_total_unc_errors

spin_up_retries
calibration_retries
reallocated_sectors_count
program_fail_count
erase_fail_count
reallocation_event_count

When monitoring drives, it's easy to miss blips and what really matters is the total health numbers over time (It's worth reading the smart attribute tables for explanations of each item)

If these traces sit at zero it's far too easy to get a false sense of security

Originally posted by @Stoatwblr in #7383 (comment)

@ilyam8
Copy link
Member Author

ilyam8 commented Nov 29, 2019

Problem reported by @Stoatwblr.
All attributes from OP should be absolute, not incremental. We need to check it and fix it if it is true.

i tend to agree. incremental type for raw values is not good, because zero values (rate) on a chart are misleading, a user interpret them as all is ok, which is false.

@cakrit
Copy link
Contributor

cakrit commented Oct 26, 2020

This should be a very quick fix. Can we do it?

@Ferroin
Copy link
Member

Ferroin commented Oct 26, 2020

In general, no SMART attributes should be treated as incremental unless we want a running total. They’re all inherently absolute counters or absolute gauges, because that’s quite simply how SMART works.

All of the stuff listed in the OP should indeed be incremental, and they’re also all definitely counters (they all count discrete events). Most of the other attributes with things like retries or count in their name should also be counters, and in general all of these should also be raw attributes. Notably, these can also go down in certain cases, and we should actually probably have alarms for that case for some of them (though that should be it’s own issue).

Any of the rate attributes should probably be absolute gauges, but we need to look at normalized values there because the raw values are almost invariably vendor-specific and kind of meaningless unless you have a lot of potentially proprietary knowledge about the drive firmware. Those are potentially strange though because the normalized values count down towards zero from 100 (except for temperature attributes, but those are weird for numerous other reasons).

Any of the temperature attributes should be absolute gauges with special (and rather complicated) handling interpreting the raw values directly. I’m 99% certain though that we handle all of these correctly though.

Possibly also see some of the discussion in #4285, as there was a lot of research that happened there relating to the smartd_log collector

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/collectors Everything related to data collection collectors/python.d help wanted
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants