-
Notifications
You must be signed in to change notification settings - Fork 5.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix counter reset detection #7220
Conversation
Manage this branch in SquashTest this branch here: https://mfundulfix-counter-reset-detec-fqlqb.squash.io |
Actually, netdata only supports up to 64-bit signed values, so the existing code was wrong. See this:
in I will remove support for 64-bit unsigned values. |
Or figure out a way to fix unsigned 64-bit counters, but it's not obvious. |
I think we'd need a new counter type, because there are some things we track that actually do need to be able to have negative values (though usually they're within a signed 16-bit, or even signed 8-bit range). |
As soon as a signed counter wraps around we have a problem because it will begin to produce valid (as far as netdata is concerned) negative values. It might be simpler at this point to fall back to the old code that only handles unsigned counters if we are to improve the existing code-base without making a significant effort to fix things the proper way. |
Just realized I was misinterpreting things, there should be no counters we track that need to be negative, it's gauges that may need to be negative (assuming all plugins are reporting things sanely). Assuming we can confirm that, it becomes trivial to reliably identify overflow of signed 32-bit counters, because it's a distinctly different state from reset (IOW, if the counter goes negative when it wasn't close to zero to begin with, it overflowed). No matter what we do though, we can't properly handle overflow of whatever the largest type we have is, which I didn't think of until now either (because we can't store the overflow-corrected value). |
I added some unit tests and made sure that two's complement for overflowing counters handles signed and unsigned cases. We should be covered even if a physical device overflows into negative values, the delta should always be positive. In the end only the bit width will matter for the incremental algorithm. |
An example, from the unit test:
here we can see that even though the collector gathered a negative value of an overflown signed 64-bit counter, the database will eventually store a positive delta that is correct as shown here:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After the tests, I could not detect any problem with Netdata with these changes , so I am approving it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please read comment about using constants or macros to minimize the probability of error in future use of these thresholds.
else if(max > 0x00000000000000FFULL) cap = 0x000000000000FFFFULL; | ||
else cap = 0x00000000000000FFULL; | ||
// Signed values are handled by exploiting two's complement which will produce positive deltas | ||
if (max > 0x00000000FFFFFFFFULL) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not use a const
(or, dog forbid a #define
) for these to avoid messing up your F
s?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The lazy answer is that I modified the existing code which was this way, but I agree that constants are better, especially if they values are reused.
* Removed support for 16-bit and 8-bit counter overflow * Improve behaviour of counter overflow detection versus counter resets. * Added support for signed 32-bit and 64-bit limits for counter overflows. * Fixed signed incremental counter issues and added unit tests.
* Removed support for 16-bit and 8-bit counter overflow * Improve behaviour of counter overflow detection versus counter resets. * Added support for signed 32-bit and 64-bit limits for counter overflows. * Fixed signed incremental counter issues and added unit tests.
Summary
Fixes #4962
Fixes #5297
Fixes #7217
Component Name
daemon
database
Additional Information
This is the first step to fix some of the problems discussed in #7049 .
It is theoretically impossible to always be 100% sure about things such as hardware counters that overflow. As a result it is theoretically impossible to both solve the counter reset and the counter overflow prroblem.
This implementation attempts to solve both issues by mostly erring on the safe side as if it was a counter reset.
8-bit and 16-bit or any arbitrary precision counters are not supported as it would be too much effort.
We attempt to identify most of the wrap-arounds without creating too many false negatives for counter resets.