Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix counter reset detection #7220

Merged
merged 4 commits into from Oct 30, 2019

Conversation

@mfundul
Copy link
Contributor

mfundul commented Oct 29, 2019

Summary

Fixes #4962
Fixes #5297
Fixes #7217

Component Name

daemon
database

Additional Information

This is the first step to fix some of the problems discussed in #7049 .

It is theoretically impossible to always be 100% sure about things such as hardware counters that overflow. As a result it is theoretically impossible to both solve the counter reset and the counter overflow prroblem.

This implementation attempts to solve both issues by mostly erring on the safe side as if it was a counter reset.

8-bit and 16-bit or any arbitrary precision counters are not supported as it would be too much effort.

We attempt to identify most of the wrap-arounds without creating too many false negatives for counter resets.

@mfundul mfundul added this to the v1.19-Sprint2 milestone Oct 29, 2019
@mfundul mfundul requested review from cakrit, cosmix and thiagoftsm as code owners Oct 29, 2019
@mfundul mfundul added this to In progress in Core via automation Oct 29, 2019
@squash-labs

This comment has been minimized.

Copy link

squash-labs bot commented Oct 29, 2019

Manage this branch in Squash

Test this branch here: https://mfundulfix-counter-reset-detec-fqlqb.squash.io
database/rrdset.c Outdated Show resolved Hide resolved
@mfundul mfundul changed the title [WIP] Fix counter reset detection Fix counter reset detection Oct 29, 2019
@mfundul mfundul changed the title Fix counter reset detection [WIP] Fix counter reset detection Oct 29, 2019
@mfundul

This comment has been minimized.

Copy link
Contributor Author

mfundul commented Oct 29, 2019

Actually, netdata only supports up to 64-bit signed values, so the existing code was wrong. See this:

typedef long long collected_number;

in libnetdata/storage_number/storage_number.h.

I will remove support for 64-bit unsigned values.

@mfundul

This comment has been minimized.

Copy link
Contributor Author

mfundul commented Oct 29, 2019

Or figure out a way to fix unsigned 64-bit counters, but it's not obvious.

@Ferroin

This comment has been minimized.

Copy link
Collaborator

Ferroin commented Oct 29, 2019

Or figure out a way to fix unsigned 64-bit counters, but it's not obvious.

I think we'd need a new counter type, because there are some things we track that actually do need to be able to have negative values (though usually they're within a signed 16-bit, or even signed 8-bit range).

@mfundul

This comment has been minimized.

Copy link
Contributor Author

mfundul commented Oct 29, 2019

As soon as a signed counter wraps around we have a problem because it will begin to produce valid (as far as netdata is concerned) negative values. It might be simpler at this point to fall back to the old code that only handles unsigned counters if we are to improve the existing code-base without making a significant effort to fix things the proper way.

@Ferroin

This comment has been minimized.

Copy link
Collaborator

Ferroin commented Oct 29, 2019

As soon as a signed counter wraps around we have a problem because it will begin to produce valid (as far as netdata is concerned) negative values. It might be simpler at this point to fall back to the old code that only handles unsigned counters if we are to improve the existing code-base without making a significant effort to fix things the proper way.

Just realized I was misinterpreting things, there should be no counters we track that need to be negative, it's gauges that may need to be negative (assuming all plugins are reporting things sanely).

Assuming we can confirm that, it becomes trivial to reliably identify overflow of signed 32-bit counters, because it's a distinctly different state from reset (IOW, if the counter goes negative when it wasn't close to zero to begin with, it overflowed). No matter what we do though, we can't properly handle overflow of whatever the largest type we have is, which I didn't think of until now either (because we can't store the overflow-corrected value).

@mfundul

This comment has been minimized.

Copy link
Contributor Author

mfundul commented Oct 29, 2019

I added some unit tests and made sure that two's complement for overflowing counters handles signed and unsigned cases. We should be covered even if a physical device overflows into negative values, the delta should always be positive. In the end only the bit width will matter for the incremental algorithm.

@mfundul

This comment has been minimized.

Copy link
Contributor Author

mfundul commented Oct 29, 2019

An example, from the unit test:

delta -9838263505978427529.0000000, rate -9838263505978427529.0000000
       >> dim1 with value -1229782938247303442

here we can see that even though the collector gathered a negative value of an overflown signed 64-bit counter, the database will eventually store a positive delta that is correct as shown here:

expecting value 8608481000000000000.0000000, found 8608481000000000000.0000000, OK
@mfundul mfundul changed the title [WIP] Fix counter reset detection Fix counter reset detection Oct 29, 2019
Copy link
Contributor

thiagoftsm left a comment

After the tests, I could not detect any problem with Netdata with these changes , so I am approving it.

@cosmix
cosmix approved these changes Oct 30, 2019
Copy link
Contributor

cosmix left a comment

Please read comment about using constants or macros to minimize the probability of error in future use of these thresholds.

else if(max > 0x00000000000000FFULL) cap = 0x000000000000FFFFULL;
else cap = 0x00000000000000FFULL;
// Signed values are handled by exploiting two's complement which will produce positive deltas
if (max > 0x00000000FFFFFFFFULL)

This comment has been minimized.

Copy link
@cosmix

cosmix Oct 30, 2019

Contributor

Why not use a const (or, dog forbid a #define) for these to avoid messing up your Fs?

This comment has been minimized.

Copy link
@mfundul

mfundul Oct 30, 2019

Author Contributor

The lazy answer is that I modified the existing code which was this way, but I agree that constants are better, especially if they values are reused.

@mfundul mfundul merged commit 69e5e12 into netdata:master Oct 30, 2019
13 checks passed
13 checks passed
Header rules - netdata No header rules processed
Details
LGTM analysis: JavaScript No code changes detected
Details
LGTM analysis: Python No code changes detected
Details
Pages changed - netdata 3 new files uploaded
Details
Codacy/PR Quality Review Up to standards. A positive pull request.
Details
LGTM analysis: C/C++ No new or fixed alerts
Details
Mixed content - netdata No mixed content detected
Details
Redirect rules - netdata 1 redirect rule processed
Details
Squash environment: netdata Successful in 4.23 minutes - Received a success response
Details
Travis CI - Pull Request Build Passed
Details
WIP Ready for review
Details
license/cla Contributor License Agreement is signed.
Details
netlify/netdata/deploy-preview Deploy preview ready!
Details
Core automation moved this from In progress to Done Oct 30, 2019
@mfundul mfundul deleted the mfundul:fix-counter-reset-detection branch Oct 30, 2019
@cakrit cakrit removed this from the v1.19-Sprint2 milestone Nov 7, 2019
@cakrit cakrit added this to the v1.19-Sprint2 milestone Nov 7, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Core
  
Done
5 participants
You can’t perform that action at this time.