Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix counter reset detection #7220

Merged
merged 4 commits into from
Oct 30, 2019
Merged

Fix counter reset detection #7220

merged 4 commits into from
Oct 30, 2019

Conversation

mfundul
Copy link
Contributor

@mfundul mfundul commented Oct 29, 2019

Summary

Fixes #4962
Fixes #5297
Fixes #7217

Component Name

daemon
database

Additional Information

This is the first step to fix some of the problems discussed in #7049 .

It is theoretically impossible to always be 100% sure about things such as hardware counters that overflow. As a result it is theoretically impossible to both solve the counter reset and the counter overflow prroblem.

This implementation attempts to solve both issues by mostly erring on the safe side as if it was a counter reset.

8-bit and 16-bit or any arbitrary precision counters are not supported as it would be too much effort.

We attempt to identify most of the wrap-arounds without creating too many false negatives for counter resets.

@mfundul mfundul added this to the v1.19-Sprint2 milestone Oct 29, 2019
@mfundul mfundul added this to In progress in Core via automation Oct 29, 2019
@squash-labs
Copy link

squash-labs bot commented Oct 29, 2019

Manage this branch in Squash

Test this branch here: https://mfundulfix-counter-reset-detec-fqlqb.squash.io

database/rrdset.c Outdated Show resolved Hide resolved
@mfundul mfundul changed the title [WIP] Fix counter reset detection Fix counter reset detection Oct 29, 2019
@mfundul mfundul changed the title Fix counter reset detection [WIP] Fix counter reset detection Oct 29, 2019
@mfundul
Copy link
Contributor Author

mfundul commented Oct 29, 2019

Actually, netdata only supports up to 64-bit signed values, so the existing code was wrong. See this:

typedef long long collected_number;

in libnetdata/storage_number/storage_number.h.

I will remove support for 64-bit unsigned values.

@mfundul
Copy link
Contributor Author

mfundul commented Oct 29, 2019

Or figure out a way to fix unsigned 64-bit counters, but it's not obvious.

@Ferroin
Copy link
Member

Ferroin commented Oct 29, 2019

Or figure out a way to fix unsigned 64-bit counters, but it's not obvious.

I think we'd need a new counter type, because there are some things we track that actually do need to be able to have negative values (though usually they're within a signed 16-bit, or even signed 8-bit range).

@mfundul
Copy link
Contributor Author

mfundul commented Oct 29, 2019

As soon as a signed counter wraps around we have a problem because it will begin to produce valid (as far as netdata is concerned) negative values. It might be simpler at this point to fall back to the old code that only handles unsigned counters if we are to improve the existing code-base without making a significant effort to fix things the proper way.

@Ferroin
Copy link
Member

Ferroin commented Oct 29, 2019

As soon as a signed counter wraps around we have a problem because it will begin to produce valid (as far as netdata is concerned) negative values. It might be simpler at this point to fall back to the old code that only handles unsigned counters if we are to improve the existing code-base without making a significant effort to fix things the proper way.

Just realized I was misinterpreting things, there should be no counters we track that need to be negative, it's gauges that may need to be negative (assuming all plugins are reporting things sanely).

Assuming we can confirm that, it becomes trivial to reliably identify overflow of signed 32-bit counters, because it's a distinctly different state from reset (IOW, if the counter goes negative when it wasn't close to zero to begin with, it overflowed). No matter what we do though, we can't properly handle overflow of whatever the largest type we have is, which I didn't think of until now either (because we can't store the overflow-corrected value).

@mfundul
Copy link
Contributor Author

mfundul commented Oct 29, 2019

I added some unit tests and made sure that two's complement for overflowing counters handles signed and unsigned cases. We should be covered even if a physical device overflows into negative values, the delta should always be positive. In the end only the bit width will matter for the incremental algorithm.

@mfundul
Copy link
Contributor Author

mfundul commented Oct 29, 2019

An example, from the unit test:

delta -9838263505978427529.0000000, rate -9838263505978427529.0000000
       >> dim1 with value -1229782938247303442

here we can see that even though the collector gathered a negative value of an overflown signed 64-bit counter, the database will eventually store a positive delta that is correct as shown here:

expecting value 8608481000000000000.0000000, found 8608481000000000000.0000000, OK

@mfundul mfundul changed the title [WIP] Fix counter reset detection Fix counter reset detection Oct 29, 2019
Copy link
Contributor

@thiagoftsm thiagoftsm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After the tests, I could not detect any problem with Netdata with these changes , so I am approving it.

Copy link
Contributor

@cosmix cosmix left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please read comment about using constants or macros to minimize the probability of error in future use of these thresholds.

else if(max > 0x00000000000000FFULL) cap = 0x000000000000FFFFULL;
else cap = 0x00000000000000FFULL;
// Signed values are handled by exploiting two's complement which will produce positive deltas
if (max > 0x00000000FFFFFFFFULL)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not use a const (or, dog forbid a #define) for these to avoid messing up your Fs?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The lazy answer is that I modified the existing code which was this way, but I agree that constants are better, especially if they values are reused.

@mfundul mfundul merged commit 69e5e12 into netdata:master Oct 30, 2019
Core automation moved this from In progress to Done Oct 30, 2019
@mfundul mfundul deleted the fix-counter-reset-detection branch October 30, 2019 15:15
@cakrit cakrit modified the milestone: v1.19-Sprint2 Nov 7, 2019
jackyhuang85 pushed a commit to jackyhuang85/netdata that referenced this pull request Jan 1, 2020
* Removed support for 16-bit and 8-bit counter overflow

* Improve behaviour of counter overflow detection versus counter resets.

* Added support for signed 32-bit and 64-bit limits for counter overflows.

* Fixed signed incremental counter issues and added unit tests.
Saruspete pushed a commit to Saruspete/netdata that referenced this pull request May 21, 2020
* Removed support for 16-bit and 8-bit counter overflow

* Improve behaviour of counter overflow detection versus counter resets.

* Added support for signed 32-bit and 64-bit limits for counter overflows.

* Fixed signed incremental counter issues and added unit tests.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Core
  
Done
5 participants