Fix counter reset detection #7220

mfundul · 2019-10-29T13:33:23Z

Summary

Component Name

daemon
database

Additional Information

This is the first step to fix some of the problems discussed in #7049 .

It is theoretically impossible to always be 100% sure about things such as hardware counters that overflow. As a result it is theoretically impossible to both solve the counter reset and the counter overflow prroblem.

This implementation attempts to solve both issues by mostly erring on the safe side as if it was a counter reset.

8-bit and 16-bit or any arbitrary precision counters are not supported as it would be too much effort.

We attempt to identify most of the wrap-arounds without creating too many false negatives for counter resets.

squash-labs · 2019-10-29T13:33:29Z

Manage this branch in Squash

Test this branch here: https://mfundulfix-counter-reset-detec-fqlqb.squash.io

database/rrdset.c

mfundul · 2019-10-29T16:45:08Z

Actually, netdata only supports up to 64-bit signed values, so the existing code was wrong. See this:

typedef long long collected_number;

in libnetdata/storage_number/storage_number.h.

I will remove support for 64-bit unsigned values.

mfundul · 2019-10-29T16:47:52Z

Or figure out a way to fix unsigned 64-bit counters, but it's not obvious.

Ferroin · 2019-10-29T16:51:45Z

Or figure out a way to fix unsigned 64-bit counters, but it's not obvious.

I think we'd need a new counter type, because there are some things we track that actually do need to be able to have negative values (though usually they're within a signed 16-bit, or even signed 8-bit range).

mfundul · 2019-10-29T17:02:53Z

As soon as a signed counter wraps around we have a problem because it will begin to produce valid (as far as netdata is concerned) negative values. It might be simpler at this point to fall back to the old code that only handles unsigned counters if we are to improve the existing code-base without making a significant effort to fix things the proper way.

Ferroin · 2019-10-29T17:36:44Z

As soon as a signed counter wraps around we have a problem because it will begin to produce valid (as far as netdata is concerned) negative values. It might be simpler at this point to fall back to the old code that only handles unsigned counters if we are to improve the existing code-base without making a significant effort to fix things the proper way.

Just realized I was misinterpreting things, there should be no counters we track that need to be negative, it's gauges that may need to be negative (assuming all plugins are reporting things sanely).

Assuming we can confirm that, it becomes trivial to reliably identify overflow of signed 32-bit counters, because it's a distinctly different state from reset (IOW, if the counter goes negative when it wasn't close to zero to begin with, it overflowed). No matter what we do though, we can't properly handle overflow of whatever the largest type we have is, which I didn't think of until now either (because we can't store the overflow-corrected value).

mfundul · 2019-10-29T17:47:35Z

I added some unit tests and made sure that two's complement for overflowing counters handles signed and unsigned cases. We should be covered even if a physical device overflows into negative values, the delta should always be positive. In the end only the bit width will matter for the incremental algorithm.

mfundul · 2019-10-29T17:52:00Z

An example, from the unit test:

delta -9838263505978427529.0000000, rate -9838263505978427529.0000000
       >> dim1 with value -1229782938247303442

here we can see that even though the collector gathered a negative value of an overflown signed 64-bit counter, the database will eventually store a positive delta that is correct as shown here:

expecting value 8608481000000000000.0000000, found 8608481000000000000.0000000, OK

thiagoftsm

After the tests, I could not detect any problem with Netdata with these changes , so I am approving it.

cosmix

Please read comment about using constants or macros to minimize the probability of error in future use of these thresholds.

cosmix · 2019-10-30T13:31:53Z

database/rrdset.c

-                    else if(max > 0x00000000000000FFULL) cap = 0x000000000000FFFFULL;
-                    else                                 cap = 0x00000000000000FFULL;
+                    // Signed values are handled by exploiting two's complement which will produce positive deltas
+                    if (max > 0x00000000FFFFFFFFULL)


Why not use a const (or, dog forbid a #define) for these to avoid messing up your Fs?

The lazy answer is that I modified the existing code which was this way, but I agree that constants are better, especially if they values are reused.

* Removed support for 16-bit and 8-bit counter overflow * Improve behaviour of counter overflow detection versus counter resets. * Added support for signed 32-bit and 64-bit limits for counter overflows. * Fixed signed incremental counter issues and added unit tests.

mfundul added 2 commits October 29, 2019 12:48

Removed support for 16-bit and 8-bit counter overflow

f7be06e

Improve behaviour of counter overflow detection versus counter resets.

be738bb

mfundul added area/daemon area/database labels Oct 29, 2019

mfundul added this to the v1.19-Sprint2 milestone Oct 29, 2019

mfundul requested review from cakrit, cosmix and thiagoftsm as code owners October 29, 2019 13:33

mfundul added this to In progress in Core via automation Oct 29, 2019

Ferroin reviewed Oct 29, 2019

View reviewed changes

database/rrdset.c Outdated Show resolved Hide resolved

Added support for signed 32-bit and 64-bit limits for counter overflows.

b72fa59

mfundul changed the title ~~[WIP] Fix counter reset detection~~ Fix counter reset detection Oct 29, 2019

mfundul changed the title ~~Fix counter reset detection~~ [WIP] Fix counter reset detection Oct 29, 2019

Fixed signed incremental counter issues and added unit tests.

7b54ef2

mfundul changed the title ~~[WIP] Fix counter reset detection~~ Fix counter reset detection Oct 29, 2019

thiagoftsm approved these changes Oct 29, 2019

View reviewed changes

cosmix approved these changes Oct 30, 2019

View reviewed changes

mfundul merged commit 69e5e12 into netdata:master Oct 30, 2019

Core automation moved this from In progress to Done Oct 30, 2019

mfundul deleted the fix-counter-reset-detection branch October 30, 2019 15:15

cakrit modified the milestone: v1.19-Sprint2 Nov 7, 2019

ilyam8 mentioned this pull request Dec 10, 2019

flapping 1m_received_traffic_overflow and 1m_sent_traffic_overflow #6077

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix counter reset detection #7220

Fix counter reset detection #7220

mfundul commented Oct 29, 2019

squash-labs bot commented Oct 29, 2019

mfundul commented Oct 29, 2019

mfundul commented Oct 29, 2019

Ferroin commented Oct 29, 2019

mfundul commented Oct 29, 2019

Ferroin commented Oct 29, 2019

mfundul commented Oct 29, 2019

mfundul commented Oct 29, 2019

thiagoftsm left a comment

cosmix left a comment

cosmix Oct 30, 2019

mfundul Oct 30, 2019

Fix counter reset detection #7220

Fix counter reset detection #7220

Conversation

mfundul commented Oct 29, 2019

Summary

Component Name

Additional Information

squash-labs bot commented Oct 29, 2019

Manage this branch in Squash

mfundul commented Oct 29, 2019

mfundul commented Oct 29, 2019

Ferroin commented Oct 29, 2019

mfundul commented Oct 29, 2019

Ferroin commented Oct 29, 2019

mfundul commented Oct 29, 2019

mfundul commented Oct 29, 2019

thiagoftsm left a comment

Choose a reason for hiding this comment

cosmix left a comment

Choose a reason for hiding this comment

cosmix Oct 30, 2019

Choose a reason for hiding this comment

mfundul Oct 30, 2019

Choose a reason for hiding this comment