-
Notifications
You must be signed in to change notification settings - Fork 522
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix spanmetrics latency underflow, prevent partial histogram scrape #3412
Conversation
Thinking about it some more - the lack of mutex lock might also explain situations where _, err = appender.Append(0, lb.Labels(), timeMs, s.count.Load())
...
for i, bucketLabel := range h.bucketLabels {
ref, err := appender.Append(0, lb.Labels(), timeMs, s.buckets[i].Load()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice find!
The performance impact from the lock shouldn't be too bad. We collect samples only once every 15s to 60s. While we are collecting a metric it would lock out ingest for that metric. Since we have a low amount of metrics (each processor has a handful) this can lead to contention, but collecting data itself is pretty fast so...
Some thoughts (maybe for a next refactor):
- I think we can drop the atomic types in
histogramSeries
now, using regular data types might be better for memory - Instead of locking the entire
series
map, we can keep theRWMutex
onseries
and keep a lock on eachhistogramSeries
. Whenever you modify or read the series you would lock that series only. This would be a more granular approach to avoiding the partial reads/writes but requires more lock management.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've run this change in our largest cluster and didn't see significant changes in resource usage or latency. I'm not 100% yet this fixes our correctness issues, that will require some further investigation. I'm good with merging this though 👍
…rafana#3412) * Fix spanmetrics latency underflow, prevent partial histogram scrape * changelog
What this PR does:
Fixes 2 small bugs in the span metrics processor.
As mentioned in #3338 the latency calculation will be incorrect if span End is < Start. However it doesn't lead to a negative float64. Because start and end are uint64 the calculation underflows to a very large uint64 before float64 conversion, for example float64(uint64(-1)) is
1.8446744073709552e+19
. I'm not sure of the side effects from this all the way to the metrics backend, but in addition to the bad value the counter also stops increasing. It no longer has enough precision to capture increments of +1.0.Rather than ignore these spans, they are treated as zero latency. This ensures that _count and le buckets are still incremented.
The histograms counters are updated outside of lock. Therefore it's possible for a histogram to be scraped with inconsistent values. For example if the histogram is scraped after incrementing
_count
but before incrementingle="+Inf"
. In all cases these metrics should be identical. I addressed this by doingupdateSeries
under lock and reverting tosync.Mutex
instead ofRWMutex
. This has performance implications, but not sure of the degree in practice. If we really need to avoid the lock, then we could make it more "failsafe" by changing the order of increments inupdateSeries
. I.e. update the buckets in reverse order so+Inf
is always highest even in a partial scrape. Happy to discuss more.Which issue(s) this PR fixes:
Relates to #3338 I don't think it fixes it
Checklist
CHANGELOG.md
updated - the order of entries should be[CHANGE]
,[FEATURE]
,[ENHANCEMENT]
,[BUGFIX]