Fix spanmetrics latency underflow, prevent partial histogram scrape #3412

mdisibio · 2024-02-20T18:59:27Z

What this PR does:
Fixes 2 small bugs in the span metrics processor.

Latency underflow

As mentioned in #3338 the latency calculation will be incorrect if span End is < Start. However it doesn't lead to a negative float64. Because start and end are uint64 the calculation underflows to a very large uint64 before float64 conversion, for example float64(uint64(-1)) is 1.8446744073709552e+19. I'm not sure of the side effects from this all the way to the metrics backend, but in addition to the bad value the counter also stops increasing. It no longer has enough precision to capture increments of +1.0.

Rather than ignore these spans, they are treated as zero latency. This ensures that _count and le buckets are still incremented.

Partial histogram scrape

The histograms counters are updated outside of lock. Therefore it's possible for a histogram to be scraped with inconsistent values. For example if the histogram is scraped after incrementing _count but before incrementing le="+Inf". In all cases these metrics should be identical. I addressed this by doing updateSeries under lock and reverting to sync.Mutex instead of RWMutex. This has performance implications, but not sure of the degree in practice. If we really need to avoid the lock, then we could make it more "failsafe" by changing the order of increments in updateSeries. I.e. update the buckets in reverse order so +Inf is always highest even in a partial scrape. Happy to discuss more.

Which issue(s) this PR fixes:
Relates to #3338 I don't think it fixes it

Checklist

Tests updated
Documentation added
CHANGELOG.md updated - the order of entries should be [CHANGE], [FEATURE], [ENHANCEMENT], [BUGFIX]

mdisibio · 2024-02-20T20:56:19Z

Thinking about it some more - the lack of mutex lock might also explain situations where _count is < le="+Inf". If the histogram is incremented in between these two lines https://github.com/grafana/tempo/blob/main/modules/generator/registry/histogram.go#L192-L202 then it's possible for the Inf bucket to be greater than _count. If appender.Append is performing i/o and pausing the collection routine, then the likelihood is increased.

_, err = appender.Append(0, lb.Labels(), timeMs, s.count.Load())

...
		
for i, bucketLabel := range h.bucketLabels {
	ref, err := appender.Append(0, lb.Labels(), timeMs, s.buckets[i].Load())

yvrhdn

Nice find!

The performance impact from the lock shouldn't be too bad. We collect samples only once every 15s to 60s. While we are collecting a metric it would lock out ingest for that metric. Since we have a low amount of metrics (each processor has a handful) this can lead to contention, but collecting data itself is pretty fast so...

Some thoughts (maybe for a next refactor):

I think we can drop the atomic types in histogramSeries now, using regular data types might be better for memory
Instead of locking the entire series map, we can keep the RWMutex on series and keep a lock on each histogramSeries. Whenever you modify or read the series you would lock that series only. This would be a more granular approach to avoiding the partial reads/writes but requires more lock management.

yvrhdn

I've run this change in our largest cluster and didn't see significant changes in resource usage or latency. I'm not 100% yet this fixes our correctness issues, that will require some further investigation. I'm good with merging this though 👍

…rafana#3412) * Fix spanmetrics latency underflow, prevent partial histogram scrape * changelog

Fix spanmetrics latency underflow, prevent partial histogram scrape

f208d48

mdisibio requested review from joe-elliott, annanay25, mapno, yvrhdn, zalegrala, electron0zero, ie-pham and stoewer as code owners February 20, 2024 18:59

joe-elliott approved these changes Feb 20, 2024

View reviewed changes

mdisibio added 2 commits February 20, 2024 16:06

Merge branch 'main' into iss-3338

f1df821

changelog

84c6537

yvrhdn approved these changes Feb 21, 2024

View reviewed changes

yvrhdn approved these changes Feb 22, 2024

View reviewed changes

yvrhdn merged commit 1c125aa into grafana:main Feb 22, 2024
14 checks passed

yvrhdn deleted the iss-3338 branch February 22, 2024 18:41

yvrhdn pushed a commit to yvrhdn/tempo that referenced this pull request Feb 26, 2024

Fix spanmetrics latency underflow, prevent partial histogram scrape (g…

c5b58da

…rafana#3412) * Fix spanmetrics latency underflow, prevent partial histogram scrape * changelog

yvrhdn mentioned this pull request Jun 20, 2024

Drop unnecessary concurrency check in metrics-generator histogram #3797

Merged

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix spanmetrics latency underflow, prevent partial histogram scrape #3412

Fix spanmetrics latency underflow, prevent partial histogram scrape #3412

mdisibio commented Feb 20, 2024

mdisibio commented Feb 20, 2024

yvrhdn left a comment

yvrhdn left a comment

Fix spanmetrics latency underflow, prevent partial histogram scrape #3412

Fix spanmetrics latency underflow, prevent partial histogram scrape #3412

Conversation

mdisibio commented Feb 20, 2024

mdisibio commented Feb 20, 2024

yvrhdn left a comment

Choose a reason for hiding this comment

yvrhdn left a comment

Choose a reason for hiding this comment