Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix spanmetrics latency underflow, prevent partial histogram scrape #3412

Merged
merged 3 commits into from
Feb 22, 2024

Conversation

mdisibio
Copy link
Contributor

What this PR does:
Fixes 2 small bugs in the span metrics processor.

  1. Latency underflow

As mentioned in #3338 the latency calculation will be incorrect if span End is < Start. However it doesn't lead to a negative float64. Because start and end are uint64 the calculation underflows to a very large uint64 before float64 conversion, for example float64(uint64(-1)) is 1.8446744073709552e+19. I'm not sure of the side effects from this all the way to the metrics backend, but in addition to the bad value the counter also stops increasing. It no longer has enough precision to capture increments of +1.0.

Rather than ignore these spans, they are treated as zero latency. This ensures that _count and le buckets are still incremented.

  1. Partial histogram scrape

The histograms counters are updated outside of lock. Therefore it's possible for a histogram to be scraped with inconsistent values. For example if the histogram is scraped after incrementing _count but before incrementing le="+Inf". In all cases these metrics should be identical. I addressed this by doing updateSeries under lock and reverting to sync.Mutex instead of RWMutex. This has performance implications, but not sure of the degree in practice. If we really need to avoid the lock, then we could make it more "failsafe" by changing the order of increments in updateSeries. I.e. update the buckets in reverse order so +Inf is always highest even in a partial scrape. Happy to discuss more.

Which issue(s) this PR fixes:
Relates to #3338 I don't think it fixes it

Checklist

  • Tests updated
  • Documentation added
  • CHANGELOG.md updated - the order of entries should be [CHANGE], [FEATURE], [ENHANCEMENT], [BUGFIX]

@mdisibio
Copy link
Contributor Author

Thinking about it some more - the lack of mutex lock might also explain situations where _count is < le="+Inf". If the histogram is incremented in between these two lines https://github.com/grafana/tempo/blob/main/modules/generator/registry/histogram.go#L192-L202 then it's possible for the Inf bucket to be greater than _count. If appender.Append is performing i/o and pausing the collection routine, then the likelihood is increased.

_, err = appender.Append(0, lb.Labels(), timeMs, s.count.Load())

...
		
for i, bucketLabel := range h.bucketLabels {
	ref, err := appender.Append(0, lb.Labels(), timeMs, s.buckets[i].Load())

Copy link
Member

@yvrhdn yvrhdn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice find!

The performance impact from the lock shouldn't be too bad. We collect samples only once every 15s to 60s. While we are collecting a metric it would lock out ingest for that metric. Since we have a low amount of metrics (each processor has a handful) this can lead to contention, but collecting data itself is pretty fast so...

Some thoughts (maybe for a next refactor):

  • I think we can drop the atomic types in histogramSeries now, using regular data types might be better for memory
  • Instead of locking the entire series map, we can keep the RWMutex on series and keep a lock on each histogramSeries. Whenever you modify or read the series you would lock that series only. This would be a more granular approach to avoiding the partial reads/writes but requires more lock management.

Copy link
Member

@yvrhdn yvrhdn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've run this change in our largest cluster and didn't see significant changes in resource usage or latency. I'm not 100% yet this fixes our correctness issues, that will require some further investigation. I'm good with merging this though 👍

@yvrhdn yvrhdn merged commit 1c125aa into grafana:main Feb 22, 2024
14 checks passed
@yvrhdn yvrhdn deleted the iss-3338 branch February 22, 2024 18:41
yvrhdn pushed a commit to yvrhdn/tempo that referenced this pull request Feb 26, 2024
…rafana#3412)

* Fix spanmetrics latency underflow, prevent partial histogram scrape

* changelog
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants