Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use GaugeHistogram for Prometheus LongTaskTimer #4988

Closed
mhalbritter opened this issue Apr 29, 2024 · 3 comments
Closed

Use GaugeHistogram for Prometheus LongTaskTimer #4988

mhalbritter opened this issue Apr 29, 2024 · 3 comments
Labels
bug A general bug registry: prometheus A Prometheus Registry related issue
Milestone

Comments

@mhalbritter
Copy link

mhalbritter commented Apr 29, 2024

Original title:

Enabling distribution percentiles for Prometheus 1.x leads to "Counts in ClassicHistogramBuckets cannot be negative"

See the bug reported against Spring Boot: spring-projects/spring-boot#40552

When debugging this, counts in io.micrometer.prometheusmetrics.PrometheusMeterRegistry#addDistributionStatisticSamples, which is used to call ClassicHistogramBuckets.of(buckets, counts) contains at the last element a -1:

It entered this condition if (Double.isFinite(histogramCounts[histogramCounts.length - 1].bucket())) {, which sets

counts.add(count - histogramCounts[histogramCounts.length - 1].count());

count is 1, histogramCounts[histogramCounts.length - 1].count() is increasing with every request. This leads to a negative number.

Version micrometer-registry-prometheus, 1.13.0-RC1.

@jonatan-ivanov
Copy link
Member

Thanks @snicoll for catching this and for the reproducer. Thanks @mhalbritter for the issue and for further investigating this.

Workaround

It seems the issue is happening when a LongTaskTimer (that has a histogram enabled) is in progress while scrape is called, if you disable histogram for the LongTaskTimer, @snicoll's app should work.

management.metrics.distribution.percentiles-histogram.http.server.requests=true
management.metrics.distribution.percentiles-histogram.http.server.requests.active=false

Micrometer Reproducer

I was able to come up with a reproducer without Boot:

@Test
void lttWithHistogram() {
    LongTaskTimer longTaskTimer = LongTaskTimer.builder("test")
        .publishPercentileHistogram()
        .register(registry);

    Sample sample = longTaskTimer.start();
    assertThat(registry.scrape()).isNotBlank();
    sample.stop();
    Sample sample2 = longTaskTimer.start();
    assertThat(registry.scrape()).isNotBlank();
    sample2.stop();
}

or (slightly simpler):

@Test
void lttWithHistogram() {
    LongTaskTimer longTaskTimer = LongTaskTimer.builder("test")
        .publishPercentileHistogram()
        .register(registry);

    Sample sample = longTaskTimer.start();
    assertThat(registry.scrape()).isNotBlank();
    assertThat(registry.scrape()).isNotBlank();
    sample.stop();
}

(scrape needs to be called at least twice to repro the issue.)

Troubleshooting

It seems we have a similar issue with Prometheus 0.x but because we don't need to calculate the value of the +Inf bucket the issue is not surfaced (the value is still wrong though) while in Prometheus 1.x since we need to calculate that value, the validation surfaces the issue. If I do this with Prometheus 0.x:

LongTaskTimer longTaskTimer = LongTaskTimer.builder("test")
        .publishPercentileHistogram()
        .register(registry);
Sample sample = longTaskTimer.start();
registry.scrape();
System.out.println(registry.scrape());
sample.stop();

I get this:

[...]
test_seconds_bucket{le="7200.0",} 2.0
test_seconds_bucket{le="+Inf",} 1.0
test_seconds_active_count 1.0
[...]

The logic is the following: +Inf bucket = count which is also incorrect since this being a cumulative histogram, +Inf cannot be less than the previous bucket (7200.0: 2 and +Inf: 1 is invalid). This is because count here is the active task count which can go up and down while the histogram cannot.

Also, there seems to be another issue, calling scrape (snapshotting the histogram) multiple times somehow increases the bucket counts of a histogram, see:

test_seconds_bucket{le="7200.0",} 2.0

There was only one recording, this should not be 2.

I think "fixing" the exception might not be too hard but I also think that the real fix is using a "gauge histogram" (histogram counters can go up or down) instead of a "classic" cumulative histogram (monotonic counters) which might take a bit more effort.

@jonatan-ivanov jonatan-ivanov added bug A general bug and removed waiting-for-triage labels Apr 30, 2024
@jonatan-ivanov jonatan-ivanov added this to the 1.13.0 milestone Apr 30, 2024
@jonatan-ivanov jonatan-ivanov changed the title Enabling distribution percentiles for Prometheus 1.x leads to "Counts in ClassicHistogramBuckets cannot be negative" Use GaugeHistogram for Prometheus LongTaskTimer Apr 30, 2024
@jonatan-ivanov
Copy link
Member

jonatan-ivanov commented Apr 30, 2024

I fixed this by migrating to Prometheus GaugeHistogram, please see the commit message of d7b9d24 for details.

Please notice that since

  • the values will be different (correct)
  • _count, _sum, and _bucket are Prometheus Gauges now
  • the _count suffix was changed to _gcount and the _sum suffix was changed to _gsum (Prometheus convention)

this is a breaking change.

@jonatan-ivanov
Copy link
Member

@snicoll I tried this with Boot locally, could you please check your demo with the latest snapshot?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug A general bug registry: prometheus A Prometheus Registry related issue
Projects
None yet
Development

No branches or pull requests

3 participants