Use GaugeHistogram for Prometheus LongTaskTimer #4988

mhalbritter · 2024-04-29T11:47:09Z

Original title:

Enabling distribution percentiles for Prometheus 1.x leads to "Counts in ClassicHistogramBuckets cannot be negative"

See the bug reported against Spring Boot: spring-projects/spring-boot#40552

When debugging this, counts in io.micrometer.prometheusmetrics.PrometheusMeterRegistry#addDistributionStatisticSamples, which is used to call ClassicHistogramBuckets.of(buckets, counts) contains at the last element a -1:

It entered this condition if (Double.isFinite(histogramCounts[histogramCounts.length - 1].bucket())) {, which sets

counts.add(count - histogramCounts[histogramCounts.length - 1].count());

count is 1, histogramCounts[histogramCounts.length - 1].count() is increasing with every request. This leads to a negative number.

Version micrometer-registry-prometheus, 1.13.0-RC1.

The text was updated successfully, but these errors were encountered:

jonatan-ivanov · 2024-04-30T01:05:53Z

Thanks @snicoll for catching this and for the reproducer. Thanks @mhalbritter for the issue and for further investigating this.

Workaround

It seems the issue is happening when a LongTaskTimer (that has a histogram enabled) is in progress while scrape is called, if you disable histogram for the LongTaskTimer, @snicoll's app should work.

management.metrics.distribution.percentiles-histogram.http.server.requests=true
management.metrics.distribution.percentiles-histogram.http.server.requests.active=false

Micrometer Reproducer

I was able to come up with a reproducer without Boot:

@Test
void lttWithHistogram() {
    LongTaskTimer longTaskTimer = LongTaskTimer.builder("test")
        .publishPercentileHistogram()
        .register(registry);

    Sample sample = longTaskTimer.start();
    assertThat(registry.scrape()).isNotBlank();
    sample.stop();
    Sample sample2 = longTaskTimer.start();
    assertThat(registry.scrape()).isNotBlank();
    sample2.stop();
}

or (slightly simpler):

@Test
void lttWithHistogram() {
    LongTaskTimer longTaskTimer = LongTaskTimer.builder("test")
        .publishPercentileHistogram()
        .register(registry);

    Sample sample = longTaskTimer.start();
    assertThat(registry.scrape()).isNotBlank();
    assertThat(registry.scrape()).isNotBlank();
    sample.stop();
}

(scrape needs to be called at least twice to repro the issue.)

Troubleshooting

It seems we have a similar issue with Prometheus 0.x but because we don't need to calculate the value of the +Inf bucket the issue is not surfaced (the value is still wrong though) while in Prometheus 1.x since we need to calculate that value, the validation surfaces the issue. If I do this with Prometheus 0.x:

LongTaskTimer longTaskTimer = LongTaskTimer.builder("test")
        .publishPercentileHistogram()
        .register(registry);
Sample sample = longTaskTimer.start();
registry.scrape();
System.out.println(registry.scrape());
sample.stop();

I get this:

[...]
test_seconds_bucket{le="7200.0",} 2.0
test_seconds_bucket{le="+Inf",} 1.0
test_seconds_active_count 1.0
[...]

The logic is the following: +Inf bucket = count which is also incorrect since this being a cumulative histogram, +Inf cannot be less than the previous bucket (7200.0: 2 and +Inf: 1 is invalid). This is because count here is the active task count which can go up and down while the histogram cannot.

Also, there seems to be another issue, calling scrape (snapshotting the histogram) multiple times somehow increases the bucket counts of a histogram, see:

test_seconds_bucket{le="7200.0",} 2.0

There was only one recording, this should not be 2.

I think "fixing" the exception might not be too hard but I also think that the real fix is using a "gauge histogram" (histogram counters can go up or down) instead of a "classic" cumulative histogram (monotonic counters) which might take a bit more effort.

jonatan-ivanov · 2024-04-30T23:55:11Z

I fixed this by migrating to Prometheus GaugeHistogram, please see the commit message of d7b9d24 for details.

Please notice that since

the values will be different (correct)
_count, _sum, and _bucket are Prometheus Gauges now
the _count suffix was changed to _gcount and the _sum suffix was changed to _gsum (Prometheus convention)

this is a breaking change.

jonatan-ivanov · 2024-05-01T00:10:07Z

@snicoll I tried this with Boot locally, could you please check your demo with the latest snapshot?

mhalbritter added the waiting-for-triage label Apr 29, 2024

jonatan-ivanov added bug A general bug and removed waiting-for-triage labels Apr 30, 2024

jonatan-ivanov added this to the 1.13.0 milestone Apr 30, 2024

jonatan-ivanov changed the title ~~Enabling distribution percentiles for Prometheus 1.x leads to "Counts in ClassicHistogramBuckets cannot be negative"~~ Use GaugeHistogram for Prometheus LongTaskTimer Apr 30, 2024

jonatan-ivanov closed this as completed in d7b9d24 Apr 30, 2024

shakuzen added the registry: prometheus A Prometheus Registry related issue label May 1, 2024

mhalbritter mentioned this issue Jun 6, 2024

Counts in ClassicHistogramBuckets cannot be negative issue after upgrade to spring boot 3.3 spring-projects/spring-boot#41003

Closed

rhatlapa mentioned this issue Jun 7, 2024

Fix histogram consistency in PrometheusMeterRegistry #5193

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use GaugeHistogram for Prometheus LongTaskTimer #4988

Use GaugeHistogram for Prometheus LongTaskTimer #4988

mhalbritter commented Apr 29, 2024 •

edited by jonatan-ivanov

Loading

jonatan-ivanov commented Apr 30, 2024

jonatan-ivanov commented Apr 30, 2024 •

edited

Loading

jonatan-ivanov commented May 1, 2024

Use GaugeHistogram for Prometheus LongTaskTimer #4988

Use GaugeHistogram for Prometheus LongTaskTimer #4988

Comments

mhalbritter commented Apr 29, 2024 • edited by jonatan-ivanov Loading

jonatan-ivanov commented Apr 30, 2024

Workaround

Micrometer Reproducer

Troubleshooting

jonatan-ivanov commented Apr 30, 2024 • edited Loading

jonatan-ivanov commented May 1, 2024

mhalbritter commented Apr 29, 2024 •

edited by jonatan-ivanov

Loading

jonatan-ivanov commented Apr 30, 2024 •

edited

Loading