Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Timer max expiry inconsistent with percentiles expiry #3298

Open
gdabski opened this issue Jul 16, 2022 · 1 comment
Open

Timer max expiry inconsistent with percentiles expiry #3298

gdabski opened this issue Jul 16, 2022 · 1 comment
Labels
module: micrometer-core An issue that is related to our core module waiting-for-triage

Comments

@gdabski
Copy link

gdabski commented Jul 16, 2022

Describe the bug
A single recorded time (sample) affects a Timer's max() for a longer time than percentiles produced by the Timer. With default DistributionStatisticConfig the effective expiry is one minute for the percentiles and three minutes for max. The issue is related to how TimeWindowMax and AbstractTimeWindowHistogram interpret the value of DistributionStatisticConfig.expiry. Both use ring buffers of same size to implement the decay, but while the former only moves by one buffer position in intervals equal to expiry, the latter is implemented to do a full rotation of the buffer in the same time.

Environment

  • Micrometer version: 1.9.x
  • OS: Linux
  • Java version: 17.0.3

To Reproduce

    @Test
    void demonstratesDifferentExpiryForMaxAndPercentiles() throws InterruptedException {
        MeterRegistry meterRegistry = new SimpleMeterRegistry();
        Timer timer = Timer.builder("test").publishPercentiles(new double[]{0.5}).register(meterRegistry);
        Duration checkInterval = Duration.ofSeconds(30);

        timer.record(Duration.ofMillis(200));

        Duration testDuration = Duration.ZERO;
        var max = timer.max(MILLISECONDS);
        var snapshot = timer.takeSnapshot();

        while (max != 0) {
            System.out.printf("%ss: %s max %.0f%n", testDuration.toSeconds(), snapshot.percentileValues()[0], max);

            Thread.sleep(checkInterval.toMillis());

            testDuration = testDuration.plus(checkInterval);
            max = timer.max(MILLISECONDS);
            snapshot = timer.takeSnapshot();
        }

        System.out.printf("%ss: %s max %.0f%n", testDuration.toSeconds(), snapshot.percentileValues()[0], max);
    }

Prints:

0s: (1.92937984E8 at 50.0%) max 200
30s: (1.92937984E8 at 50.0%) max 200
60s: (0.0 at 50.0%) max 200
90s: (0.0 at 50.0%) max 200
120s: (0.0 at 50.0%) max 200
150s: (0.0 at 50.0%) max 200
180s: (0.0 at 50.0%) max 0

Expected behaviour
A single sample ceases to affect percentiles and timer max at the same point in time.

Related issues
In #2751 there is a complaint about max not expiring in expected time, but response was that the TimeWindowMax uses expiry (and bufferLength) from DistributionStatisticConfig right.

@gdabski gdabski changed the title Timer max expiry inconsistent with histogram expiry Timer max expiry inconsistent with percentiles expiry Jul 16, 2022
@shakuzen shakuzen added waiting-for-triage module: micrometer-core An issue that is related to our core module labels Jul 19, 2022
@rafalh
Copy link

rafalh commented Jun 27, 2024

I just stepped onto this issue.
I think I found the root cause - it is the way how durationBetweenRotatesMillis is initialized. For some reasons expiry time from configuration is divided there by bufferLength (ageBuckets variable) unlike in TimeWindowMax where expiry time is taken directly. This means that effective expiry time (when metric is reset to 0 after single request) for max metric is expiry * bufferLength but for percentiles it is only expiry.
AFAIK documentation does not reflect this difference and for me it is confusing so I'm assuming it is a bug.
For the default configuration where expiry/step is set to 1 minute and bufferLength is set to 3, assuming metrics are scrapped every minute, in unfortunate case when scrapping occurs immediately after metrics are rotated, requests from 20 seconds window out of 1 minute (33% of data) are not taken into account in percentile metrics at all...
I'm planning to create a PR changing durationBetweenRotatesMillis initialization.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
module: micrometer-core An issue that is related to our core module waiting-for-triage
Projects
None yet
Development

No branches or pull requests

3 participants