Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

If throttled, don't `rate`, etc. with 0 values #2793

Closed
Dominik-K opened this Issue Jun 1, 2017 · 9 comments

Comments

Projects
None yet
2 participants
@Dominik-K
Copy link
Contributor

Dominik-K commented Jun 1, 2017

We've run into the "throttling mode" ° today. It seems that Prometheus (1.6.3) uses zero values for the throttled time and calculates wrong rates resulting in false alerts afterwards.

A threshold-underrun alert was thrown 60 minutes later because we used a rate(...[60m]) function:

screen shot 2017-06-01 at 18 03 06

  • Throttling started at 16:27 and left "rushed mode" short before 16:42.
  • Both graphs are based on the same sum(rate(METRIC)) with different rate intervals.

What did you expect to see?

  • A gap between 16:27 and 16:42 as Prometheus didn't scrape any values at this time.
  • A near-to-flat, small changing rate(...[60m]) graph (as seen before 16:27 and after 17:45).
    => No false alerts like threshold was underrun.

I propose that Prometheus skips the throttled timeframe. I.e. uses the last value before throttling for rates calculated after it left "rushed mode", e.g. has real data values again.

° "Storage needs throttling. Scrapes and rule evaluations will be skipped." (urgencyScore=1)

@brian-brazil

This comment has been minimized.

Copy link
Member

brian-brazil commented Jun 1, 2017

Nothing is ingested when throttling.

You're not showing us everything here. Is there a single time series which shows the supposed problem?

@Dominik-K

This comment has been minimized.

Copy link
Contributor Author

Dominik-K commented Jun 6, 2017

The graphs shown above (and below) are based on the same counter timeseries. Only the rate interval is different. Here's also the raw counter value shown in the same graph (times are UTC+1 with summer time => +2h offset to UTC+0):

screen shot 2017-06-06 at 17 00 37

The raw counter value in Prometheus (times are UTC+0):
screen shot 2017-06-06 at 17 01 27

While throttling at the first 5 minutes, the counter value is set with the same value as the last scrape. Then there are NaN's until throttling is over. That's a second problem, though.

@brian-brazil

This comment has been minimized.

Copy link
Member

brian-brazil commented Jun 6, 2017

Can you share the raw data for that counter?

@Dominik-K

This comment has been minimized.

Copy link
Contributor Author

Dominik-K commented Jun 8, 2017

Sure. Here are the raw values, retrieved with the query_range API call.

@brian-brazil

This comment has been minimized.

Copy link
Member

brian-brazil commented Jun 9, 2017

Can you get the values using the query api please.

@Dominik-K

This comment has been minimized.

Copy link
Contributor Author

Dominik-K commented Jun 13, 2017

Where's the difference? I just checked the first and last value with the query call and they are the same. E.g. the last value in the timeframe:

{
    "status": "success",
    "data": {
        "resultType": "vector",
        "result": [
            {
                "metric": {
                    "__name__": "REDACTED",
                    "instance": "REDACTED",
                    "job": "REDACTED"
                },
                "value": [
                    1496332800,
                    "266768234"
                ]
            }
        ]
    }
}
@brian-brazil

This comment has been minimized.

Copy link
Member

brian-brazil commented Jun 13, 2017

query_range does not return the raw data.

@brian-brazil

This comment has been minimized.

Copy link
Member

brian-brazil commented Jul 14, 2017

I'm presuming this is a misunderstanding of how rate() works. If you've evidence that rate() is broken, please share the raw data.

@lock

This comment has been minimized.

Copy link

lock bot commented Mar 23, 2019

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

@lock lock bot locked and limited conversation to collaborators Mar 23, 2019

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
You can’t perform that action at this time.