Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make alerts stateless #1117

Closed
fabxc opened this Issue Sep 26, 2015 · 15 comments

Comments

Projects
None yet
6 participants
@fabxc
Copy link
Member

fabxc commented Sep 26, 2015

Alerting rules should always be evaluated as a range query over the entire hold duration so pending alerts are not lost on restart.

@matthiasr

This comment has been minimized.

Copy link
Contributor

matthiasr commented Sep 26, 2015

👍

@brian-brazil

This comment has been minimized.

Copy link
Member

brian-brazil commented Sep 26, 2015

This is a dupe of #422 .

I think we'll need something more nuanced to cover both the time when the prometheus server is down (we don't want an alert before 2 scrapes have been done so we have our bearings), and that doing the query historically may not have exactly the same results as doing it at each point in time.

@juliusv

This comment has been minimized.

Copy link
Member

juliusv commented Sep 28, 2015

Also, doing the query always over the entire FOR range gets prohibitively expensive for some alert expressions, so the state needs to be kept separately for alerting rules.

@fabxc

This comment has been minimized.

Copy link
Member Author

fabxc commented Sep 28, 2015

That's something I strongly doubt without measurements indicating otherwise.
Our evaluation interval typically is 60s, alerts are on average FOR 10m.
This increases my computation by 10x. Chunk preloading is not affected in
most cases as a single sample of samples over 10m are in the same chunk.

As alerting rule evaluation has never been anything but white noise so far,
the increase is most likely completely negligible.

If on the other hand, it would be an issue. This would be a clear sign to
me that we have to work on our completely naiive query evaluation, which
can most likely be an order of magnitude more efficient.

On Mon, Sep 28, 2015 at 12:54 PM Julius Volz notifications@github.com
wrote:

Also, doing the query always over the entire FOR range gets prohibitively
expensive for some alert expressions, so the state needs to be kept
separately for alerting rules.


Reply to this email directly or view it on GitHub
#1117 (comment)
.

@juliusv

This comment has been minimized.

Copy link
Member

juliusv commented Sep 28, 2015

As FOR durations are not technically limited, some people will have FOR durations of multiple hours, or in pathological cases even days (although the latter would usually be a bad idea in any case). This will definitely have a big impact on rule evaluations. Of course optimizing general expression evaluation is always a good idea, but I don't see that legitimizing making alert evaluations potentially 100x or more expensive than they are now.

@brian-brazil

This comment has been minimized.

Copy link
Member

brian-brazil commented Sep 28, 2015

I'm expecting to see for expressions up to a month in length in common usage, so I agree with Julius that I don't see this approach working out.

@fabxc

This comment has been minimized.

Copy link
Member Author

fabxc commented Sep 28, 2015

At which evaluating at 1m intervals would be a highly questionable approach.

On Mon, Sep 28, 2015 at 1:14 PM Brian Brazil notifications@github.com
wrote:

I'm expecting to see for expressions up to a month in length in common
usage, so I agree with Julius that I don't see this approach working out.


Reply to this email directly or view it on GitHub
#1117 (comment)
.

@juliusv

This comment has been minimized.

Copy link
Member

juliusv commented Sep 28, 2015

Even if you increase the interval by a bit, that'll be very confusing to a user, why suddenly FOR doesn't work at their defined rule evaluation interval anymore, and also it would still kill the storage because it'd load a huge amount of history. We should keep the FOR state separately.

@juliusv

This comment has been minimized.

Copy link
Member

juliusv commented Sep 28, 2015

Also, most rules should be able to be answered from memory alone. Once we need to go to disk, we are already dead for many of them (you can already see that after some beefy servers start up - they take a while to function properly after restart because the rules load so much from disk).

@matthiasr

This comment has been minimized.

Copy link
Contributor

matthiasr commented Sep 28, 2015

So, the original reason we thought about doing it this way was to get sensible evaluation of FOR over restarts / reloads, without having to introduce yet another bit of storage that needs to be updated and cared for.

We don't need to make the evaluation "really" stateless – what about using the timeseries data if the rule hasn't been evaluated since a restart / reload, but cache the result of that (the time when it started firing) in memory?

@brian-brazil

This comment has been minimized.

Copy link
Member

brian-brazil commented Sep 28, 2015

We don't need to make the evaluation "really" stateless – what about using the timeseries data if the rule hasn't been evaluated since a restart / reload, but cache the result of that (the time when it started firing) in memory?

Such as via ALERTS? It needs a little bit more than that (we need the time it started firing and to add a bit more time for 'for' expressions on the lower end), but the general approach sounds good.

@barkerd427

This comment has been minimized.

Copy link

barkerd427 commented Sep 28, 2015

We run batch jobs weekly or monthly, so my for time goes back that far.
Without some sort of method for determining the date or day of the week, I
have to go back a week to ensure we get results each week. This may just be
a poor architecture, but it is working well currently.
On Sep 28, 2015 6:33 AM, "Brian Brazil" notifications@github.com wrote:

We don't need to make the evaluation "really" stateless – what about using
the timeseries data if the rule hasn't been evaluated since a restart /
reload, but cache the result of that (the time when it started firing) in
memory?

Such as via ALERTS? It needs a little bit more than that (we need the
time it started firing and to add a bit more time for 'for' expressions on
the lower end), but the general approach sounds good.


Reply to this email directly or view it on GitHub
#1117 (comment)
.

@jinhang

This comment has been minimized.

Copy link

jinhang commented Nov 7, 2016

@brian-brazil why I write for 1s but the pending to firing use time more than 1s ?

@juliusv

This comment has been minimized.

Copy link
Member

juliusv commented Nov 8, 2016

@jinhang Because alerts need at least one rule evaluation cycle to go from pending to firing. I would guess that your rule evaluation cycle is much longer. You can just omit FOR to go directly into firing mode. This question is not related to this issue though.

@lock

This comment has been minimized.

Copy link

lock bot commented Mar 24, 2019

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

@lock lock bot locked and limited conversation to collaborators Mar 24, 2019

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
You can’t perform that action at this time.