Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Persist alert 'for' state across restarts #422

Closed
brian-brazil opened this Issue Dec 10, 2014 · 12 comments

Comments

Projects
None yet
5 participants
@brian-brazil
Copy link
Member

brian-brazil commented Dec 10, 2014

Currently if Prometheus restarts, we lose the 'for' state for firing alerts. While this isn't an issue for short for clauses, it presents a problem for clauses that are in the hours to days range.

It'd be good to persist this state in some way, so that alerts don't have to start again from scratch. We probably don't want to count the time the prometheus server is down against the 'for' clause.

@gouthamve

This comment has been minimized.

Copy link
Member

gouthamve commented Mar 28, 2017

How would this work? Do we evaluate from the point we terminated on startup again as a way of catching up?

For example take an alert A1 with FOR 12h that has started life at 0000. Now it is pending and we shut down the server at 0600 and we persist A1 pending for 6h at 0600. Suppose we restart at 01500, do we start evaluating from 0600-01500 and then say catched up, continue normal evaluation? Isn't this going to be costly?

Are we going to have a threshold? Something like you lose alert state if Prometheus is down for more than 1h? We can make this a flag to support use-cases where the server was down for a long time.

@brian-brazil

This comment has been minimized.

Copy link
Member Author

brian-brazil commented Mar 29, 2017

Suppose we restart at 01500, do we start evaluating from 0600-01500 and then say catched up, continue normal evaluation?

No, and the alert probably wouldn't be firing during that period anyway as there'll be no data from the Prometheus going down.

What we want is that we remember that 6h of the FOR has already been satisfied, and when it's satisfied for another 6h then we alert. We need to be a little careful here, for safety if the remaining time is <10m we should treat that to min(10m, for).

To implement I was thinking an analouge to ALERTs that'd track when an alert started, and when Promethus restarts it'd check that metric (adjusting if needed for the 10m safety) and initialise internal state accordingly.

One other thing is that it can take a while for a Prometheus to warm up and have done enough scrapes for alerts to start firing properly. Thus we should delay rule&alert evaluation for 2 scrape_intervals at starttup to mitigate this.

@gouthamve

This comment has been minimized.

Copy link
Member

gouthamve commented Mar 29, 2017

Okay, biggest facepalm moment ever! How could I not see that the server is not running! Sorry about that!

Coming to when to start evaluation, will 2 global scrape_interval be okay? Because we might be alerting on series that are scraped much less frequently. And what about alerts on range queries?

@brian-brazil

This comment has been minimized.

Copy link
Member Author

brian-brazil commented Mar 29, 2017

global scrape interval is all that makes sense.

Range queries will usually be rate(), and two scrapes+slack should be enough for that to work.

@jacksontj

This comment has been minimized.

Copy link
Contributor

jacksontj commented Jun 20, 2018

Do we want to persist state? If state is persisted then it means that the alerts require a timeseries store locally to enforce the alerting. This is fine for prometheus itself (although it is storing data we might not care about) but does pose a problem for things that use the alerting libraries to enforce promql alerts to remote storage systems. I had hacked up an approach for backfilling on startup, but that doesn't cover the case that you mention here that you don't want to count "prom downtime" against the alert FOR interval. Maybe it'd be helpful to make some interface for "load alert state" that can be passed in (prometheus itself can store/load from a timeseries, others can re-query etc.)?

@brian-brazil

This comment has been minimized.

Copy link
Member Author

brian-brazil commented Jun 20, 2018

Do we want to persist state?

That's kind of the entire point of this PR. If you're running some non-standard system then you won't get the benefit of this feature, but alerting will continue to function as it does today.

@jacksontj

This comment has been minimized.

Copy link
Contributor

jacksontj commented Jun 20, 2018

I assume then that this functionality would be optional (through some config)?

@brian-brazil

This comment has been minimized.

Copy link
Member Author

brian-brazil commented Jun 20, 2018

I see no reason why this functionality would not be always on in the Prometheus server. The Prometheus server always has a local tsdb.

@jacksontj

This comment has been minimized.

Copy link
Contributor

jacksontj commented Jun 20, 2018

Quoting from myself:

If state is persisted then it means that the alerts require a timeseries store locally to enforce the alerting. This is fine for prometheus itself (although it is storing data we might not care about) but does pose a problem for things that use the alerting libraries to enforce promql alerts to remote storage systems.

TLDR; What about others using the library?

@brian-brazil

This comment has been minimized.

Copy link
Member Author

brian-brazil commented Jun 21, 2018

It's an internal API so we offer no guarantees there, but you can pass in whatever storage objects you like.

@gouthamve

This comment has been minimized.

Copy link
Member

gouthamve commented Aug 3, 2018

Closed by #4061

@lock

This comment has been minimized.

Copy link

lock bot commented Mar 22, 2019

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

@lock lock bot locked and limited conversation to collaborators Mar 22, 2019

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
You can’t perform that action at this time.