Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

when I write an alert rule "for 1s",why use more than 1s from pending to firing? #2171

Closed
jinhang opened this Issue Nov 7, 2016 · 16 comments

Comments

Projects
None yet
3 participants
@jinhang
Copy link

jinhang commented Nov 7, 2016

What did you do?

What did you expect to see?

What did you see instead? Under which circumstances?

Environment

  • System information:

    insert output of uname -srm here

  • Prometheus version:

    insert output of prometheus -version here

  • Alertmanager version:

    insert output of alertmanager -version here (if relevant to the issue)

  • Prometheus configuration file:

insert configuration here
  • Alertmanager configuration file:
insert configuration here (if relevant to the issue)
  • Logs:
insert Prometheus and Alertmanager logs relevant to the issue here
@brancz

This comment has been minimized.

Copy link
Member

brancz commented Nov 7, 2016

How long are you seeing? I believe it also depends on the evaluation interval.

@brancz

This comment has been minimized.

Copy link
Member

brancz commented Nov 7, 2016

Just looked at the code and it looks like it needs two evaluation intervals to start firing, because the first evaluation puts the alert in the pending state, and then the second evaluation starts firing.

You generally want to have evaluation and scrape interval somewhat synchronized and have around 3 samples until you start firing otherwise you will likely get alerts when it was just a single scrape failure and everything is actually ok, which would have been discovered by subsequent scrapes.

@jinhang

This comment has been minimized.

Copy link
Author

jinhang commented Nov 8, 2016

how long two evaluation intervals? what should I config it?

@jinhang

This comment has been minimized.

Copy link
Author

jinhang commented Nov 8, 2016

@brancz how long two evaluation intervals? what should I config it?

@jinhang

This comment has been minimized.

Copy link
Author

jinhang commented Nov 8, 2016

@brancz how to control prometheus server‘s frequency to send alert state to alertmanager?
when I config alert rules:

if es_os_cpu_percent{node=\"Orikal\"} > 0
for 1s

and prometheus conf:
scrape_interval: 1s
if es_os_cpu_percent{node=\"Orikal\"} > 0 is true ,I want receive email at once. but I find alert state is pending some times.
what should I do?

@brancz

This comment has been minimized.

Copy link
Member

brancz commented Nov 8, 2016

One evaluation interval is 1 minute by default. The For condition is optional, if you leave it out the alert will be fired on the first evaluation interval the rule triggers. Prometheus tries to fire the alert to the Alertmanager as soon as possible after the rule was evaluated to fire.

@jinhang

This comment has been minimized.

Copy link
Author

jinhang commented Nov 8, 2016

thanks a lot ,when I stop the alert send email not use config silence,but use modify the alert rules file(stop alert email use remove the alert rule) and to reload prometheus server, it see wait 2m can stop send email. I see Alertmanager listening to the / api / v1 / alerts ,and what should I do not use silence?

@jinhang

This comment has been minimized.

Copy link
Author

jinhang commented Nov 8, 2016

@brancz

This comment has been minimized.

Copy link
Member

brancz commented Nov 8, 2016

I'm sorry but I don't think I understand what you are asking.

@jinhang

This comment has been minimized.

Copy link
Author

jinhang commented Nov 8, 2016

oh,if you delete alert rule from prometheus.rules and reload prometheus,. alertmanager also send email about 2m. (I want silence alert not use alertmanager's silence)

@brancz

This comment has been minimized.

Copy link
Member

brancz commented Nov 8, 2016

Aha, so if I understand correctly, the problem is that even though you changed and reloaded the rule file (by either sending SIGHUP or a POST request to the reload endpoint), the Alertmanager still sends an email for 2 minutes and you don't want to use a silence from the Alertmanager? What kind of email are you receiving? If configured the Alertmanager sends a resolved notification, otherwise you will want to look into group_wait and group_interval. (see here)

@jinhang

This comment has been minimized.

Copy link
Author

jinhang commented Nov 8, 2016

yes ,I don't want to use a silence from the Alertmanager, to use remove alert rule let prometheus server display
image
equal to silenced a alert

@brancz

This comment has been minimized.

Copy link
Member

brancz commented Nov 8, 2016

If I understand correctly you're in a dilemma situation, you either need to decide for a FOR condition (a typical duration is 10 minutes, firing directly is rather unusual) or accept that you might get a notification more than you expect as the Alertmanager waits to group the notifications.

@jinhang

This comment has been minimized.

Copy link
Author

jinhang commented Nov 10, 2016

@brancz hi,can I ask you an question?
image
how to merge every node 's data to one curve use promql?

@brancz

This comment has been minimized.

Copy link
Member

brancz commented Nov 10, 2016

I believe what you are looking for is for example sum by(job) (<metric_name>), which only aggregates the metrics based on the job label being equal.

@lock

This comment has been minimized.

Copy link

lock bot commented Mar 24, 2019

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

@lock lock bot locked and limited conversation to collaborators Mar 24, 2019

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
You can’t perform that action at this time.