Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cmd option "-rules.alert.resend-delay" is ignored #5048

Open
zuzzas opened this Issue Dec 28, 2018 · 9 comments

Comments

Projects
None yet
4 participants
@zuzzas
Copy link

zuzzas commented Dec 28, 2018

Bug Report

What did you do?

I've set the --rules.alert.resend-delay=30s option.

What did you expect to see?

I've expected to see alerts POST's going at 30 seconds interval.

What did you see instead? Under which circumstances?

Alerts were sent at 1 minute intervals, instead of 30s. When I set it to 29s, they were sent a 30s interval. I guess, it's related to evaluation_interval, but I haven't been able to confirm that in source code.

Here are the relevant Wireshark captures at --rules.alert.resend-delay set to 30 and 29 seconds respectively:

Environment

  • System information:

    Linux kube-a-1 4.10.0-42-generic #46~16.04.1-Ubuntu SMP Mon Dec 4 15:57:59 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux

  • Prometheus version:

    Prometheus v2.6.0

  • Prometheus configuration file:

global:
  scrape_interval: 30s
  scrape_timeout: 10s
  evaluation_interval: 30s
  external_labels:
    prometheus: kube-prometheus/main
    prometheus_replica: prometheus-main-0
alerting:
  alert_relabel_configs:
  - separator: ;
    regex: prometheus_replica
    replacement: $1
    action: labeldrop
  alertmanagers:
  - kubernetes_sd_configs:
    - api_server: null
      role: endpoints
      namespaces:
        names:
        - kube-prometheus
    scheme: http
    path_prefix: /
    timeout: 10s
    relabel_configs:
    - source_labels: [__meta_kubernetes_service_name]
      separator: ;
      regex: proxy
      replacement: $1
      action: keep
    - source_labels: [__meta_kubernetes_endpoint_port_name]
      separator: ;
      regex: http
      replacement: $1
      action: keep

Everything else is a bit confidential.

@juliusv

This comment has been minimized.

Copy link
Member

juliusv commented Dec 28, 2018

That's sounds like a (relatively predictable) race condition between the 30s rule evaluation interval and the resend interval. When the two are exactly the same, then whether an alert is sent twice in a row during rule evaluations will depend on exact details of when the resend delay is checked vs. when the last resend is recorded. Prometheus might think that a resend happened 29.999s earlier (in the last rule evaluation iteration) and thus not send the alert again. Probably indeed if you depend on Prometheus sending alerts on every rule evaluation iteration, you'll want to set the re-send interval to at least a couple of seconds lower than the evaluation interval.

@zuzzas

This comment has been minimized.

Copy link
Author

zuzzas commented Dec 28, 2018

I thought that alert resend interval is decoupled from rule_evaluation after Prometheus 2.4.0. Why is there a race condition and can it be considered a bug?

@juliusv

This comment has been minimized.

Copy link
Member

juliusv commented Dec 28, 2018

@zuzzas Alerting rules are evaluated periodically, according to the configured rule evaluation interval. Any resulting alerts are then sent to the Alertmanager right after evaluation. At the time that sending happens, Prometheus checks whether a previous send for the same alert was too recent (according to the configured resend delay) and records the new "last send" time if there is a re-send.

So alert re-sends only happen right after the alerting rule was evaluated and simply get dropped if the resend delay hasn't been reached yet. I think the intention of the resend delay flag is mainly for situations where you have a really short evaluation interval, like 5s (so you can detect broken things fast), but you don't need to re-notify Alertmanager about already broken things every 5s (rather, every minute or couple of minutes is sufficient).

@zuzzas

This comment has been minimized.

Copy link
Author

zuzzas commented Dec 28, 2018

@juliusv
Thanks for explaining the inner-workings and interactions of this option and evaluation_period. The only question that I've got is whether the actual race condition should be fixed, since it creates a rather peculiar and hard to debug problem.

@brian-brazil

This comment has been minimized.

Copy link
Member

brian-brazil commented Dec 29, 2018

What's the actual problem this is causing for you? We should be resilient to a failed notification.

@zuzzas

This comment has been minimized.

Copy link
Author

zuzzas commented Dec 29, 2018

@brian-brazil
I am using a Dead Man's Switch alerting pattern. When v2.4.0 came out, there came a change to alert resending logic. I've tried to adjust to it by setting --rules.alert.resend-delay by setting it to the same value as evaluation_interval. Alas, I found out that resend interval may sometimes double, because of the aforementioned race condition. Since such behavior came as a complete surprise to me, I am afraid that some people might be puzzled by this as well.

Right now I'd like to get your opinion on the following:

  1. Whether the documentation should be clarified on this particular case;
  2. Or should the race condition just get fixed by having a lock somewhere?
@brian-brazil

This comment has been minimized.

Copy link
Member

brian-brazil commented Dec 29, 2018

I don't see anything that need clarifying here. There's only a problem if the EndsAt doesn't allow enough slack to handle a failed resend.

@brian-brazil

This comment has been minimized.

Copy link
Member

brian-brazil commented Jan 26, 2019

@sbueringer

This comment has been minimized.

Copy link

sbueringer commented Jan 26, 2019

@brian-brazil Sorry deleted the comment for a moment :). I had to find out if that's really the problem in our case, because I found other errors in our Prometheus log. But I now set the resend-interval to 45 seconds and I get an alert in Alertmanager every minute instead of every 2 minutes.

I don't know if it's a good idea, but might make sense to set the endAt a little bit higher then 3x the interval (maybe a second or two more). So the alert is only resolved if you really lose 3 messages instead of 2. Might make sense because we're actually waiting 3x the interval anyway before resolving the Alert right now.

Example:

ts=2019-01-26T09:39:20.763700367Z msg="Received alert" alert=AlertName[80cd85c][active]  end=2019-01-26T09:42:20.338042927Z > received
ts=2019-01-26T09:40:20.364404802Z msg="Received alert" alert=AlertName[80cd85c][active]  end=2019-01-26T09:43:20.338042927Z > lost 
ts=2019-01-26T09:41:20.364032739Z msg="Received alert" alert=AlertName[80cd85c][active]  end=2019-01-26T09:44:20.338042927Z > lost
ts=2019-01-26T09:42:20.341995319Z msg="Received alert" alert=AlertName[80cd85c][active]  end=2019-01-26T09:45:20.338042927Z > received

This would resolve the alert if your unlucky, I guess.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.