Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bad alert RPCs when alerting rule with recently-cleared alerts is reloaded with longer FOR clause #1871

Closed
aecolley opened this Issue Aug 4, 2016 · 8 comments

Comments

Projects
None yet
9 participants
@aecolley
Copy link

aecolley commented Aug 4, 2016

What did you do?
I extended the FOR threshold of an alert rule from 10m to 20m and triggered a reload (successfully). Several alerts were active for the rule at the time (some pending, some firing, some recently cleared).

What did you expect to see?
Pending alerts and cleared alerts should be unaffected. Firing alerts may change to any of the three states.

What did you see instead? Under which circumstances?
Alertmanager started rejecting alert RPCs with 400 Bad Request. Using tcpdump, the rpc was captured. The response was:

{"error": "start time must be before end time"},
 "errorType": "bad_data",
 "status": "error"}

The (slightly redacted) part of the request was:

{"annotations": {...},
 "endsAt": "2016-08-04T07:10:37.803-04:00",
 "generatorURL": ...,
 "labels": {...},
 "startsAt": "2016-08-04T07:16:37.903-04:00"}

On inspection of logs and the timeseries history for ALERTS, the sequence of events for this alert instance was:

time_t description
1470308224 alertstate="pending" for first time (active rule says FOR 10m)
1470308824 alertstate="firing" for first time
1470309003 config files and rules reloaded (rule now says FOR 20m)
1470309004 firing alert still 1, pending alert still 0
1470309037 the endsAt time of the tcpdump-captured alert RPC
1470309038 first "400 Bad Request" response from alertmanager (not captured)
1470309064 firing alert changes value from 1 to 0 (pending alert remains 0)
1470309397 the startsAt time of the tcpdump-captured alert RPC
1470309818 the time of the tcpdump-captured alert RPC

Environment

  • System information:

    uname -srm: Linux 2.6.32-573.22.1.el6.x86_64 x86_64

  • Prometheus version:

    1.0.0 (built from source with go1.6.2)

  • Alertmanager version:

    0.2.1 (built from source with go1.6.2)

  • Prometheus configuration file:
    Single config with multiple entries under rule_files, one of which defines a simple "ALERT FooDown IF up{job="foo"} == 0 FOR 10m ..." rule, which was amended from 10m to 20m while alerts generated from it were active.

  • Alertmanager configuration file:
    Not relevant.

  • Logs:
    Prometheus log snippet:

time="2016-08-04T07:07:51-04:00" level=info msg="Done checkpointing in-memory metrics and chunks in 5.113380938s." source="persistence.go:563"
time="2016-08-04T07:10:03-04:00" level=info msg="Loading configuration file (redacted).yml" source="main.go:206"
time="2016-08-04T07:10:38-04:00" level=error msg="Error sending alerts: bad response status 400 Bad Request" alertmanager="http://(redacted):(redacted)" count=29 source="notifier.go:292"

Alertmanager log snippet:

time="2016-08-04T07:09:01-04:00" level=info msg="Loading configuration file" file=(redacted).yml source="main.go:150"
time="2016-08-04T07:10:38-04:00" level=error msg="api error: {bad_data 0xc82099ae00}" source="api.go:426"

Supplementary gripes
Neither prometheus nor alertmanager logged the alert RPC that was bad. The prometheus counters prometheus_notifications_dropped_total and prometheus_notifications_errors_total were incremented, and they caused metamonitoring alerts to be raised (delivered by a different prometheus to the same alertmanager) as expected. However, I had to resort to tcpdump to see the alert RPCs and verify what alertmanager's error message was telling me about them.

@fabxc

This comment has been minimized.

Copy link
Member

fabxc commented Aug 4, 2016

Thanks for the elaborate report, really helpful!
I've a rough hunch what might be going on but need to set aside some time for debugging.

@brian-brazil

This comment has been minimized.

Copy link
Member

brian-brazil commented Feb 13, 2017

Was this fixed?

@atombender

This comment has been minimized.

Copy link
Contributor

atombender commented Mar 23, 2017

I'm getting this. Is there a workaround? I'm still getting the log message even after commenting out the one rule that I changed.

@shredder12

This comment has been minimized.

Copy link

shredder12 commented Sep 27, 2017

This is still an issue.

@zemek

This comment has been minimized.

Copy link
Contributor

zemek commented Nov 16, 2017

i'm also experiencing this issue

@serhatcetinkaya

This comment has been minimized.

Copy link

serhatcetinkaya commented Dec 26, 2017

We are also experiencing this issue, when there is an alert in FIRING state if we change the FOR clause for that alert and reload prometheus, we see several errors like the one below in alertmanager.log :
time="2017-12-26T08:09:08Z" level=error msg="api error: bad_data: start time must be before end time" source="api.go:716"

@Conorbro

This comment has been minimized.

Copy link
Contributor

Conorbro commented Jan 10, 2018

I've been debugging this today and believe the original issue has been solved by Fabian in this commit.

I think the problematic line was adding the new duration to the time the alert was activated at, which could result in the alert's startTime > endTime generating the given error.

The issue was consistently reproducible in release 2.0, however I have also managed to run into it on HEAD (+ Krasi's pr to fix Alertmanager sd).

I will update this comment when I know more.

EDIT: Using tcpdump it seems sometimes when the startTime of the alert is reset to zero, Prometheus still sends it which also results in the same "start time must be before end time" error.

What seems to be happening:

  1. The endTime is set to the current time and the startTime set to zero in Prometheus.

  2. The Alertmanager then sets the received alert's startTime to the current time which will be ahead of the endTime and result in the "start time must be before end time" error .

It looks as though a similar issue persists on the Alertmanager side of things so I will close this issue and open the relevant one there.

@lock

This comment has been minimized.

Copy link

lock bot commented Mar 23, 2019

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

@lock lock bot locked and limited conversation to collaborators Mar 23, 2019

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
You can’t perform that action at this time.