Join GitHub today
GitHub is home to over 31 million developers working together to host and review code, manage projects, and build software together.
Sign upBad alert RPCs when alerting rule with recently-cleared alerts is reloaded with longer FOR clause #1871
Comments
This comment has been minimized.
This comment has been minimized.
|
Thanks for the elaborate report, really helpful! |
fabxc
added
kind/bug
priority/P1
component/notify
labels
Aug 4, 2016
This comment has been minimized.
This comment has been minimized.
|
Was this fixed? |
This comment has been minimized.
This comment has been minimized.
|
I'm getting this. Is there a workaround? I'm still getting the log message even after commenting out the one rule that I changed. |
brian-brazil
added
the
help wanted
label
Jul 14, 2017
This comment has been minimized.
This comment has been minimized.
shredder12
commented
Sep 27, 2017
|
This is still an issue. |
gouthamve
added
the
hacktoberfest
label
Sep 28, 2017
This comment has been minimized.
This comment has been minimized.
|
i'm also experiencing this issue |
This comment has been minimized.
This comment has been minimized.
serhatcetinkaya
commented
Dec 26, 2017
|
We are also experiencing this issue, when there is an alert in FIRING state if we change the FOR clause for that alert and reload prometheus, we see several errors like the one below in alertmanager.log : |
This comment has been minimized.
This comment has been minimized.
|
I've been debugging this today and believe the original issue has been solved by Fabian in this commit. I think the problematic line was adding the new duration to the time the alert was activated at, which could result in the alert's The issue was consistently reproducible in release 2.0, however I have also managed to run into it on HEAD (+ Krasi's pr to fix Alertmanager sd). I will update this comment when I know more. EDIT: Using tcpdump it seems sometimes when the startTime of the alert is reset to zero, Prometheus still sends it which also results in the same "start time must be before end time" error. What seems to be happening:
It looks as though a similar issue persists on the Alertmanager side of things so I will close this issue and open the relevant one there. |
Conorbro
referenced this issue
Jan 10, 2018
Closed
Alertmanager incorrectly handling newly resolved alert default startTime when zero #1191
Conorbro
closed this
Jan 11, 2018
This comment has been minimized.
This comment has been minimized.
lock
bot
commented
Mar 23, 2019
|
This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs. |
aecolley commentedAug 4, 2016
What did you do?
I extended the FOR threshold of an alert rule from 10m to 20m and triggered a reload (successfully). Several alerts were active for the rule at the time (some pending, some firing, some recently cleared).
What did you expect to see?
Pending alerts and cleared alerts should be unaffected. Firing alerts may change to any of the three states.
What did you see instead? Under which circumstances?
Alertmanager started rejecting alert RPCs with 400 Bad Request. Using tcpdump, the rpc was captured. The response was:
{"error": "start time must be before end time"}, "errorType": "bad_data", "status": "error"}The (slightly redacted) part of the request was:
{"annotations": {...}, "endsAt": "2016-08-04T07:10:37.803-04:00", "generatorURL": ..., "labels": {...}, "startsAt": "2016-08-04T07:16:37.903-04:00"}On inspection of logs and the timeseries history for
ALERTS, the sequence of events for this alert instance was:endsAttime of the tcpdump-captured alert RPCstartsAttime of the tcpdump-captured alert RPCEnvironment
System information:
uname -srm: Linux 2.6.32-573.22.1.el6.x86_64 x86_64Prometheus version:
1.0.0 (built from source with go1.6.2)
Alertmanager version:
0.2.1 (built from source with go1.6.2)
Prometheus configuration file:
Single config with multiple entries under
rule_files, one of which defines a simple "ALERT FooDown IF up{job="foo"} == 0 FOR 10m..." rule, which was amended from10mto20mwhile alerts generated from it were active.Alertmanager configuration file:
Not relevant.
Logs:
Prometheus log snippet:
Alertmanager log snippet:
Supplementary gripes
Neither prometheus nor alertmanager logged the alert RPC that was bad. The prometheus counters
prometheus_notifications_dropped_totalandprometheus_notifications_errors_totalwere incremented, and they caused metamonitoring alerts to be raised (delivered by a different prometheus to the same alertmanager) as expected. However, I had to resort to tcpdump to see the alert RPCs and verify what alertmanager's error message was telling me about them.