Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bad data error posting alerts to alertmanager #3543

Closed
alxmk opened this Issue Dec 4, 2017 · 19 comments

Comments

Projects
None yet
@alxmk
Copy link

alxmk commented Dec 4, 2017

What did you do?
Ran prom/prometheus:master with prom/alertmanager:v0.11.0. I'm using master instead of v2.0.0 because I need to pick up the fixes in prometheus/tsdb#213 and #3508.

What did you expect to see?
Alerts posted to alertmanager from prometheus work as expected

What did you see instead? Under which circumstances?
In prometheus logs:

level=error ts=2017-12-04T15:57:01.4290217Z caller=notifier.go:444 component=notifier alertmanager=http://172.30.18.142:9093/api/v1/alerts count=64 msg="Error sending alert" err="bad response status 400 Bad Request"

In alertmanager logs:

level=error ts=2017-12-04T15:57:01.42907689Z caller=api.go:780 msg="API error" err="bad_data: start time must be before end time; start time must be before end time; start time must be before end time; start time must be before end time; start time must be before end time"

Environment
Kubernetes 1.8.2

  • System information:
/prometheus $ uname -srm
Linux 4.4.0-101-generic x86_64
  • Prometheus version:
/prometheus $ prometheus --version
prometheus, version 2.0.0 (branch: master, revision: 30b4439bbd795fee4703b8f5015da6bf13fee0ac)
  build user:       root@69018d7b60eb
  build date:       20171204-11:46:26
  go version:       go1.9.2
  • Alertmanager version:
alertmanager, version 0.11.0 (branch: HEAD, revision: 30dd0426c08b6479d9a26259ea5efd63bc1ee273)
  build user:       root@3e103e3fc918
  build date:       20171116-17:43:56
  go version:       go1.9.2
  • Prometheus configuration file:
global:
  scrape_interval: 30s
  scrape_timeout: 10s
  evaluation_interval: 30s
  external_labels:
    core_env: kube
alerting:
  alertmanagers:
  - kubernetes_sd_configs:
    - api_server: null
      role: pod
      namespaces:
        names: []
    scheme: http
    timeout: 10s
    relabel_configs:
    - source_labels: [__meta_kubernetes_pod_label_app]
      separator: ;
      regex: alertmanager
      replacement: $1
      action: keep
    - source_labels: [__meta_kubernetes_namespace]
      separator: ;
      regex: default
      replacement: $1
      action: keep
    - source_labels: [__meta_kubernetes_pod_container_port_number]
      separator: ;
      regex: null
      replacement: $1
      action: drop
rule_files:
- /config/prometheus-2/rules
scrape_configs:
- job_name: prometheus
  scrape_interval: 30s
  scrape_timeout: 10s
  metrics_path: /metrics
  scheme: http
  static_configs:
  - targets:
    - localhost:9091
- job_name: job-N
  scrape_interval: 30s
  scrape_timeout: 10s
  metrics_path: /metrics
  scheme: http
  metric_relabel_configs:
  - source_labels: [__name__]
    separator: ;
    regex: go_.+
    replacement: $1
    action: drop
  - source_labels: [__name__]
    separator: ;
    regex: http_.+
    replacement: $1
    action: drop
  static_configs:
  - targets:
    -  <snip>
    labels:
      <snip>
  - targets:
    - <snip>
    labels:
      <snip>
.
.
.
  • Alertmanager configuration file:
    global:
      resolve_timeout: 1h
      slack_api_url: "{{ inquisition_slack_url }}"

    route:
      group_by: ['alertname']
      group_wait: 2s
      group_interval: 10s
      receiver: slack
      # If an alert has been sent, then wait "repeat_interval" time before resending it
      repeat_interval: 1h
      routes:
      - match:
          destination: inq-enforcer
        receiver: inq-enforcer
        repeat_interval: 10m
        group_by: ['instance','alertname','hostname','privateip','publicip']
        group_wait: 2s
        group_interval: 10s
      - match:
          destination: inq-town-crier
        receiver: inq-town-crier
        repeat_interval: 10m
        group_by: ['instance','alertname','hostname','privateip','publicip']
        group_wait: 2s
        group_interval: 10s
      - match:
          destination: slack
        receiver: slack
        group_by: ['instance','alertname','hostname','privateip','publicip']
        group_wait: 2s
        group_interval: 10s
      - match:
          destination: slack-summarise
        receiver: slack-summarise
        group_by: ['alertname']
        group_wait: 1m
        group_interval: 10m
        repeat_interval: 30m
      - match:
          destination: inq-overwatch
        receiver: inq-overwatch
      - match:
          destination: pagerduty
        receiver: pagerduty
        group_by: ['instance','alertname','hostname','privateip','publicip']

    receivers:
    - name: inq-enforcer
      webhook_configs:
        - url: '<URL>'
    - name: inq-town-crier
      webhook_configs:
        - url: '<URL>'
    - name: slack
      slack_configs:
      - channel: "#{{ notification_channel }}"
        title: "[AlertManager] {{ '{{' }} .CommonAnnotations.summary {{ '}}' }}"
        text: "{{ '{{' }} .CommonLabels.alertname {{ '}}' }} - {{ '{{' }} .CommonAnnotations.description {{ '}}' }}"
    - name: slack-summarise
      slack_configs:
      - channel: "#{{ notification_channel }}"
        title: "[AlertManager] {{ '{{' }} len .Alerts {{ '}}' }} {{ '{{' }} .CommonAnnotations.grouped_summary {{ '}}' }}"
        text: "{{ '{{' }} .CommonLabels.alertname {{ '}}' }} - {{ '{{' }} len .Alerts {{ '}}' }} {{ '{{' }} .CommonAnnotations.grouped_description {{ '}}' }}"
    - name: inq-overwatch
      slack_configs:
      - channel: "#{{ inquisition_notification_channel }}"
        title: "{{ '{{' }} .CommonAnnotations.summary {{ '}}' }}"
        text: "{{ '{{' }} .CommonAnnotations.description {{ '}}' }}"
      webhook_configs:
        - url: '<URL>'
    - name: pagerduty
      pagerduty_configs:
      - service_key: "{{ inquisition_monitoring_pagerduty_key }}"
  • Logs:
    Prom:
level=info ts=2017-12-04T15:53:31.333931883Z caller=main.go:220 msg="Starting Prometheus" version="(version=2.0.0, branch=master, revision=36190239f6d965f5e009da2fae8b502c09399df5)"
level=info ts=2017-12-04T15:53:31.334009371Z caller=main.go:221 build_context="(go=go1.9.2, user=root@9d041e93ac17, date=20171130-13:36:53)"
level=info ts=2017-12-04T15:53:31.334034568Z caller=main.go:222 host_details="(Linux 4.4.0-72-generic #93-Ubuntu SMP Fri Mar 31 14:07:41 UTC 2017 x86_64 prometheus-2-996bbbd66-4h47w (none))"
level=info ts=2017-12-04T15:53:31.338234558Z caller=main.go:398 msg="Starting TSDB ..."
level=info ts=2017-12-04T15:53:31.338073315Z caller=targetmanager.go:71 component="target manager" msg="Starting target manager..."
level=info ts=2017-12-04T15:53:31.338340621Z caller=web.go:380 component=web msg="Start listening for connections" address=:9091
level=info ts=2017-12-04T15:54:11.407839167Z caller=main.go:408 msg="TSDB started"
level=info ts=2017-12-04T15:54:11.407932743Z caller=main.go:490 msg="Loading configuration file" filename=/config/prometheus-2/prometheus
level=info ts=2017-12-04T15:54:12.74356221Z caller=kubernetes.go:100 component=notifier discovery=k8s msg="Using pod service account via in-cluster config"
level=info ts=2017-12-04T15:54:12.751527333Z caller=main.go:385 msg="Server is ready to receive requests."
level=error ts=2017-12-04T15:57:01.4290217Z caller=notifier.go:444 component=notifier alertmanager=http://172.30.18.142:9093/api/v1/alerts count=64 msg="Error sending alert" err="bad response status 400 Bad Request"
level=error ts=2017-12-04T15:58:01.456900885Z caller=notifier.go:444 component=notifier alertmanager=http://172.30.18.142:9093/api/v1/alerts count=64 msg="Error sending alert" err="bad response status 400 Bad Request"
level=error ts=2017-12-04T15:58:01.464241747Z caller=notifier.go:444 component=notifier alertmanager=http://172.30.18.142:9093/api/v1/alerts count=64 msg="Error sending alert" err="bad response status 400 Bad Request"

Alertmanager:

.
.
.
level=info ts=2017-12-04T15:40:50.614454519Z caller=silence.go:279 component=silences msg="Maintenance done" duration=75.744µs
level=info ts=2017-12-04T15:40:50.62667297Z caller=nflog.go:304 component=nflog msg="Maintenance done" duration=99.391µs
level=info ts=2017-12-04T15:55:50.609303021Z caller=nflog.go:287 component=nflog msg="Running maintenance"
level=info ts=2017-12-04T15:55:50.609515273Z caller=silence.go:262 component=silences msg="Running maintenance"
level=info ts=2017-12-04T15:55:50.61830075Z caller=silence.go:279 component=silences msg="Maintenance done" duration=63.853µs
level=info ts=2017-12-04T15:55:50.631665188Z caller=nflog.go:304 component=nflog msg="Maintenance done" duration=85.098µs
level=error ts=2017-12-04T15:57:01.42907689Z caller=api.go:780 msg="API error" err="bad_data: start time must be before end time; start time must be before end time; start time must be before end time; start time must be before end time; start time must be before end time"
level=error ts=2017-12-04T15:58:01.456887499Z caller=api.go:780 msg="API error" err="bad_data: start time must be before end time; start time must be before end time; start time must be before end time; start time must be before end time"
level=error ts=2017-12-04T15:58:01.46423974Z caller=api.go:780 msg="API error" err="bad_data: start time must be before end time; start time must be before end time; start time must be before end time"

Alertmanager has been up for 3 days prior to this issue (which occurred when I updated the prometheus container) so cut the rest of the logs out for clarity. I can provide them if needed.

@zemek

This comment has been minimized.

Copy link
Contributor

zemek commented Dec 4, 2017

probably the same thing as #1871

@alxmk

This comment has been minimized.

Copy link
Author

alxmk commented Dec 5, 2017

I didn’t change any config during the migration, may be a change in how the alerts are bookended which gives similar behaviour I guess? It’s something that’s changed between 2.0.0 and now because the errors go away when I revert.

@brian-brazil

This comment has been minimized.

Copy link
Member

brian-brazil commented Dec 8, 2017

Can you check if the time is correct on both machines?

@alxmk

This comment has been minimized.

Copy link
Author

alxmk commented Dec 8, 2017

Yes it is - both synced to the same NTP and from a manual sanity check both came up with the same time when I ran the command simultaneously from within the containers.

@yylt

This comment has been minimized.

Copy link

yylt commented Dec 13, 2017

startsAt =alert.ActiveAt.Add(rule.holdDuration)
when holdDuration big enought will cause this promblem

@simonpasquier

This comment has been minimized.

Copy link
Member

simonpasquier commented Jan 22, 2018

I can reproduce the bug on master (and 2.1) while it doesn't occur on 2.0.

Configuration and steps to reproduce

Prometheus configuration

global:
  scrape_interval:     15s
  evaluation_interval: 30s

alerting:
  alertmanagers:
  - static_configs:
    - targets:
      - localhost:9093

rule_files:
  - "*rules.yml"

scrape_configs:
  - job_name: 'prometheus'
    static_configs:
        - targets: ['localhost:9090', "localhost:8081"]

Rules

 groups:
- name: test
  interval: 15s
  rules:
  - alert: TargetDown
    expr: up == 0
    for: 1m
    labels:
      severity: test

AlertManager configuration

global:
  resolve_timeout: 5m
route:
  group_by: [alertname]
  group_wait: 10s
  group_interval: 1m
  repeat_interval: 3h
  receiver: 'web.hook'
receivers:
- name: 'web.hook'
  webhook_configs:
  - url: 'http://127.0.0.2:8080/'

Steps

  • Stop the localhost:8081 target.
  • Wait for the alert to fire and the notification to be received.
  • Start localhost:8081 again.

The receiver never gets the resolved notification as Prometheus sends invalid payload to AlertManager. From the AlertManager logs:

level=error ts=2018-01-22T14:19:48.749291362Z caller=api.go:825 msg="API error" err="bad_data: start time must be before end time"                                                                                 

IIUC the problem is located here

if a.State != StateInactive {
a.State = StateInactive
a.ResolvedAt = ts
a.FiredAt = time.Time{}
}

Basically whenever an alert transitions from firing to inactive, FiredAt is set to zero:

a := &notifier.Alert{
StartsAt: alert.FiredAt,
Labels: alert.Labels,
Annotations: alert.Annotations,
GeneratorURL: externalURL + strutil.TableLinkForExpression(expr),
}

And the rest is explained in prometheus/alertmanager#1191 (cc @Conorbro).
Commenting L269 seems to solve the issue.

@solsson

This comment has been minimized.

Copy link

solsson commented Jan 23, 2018

This is a serious regression with 2.1, requiring pod restarts. Please prioritize, as downgrade is unsupported due to persistence.

@bittopaz

This comment has been minimized.

Copy link

bittopaz commented Jan 24, 2018

We are experiencing the same problem after upgrade to v2.1.0, please prioritize

@jbiel

This comment has been minimized.

Copy link

jbiel commented Jan 29, 2018

I'm just hopping into the Prometheus ecosystem with the latest stable versions and experienced this bug during the first alerting test that I performed. I've checked that time is NTP synced on all nodes.

  • Created alert for up == 0 after 30s
  • Killed node_exporter on one instance
  • Alert is received
  • Waited ~5 minutes then restarted node_exporter on "failed" instance
  • Expected alert clear, instead got this never ending error in the alertmanager log. The alert did manage to clear itself eventually somehow, but the log errors continued.

Prometheus: 2.1.0
Alertmanager: 0.13.0

@hectorag

This comment has been minimized.

Copy link

hectorag commented Jan 30, 2018

I'm Also having same behavior after update to version 2.1.0
My system:
*Prometheus V2.1.0
*Alertmanager v0.9.1

time="2018-01-30T16:50:59Z" level=error msg="api error: bad_data: start time must be before end time; start time must be before end time" source="api.go:716"

@jsuchenia

This comment has been minimized.

Copy link

jsuchenia commented Jan 30, 2018

@hectorag in this case I suggest you to stay with 2.0.0 and upgrade alertmanager to recent versions

@bastischubert

This comment has been minimized.

Copy link

bastischubert commented Jan 31, 2018

We (@bastischubert + @RichiH ) are experiencing the same problem after upgrade to v2.1.0 - running AM 0.13.0

am side:
Jan 31 21:30:52 prometheus11.xxx.xx alertmanager[1666]: level=error ts=2018-01-31T20:30:52.210439868Z caller=api.go:803 msg="API error" err="bad_data: start time must be before end time"

prom side:
Jan 31 21:30:52 prometheus11.xxx.xx prometheus[32630]: level=error ts=2018-01-31T20:30:52.210676778Z caller=notifier.go:454 component=notifier alertmanager=http://prometheus11.xxx.xx:9093/api/v1/alerts count=1 msg="Error sending alert" err="bad response status 400 Bad Request"

later i managed to capture some api calls (like that one):

POST /api/v1/alerts HTTP/1.1
Host: prometheus11.xxx.xx:9093
User-Agent: Go-http-client/1.1
Content-Length: 651
Content-Type: application/json

[{"labels":{"alertname":"watchdog_1_4","customer":"xx","instance":"xx.xx.xx:80","job":"watchdog_exporter","monitor":"prometheus10.xxx.xx","severity":"warning","team":"cloudhosting","ucm_area":"xxxx","ucm_request":"xxxx","watchdog_id":"xxx-management-win"},"annotations":{"description":"Watchdog: xxx-management-win, UCM: xxxxxx","summary":"Watchdog: xxx-management-win, UCM: xxxx"},"startsAt":"0001-01-01T00:00:00Z","endsAt":"2018-01-31T21:49:06.661888851+01:00","generatorURL":"http://prometheus11.xxx.xxx:9090/graph?g0.expr=sn_watchdog_alert%7Bucm_area%21%3D%22ops%22%7D+%3E+0+%3C+5\u0026g0.tab=1"}]

HTTP/1.1 400 Bad Request
Access-Control-Allow-Headers: Accept, Authorization, Content-Type, Origin
Access-Control-Allow-Methods: GET, DELETE, OPTIONS
Access-Control-Allow-Origin: *
Access-Control-Expose-Headers: Date
Content-Type: application/json
Date: Wed, 31 Jan 2018 20:49:21 GMT
Content-Length: 86

{"status":"error","errorType":"bad_data","error":"start time must be before end time"}

the json "startsAt":"0001-01-01T00:00:00Z" looks f***ouled up to me - the other instance still at prom 1.8.2 sends proper startsAt times.

both running on the same machine, so timesync should be fine ;)

@simonpasquier

This comment has been minimized.

Copy link
Member

simonpasquier commented Feb 1, 2018

@bastischubert the bug was introduced by the commit 6246137#diff-b4ad1f52631fd3bd0de49dd2ed5f0d01R269. There's #3724 which removes the offending line but it isn't yet merged.

@mst-ableton

This comment has been minimized.

Copy link

mst-ableton commented Feb 1, 2018

We hit this on 2.1.0 with a heartbeat alert ((hour() == 7) and (minute() > 0) and (minute() < 10)), can confirm that rolling back to 2.0.0 resolved the issue. Be advised that rolling back to 2.0.0 is a destructive operation and requires deleting the tsdb.

@jsuchenia

This comment has been minimized.

Copy link

jsuchenia commented Feb 4, 2018

@Conorbro What are the plans/schedule to release fix (#3724) for that bug?

@RRAlex

This comment has been minimized.

Copy link

RRAlex commented Feb 9, 2018

I don't see the log line anymore by running the :master on docker hub (prometheus + alertmanager).

@kamusin

This comment has been minimized.

Copy link

kamusin commented Mar 13, 2018

cheers! I just came across this issue on our 2.1 cluster.

@simonpasquier

This comment has been minimized.

Copy link
Member

simonpasquier commented Apr 23, 2018

The fix is available with v2.2.0 but since you're running 1.8.1, you shouldn't be affected. Please ask on the Prometheus mailing list if you need more help.

@lock

This comment has been minimized.

Copy link

lock bot commented Mar 22, 2019

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

@lock lock bot locked and limited conversation to collaborators Mar 22, 2019

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
You can’t perform that action at this time.