Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Alertmanager cluster send duplicate notification #2222

Open
cycwll opened this issue Apr 1, 2020 · 2 comments
Open

Alertmanager cluster send duplicate notification #2222

cycwll opened this issue Apr 1, 2020 · 2 comments

Comments

@cycwll
Copy link

cycwll commented Apr 1, 2020

What did you do?
3 Prometheus nodes for HA
3 Alertmanager nodes for HA

alert01 startup command:
/bin/alertmanager --config.file=/etc/alertmanager/config.yml --storage.path=/alertmanager --log.level=debug --cluster.listen-address=prom01:9094

alert02 startup command:
/bin/alertmanager --config.file=/etc/alertmanager/config.yml --storage.path=/alertmanager --log.level=debug --cluster.listen-address=prom02:9094 --cluster.peer=prom01:9094

alert03 startup command:
/bin/alertmanager --config.file=/etc/alertmanager/config.yml --storage.path=/alertmanager --log.level=debug --cluster.listen-address=prom03:9094 --cluster.peer=prom01:9094

And "Cluster Status" status is ready on all alertmanager node.

What did you expect to see?
While instance down alert firing, just receiving one notification.

What did you see instead? Under which circumstances?
While instance down alert firing, sometimes receiving two notification. (sometimes receiving one notification.)

Environment

  • System information:

    Linux 4.12.14-94.41-default x86_64

  • Alertmanager version:

alertmanager, version 0.20.0 (branch: HEAD, revision: f74be04)
build user: root@00c3106655f8
build date: 20191211-14:13:14
go version: go1.13.5

  • Prometheus version:

prometheus, version 2.16.0 (branch: HEAD, revision: b90be6f32a33c03163d700e1452b54454ddce0ec)
build user: root@7ea0ae865f12
build date: 20200213-23:50:02
go version: go1.13.8

  • Alertmanager configuration file:
route:
  group_by: ['alertname']
  group_wait: 30s
  group_interval: 60s
  repeat_interval: 1h
  receiver: 'wechat'
  • Prometheus configuration file:
global:
  scrape_interval:     30s
  scrape_timeout:      30s
  evaluation_interval: 30s
alerting:
  alertmanagers:
  - static_configs:
    - targets:
       - prom01:9093
       - prom02:9093
       - prom03:9093
  • Logs:
**alert01 logs**:  
as '1' line, at 23:54:24,  alert node-01 received firing_alerts from alert node-03, but, at '2' line 23:54:55, node-01 still send a same alert notification.

**1** level=debug ts=2020-03-31T23:54:24.565Z caller=nflog.go:540 component=nflog msg="gossiping new entry" entry="entry:<group_key:\"{}:{alertname=\\\"InstanceDown\\\"}\" receiver:<group_name:\"Atlassian\" integration:\"wechat\" > timestamp:<seconds:1585698864 nanos:495618381 > firing_alerts:4666247465654023712 > expires_at:<seconds:1586130864 nanos:495618381 > "
level=debug ts=2020-03-31T23:54:43.838Z caller=cluster.go:306 component=cluster memberlist="2020/04/01 07:54:43 [DEBUG] memberlist: Initiating push/pull sync with: 10.188.53.37:9094\n"
level=debug ts=2020-03-31T23:54:46.184Z caller=cluster.go:306 component=cluster memberlist="2020/04/01 07:54:46 [DEBUG] memberlist: Stream connection from=10.188.53.150:42816\n"
level=debug ts=2020-03-31T23:54:53.580Z caller=dispatch.go:135 component=dispatcher msg="Received alert" alert=InstanceDown[e4a505b][active]
level=debug ts=2020-03-31T23:54:53.580Z caller=dispatch.go:135 component=dispatcher msg="Received alert" alert=InstanceDown[e4a505b][active]
level=debug ts=2020-03-31T23:54:53.583Z caller=dispatch.go:135 component=dispatcher msg="Received alert" alert=WinNodeFilesystemUsage[f5be04f][active]
level=debug ts=2020-03-31T23:54:53.589Z caller=dispatch.go:135 component=dispatcher msg="Received alert" alert=WinNodeFilesystemUsage[f5be04f][active]

**2** level=debug ts=2020-03-31T23:54:55.781Z caller=wechat.go:182 integration=wechat response="{\"errcode\":0,\"errmsg\":\"ok\",\"invaliduser\":\"\"}" incident="{}:{alertname=\"InstanceDown\"}"


**alert03 logs**:  at 23:54:24, node-03 send a notification, and then, at 23:54:55, it received firing_alerts from alert node-01.
**1**  level=debug ts=2020-03-31T23:54:24.495Z caller=wechat.go:182 integration=wechat response="{\"errcode\":0,\"errmsg\":\"ok\",\"invaliduser\":\"\"}" incident="{}:{alertname=\"InstanceDown\"}"
level=debug ts=2020-03-31T23:54:43.839Z caller=cluster.go:306 component=cluster memberlist="2020/04/01 07:54:43 [DEBUG] memberlist: Stream connection from=10.188.53.29:40128\n"
level=debug ts=2020-03-31T23:54:53.579Z caller=dispatch.go:135 component=dispatcher msg="Received alert" alert=InstanceDown[e4a505b][active]
level=debug ts=2020-03-31T23:54:53.579Z caller=dispatch.go:135 component=dispatcher msg="Received alert" alert=InstanceDown[e4a505b][active]
level=debug ts=2020-03-31T23:54:53.583Z caller=dispatch.go:135 component=dispatcher msg="Received alert" alert=WinNodeFilesystemUsage[f5be04f][active]
level=debug ts=2020-03-31T23:54:53.589Z caller=dispatch.go:135 component=dispatcher msg="Received alert" alert=WinNodeFilesystemUsage[f5be04f][active]

**2** level=debug ts=2020-03-31T23:54:55.813Z caller=nflog.go:540 component=nflog msg="gossiping new entry" entry="entry:<group_key:\"{}:{alertname=\\\"InstanceDown\\\"}\" receiver:<group_name:\"Atlassian\" integration:\"wechat\" > timestamp:<seconds:1585698895 nanos:781364302 > firing_alerts:4666247465654023712 > expires_at:<seconds:1586130895 nanos:781364302 > "
  

alert02 log: received msg="gossiping new entry" from node-03 and node-01 at 23:54:24 and 23:54:55 respectively.
level=debug ts=2020-03-31T23:54:24.564Z caller=nflog.go:540 component=nflog msg="gossiping new entry" entry="entry:<group_key:\"{}:{alertname=\\\"InstanceDown\\\"}\" receiver:<group_name:\"Atlassian\" integration:\"wechat\" > timestamp:<seconds:1585698864 nanos:495618381 > firing_alerts:4666247465654023712 > expires_at:<seconds:1586130864 nanos:495618381 > "
level=debug ts=2020-03-31T23:54:46.184Z caller=cluster.go:306 component=cluster memberlist="2020/04/01 07:54:46 [DEBUG] memberlist: Initiating push/pull sync with: 10.188.53.29:9094\n"
level=debug ts=2020-03-31T23:54:53.578Z caller=dispatch.go:135 component=dispatcher msg="Received alert" alert=InstanceDown[e4a505b][active]
level=debug ts=2020-03-31T23:54:53.578Z caller=dispatch.go:135 component=dispatcher msg="Received alert" alert=InstanceDown[e4a505b][active]
level=debug ts=2020-03-31T23:54:53.582Z caller=dispatch.go:135 component=dispatcher msg="Received alert" alert=WinNodeFilesystemUsage[f5be04f][active]
level=debug ts=2020-03-31T23:54:53.589Z caller=dispatch.go:135 component=dispatcher msg="Received alert" alert=WinNodeFilesystemUsage[f5be04f][active]
level=debug ts=2020-03-31T23:54:55.811Z caller=nflog.go:540 component=nflog msg="gossiping new entry" entry="entry:<group_key:\"{}:{alertname=\\\"InstanceDown\\\"}\" receiver:<group_name:\"Atlassian\" integration:\"wechat\" > timestamp:<seconds:1585698895 nanos:781364302 > firing_alerts:4666247465654023712 > expires_at:<seconds:1586130895 nanos:781364302 > "
@devinodaniel
Copy link

devinodaniel commented Apr 1, 2020

I had some luck with minimizing duplicate notifications by tweaking the --cluster.pushpull-interval and --cluster.gossip-interval flags in the alertmanager startup command to values other than the default. I started with defaults of 1m0s and 200ms respectively and vastly changed them until I got either more notifications or less, then slowly narrowed it down. It was quite painstaking.

To me, it seems to be related to the latency between the alertmanagers over the wire. For instance, I have 4 alertmangers communicating over a tunnel between NYC and CA and sometimes it's fast.. but sometimes, because of high ISP latency, their communication is slow. It would be nice to know if you have the same luck. I still get duplication of 2 to 3 notifications occasionally but I'd rather get multiple alerts than none.

@cycwll
Copy link
Author

cycwll commented Apr 2, 2020

@devinodaniel
thank for your help. My nodes is in a same LAN, so low latency between the alertmanagers over the wire. At present, I have a low probability of receiving repeated notifications(about 5%), according to your description, it seems that repeated notifications is inevitable, I will try your suggestions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants