Alertmanager cluster send duplicate notification #2222

cycwll · 2020-04-01T02:06:31Z

What did you do?
3 Prometheus nodes for HA
3 Alertmanager nodes for HA

alert01 startup command:
/bin/alertmanager --config.file=/etc/alertmanager/config.yml --storage.path=/alertmanager --log.level=debug --cluster.listen-address=prom01:9094

alert02 startup command:
/bin/alertmanager --config.file=/etc/alertmanager/config.yml --storage.path=/alertmanager --log.level=debug --cluster.listen-address=prom02:9094 --cluster.peer=prom01:9094

alert03 startup command:
/bin/alertmanager --config.file=/etc/alertmanager/config.yml --storage.path=/alertmanager --log.level=debug --cluster.listen-address=prom03:9094 --cluster.peer=prom01:9094

And "Cluster Status" status is ready on all alertmanager node.

What did you expect to see?
While instance down alert firing, just receiving one notification.

What did you see instead? Under which circumstances?
While instance down alert firing, sometimes receiving two notification. (sometimes receiving one notification.)

Environment

System information:

Linux 4.12.14-94.41-default x86_64
Alertmanager version:

alertmanager, version 0.20.0 (branch: HEAD, revision: f74be04)
build user: root@00c3106655f8
build date: 20191211-14:13:14
go version: go1.13.5

Prometheus version:

prometheus, version 2.16.0 (branch: HEAD, revision: b90be6f32a33c03163d700e1452b54454ddce0ec)
build user: root@7ea0ae865f12
build date: 20200213-23:50:02
go version: go1.13.8

Alertmanager configuration file:

route:
  group_by: ['alertname']
  group_wait: 30s
  group_interval: 60s
  repeat_interval: 1h
  receiver: 'wechat'

Prometheus configuration file:

global:
  scrape_interval:     30s
  scrape_timeout:      30s
  evaluation_interval: 30s
alerting:
  alertmanagers:
  - static_configs:
    - targets:
       - prom01:9093
       - prom02:9093
       - prom03:9093

Logs:

**alert01 logs**:  
as '1' line, at 23:54:24,  alert node-01 received firing_alerts from alert node-03, but, at '2' line 23:54:55, node-01 still send a same alert notification.

**1** level=debug ts=2020-03-31T23:54:24.565Z caller=nflog.go:540 component=nflog msg="gossiping new entry" entry="entry:<group_key:\"{}:{alertname=\\\"InstanceDown\\\"}\" receiver:<group_name:\"Atlassian\" integration:\"wechat\" > timestamp:<seconds:1585698864 nanos:495618381 > firing_alerts:4666247465654023712 > expires_at:<seconds:1586130864 nanos:495618381 > "
level=debug ts=2020-03-31T23:54:43.838Z caller=cluster.go:306 component=cluster memberlist="2020/04/01 07:54:43 [DEBUG] memberlist: Initiating push/pull sync with: 10.188.53.37:9094\n"
level=debug ts=2020-03-31T23:54:46.184Z caller=cluster.go:306 component=cluster memberlist="2020/04/01 07:54:46 [DEBUG] memberlist: Stream connection from=10.188.53.150:42816\n"
level=debug ts=2020-03-31T23:54:53.580Z caller=dispatch.go:135 component=dispatcher msg="Received alert" alert=InstanceDown[e4a505b][active]
level=debug ts=2020-03-31T23:54:53.580Z caller=dispatch.go:135 component=dispatcher msg="Received alert" alert=InstanceDown[e4a505b][active]
level=debug ts=2020-03-31T23:54:53.583Z caller=dispatch.go:135 component=dispatcher msg="Received alert" alert=WinNodeFilesystemUsage[f5be04f][active]
level=debug ts=2020-03-31T23:54:53.589Z caller=dispatch.go:135 component=dispatcher msg="Received alert" alert=WinNodeFilesystemUsage[f5be04f][active]

**2** level=debug ts=2020-03-31T23:54:55.781Z caller=wechat.go:182 integration=wechat response="{\"errcode\":0,\"errmsg\":\"ok\",\"invaliduser\":\"\"}" incident="{}:{alertname=\"InstanceDown\"}"


**alert03 logs**:  at 23:54:24, node-03 send a notification, and then, at 23:54:55, it received firing_alerts from alert node-01.
**1**  level=debug ts=2020-03-31T23:54:24.495Z caller=wechat.go:182 integration=wechat response="{\"errcode\":0,\"errmsg\":\"ok\",\"invaliduser\":\"\"}" incident="{}:{alertname=\"InstanceDown\"}"
level=debug ts=2020-03-31T23:54:43.839Z caller=cluster.go:306 component=cluster memberlist="2020/04/01 07:54:43 [DEBUG] memberlist: Stream connection from=10.188.53.29:40128\n"
level=debug ts=2020-03-31T23:54:53.579Z caller=dispatch.go:135 component=dispatcher msg="Received alert" alert=InstanceDown[e4a505b][active]
level=debug ts=2020-03-31T23:54:53.579Z caller=dispatch.go:135 component=dispatcher msg="Received alert" alert=InstanceDown[e4a505b][active]
level=debug ts=2020-03-31T23:54:53.583Z caller=dispatch.go:135 component=dispatcher msg="Received alert" alert=WinNodeFilesystemUsage[f5be04f][active]
level=debug ts=2020-03-31T23:54:53.589Z caller=dispatch.go:135 component=dispatcher msg="Received alert" alert=WinNodeFilesystemUsage[f5be04f][active]

**2** level=debug ts=2020-03-31T23:54:55.813Z caller=nflog.go:540 component=nflog msg="gossiping new entry" entry="entry:<group_key:\"{}:{alertname=\\\"InstanceDown\\\"}\" receiver:<group_name:\"Atlassian\" integration:\"wechat\" > timestamp:<seconds:1585698895 nanos:781364302 > firing_alerts:4666247465654023712 > expires_at:<seconds:1586130895 nanos:781364302 > "
  

alert02 log: received msg="gossiping new entry" from node-03 and node-01 at 23:54:24 and 23:54:55 respectively.
level=debug ts=2020-03-31T23:54:24.564Z caller=nflog.go:540 component=nflog msg="gossiping new entry" entry="entry:<group_key:\"{}:{alertname=\\\"InstanceDown\\\"}\" receiver:<group_name:\"Atlassian\" integration:\"wechat\" > timestamp:<seconds:1585698864 nanos:495618381 > firing_alerts:4666247465654023712 > expires_at:<seconds:1586130864 nanos:495618381 > "
level=debug ts=2020-03-31T23:54:46.184Z caller=cluster.go:306 component=cluster memberlist="2020/04/01 07:54:46 [DEBUG] memberlist: Initiating push/pull sync with: 10.188.53.29:9094\n"
level=debug ts=2020-03-31T23:54:53.578Z caller=dispatch.go:135 component=dispatcher msg="Received alert" alert=InstanceDown[e4a505b][active]
level=debug ts=2020-03-31T23:54:53.578Z caller=dispatch.go:135 component=dispatcher msg="Received alert" alert=InstanceDown[e4a505b][active]
level=debug ts=2020-03-31T23:54:53.582Z caller=dispatch.go:135 component=dispatcher msg="Received alert" alert=WinNodeFilesystemUsage[f5be04f][active]
level=debug ts=2020-03-31T23:54:53.589Z caller=dispatch.go:135 component=dispatcher msg="Received alert" alert=WinNodeFilesystemUsage[f5be04f][active]
level=debug ts=2020-03-31T23:54:55.811Z caller=nflog.go:540 component=nflog msg="gossiping new entry" entry="entry:<group_key:\"{}:{alertname=\\\"InstanceDown\\\"}\" receiver:<group_name:\"Atlassian\" integration:\"wechat\" > timestamp:<seconds:1585698895 nanos:781364302 > firing_alerts:4666247465654023712 > expires_at:<seconds:1586130895 nanos:781364302 > "

The text was updated successfully, but these errors were encountered:

devinodaniel · 2020-04-01T15:34:15Z

I had some luck with minimizing duplicate notifications by tweaking the --cluster.pushpull-interval and --cluster.gossip-interval flags in the alertmanager startup command to values other than the default. I started with defaults of 1m0s and 200ms respectively and vastly changed them until I got either more notifications or less, then slowly narrowed it down. It was quite painstaking.

To me, it seems to be related to the latency between the alertmanagers over the wire. For instance, I have 4 alertmangers communicating over a tunnel between NYC and CA and sometimes it's fast.. but sometimes, because of high ISP latency, their communication is slow. It would be nice to know if you have the same luck. I still get duplication of 2 to 3 notifications occasionally but I'd rather get multiple alerts than none.

cycwll · 2020-04-02T01:41:37Z

@devinodaniel
thank for your help. My nodes is in a same LAN, so low latency between the alertmanagers over the wire. At present, I have a low probability of receiving repeated notifications(about 5%), according to your description, it seems that repeated notifications is inevitable, I will try your suggestions.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Alertmanager cluster send duplicate notification #2222

Alertmanager cluster send duplicate notification #2222

cycwll commented Apr 1, 2020

devinodaniel commented Apr 1, 2020 •

edited

Loading

cycwll commented Apr 2, 2020

Alertmanager cluster send duplicate notification #2222

Alertmanager cluster send duplicate notification #2222

Comments

cycwll commented Apr 1, 2020

devinodaniel commented Apr 1, 2020 • edited Loading

cycwll commented Apr 2, 2020

devinodaniel commented Apr 1, 2020 •

edited

Loading