DNS discovery fails to sync if AlertManager has connection timeout #8768

pballok-logmein · 2021-04-28T22:43:42Z

What did you do?
Started Prometheus with two AlertManagers alive and receiving alerts. I have a DNS service that lists both AlertManager IPs for the url I'm using to connect to the AlertManager.
I then stopped one of the two AlertManagers, so only one left working. I confirmed that my DNS service now only lists only one IP, the one that was left active. So if Prometheus queries the DNS during DNS service, it will now only see one IP.
I set the AlertManager timeout to 10s, Evaluation period is 30s

What did you expect to see?
I expected to see some error messages related to failing to connect to the now-dead AlertManager.
After 30s (I use the default DNS discovery frequency) the DNS discovery should update the list of AlertManagers to now only contain a single IP, and the errors should stop, the obsolete IP should not be used any more.

What did you see instead? Under which circumstances?

The Notifier was waiting within the Run function (

prometheus/notifier/notifier.go

Line 305 in fa184a5

func (n *Manager) Run(tsets <-chan map[string][]*targetgroup.Group) {

) for either Alerts to arrive that needed to be sent to AlertManager, or a sync message from the DNS discovery.
When sending alerts started to run into timeout errors (old IP is no longer reachable), the sending of alerts took a lot longer, so by the time the select started to wait again, there were already new alerts waiting to be sent again.
So it took those new alerts, tried to send them, ran into timeout again, etc. It never received the sync messages from the sync channel from the DNS discovery. And fell into an endless loop of failing to reach the long-dead AlertCenter.
Hours later it finally recovered, when suddenly there were no new alerts coming every few seconds, so it had time to wait on the sync channel and finally noticed the DNS discovery, updated the IPs and everything went fine after that.

Note that the channel used for the DNS discovery sync is not buffered, so the Notifier will only see those sync messages on the channel if it is waiting inside the select at the moment. When it is busy trying to send the alerts, it will ignore the sync channel.

I tried a small fix where I made the sync channel into a buffered channel with a queue size of 1, it solved this issue, the recovery was instant when one of the AlertManager went offline.

Environment

System information:

Darwin 20.3.0 x86_64
Prometheus version:

prometheus, version 2.26.0 (branch: main, revision: f3b2d2a)
build user: pballok@pballok-MBP
build date: 20210428-22:20:08
go version: go1.16.2
platform: linux/amd64

Alertmanager version:

/bin/sh: alertmanager: not found
Prometheus configuration file:

# my global config
global:
  scrape_interval:     30s
  evaluation_interval: 30s
  # scrape_timeout is set to the global default (10s).

  external_labels:
    store_name: prometheus
    store_id: aaaaa
    cluster: local

scrape_configs:
  # metrics_path defaults to '/metrics'
  # scheme defaults to 'http'.

  - job_name: 'prometheus-exporter'
    # Use DNS service discovery to get all local instances of the exporter
    dns_sd_configs:
      - names: [***]

  - job_name: 'device-prometheus-exporter'
    static_configs: 
      - targets: [***]


alerting:
  alert_relabel_configs:
    - regex: 'store_id'
      action: labeldrop
  alertmanagers:
    - dns_sd_configs:
      - names: [***]

rule_files:
  - /etc/prometheus/alert.rules.yml

Alertmanager configuration file:

insert configuration here (if relevant to the issue)

Logs:

insert Prometheus and Alertmanager logs relevant to the issue here

The text was updated successfully, but these errors were encountered:

roidelapluie · 2021-04-28T23:37:10Z

thank you, it seems you have a patch, would you mind opening a pull request for further discussion? Thank you!

pballok-logmein · 2021-04-29T19:08:33Z

thank you, it seems you have a patch, would you mind opening a pull request for further discussion? Thank you!

Sure, I'm preparing a PR

pballok-logmein · 2021-04-30T00:08:19Z

The patch I created for this issue fails one of the CI tests (ci/circleci: test_windows), but I'm unable to check the CircleCI results, since it wants access to all my repos, and this is a company account.
I'm also unable to run the tests locally, I'd need to put more time into creating a proper environment for this. Also, I only have access to macOS environment, and the test name "test_windows" suggests I'd need a Windows environment for these specific tests.
Long story short: I need help confirming that the failing test is indeed related to my patch, and if yes, what exactly is the log of the failing test...

Nick-Triller · 2021-04-30T08:54:36Z

Hi @pballok-logmein, if you press the arrow next to the "Log In with GitHub" button, there is a "Public Repos Only" option. Maybe logging in is acceptable to you if you don't have to grant access to private repos :)

roidelapluie · 2021-04-30T09:41:04Z

Note that the channel used for the DNS discovery sync is not buffered, so the Notifier will only see those sync messages on the channel if it is waiting inside the select at the moment. When it is busy trying to send the alerts, it will ignore the sync channel.

I do not think that the fix is correct.

The fact that it is a sync channel means that it will block the go routine that sets the targets, but it will see the new targets eventually.

The fact that it is buffered or not should have very little impact here, it might even make the situation worse as the service discoveries would queue multiple updates at the same time.

pballok-logmein · 2021-04-30T14:22:31Z

Hi @roidelapluie ,
The way I see it in the code is that the Discover Manager sends the targets here:

prometheus/discovery/manager.go

Line 242 in 2a4b8e1

case m.syncCh <- m.allGroups():

It will not get blocked if the channel is not being waited on, because currently this is not a buffered channel and there is a "default" branch. It will just jump to the default and try again in 5 seconds.

The receiver side is here:

prometheus/notifier/notifier.go

Line 311 in 2a4b8e1

case ts := <-tsets:

This will be indeed blocked if there are no new targets from the discovery manager, AND there are no alerts to send either. If there are always alerts to send, because of all the sending taking too long time, it will never wait and have a chance to read the new discovery targets.
I agree that under normal conditions when there are pauses between sending alerts, the discovery targets will be eventually read.

pballok-logmein · 2021-04-30T14:29:28Z

The fact that it is buffered or not should have very little impact here, it might even make the situation worse as the service discoveries would queue multiple updates at the same time.

I think with a buffer size of 1, there won't be multiple updates waiting at the same time. If a set of targets is placed in the queue, the next discovery run after 30s seconds will find the queue already full. But in practice, since the discovery sync channel is being read between every alert send operations, it will take much shorter than 30 seconds to empty the sync queue anyway.

roidelapluie · 2021-04-30T14:43:54Z

I missed a piece of the puzzle. Thanks for pointing me it out.

I think that this is still not the correct fix. The issue I see is that a service discovery can have a lot of target groups, therefore buffering could still have a huge delay if multiple groups are updated.

Maybe we could have a select in a loop and the send of the alerts in another go routine.

pballok-logmein · 2021-04-30T17:14:08Z

If there are multiple groups updated, are those updated one by one with separate messages on the sync channel, or just as one message? The sending code m.syncCh <- m.allGroups() looks like it would send all the groups as a single map, which would be received by the Notifier as a single item.

pballok-logmein · 2021-05-03T20:58:22Z

Hi @roidelapluie,
I'm a bit confused about what the next step is. Do you still prefer a different solution that involves a separate loop / go-routine to listen to DNS discovery targets? Or is the current proposed solution ok?

The current proposed solution makes sure that the two channels (1. the one used to signal that there are alerts to be sent, 2. the one used to send DNS discovery targets) have equal chance to be read, since both will be buffered (with queue size of 1). I agree this could also be achieved if separate loops were reading the two channels, but that change might have other implications (I'm not sure which, I didn't look into it), since this is a generic part used by other discovery notifiers too. Let me know what you guys think.

roidelapluie · 2021-05-03T21:02:58Z

I need to dig into the code. I think that it might break other stuff in subtle ways, that I need to check. Did you look at the impact this would have on service discovery?

pballok-logmein · 2021-05-03T21:10:34Z

One impact my proposal has for sure, that the current implementation only cares about DNS discovery results, if it's idle (ie has no alerts to send). With my proposed solution, the DNS discovery results (generated every 30s by default) will be taken into account, and updated even if there are alerts to be sent as well, and this might (slightly) disrupt the current timing. But I think this same impact would be there as well if I went with a separate loop.

pballok-logmein · 2021-05-03T21:12:26Z

I will look into possible impacts related to other service Discovery

roidelapluie · 2021-05-06T18:07:21Z

Actually the service discovery is using a different mechanism, more elaborated, where we create a buffered chan.

pballok-logmein · 2021-05-17T14:58:37Z

Agreed, does this mean that having a similar buffered chan for DNS discovery targets could be ok? (this is what the proposed PR does)

roidelapluie · 2021-05-17T15:47:54Z

The approach seems OK. However, the proposed PR causes a double buffer in the scrape discovery, which I'd like to avoid.

roidelapluie · 2021-06-01T15:32:52Z

@pballok-logmein are you willing to to implement the same logic we have in the scrape manager in the notifier? Thanks.

dschmo · 2022-06-28T12:53:20Z

We see the same issue in our clusters quite frequently. We have two alertmanager replicas and a lot of firing alerts most of the time. As soon as one alertmanager gets scheduled to a new node we see this issue. Is there a workaround to avoid it? Currently we restart the config-reloader container of prometheus to fix it manually.

SuperQ · 2022-06-30T12:45:25Z

@dschmo We use a SRV record with stable external DNS hostnames like alertmanager-0, alertmanager-1. Each Alertmanager instance has a separate Ingress endpoint. These point to an internal cloud provider LB with static IPs. This way the path from Prometheus to Alertmanager is mostly static from Prometheus's point of view.

We did this mostly to handle cross-cluster alertmanager traffic.

roidelapluie · 2022-07-01T08:22:43Z

@pballok-logmein I have another proposal in #10948 , can you please tell me what you think?

roidelapluie · 2022-07-01T11:31:20Z

I'd like to explicitly thank you for your debugging efforts and your clear explanation of the issue.

multani · 2022-08-31T10:31:50Z

We are still this issue with a similar setup as the configuration from #7063, while running v2.38.0, which should contain the fix from #10948

Prometheus & Alertmanager running both on Kubernetes
Prometheus being configured to find Alertmanager using kubernetes_sd_configs + role: endpoints

dschmo · 2022-09-02T13:36:27Z

We're still seeing this issue as well. It's reproducible with > 1500 firing alerts and two alertmanager replicas. Just restart the alertmanager pods.

roidelapluie · 2022-09-14T08:59:42Z

can you fill another issue with all your details?

multani · 2022-10-11T14:33:20Z

can you fill another issue with all your details?

I filled in a new issue in #11444 👍

pballok-logmein mentioned this issue Apr 29, 2021

use buffered channel for Discovery notifications #8773

Closed

roidelapluie added component/notify kind/bug priority/P3 labels Jun 1, 2021

roidelapluie mentioned this issue Nov 26, 2021

Prometheus alerting reacts to Kubernetes Endpoints changes slowly #7063

Closed

roidelapluie mentioned this issue Jul 1, 2022

Split notifier select in 2 to ensure newer targets are used. #10948

Merged

roidelapluie closed this as completed in #10948 Jul 1, 2022

multani mentioned this issue Oct 11, 2022

kubernetes_sd_config doesn't discover Kubernetes endpoints correctly after some time #11444

Closed

prometheus locked as resolved and limited conversation to collaborators Apr 9, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DNS discovery fails to sync if AlertManager has connection timeout #8768

DNS discovery fails to sync if AlertManager has connection timeout #8768

pballok-logmein commented Apr 28, 2021 •

edited

roidelapluie commented Apr 28, 2021

pballok-logmein commented Apr 29, 2021

pballok-logmein commented Apr 30, 2021

Nick-Triller commented Apr 30, 2021

roidelapluie commented Apr 30, 2021

pballok-logmein commented Apr 30, 2021 •

edited

pballok-logmein commented Apr 30, 2021 •

edited

roidelapluie commented Apr 30, 2021

pballok-logmein commented Apr 30, 2021

pballok-logmein commented May 3, 2021

roidelapluie commented May 3, 2021

pballok-logmein commented May 3, 2021 •

edited

pballok-logmein commented May 3, 2021

roidelapluie commented May 6, 2021

pballok-logmein commented May 17, 2021

roidelapluie commented May 17, 2021

roidelapluie commented Jun 1, 2021

dschmo commented Jun 28, 2022

SuperQ commented Jun 30, 2022 •

edited

roidelapluie commented Jul 1, 2022

roidelapluie commented Jul 1, 2022

multani commented Aug 31, 2022

dschmo commented Sep 2, 2022

roidelapluie commented Sep 14, 2022

multani commented Oct 11, 2022

DNS discovery fails to sync if AlertManager has connection timeout #8768

DNS discovery fails to sync if AlertManager has connection timeout #8768

Comments

pballok-logmein commented Apr 28, 2021 • edited

roidelapluie commented Apr 28, 2021

pballok-logmein commented Apr 29, 2021

pballok-logmein commented Apr 30, 2021

Nick-Triller commented Apr 30, 2021

roidelapluie commented Apr 30, 2021

pballok-logmein commented Apr 30, 2021 • edited

pballok-logmein commented Apr 30, 2021 • edited

roidelapluie commented Apr 30, 2021

pballok-logmein commented Apr 30, 2021

pballok-logmein commented May 3, 2021

roidelapluie commented May 3, 2021

pballok-logmein commented May 3, 2021 • edited

pballok-logmein commented May 3, 2021

roidelapluie commented May 6, 2021

pballok-logmein commented May 17, 2021

roidelapluie commented May 17, 2021

roidelapluie commented Jun 1, 2021

dschmo commented Jun 28, 2022

SuperQ commented Jun 30, 2022 • edited

roidelapluie commented Jul 1, 2022

roidelapluie commented Jul 1, 2022

multani commented Aug 31, 2022

dschmo commented Sep 2, 2022

roidelapluie commented Sep 14, 2022

multani commented Oct 11, 2022

pballok-logmein commented Apr 28, 2021 •

edited

pballok-logmein commented Apr 30, 2021 •

edited

pballok-logmein commented Apr 30, 2021 •

edited

pballok-logmein commented May 3, 2021 •

edited

SuperQ commented Jun 30, 2022 •

edited