Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DNS discovery fails to sync if AlertManager has connection timeout #8768

Closed
pballok-logmein opened this issue Apr 28, 2021 · 25 comments · Fixed by #10948
Closed

DNS discovery fails to sync if AlertManager has connection timeout #8768

pballok-logmein opened this issue Apr 28, 2021 · 25 comments · Fixed by #10948

Comments

@pballok-logmein
Copy link

pballok-logmein commented Apr 28, 2021

What did you do?
Started Prometheus with two AlertManagers alive and receiving alerts. I have a DNS service that lists both AlertManager IPs for the url I'm using to connect to the AlertManager.
I then stopped one of the two AlertManagers, so only one left working. I confirmed that my DNS service now only lists only one IP, the one that was left active. So if Prometheus queries the DNS during DNS service, it will now only see one IP.
I set the AlertManager timeout to 10s, Evaluation period is 30s

What did you expect to see?
I expected to see some error messages related to failing to connect to the now-dead AlertManager.
After 30s (I use the default DNS discovery frequency) the DNS discovery should update the list of AlertManagers to now only contain a single IP, and the errors should stop, the obsolete IP should not be used any more.

What did you see instead? Under which circumstances?

  1. The Notifier was waiting within the Run function (
    func (n *Manager) Run(tsets <-chan map[string][]*targetgroup.Group) {
    ) for either Alerts to arrive that needed to be sent to AlertManager, or a sync message from the DNS discovery.
  2. When sending alerts started to run into timeout errors (old IP is no longer reachable), the sending of alerts took a lot longer, so by the time the select started to wait again, there were already new alerts waiting to be sent again.
  3. So it took those new alerts, tried to send them, ran into timeout again, etc. It never received the sync messages from the sync channel from the DNS discovery. And fell into an endless loop of failing to reach the long-dead AlertCenter.
  4. Hours later it finally recovered, when suddenly there were no new alerts coming every few seconds, so it had time to wait on the sync channel and finally noticed the DNS discovery, updated the IPs and everything went fine after that.

Note that the channel used for the DNS discovery sync is not buffered, so the Notifier will only see those sync messages on the channel if it is waiting inside the select at the moment. When it is busy trying to send the alerts, it will ignore the sync channel.

I tried a small fix where I made the sync channel into a buffered channel with a queue size of 1, it solved this issue, the recovery was instant when one of the AlertManager went offline.

Environment

  • System information:

    Darwin 20.3.0 x86_64

  • Prometheus version:

prometheus, version 2.26.0 (branch: main, revision: f3b2d2a)
build user: pballok@pballok-MBP
build date: 20210428-22:20:08
go version: go1.16.2
platform: linux/amd64

  • Alertmanager version:

    /bin/sh: alertmanager: not found

  • Prometheus configuration file:

# my global config
global:
  scrape_interval:     30s
  evaluation_interval: 30s
  # scrape_timeout is set to the global default (10s).

  external_labels:
    store_name: prometheus
    store_id: aaaaa
    cluster: local

scrape_configs:
  # metrics_path defaults to '/metrics'
  # scheme defaults to 'http'.

  - job_name: 'prometheus-exporter'
    # Use DNS service discovery to get all local instances of the exporter
    dns_sd_configs:
      - names: [***]

  - job_name: 'device-prometheus-exporter'
    static_configs: 
      - targets: [***]


alerting:
  alert_relabel_configs:
    - regex: 'store_id'
      action: labeldrop
  alertmanagers:
    - dns_sd_configs:
      - names: [***]

rule_files:
  - /etc/prometheus/alert.rules.yml
  • Alertmanager configuration file:
insert configuration here (if relevant to the issue)
  • Logs:
insert Prometheus and Alertmanager logs relevant to the issue here
@roidelapluie
Copy link
Member

thank you, it seems you have a patch, would you mind opening a pull request for further discussion? Thank you!

@pballok-logmein
Copy link
Author

thank you, it seems you have a patch, would you mind opening a pull request for further discussion? Thank you!

Sure, I'm preparing a PR

@pballok-logmein
Copy link
Author

The patch I created for this issue fails one of the CI tests (ci/circleci: test_windows), but I'm unable to check the CircleCI results, since it wants access to all my repos, and this is a company account.
I'm also unable to run the tests locally, I'd need to put more time into creating a proper environment for this. Also, I only have access to macOS environment, and the test name "test_windows" suggests I'd need a Windows environment for these specific tests.
Long story short: I need help confirming that the failing test is indeed related to my patch, and if yes, what exactly is the log of the failing test...

@Nick-Triller
Copy link
Contributor

Hi @pballok-logmein, if you press the arrow next to the "Log In with GitHub" button, there is a "Public Repos Only" option. Maybe logging in is acceptable to you if you don't have to grant access to private repos :)

@roidelapluie
Copy link
Member

Note that the channel used for the DNS discovery sync is not buffered, so the Notifier will only see those sync messages on the channel if it is waiting inside the select at the moment. When it is busy trying to send the alerts, it will ignore the sync channel.

I do not think that the fix is correct.

The fact that it is a sync channel means that it will block the go routine that sets the targets, but it will see the new targets eventually.

The fact that it is buffered or not should have very little impact here, it might even make the situation worse as the service discoveries would queue multiple updates at the same time.

@pballok-logmein
Copy link
Author

pballok-logmein commented Apr 30, 2021

Hi @roidelapluie ,
The way I see it in the code is that the Discover Manager sends the targets here:

case m.syncCh <- m.allGroups():

It will not get blocked if the channel is not being waited on, because currently this is not a buffered channel and there is a "default" branch. It will just jump to the default and try again in 5 seconds.

The receiver side is here:

case ts := <-tsets:

This will be indeed blocked if there are no new targets from the discovery manager, AND there are no alerts to send either. If there are always alerts to send, because of all the sending taking too long time, it will never wait and have a chance to read the new discovery targets.
I agree that under normal conditions when there are pauses between sending alerts, the discovery targets will be eventually read.

@pballok-logmein
Copy link
Author

pballok-logmein commented Apr 30, 2021

The fact that it is buffered or not should have very little impact here, it might even make the situation worse as the service discoveries would queue multiple updates at the same time.

I think with a buffer size of 1, there won't be multiple updates waiting at the same time. If a set of targets is placed in the queue, the next discovery run after 30s seconds will find the queue already full. But in practice, since the discovery sync channel is being read between every alert send operations, it will take much shorter than 30 seconds to empty the sync queue anyway.

@roidelapluie
Copy link
Member

I missed a piece of the puzzle. Thanks for pointing me it out.

I think that this is still not the correct fix. The issue I see is that a service discovery can have a lot of target groups, therefore buffering could still have a huge delay if multiple groups are updated.

Maybe we could have a select in a loop and the send of the alerts in another go routine.

@pballok-logmein
Copy link
Author

If there are multiple groups updated, are those updated one by one with separate messages on the sync channel, or just as one message? The sending code m.syncCh <- m.allGroups() looks like it would send all the groups as a single map, which would be received by the Notifier as a single item.

@pballok-logmein
Copy link
Author

Hi @roidelapluie,
I'm a bit confused about what the next step is. Do you still prefer a different solution that involves a separate loop / go-routine to listen to DNS discovery targets? Or is the current proposed solution ok?

The current proposed solution makes sure that the two channels (1. the one used to signal that there are alerts to be sent, 2. the one used to send DNS discovery targets) have equal chance to be read, since both will be buffered (with queue size of 1). I agree this could also be achieved if separate loops were reading the two channels, but that change might have other implications (I'm not sure which, I didn't look into it), since this is a generic part used by other discovery notifiers too. Let me know what you guys think.

@roidelapluie
Copy link
Member

I need to dig into the code. I think that it might break other stuff in subtle ways, that I need to check. Did you look at the impact this would have on service discovery?

@pballok-logmein
Copy link
Author

pballok-logmein commented May 3, 2021

One impact my proposal has for sure, that the current implementation only cares about DNS discovery results, if it's idle (ie has no alerts to send). With my proposed solution, the DNS discovery results (generated every 30s by default) will be taken into account, and updated even if there are alerts to be sent as well, and this might (slightly) disrupt the current timing. But I think this same impact would be there as well if I went with a separate loop.

@pballok-logmein
Copy link
Author

I will look into possible impacts related to other service Discovery

@roidelapluie
Copy link
Member

Actually the service discovery is using a different mechanism, more elaborated, where we create a buffered chan.

@pballok-logmein
Copy link
Author

Agreed, does this mean that having a similar buffered chan for DNS discovery targets could be ok? (this is what the proposed PR does)

@roidelapluie
Copy link
Member

The approach seems OK. However, the proposed PR causes a double buffer in the scrape discovery, which I'd like to avoid.

@roidelapluie
Copy link
Member

@pballok-logmein are you willing to to implement the same logic we have in the scrape manager in the notifier? Thanks.

@dschmo
Copy link

dschmo commented Jun 28, 2022

We see the same issue in our clusters quite frequently. We have two alertmanager replicas and a lot of firing alerts most of the time. As soon as one alertmanager gets scheduled to a new node we see this issue. Is there a workaround to avoid it? Currently we restart the config-reloader container of prometheus to fix it manually.

@SuperQ
Copy link
Member

SuperQ commented Jun 30, 2022

@dschmo We use a SRV record with stable external DNS hostnames like alertmanager-0, alertmanager-1. Each Alertmanager instance has a separate Ingress endpoint. These point to an internal cloud provider LB with static IPs. This way the path from Prometheus to Alertmanager is mostly static from Prometheus's point of view.

We did this mostly to handle cross-cluster alertmanager traffic.

@roidelapluie
Copy link
Member

@pballok-logmein I have another proposal in #10948 , can you please tell me what you think?

@roidelapluie
Copy link
Member

I'd like to explicitly thank you for your debugging efforts and your clear explanation of the issue.

@multani
Copy link

multani commented Aug 31, 2022

We are still this issue with a similar setup as the configuration from #7063, while running v2.38.0, which should contain the fix from #10948

  • Prometheus & Alertmanager running both on Kubernetes
  • Prometheus being configured to find Alertmanager using kubernetes_sd_configs + role: endpoints

@dschmo
Copy link

dschmo commented Sep 2, 2022

We're still seeing this issue as well. It's reproducible with > 1500 firing alerts and two alertmanager replicas. Just restart the alertmanager pods.

@roidelapluie
Copy link
Member

can you fill another issue with all your details?

@multani
Copy link

multani commented Oct 11, 2022

can you fill another issue with all your details?

I filled in a new issue in #11444 👍

@prometheus prometheus locked as resolved and limited conversation to collaborators Apr 9, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
6 participants