-
Notifications
You must be signed in to change notification settings - Fork 8.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DNS discovery fails to sync if AlertManager has connection timeout #8768
Comments
thank you, it seems you have a patch, would you mind opening a pull request for further discussion? Thank you! |
Sure, I'm preparing a PR |
The patch I created for this issue fails one of the CI tests (ci/circleci: test_windows), but I'm unable to check the CircleCI results, since it wants access to all my repos, and this is a company account. |
Hi @pballok-logmein, if you press the arrow next to the "Log In with GitHub" button, there is a "Public Repos Only" option. Maybe logging in is acceptable to you if you don't have to grant access to private repos :) |
I do not think that the fix is correct. The fact that it is a sync channel means that it will block the go routine that sets the targets, but it will see the new targets eventually. The fact that it is buffered or not should have very little impact here, it might even make the situation worse as the service discoveries would queue multiple updates at the same time. |
Hi @roidelapluie , prometheus/discovery/manager.go Line 242 in 2a4b8e1
It will not get blocked if the channel is not being waited on, because currently this is not a buffered channel and there is a "default" branch. It will just jump to the default and try again in 5 seconds. The receiver side is here: prometheus/notifier/notifier.go Line 311 in 2a4b8e1
This will be indeed blocked if there are no new targets from the discovery manager, AND there are no alerts to send either. If there are always alerts to send, because of all the sending taking too long time, it will never wait and have a chance to read the new discovery targets. I agree that under normal conditions when there are pauses between sending alerts, the discovery targets will be eventually read. |
I think with a buffer size of 1, there won't be multiple updates waiting at the same time. If a set of targets is placed in the queue, the next discovery run after 30s seconds will find the queue already full. But in practice, since the discovery sync channel is being read between every alert send operations, it will take much shorter than 30 seconds to empty the sync queue anyway. |
I missed a piece of the puzzle. Thanks for pointing me it out. I think that this is still not the correct fix. The issue I see is that a service discovery can have a lot of target groups, therefore buffering could still have a huge delay if multiple groups are updated. Maybe we could have a select in a loop and the send of the alerts in another go routine. |
If there are multiple groups updated, are those updated one by one with separate messages on the sync channel, or just as one message? The sending code |
Hi @roidelapluie, The current proposed solution makes sure that the two channels (1. the one used to signal that there are alerts to be sent, 2. the one used to send DNS discovery targets) have equal chance to be read, since both will be buffered (with queue size of 1). I agree this could also be achieved if separate loops were reading the two channels, but that change might have other implications (I'm not sure which, I didn't look into it), since this is a generic part used by other discovery notifiers too. Let me know what you guys think. |
I need to dig into the code. I think that it might break other stuff in subtle ways, that I need to check. Did you look at the impact this would have on service discovery? |
One impact my proposal has for sure, that the current implementation only cares about DNS discovery results, if it's idle (ie has no alerts to send). With my proposed solution, the DNS discovery results (generated every 30s by default) will be taken into account, and updated even if there are alerts to be sent as well, and this might (slightly) disrupt the current timing. But I think this same impact would be there as well if I went with a separate loop. |
I will look into possible impacts related to other service Discovery |
Actually the service discovery is using a different mechanism, more elaborated, where we create a buffered chan. |
Agreed, does this mean that having a similar buffered chan for DNS discovery targets could be ok? (this is what the proposed PR does) |
The approach seems OK. However, the proposed PR causes a double buffer in the scrape discovery, which I'd like to avoid. |
@pballok-logmein are you willing to to implement the same logic we have in the scrape manager in the notifier? Thanks. |
We see the same issue in our clusters quite frequently. We have two alertmanager replicas and a lot of firing alerts most of the time. As soon as one alertmanager gets scheduled to a new node we see this issue. Is there a workaround to avoid it? Currently we restart the config-reloader container of prometheus to fix it manually. |
@dschmo We use a SRV record with stable external DNS hostnames like We did this mostly to handle cross-cluster alertmanager traffic. |
@pballok-logmein I have another proposal in #10948 , can you please tell me what you think? |
I'd like to explicitly thank you for your debugging efforts and your clear explanation of the issue. |
We're still seeing this issue as well. It's reproducible with > 1500 firing alerts and two alertmanager replicas. Just restart the alertmanager pods. |
can you fill another issue with all your details? |
I filled in a new issue in #11444 👍 |
What did you do?
Started Prometheus with two AlertManagers alive and receiving alerts. I have a DNS service that lists both AlertManager IPs for the url I'm using to connect to the AlertManager.
I then stopped one of the two AlertManagers, so only one left working. I confirmed that my DNS service now only lists only one IP, the one that was left active. So if Prometheus queries the DNS during DNS service, it will now only see one IP.
I set the AlertManager timeout to 10s, Evaluation period is 30s
What did you expect to see?
I expected to see some error messages related to failing to connect to the now-dead AlertManager.
After 30s (I use the default DNS discovery frequency) the DNS discovery should update the list of AlertManagers to now only contain a single IP, and the errors should stop, the obsolete IP should not be used any more.
What did you see instead? Under which circumstances?
Run
function (prometheus/notifier/notifier.go
Line 305 in fa184a5
select
started to wait again, there were already new alerts waiting to be sent again.Note that the channel used for the DNS discovery sync is not buffered, so the Notifier will only see those sync messages on the channel if it is waiting inside the select at the moment. When it is busy trying to send the alerts, it will ignore the sync channel.
I tried a small fix where I made the sync channel into a buffered channel with a queue size of 1, it solved this issue, the recovery was instant when one of the AlertManager went offline.
Environment
System information:
Darwin 20.3.0 x86_64
Prometheus version:
prometheus, version 2.26.0 (branch: main, revision: f3b2d2a)
build user: pballok@pballok-MBP
build date: 20210428-22:20:08
go version: go1.16.2
platform: linux/amd64
Alertmanager version:
/bin/sh: alertmanager: not found
Prometheus configuration file:
The text was updated successfully, but these errors were encountered: