-
Notifications
You must be signed in to change notification settings - Fork 8.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Alertmanager service discovery does not update under heavy load #13676
Comments
Good find! I see you're referring to Do you have the same issue on the latest Prometheus version? I don't think that a buffered channel to avoid losing an update is the best way to go. Others may argue that the buffer size should be more than 1 because they have more Alertmanager clusters... The discovery manager sends updates regularly. The notifier should end up getting them, even if one channel is too busy. So maybe, there is a bug somewhere else. I thought the discovery manager wasn't sending good updates, but Do you trigger any config reloads? Could you share your |
Ah, we do not use
We haven't tested with a newer version of Promethues, but the code for propagating discovery to Alertmanager hasn't changed.
That's a good point, I didn't think of that. Luckily, it seems like the discovery manager combings all the underlying providers (which I think are created from the sd_configs) into one target set which is shared over the There's also only one notifier as far as I can tell: https://github.com/prometheus/prometheus/blob/main/cmd/prometheus/main.go#L1259
I'm pretty sure what I identified above is the underlying bug. The For reference and to make sure my argument is very clear: here's the relevant snippet from // https://github.com/prometheus/prometheus/blob/main/discovery/legacymanager/manager.go#L204
func (m *Manager) sender() {
ticker := time.NewTicker(m.updatert)
defer ticker.Stop()
for {
select {
case <-m.ctx.Done():
return
case <-ticker.C: // Some discoverers send updates too often so we throttle these with the ticker.
select {
case <-m.triggerSend:
m.metrics.SentUpdates.Inc()
select {
case m.syncCh <- m.allGroups():
default:
m.metrics.DelayedUpdates.Inc()
level.Debug(m.logger).Log("msg", "Discovery receiver's channel was full so will retry the next cycle")
select {
case m.triggerSend <- struct{}{}:
default:
}
}
default:
}
}
}
} When we're ready to send a target group update, we're in the second case of the outer select {
case m.syncCh <- m.allGroups():
default:
m.metrics.DelayedUpdates.Inc()
level.Debug(m.logger).Log("msg", "Discovery receiver's channel was full so will retry the next cycle")
select {
case m.triggerSend <- struct{}{}:
default:
} The inner On the other side of // https://github.com/prometheus/prometheus/blob/main/notifier/notifier.go#L302
func (n *Manager) Run(tsets <-chan map[string][]*targetgroup.Group) {
for {
// The select is split in two parts, such as we will first try to read
// new alertmanager targets if they are available, before sending new
// alerts.
select {
case <-n.ctx.Done():
return
case ts := <-tsets:
n.reload(ts)
default:
select {
case <-n.ctx.Done():
return
case ts := <-tsets:
n.reload(ts)
case <-n.more:
}
}
alerts := n.nextBatch()
if !n.sendAll(alerts...) {
n.metrics.dropped.Add(float64(len(alerts)))
}
// If the queue still has items left, kick off the next iteration.
if n.queueLen() > 0 {
n.setMore()
}
}
} If I understand correctly, select {
case <-n.ctx.Done():
return
case ts := <-tsets:
n.reload(ts)
case <-n.more:
}
alerts := n.nextBatch()
if !n.sendAll(alerts...) {
n.metrics.dropped.Add(float64(len(alerts)))
}
// If the queue still has items left, kick off the next iteration.
if n.queueLen() > 0 {
n.setMore()
} This select is blocking, so it does give us a chance of entering the alerts := n.nextBatch()
if !n.sendAll(alerts...) {
n.metrics.dropped.Add(float64(len(alerts)))
}
// If the queue still has items left, kick off the next iteration.
if n.queueLen() > 0 {
n.setMore()
}
I won't inline So, there are two conditions where
In our case, we found that new alerts were being generated faster than the notifier could process them. This caused Even with somewhat shorter timeouts, we can expect The end result is that we should never expect the notifier to see updated target groups because it will not have an opportunity to read from the This explains the divergence in the two sd metrics - the discovery manager immediately saw the change but never got a message to the notifier. Hopefully that's not too much information and my explanation is clear! I just want to make sure we're all on the same page. That's all my reasoning, so if I've made a mistake, hopefully this explanation makes it obvious.
We did not trigger any reloads unfortunately. I don't know how the effects the notification queue. If it clears the queue, I'd expect it would've solved the problem. When we saw that the queue was not shrinking, we ended up restarting the Prometheus process. When the server came back up, the Unfortunately I can't share our config, but this is what the relevant - dns_sd_configs:
- names:
- our-alertmanager-srv-record-name
type: SRV |
You're correct. I overlooked the fact that the SD manager retries if it fails to write into the channel, even though that's why I initially thought that Initially, I thought we could simulate what happens using this code: https://go.dev/play/p/DCNFft9X33j. When I run that locally, I observe reads from So, yes, we'll need to assist the reads from Line 114 in 0a42621
By the way, the Scrape manager uses a buffered channel already: https://github.com/prometheus/prometheus/blob/0a426215b52e461e3fa4c162c79279bcd5e3b795/scrape/manager.go#L59C3-L59C16. I share @roidelapluie's opinion (see below); I'm concerned that making I believe the first step would be to assemble a test that reproduces the issue and understand why TestHangingNotifier doesn't cover your case. A similar problem has already been discussed in #8768, and a patch was merged via #10948. ± Perhaps we should rename it to avoid any confusion, as the channel is no longer used to synchronize goroutines. |
I read through this during bug-scrub, and I came to these same conclusions:
Thanks to all who contributed so far. |
Ah, great find! Unfortunately the 2-level select in #10948 cannot work without some kind of blocking write to the
Could you elaborate on why you think making syncCh buffered will make reasoning/debugging more difficult? During "normal" operation, the buffered channel will just hold the latest target group set until the notifier/scrape manager read the channel. This is effectively the same as what happens now. During periods periods where the notifier/scrape manager cannot read the channel frequently, the buffered channel will just contain the target group set which was written after the last read of the channel. Since the reader of the channel should always eventually get a chance to read it, we wouldn't expect that target group set to be significantly out of date. The discovery manager's behavior wouldn't change at all - if the buffer is full, it'll just skip writing to the channel and log that the channel is full (which is exactly what it already does). To me, this behavior seems significantly less confusing than the existing behavior because it guarantees that syncCh is actually read. I also suspect that any solution other than buffering the channel will end up implementing buffering. Anything I can think of doing basically boils down to adding a new goroutine on the reading side which does a blocking read of
If I understand I'll try to take some time this week to think of a better way to write simulate the read behavior in a test and add it to #13677. |
As a proof of concept/demonstration of the issue, I've updated #13677 so that If you run those tests you'll see that Here's the output from
and here's the output from
I don't believe that either of these are good tests, however. They both just attempt to replicate the behavior of the discovery package and don't actually test the use of the real syncCh logic. This does replicate the real bug and demonstrates that a buffered syncCh would solve the problem. |
Thanks, I'll take a look at this. |
What did you do?
We have a cluster of Alertmanager instances running in HA mode. We've configured Prometheus to discover the Alertmanager instances using DNS SRV SD. In our deployment, the DNS records are served by consul. When an Alertmanager instance fails or is removed, the SRV record is quickly updated to reflect the new topology. We had a host with an Alertmanager instance fail and observed the Prometheus
prometheus_notifications_queue_length
grow uncontrollably and we started gettingAlert notification queue full, dropping alerts
messages in the log.Reproduction steps:
What did you expect to see?
We expected Prometheus to stop sending alerts to the failed Alertmanager quickly after the DNS record changed. We expected Prometheus to continue sending alerts to the Alertmanagers that remained in the DNS record.
What did you see instead? Under which circumstances?
We experienced a failure on a host running one of the Alertmanager instances. The failed instance become unreachable and it was removed from the DNS record within a few seconds. Prometheus continued to attempt to post alerts to the failed Alertmanager until we restarted the process. Because every request hit the context timeout, the notification queue started to grow uncontrollably and we started to see notifications dropped. This is the same behavior described in #7676.
This was unexpected for our configuration because the DNS records changed to reflect the missing Alertmanger very quickly. Upon investigation, we found that the
prometheus_sd_discovered_targets{name="notify"}
reflected the new topology almost immediately (which is exactly what we expected). However,prometheus_notifications_alertmanagers_discovered
never reflected the change (until we manually restarted prometheus). This was unexpected.We did some investigation and believe we have found the root cause: https://github.com/prometheus/prometheus/blob/main/discovery/manager.go#L346 performs a non-blocking write to https://github.com/prometheus/prometheus/blob/main/discovery/manager.go#L90 which is an unbuffered channel. For Alertmanager notifications, this is read by https://github.com/prometheus/prometheus/blob/main/notifier/notifier.go#L310 which is effectively a non-blocking read when the notifier is under heavy load (because
n.more
is likely to be readable). This becomes a death-spiral because the long timeout gives more time for more alerts to enqueue which increases the chance thatn.more
will be readable. The spiral only stops once the maximum notification queue depth is reached. In our case this reached a steady-state ofn.more
always being readable andtsets
never getting read. This means that service discovery targets for Alertmanager don't actually propagate to the notifier unless the loops in the manager and the notifier happen to perfectly line up. In the case where an Alertmanager instance has failed, the expected state of the notifier is to be insendAll
almost all the time because every attempt to post an alert will require waiting for the context to expire which is much longer than the time it takes to execute the select statement inRun
. In practice, there's no reason to expect that the notifier will ever get new service discovery information in this situation.This is a serious problem because it means any Alertmanager instance becomes a signal point of failure for the entire Prometheus+Alertmanager deployment. We observed that the rate of notifications to the other Alertmanager instances steeply declined and notifications began getting dropped.
We think this would behave better if the
syncCh
was buffered. I'll open a PR with that change shorty.System information
Linux
Prometheus version
Prometheus configuration file
No response
Alertmanager version
No response
Alertmanager configuration file
No response
Logs
The text was updated successfully, but these errors were encountered: