Alertmanager service discovery does not update under heavy load #13676

Spaceman1701 · 2024-02-29T17:47:14Z

What did you do?

We have a cluster of Alertmanager instances running in HA mode. We've configured Prometheus to discover the Alertmanager instances using DNS SRV SD. In our deployment, the DNS records are served by consul. When an Alertmanager instance fails or is removed, the SRV record is quickly updated to reflect the new topology. We had a host with an Alertmanager instance fail and observed the Prometheus prometheus_notifications_queue_length grow uncontrollably and we started getting Alert notification queue full, dropping alerts messages in the log.

Reproduction steps:

Deploy an Alertmanager cluster in HA mode
Configure Prometheus to discover Alertmanager using https://prometheus.io/docs/prometheus/latest/configuration/configuration/#dns_sd_config with SRV records
Kill one Alertmanager instance and remove that instance from the SRV record
Wait until the Prometheus notification queue fills

What did you expect to see?

We expected Prometheus to stop sending alerts to the failed Alertmanager quickly after the DNS record changed. We expected Prometheus to continue sending alerts to the Alertmanagers that remained in the DNS record.

What did you see instead? Under which circumstances?

We experienced a failure on a host running one of the Alertmanager instances. The failed instance become unreachable and it was removed from the DNS record within a few seconds. Prometheus continued to attempt to post alerts to the failed Alertmanager until we restarted the process. Because every request hit the context timeout, the notification queue started to grow uncontrollably and we started to see notifications dropped. This is the same behavior described in #7676.

This was unexpected for our configuration because the DNS records changed to reflect the missing Alertmanger very quickly. Upon investigation, we found that the prometheus_sd_discovered_targets{name="notify"} reflected the new topology almost immediately (which is exactly what we expected). However, prometheus_notifications_alertmanagers_discovered never reflected the change (until we manually restarted prometheus). This was unexpected.

We did some investigation and believe we have found the root cause: https://github.com/prometheus/prometheus/blob/main/discovery/manager.go#L346 performs a non-blocking write to https://github.com/prometheus/prometheus/blob/main/discovery/manager.go#L90 which is an unbuffered channel. For Alertmanager notifications, this is read by https://github.com/prometheus/prometheus/blob/main/notifier/notifier.go#L310 which is effectively a non-blocking read when the notifier is under heavy load (because n.more is likely to be readable). This becomes a death-spiral because the long timeout gives more time for more alerts to enqueue which increases the chance that n.more will be readable. The spiral only stops once the maximum notification queue depth is reached. In our case this reached a steady-state of n.more always being readable and tsets never getting read. This means that service discovery targets for Alertmanager don't actually propagate to the notifier unless the loops in the manager and the notifier happen to perfectly line up. In the case where an Alertmanager instance has failed, the expected state of the notifier is to be in sendAll almost all the time because every attempt to post an alert will require waiting for the context to expire which is much longer than the time it takes to execute the select statement in Run. In practice, there's no reason to expect that the notifier will ever get new service discovery information in this situation.

This is a serious problem because it means any Alertmanager instance becomes a signal point of failure for the entire Prometheus+Alertmanager deployment. We observed that the rate of notifications to the other Alertmanager instances steeply declined and notifications began getting dropped.

We think this would behave better if the syncCh was buffered. I'll open a PR with that change shorty.

System information

Linux

Prometheus version

2.45.1

Prometheus configuration file

No response

Alertmanager version

No response

Alertmanager configuration file

No response

Logs

We saw both these messages repeatedly in the Prometheus logs: 

caller=manager.go:246 level=debug component="discovery manager notify" msg="Discovery receiver's channel was full so will retry the next cycle"

caller=notifier.go:371 level=warn component=notifier msg="Alert notification queue full, dropping alerts" num_dropped=3

The text was updated successfully, but these errors were encountered:

machine424 · 2024-03-01T08:39:12Z

Good find! I see you're referring to discovery/manager.go. I assume you're running Prometheus with --enable-feature=new-service-discovery-manager.

Do you have the same issue on the latest Prometheus version?

I don't think that a buffered channel to avoid losing an update is the best way to go. Others may argue that the buffer size should be more than 1 because they have more Alertmanager clusters...

The discovery manager sends updates regularly. The notifier should end up getting them, even if one channel is too busy. select should ensure fairness.

So maybe, there is a bug somewhere else. I thought the discovery manager wasn't sending good updates, but prometheus_sd_discovered_targets being updated intrigues me.

Do you trigger any config reloads? Could you share your alertmanager_config and see if you can simulate the issue using file_sd_config instead of dns_sd_config?

Spaceman1701 · 2024-03-01T17:21:35Z

Good find! I see you're referring to discovery/manager.go. I assume you're running Prometheus with --enable-feature=new-service-discovery-manager.

Ah, we do not use --enable-feature=new-service-discovery-manager and I didn't realize discovery/manager.go was gated behind it. Thanks for pointing that out! It looks like the legacymanager.Manager has the same behavior: https://github.com/prometheus/prometheus/blob/main/discovery/legacymanager/manager.go#L214-L225

Do you have the same issue on the latest Prometheus version?

We haven't tested with a newer version of Promethues, but the code for propagating discovery to Alertmanager hasn't changed.

I don't think that a buffered channel to avoid losing an update is the best way to go. Others may argue that the buffer size should be more than 1 because they have more Alertmanager clusters...

That's a good point, I didn't think of that. Luckily, it seems like the discovery manager combings all the underlying providers (which I think are created from the sd_configs) into one target set which is shared over the syncCh (at least, if I'm reading the code correctly).
See https://github.com/prometheus/prometheus/blob/main/discovery/legacymanager/manager.go#L169, https://github.com/prometheus/prometheus/blob/main/discovery/legacymanager/manager.go#L193, https://github.com/prometheus/prometheus/blob/main/discovery/legacymanager/manager.go#L238, and https://github.com/prometheus/prometheus/blob/main/discovery/legacymanager/manager.go#L217.

There's also only one notifier as far as I can tell: https://github.com/prometheus/prometheus/blob/main/cmd/prometheus/main.go#L1259

The discovery manager sends updates regularly. The notifier should end up getting them, even if one channel is too busy. select should ensure fairness.
So maybe, there is a bug somewhere else. I thought the discovery manager wasn't sending good updates, but prometheus_sd_discovered_targets being updated intrigues me.

I'm pretty sure what I identified above is the underlying bug. The select should ensure fairness if the channel was buffered, but the current implementation drops messages if the manager sends a target group to the notifier while the notifier is doing different work.

For reference and to make sure my argument is very clear: here's the relevant snippet from legacymanager/manager.go:

// https://github.com/prometheus/prometheus/blob/main/discovery/legacymanager/manager.go#L204
func (m *Manager) sender() {
	ticker := time.NewTicker(m.updatert)
	defer ticker.Stop()

	for {
		select {
		case <-m.ctx.Done():
			return
		case <-ticker.C: // Some discoverers send updates too often so we throttle these with the ticker.
			select {
			case <-m.triggerSend:
				m.metrics.SentUpdates.Inc()
				select {
				case m.syncCh <- m.allGroups():
				default:
					m.metrics.DelayedUpdates.Inc()
					level.Debug(m.logger).Log("msg", "Discovery receiver's channel was full so will retry the next cycle")
					select {
					case m.triggerSend <- struct{}{}:
					default:
					}
				}
			default:
			}
		}
	}
}

When we're ready to send a target group update, we're in the second case of the outer select (case <-m.triggerSend:). This executes the inner select:

select {
case m.syncCh <- m.allGroups():
default:
    m.metrics.DelayedUpdates.Inc()
    level.Debug(m.logger).Log("msg", "Discovery receiver's channel was full so will retry the next cycle")
    select {
    case m.triggerSend <- struct{}{}:
    default:
}

The inner select performs a non-blocking write to the m.syncCh channel. This means that if nothing is currently reading m.syncCh, the default branch will be executed. In our case, we know this was happening because we saw the Discovery receiver's channel was full so will retry the next cycle message in the log.

On the other side of m.syncCh, we have Notifier.Run in notifier.go:

// https://github.com/prometheus/prometheus/blob/main/notifier/notifier.go#L302
func (n *Manager) Run(tsets <-chan map[string][]*targetgroup.Group) {
	for {
		// The select is split in two parts, such as we will first try to read
		// new alertmanager targets if they are available, before sending new
		// alerts.
		select {
		case <-n.ctx.Done():
			return
		case ts := <-tsets:
			n.reload(ts)
		default:
			select {
			case <-n.ctx.Done():
				return
			case ts := <-tsets:
				n.reload(ts)
			case <-n.more:
			}
		}
		alerts := n.nextBatch()

		if !n.sendAll(alerts...) {
			n.metrics.dropped.Add(float64(len(alerts)))
		}
		// If the queue still has items left, kick off the next iteration.
		if n.queueLen() > 0 {
			n.setMore()
		}
	}
}

If I understand correctly, tsets is m.syncCh. I believe the comment here is a bit a misleading - the outer select does prioritize reading from tsets, but because this select has a default case, tsets is an unbuffered channel, and the manager does a non-blocking write, case ts := <-tsets: will only be selected if the manager is writing to m.syncCh at the exact moment this select is executed. It's safe to assume this never happens. This means that we'll essentially always enter the inner select (assuming the context is still valid).

select {
case <-n.ctx.Done():
    return
case ts := <-tsets:
    n.reload(ts)
case <-n.more:
}
alerts := n.nextBatch()

if !n.sendAll(alerts...) {
    n.metrics.dropped.Add(float64(len(alerts)))
}
// If the queue still has items left, kick off the next iteration.
if n.queueLen() > 0 {
    n.setMore()
}

This select is blocking, so it does give us a chance of entering the case ts := <-tsets branch if the manager writes the m.syncCh while the notifier is blocking in on this select. That condition is key because the notifier will only block on this select as long as n.more is not readable. The select statement only guarantees fair selection between cases which are readable when the select statement is ready to unblock. n.more is a buffered channel (https://github.com/prometheus/prometheus/blob/main/notifier/notifier.go#L233) so if anything writes to it, it will be readable and the select statement will be able to execute the final case (case <-n.more:) and enter n.sendAll.

alerts := n.nextBatch()

if !n.sendAll(alerts...) {
    n.metrics.dropped.Add(float64(len(alerts)))
}
// If the queue still has items left, kick off the next iteration.
if n.queueLen() > 0 {
    n.setMore()
}

n.nextBatch pops up to maxBatchSize from n.queue. For reference, maxBatchSize is 64: https://github.com/prometheus/prometheus/blob/main/notifier/notifier.go#L275. Ifn.queryLen() > 64 when we exit the select, we will call n.setMore() which internally writes to n.more. This means that there are more than 64 alerts in the queue, n.more will always be readable when we enter the select.

I won't inline sendAll here because it's very long, but it sequentially and synchronously iterates every alertmanager and attempts to post the batch: https://github.com/prometheus/prometheus/blob/main/notifier/notifier.go#L522. In the case where some Alertmanagers are unreachable, we expect this to take at least ams.cfg.Timeout. During this time, new alerts can be added to the queue via Manager.Send (https://github.com/prometheus/prometheus/blob/main/notifier/notifier.go#L349). This will also call n.setMore(). Also, remember that during this time, nothing is reading tsets so the discovery manager will drop any target group messages when it tries to write to m.syncCh.

So, there are two conditions where n.more becomes readable:

if n.queueLen() > 64 at the start of the iteration
any alert is added to the queue

In our case, we found that new alerts were being generated faster than the notifier could process them. This caused n.queueLen() to grow uncontrollably until we hit the maximum length (which defaults to 10,000: https://github.com/prometheus/prometheus/blob/main/cmd/prometheus/main.go#L428-L429). At that point, even if all alerting stopped, it would take a long time to clear the queue with the default timeout of 10s (I think about 20m, unless I'm missing something). Of course, alerts do continue to get generated so our expectation is that the queue would never clear.

Even with somewhat shorter timeouts, we can expect n.more usually be readable when an Alertmanager is unreachable. As long as n.more is readable, tsets will only get read if the manager writes to m.syncCh at the exact moment that the notifier enters the select statement in Run. There's no reason to expect that this will ever happen.

The end result is that we should never expect the notifier to see updated target groups because it will not have an opportunity to read from the tsets channel. The select statement only guarantees fair selection between cases which are readable when the select statement is ready to unblock. We expect the notifier to be in sendAll the vast majority of the time under high load conditions or when an Alertmanager is unreachable. Since tsetsis very unlikely to be readable at any given time, it's very unlikely that the branch will be selected.

This explains the divergence in the two sd metrics - the discovery manager immediately saw the change but never got a message to the notifier.

Hopefully that's not too much information and my explanation is clear! I just want to make sure we're all on the same page. That's all my reasoning, so if I've made a mistake, hopefully this explanation makes it obvious.

Do you trigger any config reloads? Could you share your alertmanager_config and see if you can simulate the issue using file_sd_config instead of dns_sd_config?

We did not trigger any reloads unfortunately. I don't know how the effects the notification queue. If it clears the queue, I'd expect it would've solved the problem. When we saw that the queue was not shrinking, we ended up restarting the Prometheus process. When the server came back up, the prometheus_notifications_alertmanagers_discovered metric showed the correct number of targets and everything started working fine again.

Unfortunately I can't share our config, but this is what the relevant alertmanagers block looks like:

-  dns_sd_configs:
     - names:
        - our-alertmanager-srv-record-name
     type: SRV

machine424 · 2024-03-06T16:58:23Z

That's a good point, I didn't think of that. Luckily, it seems like the discovery manager combings all the underlying providers (which I think are created from the sd_configs) into one target set which is shared over the syncCh (at least, if I'm reading the code correctly)...

You're correct. I overlooked the fact that the SD manager retries if it fails to write into the channel, even though that's why I initially thought that syncCh±, even when unbuffered, should eventually be read. Perhaps "fairness" isn't the most appropriate term, but I believed syncCh would have a chance.

Initially, I thought we could simulate what happens using this code: https://go.dev/play/p/DCNFft9X33j. When I run that locally, I observe reads from chan2. Of course, not every 5 seconds as expected/desired, but the channel is read occasionally. The issue is that in the Notifier, there isn't a third goroutine. The same goroutine that reads from the two channels also writes into one of them. So, in the worst cases, it looks more like https://go.dev/play/p/b_IPVrs6zJm. In that case, the other channel is indeed never read (at least on my machine), and I'm unsure if this is expected/normal.

So, yes, we'll need to assist the reads from syncCh. Another solution we could consider is moving those reads from that hot path. The Scrape manager, which is another consumer of the SD manager, does this separately:

prometheus/scrape/manager.go

Line 114 in 0a42621

func (m *Manager) Run(tsets <-chan map[string][]*targetgroup.Group) error {

. Perhaps we could do something similar.

By the way, the Scrape manager uses a buffered channel already: https://github.com/prometheus/prometheus/blob/0a426215b52e461e3fa4c162c79279bcd5e3b795/scrape/manager.go#L59C3-L59C16. I share @roidelapluie's opinion (see below); I'm concerned that making syncCh buffered as well might complicate reasoning and debugging.

I believe the first step would be to assemble a test that reproduces the issue and understand why TestHangingNotifier doesn't cover your case.

A similar problem has already been discussed in #8768, and a patch was merged via #10948.

± Perhaps we should rename it to avoid any confusion, as the channel is no longer used to synchronize goroutines.

bboreham · 2024-03-12T11:51:21Z

I read through this during bug-scrub, and I came to these same conclusions:

name should not be syncCh if it is no longer a synchronization point.
we should have a test that reproduces the issue.

Thanks to all who contributed so far.

Spaceman1701 · 2024-03-19T17:46:11Z

A similar problem has already been discussed in #8768, and a patch was merged via #10948

Ah, great find! Unfortunately the 2-level select in #10948 cannot work without some kind of blocking write to the syncCh (or a buffered syncCh).

I share @roidelapluie's opinion (see below); I'm concerned that making syncCh buffered as well might complicate reasoning and debugging.

Could you elaborate on why you think making syncCh buffered will make reasoning/debugging more difficult? During "normal" operation, the buffered channel will just hold the latest target group set until the notifier/scrape manager read the channel. This is effectively the same as what happens now. During periods periods where the notifier/scrape manager cannot read the channel frequently, the buffered channel will just contain the target group set which was written after the last read of the channel. Since the reader of the channel should always eventually get a chance to read it, we wouldn't expect that target group set to be significantly out of date. The discovery manager's behavior wouldn't change at all - if the buffer is full, it'll just skip writing to the channel and log that the channel is full (which is exactly what it already does). To me, this behavior seems significantly less confusing than the existing behavior because it guarantees that syncCh is actually read.

I also suspect that any solution other than buffering the channel will end up implementing buffering. Anything I can think of doing basically boils down to adding a new goroutine on the reading side which does a blocking read of syncCh and then passes the value to the notifier on other channel with a blocking write. This is exactly the same behavior as a buffered channel.

I believe the first step would be to assemble a test that reproduces the issue and understand why TestHangingNotifier doesn't cover your case.
we should have a test that reproduces the issue.

If I understand TestHandingNotifier correctly, it doesn't recreate the problem because it does a blocking write the syncCh: https://github.com/roidelapluie/prometheus/blob/main/notifier/notifier_test.go#L662-L677. That's different from what the real discovery manager does and would ensure that the notifier does eventually see the target groups.

I'll try to take some time this week to think of a better way to write simulate the read behavior in a test and add it to #13677.

Spaceman1701 · 2024-03-21T19:17:07Z

As a proof of concept/demonstration of the issue, I've updated #13677 so that TestHangingNotifier sends values on the syncCh consistent with the "real" way the discovery package does it. I've also added a TestHangingNotifierWithBufferedChannel that constructs a buffered syncCh.

If you run those tests you'll see that TestHangingNotifier now fails while TestHangingNotifierWithBufferedChannel succeeds.

Here's the output from TestHangingNotifier:

--- FAIL: TestHangingNotifier (10.01s)
    --- FAIL: TestHangingNotifier/0 (1.00s)
        notifier_test.go:597: hanging server called multiple times
        notifier_test.go:597: hanging server called multiple times
        notifier_test.go:597: hanging server called multiple times
        notifier_test.go:597: hanging server called multiple times
        notifier_test.go:597: hanging server called multiple times
        notifier_test.go:597: hanging server called multiple times
        notifier_test.go:597: hanging server called multiple times
        notifier_test.go:597: hanging server called multiple times
        notifier_test.go:597: hanging server called multiple times
        notifier_test.go:597: hanging server called multiple times
        notifier_test.go:597: hanging server called multiple times
        notifier_test.go:597: hanging server called multiple times
        notifier_test.go:597: hanging server called multiple times
        notifier_test.go:597: hanging server called multiple times
        notifier_test.go:597: hanging server called multiple times
        notifier_test.go:597: hanging server called multiple times
        notifier_test.go:597: hanging server called multiple times
        notifier_test.go:597: hanging server called multiple times
        notifier_test.go:597: hanging server called multiple times
        notifier_test.go:687: Timeout after 1 second, targets not synced in time.
    --- FAIL: TestHangingNotifier/1 (1.00s)
        notifier_test.go:597: hanging server called multiple times
        notifier_test.go:597: hanging server called multiple times
        notifier_test.go:597: hanging server called multiple times
        notifier_test.go:597: hanging server called multiple times
        notifier_test.go:597: hanging server called multiple times
        notifier_test.go:597: hanging server called multiple times
        notifier_test.go:597: hanging server called multiple times
        notifier_test.go:597: hanging server called multiple times
        notifier_test.go:597: hanging server called multiple times
        notifier_test.go:597: hanging server called multiple times
        notifier_test.go:597: hanging server called multiple times
        notifier_test.go:597: hanging server called multiple times
        notifier_test.go:597: hanging server called multiple times
        notifier_test.go:597: hanging server called multiple times
        notifier_test.go:597: hanging server called multiple times
        notifier_test.go:597: hanging server called multiple times
        notifier_test.go:597: hanging server called multiple times
        notifier_test.go:597: hanging server called multiple times
        notifier_test.go:597: hanging server called multiple times
        notifier_test.go:687: Timeout after 1 second, targets not synced in time.
    --- FAIL: TestHangingNotifier/2 (1.00s)
        notifier_test.go:597: hanging server called multiple times
        notifier_test.go:597: hanging server called multiple times
        notifier_test.go:597: hanging server called multiple times
        notifier_test.go:597: hanging server called multiple times
        notifier_test.go:597: hanging server called multiple times
        notifier_test.go:597: hanging server called multiple times
        notifier_test.go:597: hanging server called multiple times
        notifier_test.go:597: hanging server called multiple times
        notifier_test.go:597: hanging server called multiple times
        notifier_test.go:597: hanging server called multiple times
        notifier_test.go:597: hanging server called multiple times
        notifier_test.go:597: hanging server called multiple times
        notifier_test.go:597: hanging server called multiple times
        notifier_test.go:597: hanging server called multiple times
        notifier_test.go:597: hanging server called multiple times
        notifier_test.go:597: hanging server called multiple times
        notifier_test.go:597: hanging server called multiple times
        notifier_test.go:597: hanging server called multiple times
        notifier_test.go:597: hanging server called multiple times
        notifier_test.go:687: Timeout after 1 second, targets not synced in time.
    --- FAIL: TestHangingNotifier/3 (1.00s)
        notifier_test.go:597: hanging server called multiple times
        notifier_test.go:597: hanging server called multiple times
        notifier_test.go:597: hanging server called multiple times
        notifier_test.go:597: hanging server called multiple times
        notifier_test.go:597: hanging server called multiple times
        notifier_test.go:597: hanging server called multiple times
        notifier_test.go:597: hanging server called multiple times
        notifier_test.go:597: hanging server called multiple times
        notifier_test.go:597: hanging server called multiple times
        notifier_test.go:597: hanging server called multiple times
        notifier_test.go:597: hanging server called multiple times
        notifier_test.go:597: hanging server called multiple times
        notifier_test.go:597: hanging server called multiple times
        notifier_test.go:597: hanging server called multiple times
        notifier_test.go:597: hanging server called multiple times
        notifier_test.go:597: hanging server called multiple times
        notifier_test.go:597: hanging server called multiple times
        notifier_test.go:597: hanging server called multiple times
        notifier_test.go:597: hanging server called multiple times
        notifier_test.go:687: Timeout after 1 second, targets not synced in time.
    --- FAIL: TestHangingNotifier/4 (1.00s)
        notifier_test.go:597: hanging server called multiple times
        notifier_test.go:597: hanging server called multiple times
        notifier_test.go:597: hanging server called multiple times
        notifier_test.go:597: hanging server called multiple times
        notifier_test.go:597: hanging server called multiple times
        notifier_test.go:597: hanging server called multiple times
        notifier_test.go:597: hanging server called multiple times
        notifier_test.go:597: hanging server called multiple times
        notifier_test.go:597: hanging server called multiple times
        notifier_test.go:597: hanging server called multiple times
        notifier_test.go:597: hanging server called multiple times
        notifier_test.go:597: hanging server called multiple times
        notifier_test.go:597: hanging server called multiple times
        notifier_test.go:597: hanging server called multiple times
        notifier_test.go:597: hanging server called multiple times
        notifier_test.go:597: hanging server called multiple times
        notifier_test.go:597: hanging server called multiple times
        notifier_test.go:597: hanging server called multiple times
        notifier_test.go:597: hanging server called multiple times
        notifier_test.go:687: Timeout after 1 second, targets not synced in time.
    --- FAIL: TestHangingNotifier/5 (1.00s)
        notifier_test.go:597: hanging server called multiple times
        notifier_test.go:597: hanging server called multiple times
        notifier_test.go:597: hanging server called multiple times
        notifier_test.go:597: hanging server called multiple times
        notifier_test.go:597: hanging server called multiple times
        notifier_test.go:597: hanging server called multiple times
        notifier_test.go:597: hanging server called multiple times
        notifier_test.go:597: hanging server called multiple times
        notifier_test.go:597: hanging server called multiple times
        notifier_test.go:597: hanging server called multiple times
        notifier_test.go:597: hanging server called multiple times
        notifier_test.go:597: hanging server called multiple times
        notifier_test.go:597: hanging server called multiple times
        notifier_test.go:597: hanging server called multiple times
        notifier_test.go:597: hanging server called multiple times
        notifier_test.go:597: hanging server called multiple times
        notifier_test.go:597: hanging server called multiple times
        notifier_test.go:597: hanging server called multiple times
        notifier_test.go:597: hanging server called multiple times
        notifier_test.go:687: Timeout after 1 second, targets not synced in time.
    --- FAIL: TestHangingNotifier/6 (1.00s)
        notifier_test.go:597: hanging server called multiple times
        notifier_test.go:597: hanging server called multiple times
        notifier_test.go:597: hanging server called multiple times
        notifier_test.go:597: hanging server called multiple times
        notifier_test.go:597: hanging server called multiple times
        notifier_test.go:597: hanging server called multiple times
        notifier_test.go:597: hanging server called multiple times
        notifier_test.go:597: hanging server called multiple times
        notifier_test.go:597: hanging server called multiple times
        notifier_test.go:597: hanging server called multiple times
        notifier_test.go:597: hanging server called multiple times
        notifier_test.go:597: hanging server called multiple times
        notifier_test.go:597: hanging server called multiple times
        notifier_test.go:597: hanging server called multiple times
        notifier_test.go:597: hanging server called multiple times
        notifier_test.go:597: hanging server called multiple times
        notifier_test.go:597: hanging server called multiple times
        notifier_test.go:597: hanging server called multiple times
        notifier_test.go:597: hanging server called multiple times
        notifier_test.go:687: Timeout after 1 second, targets not synced in time.
    --- FAIL: TestHangingNotifier/7 (1.00s)
        notifier_test.go:597: hanging server called multiple times
        notifier_test.go:597: hanging server called multiple times
        notifier_test.go:597: hanging server called multiple times
        notifier_test.go:597: hanging server called multiple times
        notifier_test.go:597: hanging server called multiple times
        notifier_test.go:597: hanging server called multiple times
        notifier_test.go:597: hanging server called multiple times
        notifier_test.go:597: hanging server called multiple times
        notifier_test.go:597: hanging server called multiple times
        notifier_test.go:597: hanging server called multiple times
        notifier_test.go:597: hanging server called multiple times
        notifier_test.go:597: hanging server called multiple times
        notifier_test.go:597: hanging server called multiple times
        notifier_test.go:597: hanging server called multiple times
        notifier_test.go:597: hanging server called multiple times
        notifier_test.go:597: hanging server called multiple times
        notifier_test.go:597: hanging server called multiple times
        notifier_test.go:597: hanging server called multiple times
        notifier_test.go:597: hanging server called multiple times
        notifier_test.go:687: Timeout after 1 second, targets not synced in time.
    --- FAIL: TestHangingNotifier/8 (1.00s)
        notifier_test.go:597: hanging server called multiple times
        notifier_test.go:597: hanging server called multiple times
        notifier_test.go:597: hanging server called multiple times
        notifier_test.go:597: hanging server called multiple times
        notifier_test.go:597: hanging server called multiple times
        notifier_test.go:597: hanging server called multiple times
        notifier_test.go:597: hanging server called multiple times
        notifier_test.go:597: hanging server called multiple times
        notifier_test.go:597: hanging server called multiple times
        notifier_test.go:597: hanging server called multiple times
        notifier_test.go:597: hanging server called multiple times
        notifier_test.go:597: hanging server called multiple times
        notifier_test.go:597: hanging server called multiple times
        notifier_test.go:597: hanging server called multiple times
        notifier_test.go:597: hanging server called multiple times
        notifier_test.go:597: hanging server called multiple times
        notifier_test.go:597: hanging server called multiple times
        notifier_test.go:597: hanging server called multiple times
        notifier_test.go:597: hanging server called multiple times
        notifier_test.go:687: Timeout after 1 second, targets not synced in time.
    --- FAIL: TestHangingNotifier/9 (1.00s)
        notifier_test.go:597: hanging server called multiple times
        notifier_test.go:597: hanging server called multiple times
        notifier_test.go:597: hanging server called multiple times
        notifier_test.go:597: hanging server called multiple times
        notifier_test.go:597: hanging server called multiple times
        notifier_test.go:597: hanging server called multiple times
        notifier_test.go:597: hanging server called multiple times
        notifier_test.go:597: hanging server called multiple times
        notifier_test.go:597: hanging server called multiple times
        notifier_test.go:597: hanging server called multiple times
        notifier_test.go:597: hanging server called multiple times
        notifier_test.go:597: hanging server called multiple times
        notifier_test.go:597: hanging server called multiple times
        notifier_test.go:597: hanging server called multiple times
        notifier_test.go:597: hanging server called multiple times
        notifier_test.go:597: hanging server called multiple times
        notifier_test.go:597: hanging server called multiple times
        notifier_test.go:597: hanging server called multiple times
        notifier_test.go:597: hanging server called multiple times
        notifier_test.go:687: Timeout after 1 second, targets not synced in time.
FAIL
FAIL	github.com/prometheus/prometheus/notifier	12.059s
FAIL

and here's the output from TestHangingNotifierWithBufferedChannel:

ok  	github.com/prometheus/prometheus/notifier	2.049s

I don't believe that either of these are good tests, however. They both just attempt to replicate the behavior of the discovery package and don't actually test the use of the real syncCh logic. This does replicate the real bug and demonstrates that a buffered syncCh would solve the problem.

machine424 · 2024-04-19T13:09:33Z

Thanks, I'll take a look at this.

Spaceman1701 changed the title ~~Alertmanager service discovery does not update load~~ Alertmanager service discovery does not update under heavy load Feb 29, 2024

Spaceman1701 linked a pull request Feb 29, 2024 that will close this issue

Change discovery manager syncCh to buffered channel #13677

Open

nielsole mentioned this issue May 15, 2024

notifier: stop queue filling due to single failed AM #14099

Draft

Spaceman1701 mentioned this issue May 16, 2024

Notification queue fills with single down AM instance #7676

Open

pintohutch mentioned this issue May 24, 2024

rule-evaluator doesn't get updated alertmanager pod ipv4s GoogleCloudPlatform/prometheus-engine#866

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Alertmanager service discovery does not update under heavy load #13676

Alertmanager service discovery does not update under heavy load #13676

Spaceman1701 commented Feb 29, 2024

machine424 commented Mar 1, 2024

Spaceman1701 commented Mar 1, 2024

machine424 commented Mar 6, 2024

bboreham commented Mar 12, 2024

Spaceman1701 commented Mar 19, 2024

Spaceman1701 commented Mar 21, 2024

machine424 commented Apr 19, 2024

Alertmanager service discovery does not update under heavy load #13676

Alertmanager service discovery does not update under heavy load #13676

Comments

Spaceman1701 commented Feb 29, 2024

What did you do?

What did you expect to see?

What did you see instead? Under which circumstances?

System information

Prometheus version

Prometheus configuration file

Alertmanager version

Alertmanager configuration file

Logs

machine424 commented Mar 1, 2024

Spaceman1701 commented Mar 1, 2024

machine424 commented Mar 6, 2024

bboreham commented Mar 12, 2024

Spaceman1701 commented Mar 19, 2024

Spaceman1701 commented Mar 21, 2024

machine424 commented Apr 19, 2024