What did you do?
hi! I have alertmanager running in K8s deployed via the community Helm chart. after upgrading from chart version 1.33.1 to 1.35.0 (Alertmanager version v0.31.1 to v0.32.2 respectively) I have started getting the aforementioned error. this doesn't seem to be a Helm chart issue as the diff says the only change in the chart is the image version bump. no other config changes have been performed. AM is set up to run as a cluster of 3 instances. I have no alert grouping set up and there are a lot of alerts that go through this AM cluster with relatively complex routing/matchers, but I haven't had this issue on v0.31.1. the error seems to randomly cover various alerts, and the debug logs show that some of them are actually in the resolved state when this occurs. I have had a look at the metrics and nothing stood out to me apart from alertmanager_dispatcher_aggregation_groups, which usually hovers around 200 but spikes to 300+ along with a slight spike in alertmanager_dispatcher_alert_processing_duration_seconds_count, and that's when the errors show up. this is worrying because in dispatch.go it says specifically that this is caused by either a 'bug or extreme contention'.
please let me know if further troubleshooting is necessary to pinpoint the cause of the issue
What did you expect to see?
no bug/regression when upgrading to a newer version
What did you see instead? Under which circumstances?
upgrading from v0.31.1 to v0.32.2
System information
Kubernetes (Talos Linux v1.11.1, kernel 6.12.45-talos)
Alertmanager version
alertmanager, version 0.32.0 (branch: HEAD, revision: 685a2a1c6bb01b2c17bc1bfae995cb3416c1115e)
build user: root@e5ae55633a39
build date: 20260408-18:08:22
go version: go1.26.2
platform: linux/amd64
tags: netgo
Alertmanager configuration file
global: {}
inhibit_rules:
- equal:
- cluster
- namespace
- pod
source_matchers:
- alertname = KubePodCrashLooping
target_matchers:
- alertname = KubeContainerWaiting
receivers:
<lots-of-receivers>
route:
group_by:
- '...'
group_interval: 5m
group_wait: 10s
receiver: "null"
repeat_interval: 3h
routes:
<lots-of-routes>
templates:
- /etc/alertmanager/*.tmpl
Prometheus version
Prometheus configuration file
Logs
...
time=2026-04-16T14:47:11.144Z level=ERROR source=dispatch.go:531 msg="excessive retries creating aggregation group" component=dispatcher fingerprint=4198ce13258e8234 route={}/{} alert=KubeletRestarted retries=101
time=2026-04-16T14:47:11.163Z level=ERROR source=dispatch.go:531 msg="excessive retries creating aggregation group" component=dispatcher fingerprint=4198ce13258e8234 route={}/{} alert=KubeletRestarted retries=101
time=2026-04-16T14:58:11.161Z level=ERROR source=dispatch.go:531 msg="excessive retries creating aggregation group" component=dispatcher fingerprint=4198ce13258e8234 route={}/{} alert=KubeletRestarted retries=101
time=2026-04-16T15:12:12.750Z level=ERROR source=dispatch.go:531 msg="excessive retries creating aggregation group" component=dispatcher fingerprint=db1705f493471c37 route={}/{} alert=HostNetworkInterfaceSaturationSpike retries=101
time=2026-04-16T15:21:52.746Z level=ERROR source=dispatch.go:531 msg="excessive retries creating aggregation group" component=dispatcher fingerprint=db1705f493471c37 route={}/{} alert=HostNetworkInterfaceSaturationSpike retries=101
time=2026-04-16T15:22:12.736Z level=ERROR source=dispatch.go:531 msg="excessive retries creating aggregation group" component=dispatcher fingerprint=db1705f493471c37 route={}/{} alert=HostNetworkInterfaceSaturationSpike retries=101
time=2026-04-16T15:22:12.736Z level=ERROR source=dispatch.go:531 msg="excessive retries creating aggregation group" component=dispatcher fingerprint=db1705f493471c37 route={}/{} alert=HostNetworkInterfaceSaturationSpike retries=101
time=2026-04-16T15:22:32.746Z level=ERROR source=dispatch.go:531 msg="excessive retries creating aggregation group" component=dispatcher fingerprint=db1705f493471c37 route={}/{} alert=HostNetworkInterfaceSaturationSpike retries=101
...
What did you do?
hi! I have alertmanager running in K8s deployed via the community Helm chart. after upgrading from chart version 1.33.1 to 1.35.0 (Alertmanager version v0.31.1 to v0.32.2 respectively) I have started getting the aforementioned error. this doesn't seem to be a Helm chart issue as the diff says the only change in the chart is the image version bump. no other config changes have been performed. AM is set up to run as a cluster of 3 instances. I have no alert grouping set up and there are a lot of alerts that go through this AM cluster with relatively complex routing/matchers, but I haven't had this issue on v0.31.1. the error seems to randomly cover various alerts, and the debug logs show that some of them are actually in the resolved state when this occurs. I have had a look at the metrics and nothing stood out to me apart from
alertmanager_dispatcher_aggregation_groups, which usually hovers around 200 but spikes to 300+ along with a slight spike inalertmanager_dispatcher_alert_processing_duration_seconds_count, and that's when the errors show up. this is worrying because in dispatch.go it says specifically that this is caused by either a 'bug or extreme contention'.please let me know if further troubleshooting is necessary to pinpoint the cause of the issue
What did you expect to see?
no bug/regression when upgrading to a newer version
What did you see instead? Under which circumstances?
upgrading from v0.31.1 to v0.32.2
System information
Kubernetes (Talos Linux v1.11.1, kernel 6.12.45-talos)
Alertmanager version
Alertmanager configuration file
Prometheus version
Prometheus configuration file
Logs