Skip to content

"excessive retries creating aggregation group" error after upgrading to v0.32.0 #5176

@abcdegorov

Description

@abcdegorov

What did you do?

hi! I have alertmanager running in K8s deployed via the community Helm chart. after upgrading from chart version 1.33.1 to 1.35.0 (Alertmanager version v0.31.1 to v0.32.2 respectively) I have started getting the aforementioned error. this doesn't seem to be a Helm chart issue as the diff says the only change in the chart is the image version bump. no other config changes have been performed. AM is set up to run as a cluster of 3 instances. I have no alert grouping set up and there are a lot of alerts that go through this AM cluster with relatively complex routing/matchers, but I haven't had this issue on v0.31.1. the error seems to randomly cover various alerts, and the debug logs show that some of them are actually in the resolved state when this occurs. I have had a look at the metrics and nothing stood out to me apart from alertmanager_dispatcher_aggregation_groups, which usually hovers around 200 but spikes to 300+ along with a slight spike in alertmanager_dispatcher_alert_processing_duration_seconds_count, and that's when the errors show up. this is worrying because in dispatch.go it says specifically that this is caused by either a 'bug or extreme contention'.
please let me know if further troubleshooting is necessary to pinpoint the cause of the issue

What did you expect to see?

no bug/regression when upgrading to a newer version

What did you see instead? Under which circumstances?

upgrading from v0.31.1 to v0.32.2

System information

Kubernetes (Talos Linux v1.11.1, kernel 6.12.45-talos)

Alertmanager version

alertmanager, version 0.32.0 (branch: HEAD, revision: 685a2a1c6bb01b2c17bc1bfae995cb3416c1115e)
  build user:       root@e5ae55633a39
  build date:       20260408-18:08:22
  go version:       go1.26.2
  platform:         linux/amd64
  tags:             netgo

Alertmanager configuration file

global: {}
inhibit_rules:
- equal:
  - cluster
  - namespace
  - pod
  source_matchers:
  - alertname = KubePodCrashLooping
  target_matchers:
  - alertname = KubeContainerWaiting
receivers:
<lots-of-receivers>
route:
  group_by:
  - '...'
  group_interval: 5m
  group_wait: 10s
  receiver: "null"
  repeat_interval: 3h
  routes:
<lots-of-routes>
templates:
- /etc/alertmanager/*.tmpl

Prometheus version


Prometheus configuration file

Logs

...
time=2026-04-16T14:47:11.144Z level=ERROR source=dispatch.go:531 msg="excessive retries creating aggregation group" component=dispatcher fingerprint=4198ce13258e8234 route={}/{} alert=KubeletRestarted retries=101
time=2026-04-16T14:47:11.163Z level=ERROR source=dispatch.go:531 msg="excessive retries creating aggregation group" component=dispatcher fingerprint=4198ce13258e8234 route={}/{} alert=KubeletRestarted retries=101
time=2026-04-16T14:58:11.161Z level=ERROR source=dispatch.go:531 msg="excessive retries creating aggregation group" component=dispatcher fingerprint=4198ce13258e8234 route={}/{} alert=KubeletRestarted retries=101
time=2026-04-16T15:12:12.750Z level=ERROR source=dispatch.go:531 msg="excessive retries creating aggregation group" component=dispatcher fingerprint=db1705f493471c37 route={}/{} alert=HostNetworkInterfaceSaturationSpike retries=101
time=2026-04-16T15:21:52.746Z level=ERROR source=dispatch.go:531 msg="excessive retries creating aggregation group" component=dispatcher fingerprint=db1705f493471c37 route={}/{} alert=HostNetworkInterfaceSaturationSpike retries=101
time=2026-04-16T15:22:12.736Z level=ERROR source=dispatch.go:531 msg="excessive retries creating aggregation group" component=dispatcher fingerprint=db1705f493471c37 route={}/{} alert=HostNetworkInterfaceSaturationSpike retries=101
time=2026-04-16T15:22:12.736Z level=ERROR source=dispatch.go:531 msg="excessive retries creating aggregation group" component=dispatcher fingerprint=db1705f493471c37 route={}/{} alert=HostNetworkInterfaceSaturationSpike retries=101
time=2026-04-16T15:22:32.746Z level=ERROR source=dispatch.go:531 msg="excessive retries creating aggregation group" component=dispatcher fingerprint=db1705f493471c37 route={}/{} alert=HostNetworkInterfaceSaturationSpike retries=101
...

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

Status

Done

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions