pkg/alertmanager: Use lower value for --cluster.reconnect-timeout #3436

hwoarang · 2020-08-24T11:51:55Z

Alertmanager in cluster mode resolves the DNS name of each peer and
caches its IP address which uses on regular intervals to 'refresh'
the connection.

In high-dynamic environment like kubernetes, it's possible that
alertmanager pods come and go on frequent intervals. The default timeout
value of 6h is not suitable in that case as alertmanager will keep
trying to reconnect to a non-existing pod over and over until it gives
up and remove that peer from the member list. During this period of
time, the cluster is reported to be in a degraded state due to the
missing member.

As such, it's best to use a lower value which will allow the
alertmanager to remove the pod from the list of peers soon
after it disappears.

Related: prometheus/alertmanager#2250

pkg/alertmanager/statefulset.go

simonpasquier · 2020-08-25T14:24:07Z

The default timeout value of 6h is not suitable in that case as alertmanager will keep
trying to reconnect to a non-existing pod over and over until it gives
up and goes through another DNS resolution process.

To be clear Alertmanager will continuously resolve the peer addresses irrespective of the --cluster.reconnect-timeout value. The flag defines how long Alertmanager will try to reconnect to an IP address it has been connected before. I do agree that in dynamic environments such as Kubernetes, it makes sense to lower the default value.

hwoarang · 2020-08-25T14:26:31Z

@simonpasquier Thank you for the clarification. I can update the commit message if you want

simonpasquier · 2020-08-25T15:05:27Z

Yes please, update the comment too. Thanks!

hwoarang · 2020-08-25T15:15:12Z

Yes please, update the comment too. Thanks!

@simonpasquier I have updated both. I hope it's better now

simonpasquier · 2020-08-26T09:06:42Z

The CI fails because of Go formatting, please run make go-fmt and commit the changes.

pkg/alertmanager/statefulset.go

Alertmanager in cluster mode resolves the DNS name of each peer and caches its IP address which uses on regular intervals to 'refresh' the connection. In high-dynamic environment like kubernetes, it's possible that alertmanager pods come and go on frequent intervals. The default timeout value of 6h is not suitable in that case as alertmanager will keep trying to reconnect to a non-existing pod over and over until it gives up and remove that peer from the member list. During this period of time, the cluster is reported to be in a degraded state due to the missing member. As such, it's best to use a lower value which will allow the alertmanager to remove the pod from the list of peers soon after it disappears. Related: prometheus/alertmanager#2250

simonpasquier

LGTM

s-urbaniak · 2020-08-31T10:46:02Z

LGTM, but potentially a follow-up question: Does it make sense to configure this setting? My impression is that it shouldn't as it is an implementation detail of how we reconcile alertmanager instances. We could add that rationale though at least as a comment, wdyt?

hwoarang · 2020-08-31T12:16:22Z

LGTM, but potentially a follow-up question: Does it make sense to configure this setting? My impression is that it shouldn't as it is an implementation detail of how we reconcile alertmanager instances. We could add that rationale though at least as a comment, wdyt?

I agree that configuring this option shouldn't be necessary. I think the existing comment explains the reason to change the default value. Do you think it needs more details?

s-urbaniak · 2020-08-31T13:39:06Z

I think we're good, merging, thank you!

hwoarang requested a review from a team as a code owner August 24, 2020 11:51

hwoarang requested review from squat and removed request for a team August 24, 2020 11:51

hwoarang force-pushed the add-cluster-reconnect-timeout branch from d3660b4 to eca0442 Compare August 24, 2020 11:54

simonpasquier reviewed Aug 24, 2020

View reviewed changes

pkg/alertmanager/statefulset.go Outdated Show resolved Hide resolved

hwoarang force-pushed the add-cluster-reconnect-timeout branch from eca0442 to 00dbe38 Compare August 24, 2020 13:55

hwoarang requested a review from simonpasquier August 24, 2020 13:55

hwoarang force-pushed the add-cluster-reconnect-timeout branch from 00dbe38 to 4f2fd95 Compare August 25, 2020 15:15

simonpasquier reviewed Aug 26, 2020

View reviewed changes

pkg/alertmanager/statefulset.go Outdated Show resolved Hide resolved

hwoarang force-pushed the add-cluster-reconnect-timeout branch from 4f2fd95 to 86102e7 Compare August 26, 2020 10:02

hwoarang requested a review from simonpasquier August 26, 2020 15:01

simonpasquier approved these changes Aug 27, 2020

View reviewed changes

hwoarang mentioned this pull request Aug 31, 2020

[stable/prometheus-operator] Add Alertmanager cluster.reconnect-timeout option support helm/charts#23575

Closed

s-urbaniak merged commit 608be1b into prometheus-operator:master Aug 31, 2020

jquick mentioned this pull request Jun 24, 2023

Alertmanager peerReconnectTimeout failing grafana/grafana#70657

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

pkg/alertmanager: Use lower value for --cluster.reconnect-timeout #3436

pkg/alertmanager: Use lower value for --cluster.reconnect-timeout #3436

hwoarang commented Aug 24, 2020 •

edited

simonpasquier commented Aug 25, 2020

hwoarang commented Aug 25, 2020

simonpasquier commented Aug 25, 2020

hwoarang commented Aug 25, 2020

simonpasquier commented Aug 26, 2020

simonpasquier left a comment

s-urbaniak commented Aug 31, 2020

hwoarang commented Aug 31, 2020

s-urbaniak commented Aug 31, 2020

pkg/alertmanager: Use lower value for --cluster.reconnect-timeout #3436

pkg/alertmanager: Use lower value for --cluster.reconnect-timeout #3436

Conversation

hwoarang commented Aug 24, 2020 • edited

simonpasquier commented Aug 25, 2020

hwoarang commented Aug 25, 2020

simonpasquier commented Aug 25, 2020

hwoarang commented Aug 25, 2020

simonpasquier commented Aug 26, 2020

simonpasquier left a comment

Choose a reason for hiding this comment

s-urbaniak commented Aug 31, 2020

hwoarang commented Aug 31, 2020

s-urbaniak commented Aug 31, 2020

hwoarang commented Aug 24, 2020 •

edited