Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pkg/alertmanager: Use lower value for --cluster.reconnect-timeout #3436

Merged

Conversation

hwoarang
Copy link
Contributor

@hwoarang hwoarang commented Aug 24, 2020

Alertmanager in cluster mode resolves the DNS name of each peer and
caches its IP address which uses on regular intervals to 'refresh'
the connection.

In high-dynamic environment like kubernetes, it's possible that
alertmanager pods come and go on frequent intervals. The default timeout
value of 6h is not suitable in that case as alertmanager will keep
trying to reconnect to a non-existing pod over and over until it gives
up and remove that peer from the member list. During this period of
time, the cluster is reported to be in a degraded state due to the
missing member.

As such, it's best to use a lower value which will allow the
alertmanager to remove the pod from the list of peers soon
after it disappears.

Related: prometheus/alertmanager#2250

@hwoarang hwoarang requested a review from a team as a code owner August 24, 2020 11:51
@hwoarang hwoarang requested review from squat and removed request for a team August 24, 2020 11:51
@hwoarang hwoarang force-pushed the add-cluster-reconnect-timeout branch from d3660b4 to eca0442 Compare August 24, 2020 11:54
@simonpasquier
Copy link
Contributor

The default timeout value of 6h is not suitable in that case as alertmanager will keep
trying to reconnect to a non-existing pod over and over until it gives
up and goes through another DNS resolution process.

To be clear Alertmanager will continuously resolve the peer addresses irrespective of the --cluster.reconnect-timeout value. The flag defines how long Alertmanager will try to reconnect to an IP address it has been connected before. I do agree that in dynamic environments such as Kubernetes, it makes sense to lower the default value.

@hwoarang
Copy link
Contributor Author

@simonpasquier Thank you for the clarification. I can update the commit message if you want

@simonpasquier
Copy link
Contributor

Yes please, update the comment too. Thanks!

@hwoarang hwoarang force-pushed the add-cluster-reconnect-timeout branch from 00dbe38 to 4f2fd95 Compare August 25, 2020 15:15
@hwoarang
Copy link
Contributor Author

Yes please, update the comment too. Thanks!

@simonpasquier I have updated both. I hope it's better now

@simonpasquier
Copy link
Contributor

The CI fails because of Go formatting, please run make go-fmt and commit the changes.

Alertmanager in cluster mode resolves the DNS name of each peer and
caches its IP address which uses on regular intervals to 'refresh'
the connection.

In high-dynamic environment like kubernetes, it's possible that
alertmanager pods come and go on frequent intervals. The default timeout
value of 6h is not suitable in that case as alertmanager will keep
trying to reconnect to a non-existing pod over and over until it gives
up and remove that peer from the member list. During this period of
time, the cluster is reported to be in a degraded state due to the
missing member.

As such, it's best to use a lower value which will allow the
alertmanager to remove the pod from the list of peers soon
after it disappears.

Related: prometheus/alertmanager#2250
Copy link
Contributor

@simonpasquier simonpasquier left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@s-urbaniak
Copy link
Contributor

LGTM, but potentially a follow-up question: Does it make sense to configure this setting? My impression is that it shouldn't as it is an implementation detail of how we reconcile alertmanager instances. We could add that rationale though at least as a comment, wdyt?

@hwoarang
Copy link
Contributor Author

LGTM, but potentially a follow-up question: Does it make sense to configure this setting? My impression is that it shouldn't as it is an implementation detail of how we reconcile alertmanager instances. We could add that rationale though at least as a comment, wdyt?

I agree that configuring this option shouldn't be necessary. I think the existing comment explains the reason to change the default value. Do you think it needs more details?

@s-urbaniak
Copy link
Contributor

I think we're good, merging, thank you!

@s-urbaniak s-urbaniak merged commit 608be1b into prometheus-operator:master Aug 31, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants