-
Notifications
You must be signed in to change notification settings - Fork 3.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
pkg/alertmanager: Use lower value for --cluster.reconnect-timeout #3436
pkg/alertmanager: Use lower value for --cluster.reconnect-timeout #3436
Conversation
d3660b4
to
eca0442
Compare
eca0442
to
00dbe38
Compare
To be clear Alertmanager will continuously resolve the peer addresses irrespective of the |
@simonpasquier Thank you for the clarification. I can update the commit message if you want |
Yes please, update the comment too. Thanks! |
00dbe38
to
4f2fd95
Compare
@simonpasquier I have updated both. I hope it's better now |
The CI fails because of Go formatting, please run |
Alertmanager in cluster mode resolves the DNS name of each peer and caches its IP address which uses on regular intervals to 'refresh' the connection. In high-dynamic environment like kubernetes, it's possible that alertmanager pods come and go on frequent intervals. The default timeout value of 6h is not suitable in that case as alertmanager will keep trying to reconnect to a non-existing pod over and over until it gives up and remove that peer from the member list. During this period of time, the cluster is reported to be in a degraded state due to the missing member. As such, it's best to use a lower value which will allow the alertmanager to remove the pod from the list of peers soon after it disappears. Related: prometheus/alertmanager#2250
4f2fd95
to
86102e7
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
LGTM, but potentially a follow-up question: Does it make sense to configure this setting? My impression is that it shouldn't as it is an implementation detail of how we reconcile alertmanager instances. We could add that rationale though at least as a comment, wdyt? |
I agree that configuring this option shouldn't be necessary. I think the existing comment explains the reason to change the default value. Do you think it needs more details? |
I think we're good, merging, thank you! |
Alertmanager in cluster mode resolves the DNS name of each peer and
caches its IP address which uses on regular intervals to 'refresh'
the connection.
In high-dynamic environment like kubernetes, it's possible that
alertmanager pods come and go on frequent intervals. The default timeout
value of 6h is not suitable in that case as alertmanager will keep
trying to reconnect to a non-existing pod over and over until it gives
up and remove that peer from the member list. During this period of
time, the cluster is reported to be in a degraded state due to the
missing member.
As such, it's best to use a lower value which will allow the
alertmanager to remove the pod from the list of peers soon
after it disappears.
Related: prometheus/alertmanager#2250