Navigation Menu

Skip to content

Commit

Permalink
cluster: Do not exit when failing to join cluster (#1465)
Browse files Browse the repository at this point in the history
Alertmanager is exiting with a non-zero exit code if the initial cluster
join fails. This behavior could be not wanted because:

- As Alertmanager is a critical component with an at-least-once
guarantee, failing on joining the cluster is unnecessary as
Alertmanager still functions by itself.

- In an environment like Kubernetes discovering peers via DNS, peers
might roll out one-by-one, leaving the DNS entries unpopulated for the
first peer of a set. Failing on initial join prevents a roll-out.

Instead of failing on the initial join this patch only logs the failure.
The cluster can be later joined via the `handleReconnect`.

This is a regression introduced in PR #1456 [1].

[1] #1456

Signed-off-by: Max Leonard Inden <IndenML@gmail.com>
  • Loading branch information
mxinden committed Jul 11, 2018
1 parent f3bc41d commit 3735df3
Show file tree
Hide file tree
Showing 2 changed files with 5 additions and 3 deletions.
3 changes: 3 additions & 0 deletions cluster/cluster.go
Expand Up @@ -217,6 +217,9 @@ func (p *Peer) Join(
n, err := p.mlist.Join(p.resolvedPeers)
if err != nil {
level.Warn(p.logger).Log("msg", "failed to join cluster", "err", err)
if reconnectInterval != 0 {
level.Info(p.logger).Log("msg", fmt.Sprintf("will retry joining cluster every %v", reconnectInterval.String()))
}
} else {
level.Debug(p.logger).Log("msg", "joined cluster", "peers", n)
}
Expand Down
5 changes: 2 additions & 3 deletions cmd/alertmanager/main.go
Expand Up @@ -196,7 +196,7 @@ func main() {
*probeInterval,
)
if err != nil {
level.Error(logger).Log("msg", "Unable to initialize gossip mesh", "err", err)
level.Error(logger).Log("msg", "unable to initialize gossip mesh", "err", err)
os.Exit(1)
}
}
Expand Down Expand Up @@ -262,8 +262,7 @@ func main() {
*peerReconnectTimeout,
)
if err != nil {
level.Error(logger).Log("msg", "Unable to join gossip mesh", "err", err)
os.Exit(1)
level.Warn(logger).Log("msg", "unable to join gossip mesh", "err", err)
}
ctx, cancel := context.WithTimeout(context.Background(), *settleTimeout)
defer func() {
Expand Down

0 comments on commit 3735df3

Please sign in to comment.