[stable/rabbitmq-ha] Rejoin cluster on restarted pod #4474

nawaitesidah · 2018-03-27T13:36:55Z

Is this a request for help?:
No

Is this a BUG REPORT or FEATURE REQUEST? (choose one):
BUG REPORT

Version of Helm and Kubernetes:
Helm
Client: &version.Version{SemVer:"v2.7.2", GitCommit:"8478fb4fc723885b155c924d1c8c410b7a9444e6", GitTreeState:"clean"}
Server: &version.Version{SemVer:"v2.7.2", GitCommit:"8478fb4fc723885b155c924d1c8c410b7a9444e6", GitTreeState:"clean"}

Kubernetes:
Client Version: version.Info{Major:"1", Minor:"9", GitVersion:"v1.9.6", GitCommit:"9f8ebd171479bec0ada837d7ee641dec2f8c6dd1", GitTreeState:"clean", BuildDate:"2018-03-21T15:21:50Z", GoVersion:"go1.9.3", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"", Minor:"", GitVersion:"v1.9.0", GitCommit:"925c127ec6b946659ad0fd596fa959be43f0cc05", GitTreeState:"clean", BuildDate:"2018-01-26T19:04:38Z", GoVersion:"go1.9.1", Compiler:"gc", Platform:"linux/amd64"}

Which chart:
stable/rabbitmq-ha

What happened:
On pod restart, the restarted pod won't be able to rejoin cluster.
The message is: init terminating in do_boot ({error,{inconsistent_cluster,Node 'rabbit@172.17.0.26' thinks it's clustered with node 'rabbit@172.17.0.19', but 'rabbit@172.17.0.19' disagrees}})

What you expected to happen:
The restarted pod should be able to rejoin cluster

This page https://www.rabbitmq.com/clustering.html says that the node should run rabbitmqctl reset before rejoining

How to reproduce it (as minimally and precisely as possible):

$ helm install stable/rabbitmq-ha --name rmq --set persistentVolume.enabled=true --set persistentVolume.storageClass=standard
$ kubectl exec rmq-rabbitmq-ha-0 -- kill 1

After that, the restarted pod won't be able to rejoin cluster

Anything else we need to know:
Image used: rabbitmq:3.7-alpine
Chart version: 1.0.2
minikube version: v0.25.0

The text was updated successfully, but these errors were encountered:

nawaitesidah · 2018-03-31T09:58:01Z

#4519

michaelklishin · 2018-04-06T09:25:09Z

rabbitmqctl reset will wipe out a node (clear its database) and is only necessary when adding a brand new node.

A rejoining node will contact its last known peer upon boot. In most cases you do not want to reset a restarted cluster member. I'd recommend getting a sense of RabbitMQ clustering basics.

RabbitMQ logs from all nodes are critically important when investigating such issues.

martianoff · 2018-04-06T11:00:38Z

@michaelklishin i agree reset and file removal looks wrong way, but it is not clear why it can't reconnect back

etiennetremel · 2018-04-08T15:08:54Z

Should be fixed via #4610

martianoff · 2018-04-09T13:31:39Z

unfortunately not fixed

michaelklishin · 2018-04-09T13:51:05Z

This rabbitmq-users response provides a very plausible hypothesis: automatic forced cleanup of unknown nodes was enabled in the Kubernetes example. Unintended removal of nodes temporarily leaving the cluster is one of the consequences of that decision, and apparently a pod restart can trigger it.

michaelklishin · 2018-04-09T13:52:07Z

@maksimru I'd inspect (or at least post) full server logs instead of waiting for a magical fix to land from a stranger on the Internet.

michaelklishin · 2018-04-18T02:27:53Z

@hadigoh please close this issue. #4823 demonstrates how to work around it (and includes a default change to the config map). The effect is documented as well.

I don't know when the review is going to happen but there are no changes in the chart necessary to avoid nodes that temporarily leave the cluster from being cleaned up.

nawaitesidah · 2018-04-18T07:12:13Z

Thanks a lot @michaelklishin

michaelklishin · 2018-04-19T03:11:14Z

#4823 is in.

anoopswsib · 2019-05-05T02:48:26Z

@hadigoh please close this issue. #4823 demonstrates how to work around it (and includes a default change to the config map). The effect is documented as well.

I don't know when the review is going to happen but there are no changes in the chart necessary to avoid nodes that temporarily leave the cluster from being cleaned up.

hi michal, I have installed helm chart rabbitmq-ha on azure kubernetes cluster. After few days one pod automatically restarts. In the logs nothing unusual is there except one line that says mnesia table exit...Any ideas what might be wrong

martianoff mentioned this issue Apr 6, 2018

Recovering problem with persistent storage enabled rabbitmq/rabbitmq-peer-discovery-k8s#21

Closed

coreypobrien mentioned this issue Apr 9, 2018

[stable/rabbitmq-ha] Clustering fixes #4610

Merged

michaelklishin mentioned this issue Apr 9, 2018

[rabbitmq] Use a safe cluster_formation.node_cleanup.only_log_warning value #4823

Merged

nawaitesidah closed this as completed Apr 18, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[stable/rabbitmq-ha] Rejoin cluster on restarted pod #4474

[stable/rabbitmq-ha] Rejoin cluster on restarted pod #4474

nawaitesidah commented Mar 27, 2018 •

edited

Loading

nawaitesidah commented Mar 31, 2018

michaelklishin commented Apr 6, 2018 •

edited

Loading

martianoff commented Apr 6, 2018

etiennetremel commented Apr 8, 2018

martianoff commented Apr 9, 2018

michaelklishin commented Apr 9, 2018

michaelklishin commented Apr 9, 2018

michaelklishin commented Apr 18, 2018 •

edited

Loading

nawaitesidah commented Apr 18, 2018

michaelklishin commented Apr 19, 2018

anoopswsib commented May 5, 2019

[stable/rabbitmq-ha] Rejoin cluster on restarted pod #4474

[stable/rabbitmq-ha] Rejoin cluster on restarted pod #4474

Comments

nawaitesidah commented Mar 27, 2018 • edited Loading

nawaitesidah commented Mar 31, 2018

michaelklishin commented Apr 6, 2018 • edited Loading

martianoff commented Apr 6, 2018

etiennetremel commented Apr 8, 2018

martianoff commented Apr 9, 2018

michaelklishin commented Apr 9, 2018

michaelklishin commented Apr 9, 2018

michaelklishin commented Apr 18, 2018 • edited Loading

nawaitesidah commented Apr 18, 2018

michaelklishin commented Apr 19, 2018

anoopswsib commented May 5, 2019

nawaitesidah commented Mar 27, 2018 •

edited

Loading

michaelklishin commented Apr 6, 2018 •

edited

Loading

michaelklishin commented Apr 18, 2018 •

edited

Loading