Add a post-start script to check node and cluster health #89

MatthiasWinzeler · 2018-06-15T16:20:18Z

fix for #87

cf-gitbot · 2018-06-15T16:20:20Z

We have created an issue in Pivotal Tracker to manage this. Unfortunately, the Pivotal Tracker project is private so you may be unable to view the contents of the story.

The labels on this github issue will be updated when the story is started.

gerhard · 2018-06-20T11:46:54Z

As the PR currently stands, post-start would exit with a non-zero status in the following scenarios:

the node that starts hasn't finished booting
the node boots, but it can't connect to its peers
mnesia is waiting to synchronise with other nodes, but those nodes are either down or simply not ready to synchronise
pause_minority is used and running nodes are in minority, meaning that rabbit will not start until a majority of nodes are running

I want to review the initial scenario that lead to this specific issue before I make specific proposals on making this better.

cc @michaelklishin for visibility, no action required

MatthiasWinzeler · 2018-06-20T16:09:16Z

Hi @gerhard
Thanks for getting back to me, much appreciated!

From how I understand you, there are some valid cases where a node can come up, but the cluster is not fully healthy (means we can not run cluster check in post-start) but BOSH needs to move on to the next one.

What do you think about running only the node check (and not the cluster check) in post-start? Could this already prevent some error cases from propagating, while still allowing your (valid) use cases to pass? Or could there be some other health check we could do in post-start that is more than just process checking?

So I agree we should figure out why the nodes failed in the first place. Do you have a starting point to investigate this?
The log looks like a successful start to me:

=INFO REPORT==== 1-Jun-2018::22:32:52 ===
Starting RabbitMQ 3.6.15 on Erlang 19.3.6.4
Copyright (C) 2007-2018 Pivotal Software, Inc.
Licensed under the MPL.  See http://www.rabbitmq.com/

=INFO REPORT==== 1-Jun-2018::22:32:52 ===
node           : rabbit@05938b8fd4a523bd529c136ca66c20c2
home dir       : /var/vcap/store/rabbitmq
config file(s) : /var/vcap/jobs/rabbitmq-server/bin/../etc/rabbitmq.config
cookie hash    : <redacted>
log            : /var/vcap/sys/log/rabbitmq-server/rabbit@05938b8fd4a523bd529c136ca66c20c2.log
sasl log       : /var/vcap/sys/log/rabbitmq-server/rabbit@05938b8fd4a523bd529c136ca66c20c2-sasl.log
database dir   : /var/vcap/store/rabbitmq/mnesia/db

=INFO REPORT==== 1-Jun-2018::22:32:54 ===
Memory high watermark set to 3193 MiB (3348537344 bytes) of 7983 MiB (8371343360 bytes) total

=INFO REPORT==== 1-Jun-2018::22:32:54 ===
Enabling free disk space monitoring

=INFO REPORT==== 1-Jun-2018::22:32:54 ===
Disk free limit set to 12557MB

=INFO REPORT==== 1-Jun-2018::22:32:54 ===
Limiting to approx 299900 file handles (269908 sockets)

=INFO REPORT==== 1-Jun-2018::22:32:54 ===
FHC read buffering:  OFF
FHC write buffering: ON

=INFO REPORT==== 1-Jun-2018::22:32:54 ===
Waiting for Mnesia tables for 30000 ms, 9 retries left

=INFO REPORT==== 1-Jun-2018::22:32:54 ===
Waiting for Mnesia tables for 30000 ms, 9 retries left

=INFO REPORT==== 1-Jun-2018::22:32:54 ===
Waiting for Mnesia tables for 30000 ms, 9 retries left

=INFO REPORT==== 1-Jun-2018::22:32:54 ===
Priority queues enabled, real BQ is rabbit_variable_queue

=INFO REPORT==== 1-Jun-2018::22:32:54 ===
Starting rabbit_node_monitor

=INFO REPORT==== 1-Jun-2018::22:32:54 ===
Management plugin: using rates mode 'basic'

=ERROR REPORT==== 1-Jun-2018::22:32:54 ===
Discarding message {'$gen_call',{<0.1142.0>,#Ref<0.0.1.2663>},{basic_consume,false,<7190.3369.3092>,<7190.10353.2914>,false,1,<<"amq.<redacted>">>,false,[],undefined}} from <0.1142.0> to <0.16824.3255> in an old incarna
tion (1) of this node (2)


=INFO REPORT==== 1-Jun-2018::22:32:54 ===
msg_store_transient: using rabbit_msg_store_ets_index to provide index

=INFO REPORT==== 1-Jun-2018::22:32:54 ===
msg_store_persistent: using rabbit_msg_store_ets_index to provide index

=ERROR REPORT==== 1-Jun-2018::22:32:54 ===
Discarding message {'$gen_call',{<0.1147.0>,#Ref<0.0.2.651>},{basic_consume,false,<7367.4494.76>,<7367.4485.76>,false,1,<<"amq.<discarded>">>,false,[],undefined}} from <0.1147.0> to <0.31357.75> in an old incarnation (1)
 of this node (2)


=WARNING REPORT==== 1-Jun-2018::22:32:54 ===
msg_store_persistent: recovery terms differ from present
rebuilding indices from scratch

... some more discarding and resyncing queue messages

=INFO REPORT==== 1-Jun-2018::22:33:16 ===
Mirrored queue 'amq.gen-<>discarded' in vhost '96621d9c-4ec4-4540-82c2-3ca8920b6409': Adding mirror on node rabbit@05938b8fd4a523bd529c136ca66c20c2: <0.21299.0>

=INFO REPORT==== 1-Jun-2018::22:33:20 ===
started TCP Listener on 0.0.0.0:5672

=INFO REPORT==== 1-Jun-2018::22:33:20 ===
rabbit on node rabbit@07632a494f7a8a8fb6b7f6af08f72f2c up

gerhard · 2018-06-22T08:37:14Z

I will have more time for this on Monday, some relevant context for that conversation:

Is the Erlang distribution up? 3.6.x or 3.7.x
post-start will fail if it runs before this command returns
it's safe to verify the Erlang version even before rabbit is running
I think that you will need to verify the rabbit version differently than this

michaelklishin · 2018-06-22T09:03:58Z

rabbitmqctl ping requires Erlang distribution to be running.

gerhard · 2018-06-22T09:36:43Z

Thank you Michael.

If I recall correctly, the Erlang distribution has to be running for the Mnesia to be able to synchronise, so it's a safe post-start check, even if the rabbit app is not running.

michaelklishin · 2018-06-22T09:40:13Z

I agree that it is safe but rabbitmqctl eval would test pretty much the same set of things. There's little harm in keeping both.

nodo · 2018-07-12T09:03:21Z

Hey @MatthiasWinzeler ,

Thanks for the PR, apologies for the late reply.

At the moment, node-check simply check that the Erlang VM is running (see here). This is a safe check to do, but it does not give the guarantee that the rabbitmq app is up and healthy.

For that guarantee, we need wait that mnesia has successfully synchronised and then execute the cluster-check script (see here). The script, in turn, performs rabbitmqctl node_health_check. There is a bit of a naming clash here, but the rabbitmqctl node_health_check verifies the rabbit application is running, which is not possible until all the nodes are up.

Coming back to your PR, as far as I understand, adding a cluster-check in the post-start will fail because it needs all the nodes to be up and mnesia to be synced to succeed. This would prevent BOSH in marking the new node as healthy and therefore I think it is not an option.

The interesting part is to understand why, in your case, the node is failing, from the logs you posted nothing seems to be wrong. You think you could provide other logs about this issue?

cc @gerhard

MatthiasWinzeler · 2018-07-12T14:05:35Z

@nodo Thanks, sounds reasonable to me.

Which additional logs would be of interest for investigation?

nodo · 2018-07-12T15:23:20Z

@MatthiasWinzeler - it would be interesting to check the post-deploy script logs, and the other nodes logs as well. Could you provide them?

Also, are you OK to close the PR for now and continue the discussion on the issue you opened here ?

gerhard · 2018-07-12T15:40:23Z

rabbit@*.log, including rabbit@*-sasl.log, are most helpful.

MatthiasWinzeler · 2018-07-13T07:09:53Z

@nodo Let's close this for now and continue in #87

MatthiasWinzeler added 3 commits June 15, 2018 18:16

no need to add rabbitmqctl to bash since it's already done in post-start

0d88b6a

does a node and cluster check on post-start (see pivotal-cf#87)

bca8b5f

add post start logs to syslog manifest

02c9f8a

cf-gitbot added the unscheduled label Jun 15, 2018

MatthiasWinzeler mentioned this pull request Jun 15, 2018

Check node or cluster health in post-start script #87

Open

cf-gitbot added scheduled in progress and removed unscheduled scheduled in progress labels Jun 18, 2018

cf-gitbot added unscheduled scheduled and removed scheduled unscheduled labels Jul 9, 2018

cf-gitbot added unscheduled and removed scheduled labels Jul 12, 2018

MatthiasWinzeler closed this Jul 13, 2018

cf-gitbot added delivered accepted and removed unscheduled delivered labels Jul 13, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add a post-start script to check node and cluster health #89

Add a post-start script to check node and cluster health #89

MatthiasWinzeler commented Jun 15, 2018

cf-gitbot commented Jun 15, 2018

gerhard commented Jun 20, 2018 •

edited

MatthiasWinzeler commented Jun 20, 2018

gerhard commented Jun 22, 2018 •

edited

michaelklishin commented Jun 22, 2018

gerhard commented Jun 22, 2018

michaelklishin commented Jun 22, 2018

nodo commented Jul 12, 2018

MatthiasWinzeler commented Jul 12, 2018

nodo commented Jul 12, 2018

gerhard commented Jul 12, 2018 •

edited

MatthiasWinzeler commented Jul 13, 2018

Add a post-start script to check node and cluster health #89

Add a post-start script to check node and cluster health #89

Conversation

MatthiasWinzeler commented Jun 15, 2018

cf-gitbot commented Jun 15, 2018

gerhard commented Jun 20, 2018 • edited

MatthiasWinzeler commented Jun 20, 2018

gerhard commented Jun 22, 2018 • edited

michaelklishin commented Jun 22, 2018

gerhard commented Jun 22, 2018

michaelklishin commented Jun 22, 2018

nodo commented Jul 12, 2018

MatthiasWinzeler commented Jul 12, 2018

nodo commented Jul 12, 2018

gerhard commented Jul 12, 2018 • edited

MatthiasWinzeler commented Jul 13, 2018

gerhard commented Jun 20, 2018 •

edited

gerhard commented Jun 22, 2018 •

edited

gerhard commented Jul 12, 2018 •

edited