check_mk-mysql_galera_cluster

A Check_MK plugin for checking the status of a galera cluster (based on https://galeracluster.com/library/documentation/monitoring-cluster.html)

The check parases the mysql output of the standard check_mk_agent for the presence of specific wsrep variables and status lines.

It will provide three single checks with WATO integration:

MySQL Galera Cluster State
MySQL Galera Node State
MySQL Galera Replication Health

License: GPLv2

MySQL Galera Cluster State

The check shows information about the general Cluster state:

Size - The current size of the cluster and the expected size configured in WATO
Status - The current status of the cluster
ConfID - Show the number of Cluster Changes and should be the same for all Cluster Nodes.
Cluster UUID - shows the current UUID and should be the same for all Members of the Cluster

WATO integration

In the WATO you have the possibility to define:

Number of Members of the Cluster

expected - the number of nodes the cluster normaly has
warning threshold - a warning alert will be thrown when the number of members is below this number
critical threshold - - a cirtical alert will be thrown when the number of members is below this number

Expected Cluster Status

Defines the status of the node. This defaults to "Primary" and it normaly doesn't make sense to change this. Any value other than primary will indicate problems with the cluster.

MySQL Galera Node State

The check shows information about the node state:

Ready - shows if the node is ready for accept write-sets from the cluster
Connected - shows if the node has network connectivity with any other nodes.
State - shows if the node is a operational part of the cluster

WATO integration

In the WATO you have the possibility to define:

wsrep_local_state_comment

Shows the node state in a human readable format. When the node is part of the Primary Component, the typical return values are Joining, Waiting on SST, Joined, Synced or Donor. In the event that the node is part of a nonoperational component, the return value is Initialized. Normaly you want to keep 'Initialized' (the relevant value to alert on).

wsrep_ready

Shows whether the node can accept queries. When the node returns a value of ON it can accept write-sets from the cluster. When it returns the value OFF, almost all queries fail with the error: 'ERROR 1047 (08501) Unknown Command'Normally you want to keep ON, the default.

wsrep_connected

Shows whether the node has network connectivity with any other nodes. When the value is ON, the node has a network connection to one or more other nodes forming a cluster component. When the value is OFF, the node does not have a connection to any cluster components. Normally you want to keep ON, the default.

MySQL Galera Replication Health

The check shows information about the performance of the cluster node member and can give hints to identify performance problems.

The check shows information about the replication performance of a node:

Receive Queue

Recv Queue AVG - Shows the average size of the local received queue since the last status query.
Recv Queue CUR - Current (instantaneous) length of the recv queue.
Recv Queue MAX - The maximum length of the recv queue since the last FLUSH STATUS command.
Recv Queue MIN - The minimum length of the recv queue since the last FLUSH STATUS command.

Send Queue

Send Queue AVG - Show an average for the send queue length since the last FLUSH STATUS query.
Send Queue CUR - Current (instantaneous) length of the send queue.
Send Queue MAX - The maximum length of the send queue since the last FLUSH STATUS command.
Send Queue MIN - The minimum length of the send queue since the last FLUSH STATUS command.

Flow Control

Flow Control Recv - Returns the number of FC_PAUSE events the node has received, including those the node has sent.
Flow Control Sent - Returns the number of FC_PAUSE events the node has sent
Flow Control paused - Shows the fraction of the time, since FLUSH STATUS was last called, that the node paused due to Flow Control.

Cert deps distance

Shows the average distance between the lowest and highest sequence number, or seqno, values that the node can possibly apply in parallel. This represents the node’s potential degree for parallelization. In other words, the optimal value you can use with the wsrep_slave_threads parameter, given that there is no reason to assign more slave threads than transactions you can apply in parallel.

WATO integration

In the WATO you have the possibility to define:

Recv queue length average

shows the average size of the local received queue since the last status query.
When the node returns a value higher than 0.0 it means that the node cannot apply write-sets as fast as it receives them, which can lead to replication throttling.
you can define warning and critical thresholds to alert a node causing replication throttling

Send queue length average

show an average for the send queue length since the last FLUSH STATUS query.
Values much greater than 0.0 indicate replication throttling or network throughput issues, such as a bottleneck on the network link. The problem can occur at any layer from the physical components of your server to the configuration of the operating system.
you can define warning and critical thresholds to alert a node with network throughput problems

Flow control paused

shows the fraction of the time, since FLUSH STATUS was last called, that the node paused due to Flow Control.
When the node returns a value of 0.0, it indicates that the node did not pause due to Flow Control during this period. When the node returns a value of 1.0, it indicates that the node spent the entire period paused. If the time between FLUSH STATUS and SHOW STATUS was one minute and the node returned 0.25, it indicates that the node was paused for a total 15 seconds over that time period.
Ideally, the return value should stay as close to 0.0 as possible, since this means the node is not falling behind the cluster. In the event that you find that the node is pausing frequently, you can adjust the wsrep_slave_threads parameter or you can exclude the node from the cluster.

Contributing

Create a personal fork of the project on Github.
Clone the fork on your local machine. Your remote repo on Github is called origin.
Add the original repository as a remote called upstream.
If you created your fork a while ago be sure to pull upstream changes into your local repository.
Create a new branch to work on!
Implement/fix your feature, comment your code.
Add or change the documentation as needed.
Push your branch to your fork on Github, the remote origin.
From your fork create a pull request to the master branch.

Name		Name	Last commit message	Last commit date
Latest commit History 26 Commits
documentation		documentation
local/share/check_mk		local/share/check_mk
mkp		mkp
var/check_mk/packages		var/check_mk/packages
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

documentation

documentation

local/share/check_mk

local/share/check_mk

mkp

mkp

var/check_mk/packages

var/check_mk/packages

.gitignore

.gitignore

LICENSE

LICENSE

README.md

README.md

Repository files navigation

check_mk-mysql_galera_cluster

MySQL Galera Cluster State

MySQL Galera Node State

MySQL Galera Replication Health

Contributing

About

Releases 2

Packages

Contributors 2

Languages

License

kriegerse/check_mk-mysql_galera_cluster

Folders and files

Latest commit

History

Repository files navigation

check_mk-mysql_galera_cluster

MySQL Galera Cluster State

MySQL Galera Node State

MySQL Galera Replication Health

Contributing

About

Topics

Resources

License

Stars

Watchers

Forks

Languages