New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Ipfailover check and notify script support #3355
Conversation
@knobunc PTAL |
[[check-notify]] | ||
=== Check and Notify scripts | ||
|
||
The *keepalived* port monitoring feature uses a simple port check ( </dev/tcp/<ip>/<watch-port> ) to verify that the application is running. The admin can provide a script that does whatever additional verification is needed, for example, the script can test a web server by issuing a request and verifying the response. The script must exit with 0 for PASS and 1 for FAIL. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how about "uses a simple tcp connect to verify"
And should we say that if watch-port is 0 the test is skipped?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK
|
||
The admin provides the additional script, via the `--check-script=<script>` option. By default the check is done every 2 seconds, this can be changed using the `--check-interval=<seconds>` option. | ||
|
||
For each VIP, *keepalived* keeps the state of the node. The VIP on the node may be in *MASTER*, *BACKUP*, or *FAULT* state. All nodes that are not in the *FAULT* state negotiate to decide who will be *MASTER* for the VIP. All of the losers enter the *BACKUP* state. When the check script fails *keepalived* enters the *FAULT* state. When the check script passes again it exits *FAULT* and negotiates for *MASTER*. The resulting state is either *MASTER* or *BACKUP*. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When the check script fails *keepalived* enters the *FAULT* state. When the check script passes again it exits *FAULT* and negotiates for *MASTER*.
When the first one fails, do all then start testing? Or do they always test and move from BACKUP to FAULT?
Surely after the first fails and moves to FAULT, one of the others in BACKUP will try to take control?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The every keepalived (on all nodes) runs the check script on a set period. When MASTER fails, trigger a renegotiation for MASTER. When BACKUP fails just enter FAULT. When FAULT fails do nothing. When FAULT passes, trigger a renegotiation. When MASTER or BACKUP passes, do nothing.
which is loaded every time *keepalived* starts. The scripts can be added to the pod with a ConfigMap as follows. | ||
|
||
First, create the desired script and create a ConfigMap to hold it. | ||
The script has no input arguments and must return 0 for OK and 1 for fail. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
1 or non-zero?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
keepalived looks for 1 for FAIL.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did more research and I can't find where it says FAIL==1. There is a reference in a 3rd party discussion that says non-zero. 1 doesn't hurt in any event. I am looking into another matter and will keep an eye out for this.
Openshift 3.5 feature. Add options to 'oadm ipfailover' to configure the check script and notify scripts and also control the period the check script runs. Keepalived periodically checks whether the application is running properly. In the default case the test is a simple verification that something is listening on the watch port. This PR permits the user to supply an additional check script that is run in the ipfailover container context to verify that the application is operating properly. For example, a web server can be tested by accessing the watch port and verifying the response. Whenever a node changes state to MASTER, BACKUP, or FAULT a notify script can be called. This script has 3 parameters filled in by keepalived: $1 - "GROUP"|"INSTANCE" $2 - name of group or instance $3 - target state of transition ("MASTER"|"BACKUP"|"FAULT") --check-script="check_script" The check script is a script in the keepalived container that verifies the service is running properly. The script must return 0 for OK and 1 for FAIL. These checks are in addition to verifying that the watch port is listening. --notify-script="notify_script" The notify script is a script in the keepalived container that is called whenever the keepalived state transitions to (MASTER|BACKUP|FAULT) --check-interval= The check script is run every seconds. Default is 2. Note: the scripts name is the full path to the script. Fixes bug 1362163 https://trello.com/c/228zu7Br/267-5-improve-the-configurability-of-the-ipfailover-container-operations Signed-off-by: Phil Cameron <pcameron@redhat.com>
@knobunc PTAL |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM @openshift/team-documentation over to you
Thank you! I will apply minor style edits in a follow-up PR. |
[rev_history] |
Openshift 3.5 feature.
Add options to 'oadm ipfailover' to configure the check script and
notify scripts and also control the period the check script runs.
Keepalived periodically checks whether the application is running
properly. In the default case the test is a simple verification that
something is listening on the watch port. This PR permits the user to
supply an additional check script that is run in the ipfailover container
context to verify that the application is operating properly. For
example, a web server can be tested by accessing the watch port and
verifying the response.
Whenever a node changes state to MASTER, BACKUP, or FAULT a notify
script can be called. This script has 3 parameters filled in by
keepalived:
$1 - "GROUP"|"INSTANCE"
$2 - name of group or instance
$3 - target state of transition ("MASTER"|"BACKUP"|"FAULT")
--check-script="check_script"
The check script is a script in the keepalived container that verifies
the service is running properly. The script must return 0 for OK and 1
for FAIL.
These checks are in addition to verifying that the watch port is
listening.
--notify-script="notify_script"
The notify script is a script in the keepalived container that is
called whenever the keepalived state transitions to
(MASTER|BACKUP|FAULT)
--check-interval=
The check script is run every seconds. Default is 2.
Note: the scripts name is the full path to the script.
Fixes bug 1362163
https://trello.com/c/228zu7Br/267-5-improve-the-configurability-of-the-ipfailover-container-operations
Signed-off-by: Phil Cameron pcameron@redhat.com