Ipfailover check and notify script support #3355

pecameron · 2016-12-09T18:16:45Z

Openshift 3.5 feature.

Add options to 'oadm ipfailover' to configure the check script and
notify scripts and also control the period the check script runs.

Keepalived periodically checks whether the application is running
properly. In the default case the test is a simple verification that
something is listening on the watch port. This PR permits the user to
supply an additional check script that is run in the ipfailover container
context to verify that the application is operating properly. For
example, a web server can be tested by accessing the watch port and
verifying the response.

Whenever a node changes state to MASTER, BACKUP, or FAULT a notify
script can be called. This script has 3 parameters filled in by
keepalived:
$1 - "GROUP"|"INSTANCE"
$2 - name of group or instance
$3 - target state of transition ("MASTER"|"BACKUP"|"FAULT")

--check-script="check_script"
The check script is a script in the keepalived container that verifies
the service is running properly. The script must return 0 for OK and 1
for FAIL.
These checks are in addition to verifying that the watch port is
listening.

--notify-script="notify_script"
The notify script is a script in the keepalived container that is
called whenever the keepalived state transitions to
(MASTER|BACKUP|FAULT)

--check-interval=
The check script is run every seconds. Default is 2.

Note: the scripts name is the full path to the script.

Fixes bug 1362163
https://trello.com/c/228zu7Br/267-5-improve-the-configurability-of-the-ipfailover-container-operations

Signed-off-by: Phil Cameron pcameron@redhat.com

pecameron · 2016-12-14T15:26:09Z

@knobunc PTAL

knobunc · 2016-12-16T19:08:15Z

admin_guide/high_availability.adoc

+[[check-notify]]
+=== Check and Notify scripts
+
+The *keepalived* port monitoring feature uses a simple port check ( </dev/tcp/<ip>/<watch-port> ) to verify that the application is running. The admin can provide a script that does whatever additional verification is needed, for example, the script can test a web server by issuing a request and verifying the response. The script must exit with 0 for PASS and 1 for FAIL.


how about "uses a simple tcp connect to verify"

And should we say that if watch-port is 0 the test is skipped?

knobunc · 2016-12-16T19:11:10Z

admin_guide/high_availability.adoc

+
+The admin provides the additional script, via the `--check-script=<script>` option.  By default the check is done every 2 seconds, this can be changed using the `--check-interval=<seconds>` option.
+
+For each VIP, *keepalived* keeps the state of the node. The VIP on the node may be in *MASTER*, *BACKUP*, or *FAULT* state.  All nodes that are not in the *FAULT* state negotiate to decide who will be *MASTER* for the VIP.  All of the losers enter the *BACKUP* state.  When the check script fails *keepalived* enters the *FAULT* state. When the check script passes again it exits *FAULT* and negotiates for *MASTER*. The resulting state is either *MASTER* or *BACKUP*.


When the check script fails *keepalived* enters the *FAULT* state. When the check script passes again it exits *FAULT* and negotiates for *MASTER*.

When the first one fails, do all then start testing? Or do they always test and move from BACKUP to FAULT?

Surely after the first fails and moves to FAULT, one of the others in BACKUP will try to take control?

The every keepalived (on all nodes) runs the check script on a set period. When MASTER fails, trigger a renegotiation for MASTER. When BACKUP fails just enter FAULT. When FAULT fails do nothing. When FAULT passes, trigger a renegotiation. When MASTER or BACKUP passes, do nothing.

knobunc · 2016-12-16T19:11:51Z

admin_guide/high_availability.adoc

+which is loaded every time *keepalived* starts. The scripts can be added to the pod with a ConfigMap as follows.
+
+First, create the desired script and create a ConfigMap to hold it.
+The script has no input arguments and must return 0 for OK and 1 for fail.


1 or non-zero?

keepalived looks for 1 for FAIL.

I did more research and I can't find where it says FAIL==1. There is a reference in a 3rd party discussion that says non-zero. 1 doesn't hurt in any event. I am looking into another matter and will keep an eye out for this.

Openshift 3.5 feature. Add options to 'oadm ipfailover' to configure the check script and notify scripts and also control the period the check script runs. Keepalived periodically checks whether the application is running properly. In the default case the test is a simple verification that something is listening on the watch port. This PR permits the user to supply an additional check script that is run in the ipfailover container context to verify that the application is operating properly. For example, a web server can be tested by accessing the watch port and verifying the response. Whenever a node changes state to MASTER, BACKUP, or FAULT a notify script can be called. This script has 3 parameters filled in by keepalived: $1 - "GROUP"|"INSTANCE" $2 - name of group or instance $3 - target state of transition ("MASTER"|"BACKUP"|"FAULT") --check-script="check_script" The check script is a script in the keepalived container that verifies the service is running properly. The script must return 0 for OK and 1 for FAIL. These checks are in addition to verifying that the watch port is listening. --notify-script="notify_script" The notify script is a script in the keepalived container that is called whenever the keepalived state transitions to (MASTER|BACKUP|FAULT) --check-interval= The check script is run every seconds. Default is 2. Note: the scripts name is the full path to the script. Fixes bug 1362163 https://trello.com/c/228zu7Br/267-5-improve-the-configurability-of-the-ipfailover-container-operations Signed-off-by: Phil Cameron <pcameron@redhat.com>

pecameron · 2016-12-16T20:36:22Z

@knobunc PTAL

knobunc

LGTM @openshift/team-documentation over to you

ahardin-rh · 2016-12-19T19:12:50Z

Thank you! I will apply minor style edits in a follow-up PR.

ahardin-rh · 2016-12-19T21:57:17Z

[rev_history]
|xref:../admin_guide/high_availability.adoc#admin-guide-high-availability[High Availability]
|Added new options to 'oadm ipfailover' to configure the check and notify scripts and to control the period of time the check script runs.
%

pecameron force-pushed the bz1362163 branch from e636333 to e99a2d0 Compare December 14, 2016 15:25

ahardin-rh added this to the Future Release milestone Dec 15, 2016

ahardin-rh added the branch/enterprise-3.5 label Dec 15, 2016

knobunc requested changes Dec 16, 2016

View reviewed changes

pecameron force-pushed the bz1362163 branch from e99a2d0 to 9ee13b5 Compare December 16, 2016 20:24

knobunc approved these changes Dec 16, 2016

View reviewed changes

vikram-redhat assigned ahardin-rh Dec 17, 2016

ahardin-rh added the to_followup label Dec 19, 2016

ahardin-rh merged commit 6f2c374 into openshift:master Dec 19, 2016

ahardin-rh mentioned this pull request Dec 19, 2016

Follow-up edits for PR#3355 #3417

Merged

ahardin-rh removed the to_followup label Dec 19, 2016

vikram-redhat modified the milestones: Future Release, Staging, OCP 3.5 GA Apr 12, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ipfailover check and notify script support #3355

Ipfailover check and notify script support #3355

pecameron commented Dec 9, 2016 •

edited

pecameron commented Dec 14, 2016

knobunc Dec 16, 2016

pecameron Dec 16, 2016

knobunc Dec 16, 2016

pecameron Dec 16, 2016

knobunc Dec 16, 2016

pecameron Dec 16, 2016

pecameron Dec 16, 2016

pecameron commented Dec 16, 2016

knobunc left a comment

ahardin-rh commented Dec 19, 2016

ahardin-rh commented Dec 19, 2016


		The admin provides the additional script, via the `--check-script=<script>` option. By default the check is done every 2 seconds, this can be changed using the `--check-interval=<seconds>` option.

		For each VIP, keepalived keeps the state of the node. The VIP on the node may be in MASTER, BACKUP, or FAULT state. All nodes that are not in the FAULT state negotiate to decide who will be MASTER for the VIP. All of the losers enter the BACKUP state. When the check script fails keepalived enters the FAULT state. When the check script passes again it exits FAULT and negotiates for MASTER. The resulting state is either MASTER or BACKUP.

Ipfailover check and notify script support #3355

Ipfailover check and notify script support #3355

Conversation

pecameron commented Dec 9, 2016 • edited

pecameron commented Dec 14, 2016

knobunc Dec 16, 2016

Choose a reason for hiding this comment

pecameron Dec 16, 2016

Choose a reason for hiding this comment

knobunc Dec 16, 2016

Choose a reason for hiding this comment

pecameron Dec 16, 2016

Choose a reason for hiding this comment

knobunc Dec 16, 2016

Choose a reason for hiding this comment

pecameron Dec 16, 2016

Choose a reason for hiding this comment

pecameron Dec 16, 2016

Choose a reason for hiding this comment

pecameron commented Dec 16, 2016

knobunc left a comment

Choose a reason for hiding this comment

ahardin-rh commented Dec 19, 2016

ahardin-rh commented Dec 19, 2016

pecameron commented Dec 9, 2016 •

edited