Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ipfailover check and notify script support #3355

Merged
merged 1 commit into from Dec 19, 2016

Conversation

pecameron
Copy link

@pecameron pecameron commented Dec 9, 2016

Openshift 3.5 feature.

Add options to 'oadm ipfailover' to configure the check script and
notify scripts and also control the period the check script runs.

Keepalived periodically checks whether the application is running
properly. In the default case the test is a simple verification that
something is listening on the watch port. This PR permits the user to
supply an additional check script that is run in the ipfailover container
context to verify that the application is operating properly. For
example, a web server can be tested by accessing the watch port and
verifying the response.

Whenever a node changes state to MASTER, BACKUP, or FAULT a notify
script can be called. This script has 3 parameters filled in by
keepalived:
$1 - "GROUP"|"INSTANCE"
$2 - name of group or instance
$3 - target state of transition ("MASTER"|"BACKUP"|"FAULT")

--check-script="check_script"
The check script is a script in the keepalived container that verifies
the service is running properly. The script must return 0 for OK and 1
for FAIL.
These checks are in addition to verifying that the watch port is
listening.

--notify-script="notify_script"
The notify script is a script in the keepalived container that is
called whenever the keepalived state transitions to
(MASTER|BACKUP|FAULT)

--check-interval=
The check script is run every seconds. Default is 2.

Note: the scripts name is the full path to the script.

Fixes bug 1362163
https://trello.com/c/228zu7Br/267-5-improve-the-configurability-of-the-ipfailover-container-operations

Signed-off-by: Phil Cameron pcameron@redhat.com

@pecameron
Copy link
Author

@knobunc PTAL

[[check-notify]]
=== Check and Notify scripts

The *keepalived* port monitoring feature uses a simple port check ( </dev/tcp/<ip>/<watch-port> ) to verify that the application is running. The admin can provide a script that does whatever additional verification is needed, for example, the script can test a web server by issuing a request and verifying the response. The script must exit with 0 for PASS and 1 for FAIL.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how about "uses a simple tcp connect to verify"

And should we say that if watch-port is 0 the test is skipped?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK


The admin provides the additional script, via the `--check-script=<script>` option. By default the check is done every 2 seconds, this can be changed using the `--check-interval=<seconds>` option.

For each VIP, *keepalived* keeps the state of the node. The VIP on the node may be in *MASTER*, *BACKUP*, or *FAULT* state. All nodes that are not in the *FAULT* state negotiate to decide who will be *MASTER* for the VIP. All of the losers enter the *BACKUP* state. When the check script fails *keepalived* enters the *FAULT* state. When the check script passes again it exits *FAULT* and negotiates for *MASTER*. The resulting state is either *MASTER* or *BACKUP*.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When the check script fails *keepalived* enters the *FAULT* state. When the check script passes again it exits *FAULT* and negotiates for *MASTER*.

When the first one fails, do all then start testing? Or do they always test and move from BACKUP to FAULT?

Surely after the first fails and moves to FAULT, one of the others in BACKUP will try to take control?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The every keepalived (on all nodes) runs the check script on a set period. When MASTER fails, trigger a renegotiation for MASTER. When BACKUP fails just enter FAULT. When FAULT fails do nothing. When FAULT passes, trigger a renegotiation. When MASTER or BACKUP passes, do nothing.

which is loaded every time *keepalived* starts. The scripts can be added to the pod with a ConfigMap as follows.

First, create the desired script and create a ConfigMap to hold it.
The script has no input arguments and must return 0 for OK and 1 for fail.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1 or non-zero?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

keepalived looks for 1 for FAIL.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did more research and I can't find where it says FAIL==1. There is a reference in a 3rd party discussion that says non-zero. 1 doesn't hurt in any event. I am looking into another matter and will keep an eye out for this.

Openshift 3.5 feature.

Add options to 'oadm ipfailover' to configure the check script and
notify scripts and also control the period the check script runs.

Keepalived periodically checks whether the application is running
properly.  In the default case the test is a simple verification that
something is listening on the watch port. This PR permits the user to
supply an additional check script that is run in the ipfailover container
context to verify that the application is operating properly. For
example, a web server can be tested by accessing the watch port and
verifying the response.

Whenever a node changes state to MASTER, BACKUP, or FAULT a notify
script can be called. This script has 3 parameters filled in by
keepalived:
$1 - "GROUP"|"INSTANCE"
$2 - name of group or instance
$3 - target state of transition ("MASTER"|"BACKUP"|"FAULT")

--check-script="check_script"
The check script is a script in the keepalived container that verifies
the service is running properly. The script must return 0 for OK and 1
for FAIL.
These checks are in addition to verifying that the watch port is
listening.

--notify-script="notify_script"
The notify script is a script in the keepalived container that is
called whenever the keepalived state transitions to
(MASTER|BACKUP|FAULT)

--check-interval=
The check script is run every seconds. Default is 2.

Note: the scripts name is the full path to the script.

Fixes bug 1362163
https://trello.com/c/228zu7Br/267-5-improve-the-configurability-of-the-ipfailover-container-operations

Signed-off-by: Phil Cameron <pcameron@redhat.com>
@pecameron
Copy link
Author

@knobunc PTAL

Copy link
Contributor

@knobunc knobunc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM @openshift/team-documentation over to you

@ahardin-rh
Copy link
Contributor

Thank you! I will apply minor style edits in a follow-up PR.

@ahardin-rh ahardin-rh merged commit 6f2c374 into openshift:master Dec 19, 2016
@ahardin-rh
Copy link
Contributor

[rev_history]
|xref:../admin_guide/high_availability.adoc#admin-guide-high-availability[High Availability]
|Added new options to 'oadm ipfailover' to configure the check and notify scripts and to control the period of time the check script runs.
%

@vikram-redhat vikram-redhat modified the milestones: Future Release, Staging, OCP 3.5 GA Apr 12, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants