Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add liveness check for apiserver #57

Merged
merged 9 commits into from
Sep 11, 2018
Merged

Conversation

ojmhetar
Copy link
Contributor

@ojmhetar ojmhetar commented Sep 6, 2018

Addresses #47

utils/install.go Outdated
}

vrrp_script chk_apiserver {
script "/usr/bin/wget -q -O - https://127.0.0.1:6443/healthz"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The bind port of API server may not be 6443. Please reference config.MasterConfiguration.API.BindPort

utils/install.go Outdated

vrrp_script chk_apiserver {
script "/usr/bin/wget -q -O - https://127.0.0.1:6443/healthz"
interval 1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Every second seems too frequent. How about 10 seconds, the default period for liveness probes.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this be a constant? Thanks!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've seen similar vrrp scripts with low interval values (usually 2 seconds)- do you think 10 seconds x8, will be too long on failure?

utils/install.go Outdated
vrrp_script chk_apiserver {
script "/usr/bin/wget -q -O - https://127.0.0.1:6443/healthz"
interval 1
fall 2
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps too small? The kube-apiserver liveness probe defines a failureThreshold of 8.

utils/install.go Outdated
}

vrrp_script chk_apiserver {
script "/usr/bin/wget -q -O - https://127.0.0.1:6443/healthz"
Copy link
Contributor

@dlipovetsky dlipovetsky Sep 7, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

timeout # script considered failed after 'timeout' seconds

What do you think about defining timeout? Not sure if keepalived kills the script... Can you check? If it doesn't, it's a good idea to add a timeout to wget, to ensure it doesn't hang.

utils/install.go Outdated
}

vrrp_script chk_apiserver {
script "/usr/bin/wget -q -O - https://127.0.0.1:6443/healthz"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will the output get logged? Do we care about the output? If we only care about the exit code, I suggest redirecting stdout/err to /dev/null.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be good to log failovers due to api-health problems.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When the check fails, it is logged with the exit status. The output of wget is not logged, updated to > /dev/null

utils/install.go Outdated
confFile := filepath.Join(constants.SystemdDir, "keepalived.conf")
writeTemplateIntoFile(kaConfFileTemplate, "vipConfFileTemplate", confFile, configTemplateVals)

scriptFile := filepath.Join(constants.SystemdDir, "chk_apiserver.sh")
writeTemplateIntoFile(kaVRRPScript, "kaVRRPScriptTemplate", scriptFile, configTemplateVals)
ChangePermissions(scriptFile, 775)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems too open, need input on how restrictive it should be. keepalived process will execute this script.

VRRPScriptInterval = 10
VRRPScriptRise = 2
VRRPScriptFall = 8
WgetTimeout = 10
Copy link
Contributor

@puneetguptanitj puneetguptanitj Sep 11, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't this be less than VRRPScriptInterval, since, by the time wget returns in case of timeout, another check for api_server would have been issued.

const (
VRRPScriptInterval = 10
VRRPScriptRise = 2
VRRPScriptFall = 8
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Curious why this is set to 8, seems a little high? if I understand it correctly it would mean that VIP will fail over after 80 seconds?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After talking with @sarun87, we dropped it down to 6. I think we could drop it down further in the future as I see other vrrp scripts perform a check at much lower intervals.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

utils/install.go Outdated
}
kaServiceData := KaServiceData{confFile, constants.KeepalivedImage}
kaServiceData := KaServiceData{confFile, scriptFile, constants.KeepalivedImage}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: abbreviation ka might be a bit too short, other places we have kept the full name like in KeepAlivedImg might be ok to use full name ?

utils/install.go Outdated

kaVRRPScript := `#!/bin/sh

/usr/bin/wget -T {{.WgetTimeout}} -qO - https://127.0.0.1:{{.InitConfig.MasterConfiguration.API.BindPort}}/healthz > /dev/null 2>&1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just to understand, since we are using https here wouldn't we have to ignore certs verification or specify ca.crt as trusted for wget?

@ojmhetar ojmhetar force-pushed the private/ojas/keepalived-failover branch from de9fa87 to de9f2c0 Compare September 11, 2018 20:23
@ojmhetar ojmhetar merged commit 6c24275 into master Sep 11, 2018
@ojmhetar ojmhetar deleted the private/ojas/keepalived-failover branch September 11, 2018 21:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants