New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add ability to disable health checks on kube-apiserver for healthz using query-params #70676

Merged
merged 2 commits into from Nov 14, 2018

Conversation

@logicalhan
Copy link
Contributor

logicalhan commented Nov 6, 2018

What type of PR is this?

/kind feature

What this PR does / why we need it:

Currently, the healthz endpoint is inflexible in that any and all health checks are currently executed when the endpoint is hit. Since the kube-apiserver only currently supports a liveness check, this means all persistent health check failures will eventually result in a kubelet driven restart. However, this may not be desirable behavior if you have additional monitoring services layered on top of your control plane.

For instance, take the etcd failure from the perspective of the kube-apiserver. Currently, persistent failures during the etcd health check from the api-server healthz endpoint will cause the kube-apiserver to be restarted. This sometimes helps (in the case of a faulty etcd-client connection) but also can be detrimental (the kube-apiserver restarts until the etcd cluster is functioning, then the kube-apiserver bombards the etcd cluster with requests to update local caches, saturating the etcd cluster with requests, causing it to timeout on its own liveness probes and trigger a kubelet-driven restart).

Let's say that we could now disable etcd from our liveness health check. Then if we had an external monitoring service, we could ask independently ask etcd if it was healthy and we could also ask the kube-apiserver if it thought etcd was unhealthy. Based on the combination of those answers, we could enable more intelligent responses to certain situation, like restarting the kube-apiserver when it thought etcd was unhealthy but etcd was actually reporting as healthy but not doing so when etcd was actually unhealthy.

This feature is backwards-compatible and opt-in. There would be no impact to existing behavior unless these query params are actually used. This feature no-opts for strings which do not match the name of an existing health check.

Which issue(s) this PR fixes:
Fixes #70591

Does this PR introduce a user-facing change?:

The kube-apiserver's healthz now takes in an optional query parameter which allows you to disable health checks from causing healthz failures. 

/sig api-machinery

@logicalhan

This comment has been minimized.

Copy link
Contributor

logicalhan commented Nov 6, 2018

@k8s-ci-robot k8s-ci-robot requested review from cheftako , lavalamp and liggitt Nov 6, 2018

@justinsb

This comment has been minimized.

Copy link
Member

justinsb commented Nov 7, 2018

/ok-to-test

Generally LGTM but I'd like to see the param used in https://github.com/kubernetes/kubernetes/blob/master/cluster/gce/manifests/kube-apiserver.manifest#L37 - and we can add a comment at the same time!

@deads2k

This comment has been minimized.

Copy link
Contributor

deads2k commented Nov 7, 2018

I don't object per se , but are you sure that it's worth enough to avoid just crashlooping it? At least in the etcd example, it's not like this process is doing much for you.

@logicalhan

This comment has been minimized.

Copy link
Contributor

logicalhan commented Nov 7, 2018

I don't object per se , but are you sure that it's worth enough to avoid just crashlooping it? At least in the etcd example, it's not like this process is doing much for you.

Crashlooping is also noisy. Yes, in the etcd example, not crashlooping is not going to fix the issue but if the etcd is unhealthy-ish due to a longish boot, then it will get to a correct state faster without crashlooping. There are trade-offs, for sure. Then again, this feature is opt-in and if you are in situation where it occurs to you to use this feature (i.e. happening a lot), then probably to you it seems worthwhile.

@roycaihw

This comment has been minimized.

Copy link
Member

roycaihw commented Nov 8, 2018

/assign @cheftako

@logicalhan logicalhan force-pushed the logicalhan:exclude-checks branch 4 times, most recently from 0923187 to dc6922b Nov 10, 2018

@logicalhan logicalhan force-pushed the logicalhan:exclude-checks branch from dc6922b to 33d4194 Nov 13, 2018

@lavalamp

This comment has been minimized.

Copy link
Member

lavalamp commented Nov 13, 2018

@lavalamp

This comment has been minimized.

Copy link
Member

lavalamp commented Nov 13, 2018

@logicalhan logicalhan force-pushed the logicalhan:exclude-checks branch from 5666d19 to 0f798a0 Nov 13, 2018

@logicalhan

This comment has been minimized.

Copy link
Contributor

logicalhan commented Nov 13, 2018

/test pull-kubernetes-e2e-gke

1 similar comment
@logicalhan

This comment has been minimized.

Copy link
Contributor

logicalhan commented Nov 13, 2018

/test pull-kubernetes-e2e-gke

@@ -157,6 +174,11 @@ func handleRootHealthz(checks ...HealthzChecker) http.HandlerFunc {
fmt.Fprintf(&verboseOut, "[+]%v ok\n", check.Name())
}
}
if excluded.Len() > 0 {
fmt.Fprintf(&verboseOut, "err: some health checks cannot be excluded: no matches for %v\n", formatQuoted(excluded.List()...))

This comment has been minimized.

@lavalamp

lavalamp Nov 13, 2018

Member

Sorry to be nitpicky, but I'd call this a warning and not an error, since it doesn't make the check fail and the word "error" makes admins nervous :)

This comment has been minimized.

@logicalhan

logicalhan Nov 13, 2018

Contributor

Sure, I can remove the prefix. I added it after the fact because it made it slightly easier to parse the string out from the verbose output blob.

@logicalhan logicalhan force-pushed the logicalhan:exclude-checks branch from 0f798a0 to 895dd41 Nov 13, 2018

@lavalamp

This comment has been minimized.

Copy link
Member

lavalamp commented Nov 13, 2018

/lgtm
/approve

@k8s-ci-robot

This comment has been minimized.

Copy link
Contributor

k8s-ci-robot commented Nov 13, 2018

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: lavalamp, logicalhan

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@logicalhan

This comment has been minimized.

Copy link
Contributor

logicalhan commented Nov 14, 2018

/test pull-kubernetes-verify

@lavalamp lavalamp added this to the v1.13 milestone Nov 14, 2018

@k8s-ci-robot k8s-ci-robot merged commit ca338b9 into kubernetes:master Nov 14, 2018

18 checks passed

cla/linuxfoundation logicalhan authorized
Details
pull-kubernetes-bazel-build Job succeeded.
Details
pull-kubernetes-bazel-test Job succeeded.
Details
pull-kubernetes-cross Skipped
pull-kubernetes-e2e-gce Job succeeded.
Details
pull-kubernetes-e2e-gce-100-performance Job succeeded.
Details
pull-kubernetes-e2e-gce-device-plugin-gpu Job succeeded.
Details
pull-kubernetes-e2e-gke Job succeeded.
Details
pull-kubernetes-e2e-kops-aws Job succeeded.
Details
pull-kubernetes-e2e-kubeadm-gce Skipped
pull-kubernetes-integration Job succeeded.
Details
pull-kubernetes-kubemark-e2e-gce-big Job succeeded.
Details
pull-kubernetes-local-e2e Skipped
pull-kubernetes-local-e2e-containerized Skipped
pull-kubernetes-node-e2e Job succeeded.
Details
pull-kubernetes-typecheck Job succeeded.
Details
pull-kubernetes-verify Job succeeded.
Details
tide In merge pool.
Details
@@ -34,11 +34,19 @@
"httpGet": {
"host": "127.0.0.1",
"port": 8080,
"path": "/healthz"
"path": "/healthz?exclude=etcd"

This comment has been minimized.

@mikedanese

mikedanese Nov 15, 2018

Member

There are a couple issues with this:

  • The API server main blocks (after listening, but before serving) on the etcd client opening a connection to the etcd server. Not all, but some etcd failures will cause the kube-apiserver to never start serving /healthz.
  • The etcd healthcheck is not the only healthcheck that depends on etcd. Spot checking [poststarthook/bootstrap-controller, poststarthook/rbac/bootstrap-roles, poststarthook/scheduling/bootstrap-system-priority-classes, poststarthook/ca-registration, autoregister-completion], all depend on etcd being up before they return an initial healthy status.

Is this change intended to help reduce restarts on apiserver startup or only in steady state?

This comment has been minimized.

@liggitt

liggitt Nov 15, 2018

Member

Is this change intended to help reduce restarts on apiserver startup or only in steady state?

The etcd check use case was to avoid useless restarts in steady state

This comment has been minimized.

@logicalhan

logicalhan Nov 15, 2018

Contributor

Yes, it was intended for steady state. Also, the observation that boot sequences are more problematic was a contributing factor for my thought process in #71054.

k8s-ci-robot added a commit that referenced this pull request Dec 5, 2018

Merge pull request #71285 from cheftako/automated-cherry-pick-of-#70753-
#70676-#70971-upstream-release-1.12

Automated cherry pick of #70753 #70676 #70971 upstream release 1.12

@logicalhan logicalhan referenced this pull request Dec 8, 2018

Closed

REQUEST: New membership for @logicalhan #292

6 of 6 tasks complete

k8s-ci-robot added a commit that referenced this pull request Dec 13, 2018

Merge pull request #71334 from cheftako/automated-cherry-pick-of-#713…
…25-upstream-release-1.10

Automated cherry pick of #70753, #70676 and #70971 upstream release 1.10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment