New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

When api server /healthz fails, put info for diagnosis into its log. #56856

Closed
aknrdureegaesr opened this Issue Dec 5, 2017 · 8 comments

Comments

Projects
None yet
6 participants
@aknrdureegaesr

aknrdureegaesr commented Dec 5, 2017

Is this a BUG REPORT or FEATURE REQUEST?:

/kind feature

What happened:

I had a problem in my cluster which caused the /healthz check of the API server to fail. This in turn caused the kubelet to keep restarting it.

I logged in to the master node running both and had a look at the log that catches the API server's stdout and stderr.

What I found there was (after unquoting)

[+]ping ok
[+]poststarthook/generic-apiserver-start-informers ok
[+]poststarthook/start-apiextensions-informers ok
[+]poststarthook/start-apiextensions-controllers ok
[-]poststarthook/bootstrap-controller failed: reason withheld
[+]poststarthook/extensions/third-party-resources ok
[-]poststarthook/ca-registration failed: reason withheld
[+]poststarthook/start-kube-apiserver-informers ok
[+]poststarthook/start-kube-aggregator-informers ok
[+]poststarthook/apiservice-registration-controller ok
[+]poststarthook/apiservice-status-available-controller ok
[+]poststarthook/kube-apiserver-autoregistration ok
[+]autoregister-completion ok
healthz check failed

What you expected to happen:

I hoped to find diagnostic information in the log. (The reason withheld felt almost like a slap in the face.)

How to reproduce it (as minimally and precisely as possible):

Whether this is reproducible, I don't know, but here is what we did: We messed up our cluster by running an application POD with a bad health check of its own. This was "deployed" for a long time. On that (non-master) node, docker ps --all ended up showing a few thousand of shut-down Docker containers waiting to be cleaned up. Eventually, the problem infected the API server through a mechanism not understood.

But any way to mess up the cluster sufficiently so the API server /healthz fails reliably should be sufficient.

Anything else we need to know?:

I checked the source code and found, in kubernetes/staging/src/k8s.io/apiserver/pkg/server/healthz/healthz.go:

if check.Check(r) != nil {
    // don't include the error since this endpoint is public.  If someone wants more detail
    // they should have explicit permission to the detailed checks.
    fmt.Fprintf(&verboseOut, "[-]%v failed: reason withheld\n", check.Name())
    failed = true
} else {

I speculate "detailed checks" means another REST resource. So I checked the API ref. But there, I could not find anything pertinent.

From a user experience point of view: A failing /healthz check of the API server will (often) result in termination of that server. So the suggestion is to fire an HTTP request against a POD that's in the process of being shut down? To me, the log really does seem like a more appropriate place for such diagnostic information.

Environment:

  • Kubernetes version (use kubectl version):
Server Version: version.Info{Major:"1", Minor:"7", GitVersion:"v1.7.10", GitCommit:"bebdeb749f1fa3da9e1312c4b08e439c404b3136", GitTreeState:"clean", BuildDate:"2017-11-03T16:31:49Z", GoVersion:"go1.8.3", Compiler:"gc", Platform:"linux/amd64"}
  • Cloud provider or hardware configuration:
    Running a kops cluster on AWS.
  • OS (e.g. from /etc/os-release):
$ cat /etc/os-version
cat: /etc/os-version: No such file or directory

AMI k8s-1.7-debian-jessie-amd64-hvm-ebs-2017-07-28 (ami-0072de6f)

  • Kernel (e.g. uname -a):
    4.4.78-k8s #1 SMP Fri Jul 28 01:28:39 UTC 2017 x86_64 GNU/Linux
  • Install tools:
    Kops
  • Others:
@aknrdureegaesr

This comment has been minimized.

Show comment
Hide comment
@aknrdureegaesr

aknrdureegaesr Dec 5, 2017

Judging from other items I've seen, this is probably the right label:

@kubernetes/sig-api-machinery-feature-requests

aknrdureegaesr commented Dec 5, 2017

Judging from other items I've seen, this is probably the right label:

@kubernetes/sig-api-machinery-feature-requests

@k8s-ci-robot

This comment has been minimized.

Show comment
Hide comment
@k8s-ci-robot

k8s-ci-robot Dec 5, 2017

Contributor

@aknrdureegaesr: Reiterating the mentions to trigger a notification:
@kubernetes/sig-api-machinery-feature-requests

In response to this:

Judging from other items I've seen, this is probably the right label:

@kubernetes/sig-api-machinery-feature-requests

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Contributor

k8s-ci-robot commented Dec 5, 2017

@aknrdureegaesr: Reiterating the mentions to trigger a notification:
@kubernetes/sig-api-machinery-feature-requests

In response to this:

Judging from other items I've seen, this is probably the right label:

@kubernetes/sig-api-machinery-feature-requests

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@hzxuzhonghu

This comment has been minimized.

Show comment
Hide comment
@hzxuzhonghu
Member

hzxuzhonghu commented Dec 6, 2017

@yliaog

This comment has been minimized.

Show comment
Hide comment
@yliaog

yliaog Dec 7, 2017

Contributor

/cc @yliaog

Contributor

yliaog commented Dec 7, 2017

/cc @yliaog

@fejta-bot

This comment has been minimized.

Show comment
Hide comment
@fejta-bot

fejta-bot Mar 7, 2018

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

fejta-bot commented Mar 7, 2018

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@hzxuzhonghu

This comment has been minimized.

Show comment
Hide comment
@hzxuzhonghu

hzxuzhonghu Mar 8, 2018

Member

This is intended, would wind closing this issue?

Member

hzxuzhonghu commented Mar 8, 2018

This is intended, would wind closing this issue?

@hzxuzhonghu

This comment has been minimized.

Show comment
Hide comment
@hzxuzhonghu

hzxuzhonghu Mar 8, 2018

Member

/close

Member

hzxuzhonghu commented Mar 8, 2018

/close

@aknrdureegaesr

This comment has been minimized.

Show comment
Hide comment
@aknrdureegaesr

aknrdureegaesr Mar 8, 2018

/remove-lifecycle stale

aknrdureegaesr commented Mar 8, 2018

/remove-lifecycle stale

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment