Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add readyz endpoint to apiserver, modify poststarthooks health checking behavior #78458

Merged
merged 1 commit into from Jun 17, 2019

Conversation

@logicalhan
Copy link
Contributor

commented May 29, 2019

This PR adds a readyz endpoint to apiserver so that we can export readiness information. It also modifies poststarthooks to do two things:

  1. liveness health checking returns true until an explicit max startup sequence period has elapsed (the time one expects apiserver boot-sequencing to have completed) .
  2. readiness health checking should immediately return whether that poststarthook has completed (this is the current behavior of our post start hook health checks).

What type of PR is this?

/kind cleanup

What this PR does / why we need it:

Adds a proper readiness endpoint to the kube-apiserver.

Which issue(s) this PR fixes:

Fixes #55453

The kube-apiserver has improved behavior for both startup and shutdown sequences and also now exposes `\readyz` for readiness checking. Readyz includes all existing healthz checks but also adds a shutdown check. When a cluster admin initiates a shutdown, the kube-apiserver will try to process existing requests (for the duration of request timeout) before killing the apiserver process.  

The apiserver also now takes an optional flag "--maximum-startup-sequence-duration". This allows you to explicitly define an upper bound on the apiserver startup sequences before healthz begins to fail. By keeping the kubelet liveness initial delay short, this can enable quick kubelet recovery as soon as we have a boot sequence which has not completed in our expected time frame, despite lack of completion from longer boot sequences (like RBAC). Kube-apiserver behavior when the value of this flag is zero is backwards compatible (this is as the defaulted value of the flag).

/sig api-machinery

@logicalhan

This comment has been minimized.

Copy link
Contributor Author

commented May 29, 2019

/assign @sttts

@logicalhan logicalhan force-pushed the logicalhan:readiness branch 2 times, most recently from 4f73e7f to 51dd241 May 29, 2019

@logicalhan logicalhan force-pushed the logicalhan:readiness branch from 51dd241 to 8f67ff5 May 29, 2019

@sttts

This comment has been minimized.

Copy link
Contributor

commented May 29, 2019

First commit message needs a little rework https://chris.beams.io/posts/git-commit/

@logicalhan logicalhan force-pushed the logicalhan:readiness branch from 69869e7 to 5203451 Jun 4, 2019

@logicalhan logicalhan force-pushed the logicalhan:readiness branch 2 times, most recently from 2b589df to b517233 Jun 4, 2019

@logicalhan

This comment has been minimized.

Copy link
Contributor Author

commented Jun 5, 2019

/test pull-kubernetes-e2e-gce-100-performance

@sttts

This comment has been minimized.

Copy link
Contributor

commented Jun 5, 2019

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm label Jun 5, 2019

@logicalhan

This comment has been minimized.

Copy link
Contributor Author

commented Jun 5, 2019

/assign @deads2k

@logicalhan

This comment has been minimized.

Copy link
Contributor Author

commented Jun 5, 2019

I like getting readiness, but I'm inclined to wait until 1.16 to merge this.

I think some explanation of how the flag for ignoring incomplete post-starthooks for a while is better than setting a longer timeout on a health check may be worthwhile too. I'm not strictly against the idea, but I'm not completely clear on why I'd choose specifying a flag over specifying a health check timeout.

@deads2k I've updated the release notes as discussed, amended with some of @sttts' suggestions (thanks @sttts!)

healthzLock sync.Mutex
healthzChecks []healthz.HealthzChecker
healthzCreated bool
healthzLock sync.Mutex

This comment has been minimized.

Copy link
@deads2k

deads2k Jun 5, 2019

Contributor

optional nit: we're up to five fields for health checks at this point, three are the same as readiness. I would consider a struct.

This comment has been minimized.

Copy link
@logicalhan

logicalhan Jun 5, 2019

Author Contributor

That sounds like a good idea. I'll refactor this in a follow-up PR.

@deads2k

This comment has been minimized.

Copy link
Contributor

commented Jun 5, 2019

Spoke on slack. We'll remove the TODO so that remote debugging with proper permissions is still possible.

/approve

@k8s-ci-robot

This comment has been minimized.

Copy link
Contributor

commented Jun 5, 2019

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: deads2k, logicalhan

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

healthzChecks: c.HealthzChecks,
healthzChecks: c.HealthzChecks,
readyzChecks: c.ReadyzChecks,
readinessStopCh: make(chan struct{}),

This comment has been minimized.

Copy link
@sttts

sttts Jun 5, 2019

Contributor

does the wiring really need these 3 in the config or would GenericAPIServer be enough?

This comment has been minimized.

Copy link
@logicalhan

logicalhan Jun 5, 2019

Author Contributor

I'm not sure I follow what you are saying.. the portion of the code you are commenting on is in fact the GenericAPIServer. Specifically this is where we initialize the GenericAPIServer object.


DiscoveryGroupManager: discovery.NewRootAPIsHandler(c.DiscoveryAddresses, c.Serializer),

enableAPIResponseCompression: c.EnableAPIResponseCompression,
maxRequestBodyBytes: c.MaxRequestBodyBytes,
healthzClock: clock.RealClock{},

This comment has been minimized.

Copy link
@sttts

sttts Jun 5, 2019

Contributor

does the wiring really need this in the config?

This comment has been minimized.

Copy link
@logicalhan

logicalhan Jun 5, 2019

Author Contributor

My thought was that it would be wise to initialize the GenericAPIServer with the clock since we can call AddPostStartHook arbitrarily on the GenericAPIServer object (prior to the installHealthz and installReadyz calls, in fact).

@k8s-ci-robot k8s-ci-robot removed the lgtm label Jun 5, 2019

@soggiest

This comment has been minimized.

Copy link
Contributor

commented Jun 6, 2019

/milestone v1.16

add readyz endpoint for kube-apiserver readiness checks
add startup sequence duration and readyz endpoint

add rbac bootstrapping policy for readyz

add integration test around grace period and readyz

rename startup sequence duration flag

copy health checks to fields

rename health-check installed boolean, refactor clock injection logic

cleanup clock injection code

remove todo about poststarthook url registration from healthz

@logicalhan logicalhan force-pushed the logicalhan:readiness branch from bb8d641 to 54dcf5c Jun 17, 2019

@sttts

This comment has been minimized.

Copy link
Contributor

commented Jun 17, 2019

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm label Jun 17, 2019

@k8s-ci-robot k8s-ci-robot merged commit ae3c44d into kubernetes:master Jun 17, 2019

23 checks passed

cla/linuxfoundation logicalhan authorized
Details
pull-kubernetes-bazel-build Job succeeded.
Details
pull-kubernetes-bazel-test Job succeeded.
Details
pull-kubernetes-conformance-image-test Skipped.
pull-kubernetes-cross Skipped.
pull-kubernetes-dependencies Job succeeded.
Details
pull-kubernetes-e2e-gce Job succeeded.
Details
pull-kubernetes-e2e-gce-100-performance Job succeeded.
Details
pull-kubernetes-e2e-gce-csi-serial Skipped.
pull-kubernetes-e2e-gce-device-plugin-gpu Job succeeded.
Details
pull-kubernetes-e2e-gce-iscsi Skipped.
pull-kubernetes-e2e-gce-iscsi-serial Skipped.
pull-kubernetes-e2e-gce-storage-slow Skipped.
pull-kubernetes-godeps Skipped.
pull-kubernetes-integration Job succeeded.
Details
pull-kubernetes-kubemark-e2e-gce-big Job succeeded.
Details
pull-kubernetes-local-e2e Skipped.
pull-kubernetes-node-e2e Job succeeded.
Details
pull-kubernetes-node-e2e-containerd Job succeeded.
Details
pull-kubernetes-typecheck Job succeeded.
Details
pull-kubernetes-verify Job succeeded.
Details
pull-publishing-bot-validate Skipped.
tide In merge pool.
Details

func (h *delayedCheck) Check(req *http.Request) error {
h.healthLock.Lock()
defer h.healthLock.Unlock()

This comment has been minimized.

Copy link
@tedyu

tedyu Jun 17, 2019

Contributor

If healthLock is RWMutex, we can take RLock in this func, making the check more efficient.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.