Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use actual etcd client for /healthz/etcd checks #65027

Merged
merged 1 commit into from Jun 21, 2018

Conversation

@liggitt
Copy link
Member

liggitt commented Jun 12, 2018

  • avoids redialing etcd on every health check (which makes slow DNS a false-positive healthz failure)
  • ensures etcd TLS setup is correct (errors verifying the etcd API or sending client credentials manifest as healthz failures)
  • ensures the etcd cluster is actually responsive

fixes #64909

Etcd health checks by the apiserver now ensure the apiserver can connect to and exercise the etcd API
@liggitt

This comment has been minimized.

Copy link
Member Author

liggitt commented Jun 12, 2018

@k8s-ci-robot

This comment has been minimized.

Copy link
Contributor

k8s-ci-robot commented Jun 12, 2018

@liggitt: GitHub didn't allow me to request PR reviews from the following users: gyuho.

Note that only kubernetes members and repo collaborators can review this PR, and authors cannot review their own PRs.

In response to this:

@kubernetes/sig-api-machinery-pr-reviews

/cc @xiang90 @gyuho

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

c.HealthzChecks = append(c.HealthzChecks, healthz.NamedCheck("etcd", func(r *http.Request) error {
done, err := preflight.EtcdConnection{ServerList: s.StorageConfig.ServerList}.CheckEtcdServers()

This comment has been minimized.

@xiang90

xiang90 Jun 12, 2018

Contributor

Is the checkEtcdServers func used by anything else after this PR?

This comment has been minimized.

@liggitt

liggitt Jun 12, 2018

Author Member

it looks like it's used as a gate when starting cmd/kube-apiserver... the etcd endpoint must be able to establish a tcp connection, but that's it. if we want to remove that pre-check, and delete the package entirely, I'd like to do it in a follow up.

This comment has been minimized.

@xiang90

xiang90 Jun 12, 2018

Contributor

sure. thanks.

@liggitt

This comment has been minimized.

Copy link
Member Author

liggitt commented Jun 13, 2018

tested this with both etcd2 and etcd3. when the etcd server was unavailable, /healthz/etcd returned failure messages without hanging. when the etcd server became available again, the /healthz/etcd endpoint recovered and started returning success again


clientValue := &atomic.Value{}

clientErrMsg := &atomic.Value{}

This comment has been minimized.

@hzxuzhonghu

hzxuzhonghu Jun 13, 2018

Member

what's this for? I think If wait.PollUntil is not blocking, this is useful.

This comment has been minimized.

@liggitt

liggitt Jun 13, 2018

Author Member

I think If wait.PollUntil is not blocking, this is useful.

good catch, in my tests, etcd was always available as the apiserver was coming up. backgrounded the client init loop as I intended to originally

// constructing the etcd v3 client blocks and times out if etcd is not available.
// retry in a loop until we successfully create the client, storing the client or error encountered

clientValue := &atomic.Value{}

This comment has been minimized.

@hzxuzhonghu

@liggitt liggitt force-pushed the liggitt:etcd-health-check branch from 825742c to b39cd00 Jun 13, 2018

@hzxuzhonghu

This comment has been minimized.

Copy link
Member

hzxuzhonghu commented Jun 13, 2018

/lgtm

@smarterclayton

This comment has been minimized.

Copy link
Contributor

smarterclayton commented Jun 13, 2018

@k8s-ci-robot

This comment has been minimized.

Copy link
Contributor

k8s-ci-robot commented Jun 13, 2018

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: hzxuzhonghu, liggitt, smarterclayton

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@fejta-bot

This comment has been minimized.

Copy link

fejta-bot commented Jun 17, 2018

/retest
This bot automatically retries jobs that failed/flaked on approved PRs (send feedback to fejta).

Review the full test history for this PR.

Silence the bot with an /lgtm cancel comment for consistent failures.

@liggitt liggitt added this to the v1.12 milestone Jun 20, 2018

@jpbetz

This comment has been minimized.

Copy link
Contributor

jpbetz commented Jun 20, 2018

/cc @cheftako

@k8s-github-robot

This comment has been minimized.

Copy link
Contributor

k8s-github-robot commented Jun 21, 2018

[MILESTONENOTIFIER] Milestone Pull Request Labels Incomplete

@hzxuzhonghu @liggitt

Action required: This pull request requires label changes. If the required changes are not made within 2 days, the pull request will be moved out of the v1.12 milestone.

kind: Must specify exactly one of kind/bug, kind/cleanup or kind/feature.
priority: Must specify exactly one of priority/critical-urgent, priority/important-longterm or priority/important-soon.

Help
@hzxuzhonghu

This comment has been minimized.

Copy link
Member

hzxuzhonghu commented Jun 21, 2018

/test pull-kubernetes-e2e-gce

@k8s-github-robot

This comment has been minimized.

Copy link
Contributor

k8s-github-robot commented Jun 21, 2018

Automatic merge from submit-queue (batch tested with PRs 64140, 64898, 65022, 65037, 65027). If you want to cherry-pick this change to another branch, please follow the instructions here.

@k8s-github-robot k8s-github-robot merged commit 9d97913 into kubernetes:master Jun 21, 2018

18 checks passed

Submit Queue Queued to run github e2e tests a second time.
Details
cla/linuxfoundation liggitt authorized
Details
pull-kubernetes-bazel-build Job succeeded.
Details
pull-kubernetes-bazel-test Job succeeded.
Details
pull-kubernetes-cross Skipped
pull-kubernetes-e2e-gce Job succeeded.
Details
pull-kubernetes-e2e-gce-100-performance Job succeeded.
Details
pull-kubernetes-e2e-gce-device-plugin-gpu Job succeeded.
Details
pull-kubernetes-e2e-gke Skipped
pull-kubernetes-e2e-kops-aws Job succeeded.
Details
pull-kubernetes-integration Job succeeded.
Details
pull-kubernetes-kubemark-e2e-gce Job succeeded.
Details
pull-kubernetes-kubemark-e2e-gce-big Job succeeded.
Details
pull-kubernetes-local-e2e Skipped
pull-kubernetes-local-e2e-containerized Skipped
pull-kubernetes-node-e2e Job succeeded.
Details
pull-kubernetes-typecheck Job succeeded.
Details
pull-kubernetes-verify Job succeeded.
Details
@lavalamp

This comment has been minimized.

Copy link
Member

lavalamp commented Jun 21, 2018

@jpbetz @liggitt If healthz fails, kubelet will restart apiserver. Is restarting apiserver a good response to etcd health check failing? I suspect it will only make things worse? We need a readiness endpoint and not just liveness...

@liggitt

This comment has been minimized.

Copy link
Member Author

liggitt commented Jun 21, 2018

Is restarting apiserver a good response to etcd health check failing? I suspect it will only make things worse?

not necessarily... it fixes stuck tcp connections, which we have observed happening in failover cases. open to further improvement here, but this is better than what we had

@liggitt liggitt deleted the liggitt:etcd-health-check branch Jun 28, 2018

@lavalamp

This comment has been minimized.

Copy link
Member

lavalamp commented Aug 3, 2018

time passes

It does seem like a good idea to drop all watches if our etcd goes away, and a restart is an especially (excessively?) thorough way to accomplish that...

However! It doesn't seem great for apiserver to continuously fail liveness checks and crash-loop while etcd is down / split; it should be sufficient to just not be ready. So, maybe liveness shouldn't care about etcd until we observe a healthy etcd and become ready, at which point if we lose readiness because etcd goes away, we should also lose liveness and be restarted.

We can probably come up with something better / more principled than that: AFAIK, no one has payed any holistic attention to apiserver's liveness / readiness checks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.