Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use actual etcd client for /healthz/etcd checks #65027

Merged
merged 1 commit into from
Jun 21, 2018

Conversation

liggitt
Copy link
Member

@liggitt liggitt commented Jun 12, 2018

  • avoids redialing etcd on every health check (which makes slow DNS a false-positive healthz failure)
  • ensures etcd TLS setup is correct (errors verifying the etcd API or sending client credentials manifest as healthz failures)
  • ensures the etcd cluster is actually responsive

fixes #64909

Etcd health checks by the apiserver now ensure the apiserver can connect to and exercise the etcd API

@k8s-ci-robot k8s-ci-robot added do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Jun 12, 2018
@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jun 12, 2018
@liggitt
Copy link
Member Author

liggitt commented Jun 12, 2018

@kubernetes/sig-api-machinery-pr-reviews

/cc @xiang90 @gyuho

@k8s-ci-robot
Copy link
Contributor

@liggitt: GitHub didn't allow me to request PR reviews from the following users: gyuho.

Note that only kubernetes members and repo collaborators can review this PR, and authors cannot review their own PRs.

In response to this:

@kubernetes/sig-api-machinery-pr-reviews

/cc @xiang90 @gyuho

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added the sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. label Jun 12, 2018
c.HealthzChecks = append(c.HealthzChecks, healthz.NamedCheck("etcd", func(r *http.Request) error {
done, err := preflight.EtcdConnection{ServerList: s.StorageConfig.ServerList}.CheckEtcdServers()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the checkEtcdServers func used by anything else after this PR?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it looks like it's used as a gate when starting cmd/kube-apiserver... the etcd endpoint must be able to establish a tcp connection, but that's it. if we want to remove that pre-check, and delete the package entirely, I'd like to do it in a follow up.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure. thanks.

@k8s-ci-robot k8s-ci-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. and removed approved Indicates a PR has been approved by an approver from all required OWNERS files. do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. labels Jun 12, 2018
@liggitt
Copy link
Member Author

liggitt commented Jun 13, 2018

tested this with both etcd2 and etcd3. when the etcd server was unavailable, /healthz/etcd returned failure messages without hanging. when the etcd server became available again, the /healthz/etcd endpoint recovered and started returning success again


clientValue := &atomic.Value{}

clientErrMsg := &atomic.Value{}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what's this for? I think If wait.PollUntil is not blocking, this is useful.

Copy link
Member Author

@liggitt liggitt Jun 13, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think If wait.PollUntil is not blocking, this is useful.

good catch, in my tests, etcd was always available as the apiserver was coming up. backgrounded the client init loop as I intended to originally

// constructing the etcd v3 client blocks and times out if etcd is not available.
// retry in a loop until we successfully create the client, storing the client or error encountered

clientValue := &atomic.Value{}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

@hzxuzhonghu
Copy link
Member

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jun 13, 2018
@smarterclayton
Copy link
Contributor

smarterclayton commented Jun 13, 2018 via email

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: hzxuzhonghu, liggitt, smarterclayton

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jun 13, 2018
@fejta-bot
Copy link

/retest
This bot automatically retries jobs that failed/flaked on approved PRs (send feedback to fejta).

Review the full test history for this PR.

Silence the bot with an /lgtm cancel comment for consistent failures.

@liggitt liggitt added this to the v1.12 milestone Jun 20, 2018
@jpbetz
Copy link
Contributor

jpbetz commented Jun 20, 2018

/cc @cheftako

@k8s-github-robot
Copy link

[MILESTONENOTIFIER] Milestone Pull Request Labels Incomplete

@hzxuzhonghu @liggitt

Action required: This pull request requires label changes. If the required changes are not made within 2 days, the pull request will be moved out of the v1.12 milestone.

kind: Must specify exactly one of kind/bug, kind/cleanup or kind/feature.
priority: Must specify exactly one of priority/critical-urgent, priority/important-longterm or priority/important-soon.

Help

@hzxuzhonghu
Copy link
Member

/test pull-kubernetes-e2e-gce

@k8s-github-robot
Copy link

Automatic merge from submit-queue (batch tested with PRs 64140, 64898, 65022, 65037, 65027). If you want to cherry-pick this change to another branch, please follow the instructions here.

@k8s-github-robot k8s-github-robot merged commit 9d97913 into kubernetes:master Jun 21, 2018
@lavalamp
Copy link
Member

@jpbetz @liggitt If healthz fails, kubelet will restart apiserver. Is restarting apiserver a good response to etcd health check failing? I suspect it will only make things worse? We need a readiness endpoint and not just liveness...

@liggitt
Copy link
Member Author

liggitt commented Jun 21, 2018

Is restarting apiserver a good response to etcd health check failing? I suspect it will only make things worse?

not necessarily... it fixes stuck tcp connections, which we have observed happening in failover cases. open to further improvement here, but this is better than what we had

@lavalamp
Copy link
Member

lavalamp commented Aug 3, 2018

time passes

It does seem like a good idea to drop all watches if our etcd goes away, and a restart is an especially (excessively?) thorough way to accomplish that...

However! It doesn't seem great for apiserver to continuously fail liveness checks and crash-loop while etcd is down / split; it should be sufficient to just not be ready. So, maybe liveness shouldn't care about etcd until we observe a healthy etcd and become ready, at which point if we lose readiness because etcd goes away, we should also lose liveness and be restarted.

We can probably come up with something better / more principled than that: AFAIK, no one has payed any holistic attention to apiserver's liveness / readiness checks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. lgtm "Looks good to me", indicates that a PR is ready to be merged. milestone/incomplete-labels release-note Denotes a PR that will be considered when it comes time to generate release notes. sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

etcd healthcheck is too simple
9 participants