Use actual etcd client for /healthz/etcd checks #65027

liggitt · 2018-06-12T18:49:42Z

avoids redialing etcd on every health check (which makes slow DNS a false-positive healthz failure)
ensures etcd TLS setup is correct (errors verifying the etcd API or sending client credentials manifest as healthz failures)
ensures the etcd cluster is actually responsive

Etcd health checks by the apiserver now ensure the apiserver can connect to and exercise the etcd API

liggitt · 2018-06-12T18:56:02Z

@kubernetes/sig-api-machinery-pr-reviews

/cc @xiang90 @gyuho

k8s-ci-robot · 2018-06-12T18:56:03Z

@liggitt: GitHub didn't allow me to request PR reviews from the following users: gyuho.

Note that only kubernetes members and repo collaborators can review this PR, and authors cannot review their own PRs.

In response to this:

@kubernetes/sig-api-machinery-pr-reviews

/cc @xiang90 @gyuho

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

xiang90 · 2018-06-12T19:04:33Z

staging/src/k8s.io/apiserver/pkg/server/options/etcd.go

 	c.HealthzChecks = append(c.HealthzChecks, healthz.NamedCheck("etcd", func(r *http.Request) error {
-		done, err := preflight.EtcdConnection{ServerList: s.StorageConfig.ServerList}.CheckEtcdServers()


Is the checkEtcdServers func used by anything else after this PR?

it looks like it's used as a gate when starting cmd/kube-apiserver... the etcd endpoint must be able to establish a tcp connection, but that's it. if we want to remove that pre-check, and delete the package entirely, I'd like to do it in a follow up.

sure. thanks.

liggitt · 2018-06-13T01:54:04Z

tested this with both etcd2 and etcd3. when the etcd server was unavailable, /healthz/etcd returned failure messages without hanging. when the etcd server became available again, the /healthz/etcd endpoint recovered and started returning success again

hzxuzhonghu · 2018-06-13T02:13:09Z

staging/src/k8s.io/apiserver/pkg/storage/storagebackend/factory/etcd3.go

+
+	clientValue := &atomic.Value{}
+
+	clientErrMsg := &atomic.Value{}


what's this for? I think If wait.PollUntil is not blocking, this is useful.

I think If wait.PollUntil is not blocking, this is useful.

good catch, in my tests, etcd was always available as the apiserver was coming up. backgrounded the client init loop as I intended to originally

hzxuzhonghu · 2018-06-13T02:16:49Z

staging/src/k8s.io/apiserver/pkg/storage/storagebackend/factory/etcd3.go

+	// constructing the etcd v3 client blocks and times out if etcd is not available.
+	// retry in a loop until we successfully create the client, storing the client or error encountered
+
+	clientValue := &atomic.Value{}


hzxuzhonghu · 2018-06-13T02:25:20Z

/lgtm

smarterclayton · 2018-06-13T05:52:54Z

/approve

…

On Tue, Jun 12, 2018 at 10:25 PM, k8s-ci-robot ***@***.***> wrote: [APPROVALNOTIFIER] This PR is *NOT APPROVED* This pull-request has been approved by: *hzxuzhonghu <#65027 (comment)>*, *liggitt <#65027#>* To fully approve this pull request, please assign additional approvers. We suggest the following additional approver: *sttts* Assign the PR to them by writing /assign @sttts in a comment when ready. The full list of commands accepted by this bot can be found here <https://go.k8s.io/bot-commands>. The pull request process is described here <https://git.k8s.io/community/contributors/guide/owners.md#the-code-review-process> Needs approval from an approver in each of these files: - *staging/src/k8s.io/apiextensions-apiserver/Godeps/OWNERS <https://github.com/kubernetes/kubernetes/blob/master/staging/src/k8s.io/apiextensions-apiserver/Godeps/OWNERS>* - staging/src/k8s.io/apiserver/OWNERS <https://github.com/kubernetes/kubernetes/blob/master/staging/src/k8s.io/apiserver/OWNERS> [liggitt] - *staging/src/k8s.io/kube-aggregator/Godeps/OWNERS <https://github.com/kubernetes/kubernetes/blob/master/staging/src/k8s.io/kube-aggregator/Godeps/OWNERS>* - staging/src/k8s.io/sample-apiserver/Godeps/OWNERS <https://github.com/kubernetes/kubernetes/blob/master/staging/src/k8s.io/sample-apiserver/Godeps/OWNERS> [liggitt] Approvers can indicate their approval by writing /approve in a comment Approvers can cancel approval by writing /approve cancel in a comment — You are receiving this because your review was requested. Reply to this email directly, view it on GitHub <#65027 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABG_pzGFk1lv_NXdcrPejXHxG4vDYjhrks5t8HgrgaJpZM4Uk9sH> .

k8s-ci-robot · 2018-06-13T05:53:02Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: hzxuzhonghu, liggitt, smarterclayton

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~staging/src/k8s.io/apiextensions-apiserver/Godeps/OWNERS~~ [smarterclayton]
~~staging/src/k8s.io/apiserver/OWNERS~~ [liggitt,smarterclayton]
~~staging/src/k8s.io/kube-aggregator/Godeps/OWNERS~~ [smarterclayton]
~~staging/src/k8s.io/sample-apiserver/Godeps/OWNERS~~ [liggitt,smarterclayton]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

fejta-bot · 2018-06-17T11:42:56Z

/retest
This bot automatically retries jobs that failed/flaked on approved PRs (send feedback to fejta).

Review the full test history for this PR.

Silence the bot with an /lgtm cancel comment for consistent failures.

jpbetz · 2018-06-20T17:10:11Z

/cc @cheftako

k8s-github-robot · 2018-06-21T08:35:35Z

[MILESTONENOTIFIER] Milestone Pull Request Labels Incomplete

@hzxuzhonghu @liggitt

Action required: This pull request requires label changes. If the required changes are not made within 2 days, the pull request will be moved out of the v1.12 milestone.

kind: Must specify exactly one of kind/bug, kind/cleanup or kind/feature.
priority: Must specify exactly one of priority/critical-urgent, priority/important-longterm or priority/important-soon.

Help

hzxuzhonghu · 2018-06-21T09:47:58Z

/test pull-kubernetes-e2e-gce

k8s-github-robot · 2018-06-21T12:40:22Z

Automatic merge from submit-queue (batch tested with PRs 64140, 64898, 65022, 65037, 65027). If you want to cherry-pick this change to another branch, please follow the instructions here.

lavalamp · 2018-06-21T21:02:03Z

@jpbetz @liggitt If healthz fails, kubelet will restart apiserver. Is restarting apiserver a good response to etcd health check failing? I suspect it will only make things worse? We need a readiness endpoint and not just liveness...

liggitt · 2018-06-21T21:06:13Z

Is restarting apiserver a good response to etcd health check failing? I suspect it will only make things worse?

not necessarily... it fixes stuck tcp connections, which we have observed happening in failover cases. open to further improvement here, but this is better than what we had

lavalamp · 2018-08-03T20:27:14Z

time passes

It does seem like a good idea to drop all watches if our etcd goes away, and a restart is an especially (excessively?) thorough way to accomplish that...

However! It doesn't seem great for apiserver to continuously fail liveness checks and crash-loop while etcd is down / split; it should be sufficient to just not be ready. So, maybe liveness shouldn't care about etcd until we observe a healthy etcd and become ready, at which point if we lose readiness because etcd goes away, we should also lose liveness and be restarted.

We can probably come up with something better / more principled than that: AFAIK, no one has payed any holistic attention to apiserver's liveness / readiness checks.

k8s-ci-robot requested review from smarterclayton and wojtek-t June 12, 2018 18:49

k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jun 12, 2018

k8s-ci-robot added the sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. label Jun 12, 2018

k8s-ci-robot requested a review from xiang90 June 12, 2018 18:56

xiang90 reviewed Jun 12, 2018

View reviewed changes

liggitt force-pushed the etcd-health-check branch from f0bda45 to 825742c Compare June 12, 2018 21:42

hzxuzhonghu reviewed Jun 13, 2018

View reviewed changes

Use actual etcd client for /healthz/etcd checks

b39cd00

liggitt force-pushed the etcd-health-check branch from 825742c to b39cd00 Compare June 13, 2018 02:19

k8s-ci-robot assigned hzxuzhonghu Jun 13, 2018

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jun 13, 2018

k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jun 13, 2018

liggitt mentioned this pull request Jun 13, 2018

Propogate etcd preflight network errors upstream #50622 [WIP] #60829

Closed

liggitt mentioned this pull request Jun 20, 2018

Enabling auto sync etcd members #64746

Closed

liggitt added this to the v1.12 milestone Jun 20, 2018

k8s-github-robot added the milestone/incomplete-labels label Jun 20, 2018

k8s-github-robot merged commit 9d97913 into kubernetes:master Jun 21, 2018

cflee mentioned this pull request Jun 26, 2018

Rejected connection in ETCD logs from localhost every 10 seconds etcd-io/etcd#9285

Closed

liggitt deleted the etcd-health-check branch June 28, 2018 20:34

dlipovetsky mentioned this pull request Aug 14, 2018

Stop triggering kube-apiserver restart as part of etcd recovery platform9/cctl#99

Open

wenjiaswe mentioned this pull request Oct 23, 2018

apiserver -> etcd healthcheck does not complete TLS handshake, etcd logs warning #63316

Closed

yangjunmyfm192085 mentioned this pull request Jan 29, 2021

Is it reasonable that apiserver sets the timeout period for etcd's health check to 2s? #98557

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use actual etcd client for /healthz/etcd checks #65027

Use actual etcd client for /healthz/etcd checks #65027

liggitt commented Jun 12, 2018 •

edited

Loading

liggitt commented Jun 12, 2018

k8s-ci-robot commented Jun 12, 2018

xiang90 Jun 12, 2018

liggitt Jun 12, 2018

xiang90 Jun 12, 2018

liggitt commented Jun 13, 2018

hzxuzhonghu Jun 13, 2018

liggitt Jun 13, 2018 •

edited

Loading

hzxuzhonghu Jun 13, 2018

hzxuzhonghu commented Jun 13, 2018

smarterclayton commented Jun 13, 2018 via email

k8s-ci-robot commented Jun 13, 2018

fejta-bot commented Jun 17, 2018

jpbetz commented Jun 20, 2018 •

edited

Loading

k8s-github-robot commented Jun 21, 2018

hzxuzhonghu commented Jun 21, 2018

k8s-github-robot commented Jun 21, 2018

lavalamp commented Jun 21, 2018

liggitt commented Jun 21, 2018 •

edited

Loading

lavalamp commented Aug 3, 2018

		c.HealthzChecks = append(c.HealthzChecks, healthz.NamedCheck("etcd", func(r *http.Request) error {
		done, err := preflight.EtcdConnection{ServerList: s.StorageConfig.ServerList}.CheckEtcdServers()


		clientValue := &atomic.Value{}

		clientErrMsg := &atomic.Value{}

Use actual etcd client for /healthz/etcd checks #65027

Use actual etcd client for /healthz/etcd checks #65027

Conversation

liggitt commented Jun 12, 2018 • edited Loading

liggitt commented Jun 12, 2018

k8s-ci-robot commented Jun 12, 2018

xiang90 Jun 12, 2018

Choose a reason for hiding this comment

liggitt Jun 12, 2018

Choose a reason for hiding this comment

xiang90 Jun 12, 2018

Choose a reason for hiding this comment

liggitt commented Jun 13, 2018

hzxuzhonghu Jun 13, 2018

Choose a reason for hiding this comment

liggitt Jun 13, 2018 • edited Loading

Choose a reason for hiding this comment

hzxuzhonghu Jun 13, 2018

Choose a reason for hiding this comment

hzxuzhonghu commented Jun 13, 2018

smarterclayton commented Jun 13, 2018 via email

k8s-ci-robot commented Jun 13, 2018

fejta-bot commented Jun 17, 2018

jpbetz commented Jun 20, 2018 • edited Loading

k8s-github-robot commented Jun 21, 2018

hzxuzhonghu commented Jun 21, 2018

k8s-github-robot commented Jun 21, 2018

lavalamp commented Jun 21, 2018

liggitt commented Jun 21, 2018 • edited Loading

lavalamp commented Aug 3, 2018

liggitt commented Jun 12, 2018 •

edited

Loading

liggitt Jun 13, 2018 •

edited

Loading

jpbetz commented Jun 20, 2018 •

edited

Loading

liggitt commented Jun 21, 2018 •

edited

Loading