Modify timeout for etcd healthcheck #111399

Argh4k · 2022-07-25T13:42:22Z

Increase default timeout for etcd healthcheck to 15 seconds.
Add additional etcd check to readyz with 2 seconds timeout.

What type of PR is this?

/kind feature

What this PR does / why we need it:

This PR increases default timeout for etcd healthcheck to 15 seconds and adds new etcd check to readiness check with timeout of 2 seconds. Currently, when the control plane is overloaded, healthchecks to etcd can take more than 2 seconds marking kube apsierver unhealthy, even if it is only etcd performance degradation. Adding 2 seconds check to readyz should help with load distribution between apiservers in case of etcd performance degradation.

Which issue(s) this PR fixes:

Fixes #111290

Special notes for your reviewer:

Does this PR introduce a user-facing change?

a new flag `etcd-ready-timeout` has been added. It configures a timeout of an additional etcd check performed as part of readyz check.

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:

k8s-ci-robot · 2022-07-25T13:42:30Z

Hi @Argh4k. Thanks for your PR.

I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Argh4k · 2022-07-25T14:01:55Z

/assign @mborsz @wojtek-t

MadhavJivrajani · 2022-07-26T06:50:52Z

/ok-to-test

wojtek-t · 2022-07-26T07:05:12Z

I'm fine with this change modulo the comments that I added.

But I would like to give a bit of time to sig-apimachinery folks to take a quick look at it.

/sig api-machinery

sttts · 2022-07-26T07:13:31Z

staging/src/k8s.io/apiserver/pkg/server/options/etcd.go

+	if err != nil {
+		return err
+	}
+	c.ReadyzChecks = append(c.ReadyzChecks, healthz.NamedCheck("etcd-readiness", func(r *http.Request) error {


we have a method for this

I can only see AddHealthChecks which adds healthcheck to readyz/livez/healthz. It even has We should prefer this to adding healthChecks directly to the config unless we explicitly want to add a healthcheck only to a specific health endpoint. in its description. Am I missing something?

Here is the function that @sttts was talking about:
https://github.com/kubernetes/kubernetes/blob/master/staging/src/k8s.io/apiserver/pkg/server/healthz.go#L63

But that is the function at the generic server level, whereas for etcd checks we're operating still at config level.

That said - instead of doing it manually, please create a AddReadyzCheck to the Config struct.

MadhavJivrajani

Could we extend the existing tests to cover this change?: https://github.com/kubernetes/kubernetes/blob/master/staging/src/k8s.io/apiserver/pkg/server/options/etcd_test.go

sttts · 2022-07-26T07:19:06Z

I can see why we want the readyz flag. I don't see why changing the healthz timeout from 2s to 15s and potentially break users who depend on the old behaviour.

Also this change is not mentioned in the "Does this PR introduce a user-facing change?" section while it is userfacing.

wojtek-t · 2022-07-26T07:00:48Z

staging/src/k8s.io/apiserver/pkg/server/options/etcd.go

@@ -234,6 +236,14 @@ func (s *EtcdOptions) addEtcdHealthEndpoint(c *server.Config) error {
 		return healthCheck()
 	}))

+	readyCheck, err := storagefactory.CreateReadyCheck(s.StorageConfig, c.DrainedNotify())


I was wondering if we shouldn't use a different channel here, although given it's actually used as a stopCh it actually makes sense.

wojtek-t · 2022-07-26T07:03:37Z

staging/src/k8s.io/apiserver/pkg/storage/storagebackend/config.go

@@ -35,7 +35,8 @@ const (

 	DefaultCompactInterval      = 5 * time.Minute
 	DefaultDBMetricPollInterval = 30 * time.Second
-	DefaultHealthcheckTimeout   = 2 * time.Second
+	DefaultHealthcheckTimeout   = 15 * time.Second


I understand the motivation and I'm supportive for the motivation as described in:
#111290

But technically that's a breaking change - for people who are not setting the timeout now, they will face a default behavior change.

For me it seems much safer to leave it set to 2s and just rely that people who would like to bump it will use the already existing flag. @deads2k - FYI

I'm fine with leaving etcd as 2s and using existing flag to tune this.

Argh4k · 2022-07-26T19:21:18Z

I can see why we want the readyz flag. I don't see why changing the healthz timeout from 2s to 15s and potentially break users who depend on the old behaviour.

Also this change is not mentioned in the "Does this PR introduce a user-facing change?" section while it is userfacing.

Changed defaults back to original values. You are right that changing them, could break it for some users.

leilajal · 2022-07-26T20:05:57Z

/triage accepted

wojtek-t · 2022-07-27T06:15:48Z

staging/src/k8s.io/apiserver/pkg/server/options/etcd.go

+	if err != nil {
+		return err
+	}
+	c.ReadyzChecks = append(c.ReadyzChecks, healthz.NamedCheck("etcd-readiness", func(r *http.Request) error {


Here is the function that @sttts was talking about:
https://github.com/kubernetes/kubernetes/blob/master/staging/src/k8s.io/apiserver/pkg/server/healthz.go#L63

But that is the function at the generic server level, whereas for etcd checks we're operating still at config level.

That said - instead of doing it manually, please create a AddReadyzCheck to the Config struct.

staging/src/k8s.io/apiserver/pkg/storage/storagebackend/factory/factory_test.go

wojtek-t

This looks good overall (modulo my one comment), but test failures seems related to this PR.

wojtek-t · 2022-07-27T08:40:58Z

staging/src/k8s.io/apiserver/go.mod

@@ -110,6 +110,7 @@ require (
 	golang.org/x/term v0.0.0-20210927222741-03fcf44c2211 // indirect
 	golang.org/x/text v0.3.7 // indirect
 	golang.org/x/time v0.0.0-20220210224613-90d013bbcef8 // indirect
+	golang.org/x/xerrors v0.0.0-20200804184101-5ec99f83aff1 // indirect


Why is this needed? You're not changing imports...

I've run /hack/update-vendor.sh because pull-kubernetes-dependecies was failing with:

Your vendored results are different: diff -Naupr -x BUILD -x 'AUTHORS*' -x 'CONTRIBUTORS*' vendor/k8s.io/apiserver/go.mod /home/prow/go/src/k8s.io/kubernetes/_tmp/kube-vendor.wBNJ6i/kubernetes/vendor/k8s.io/apiserver/go.mod --- vendor/k8s.io/apiserver/go.mod 2022-07-27 07:20:52.922456943 +0000 +++ /home/prow/go/src/k8s.io/kubernetes/_tmp/kube-vendor.wBNJ6i/kubernetes/vendor/k8s.io/apiserver/go.mod 2022-07-27 07:23:22.144718498 +0000 @@ -110,6 +110,7 @@ require ( golang.org/x/term v0.0.0-20210927222741-03fcf44c2211 // indirect golang.org/x/text v0.3.7 // indirect golang.org/x/time v0.0.0-20220210224613-90d013bbcef8 // indirect + golang.org/x/xerrors v0.0.0-20200804184101-5ec99f83aff1 // indirect google.golang.org/appengine v1.6.7 // indirect google.golang.org/genproto v0.0.0-20220502173005-c8bf987b8c21 // indirect google.golang.org/protobuf v1.28.0 // indirect Vendor Verify failed. If you're seeing this locally, run the below command to fix your directories: hack/update-vendor.sh

I'm not quite sure why it is needed. I've changed formatting from %v to %w for error and used cmpopts.EquateErrors() but I do not understand why verify-vendor says that this should be included as indirect dependency.

Can you maybe revert that change then? I would like to avoid combining those two...

Removed cmpopts, it looks like they were causing this behaviour

wojtek-t · 2022-07-27T12:45:21Z

/lgtm
/approve

Thanks!

k8s-ci-robot · 2022-07-27T12:45:53Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: Argh4k, wojtek-t

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~cmd/kube-apiserver/OWNERS~~ [wojtek-t]
~~staging/src/k8s.io/apiserver/OWNERS~~ [wojtek-t]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

k8s-ci-robot requested review from lavalamp and ping035627 July 25, 2022 13:45

k8s-ci-robot assigned mborsz and wojtek-t Jul 25, 2022

k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Jul 26, 2022

sttts reviewed Jul 26, 2022

View reviewed changes

MadhavJivrajani reviewed Jul 26, 2022

View reviewed changes

wojtek-t reviewed Jul 26, 2022

View reviewed changes

Argh4k force-pushed the i-111290 branch from 0ee6657 to e5d3f00 Compare July 26, 2022 19:16

k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Jul 26, 2022

k8s-ci-robot added triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Jul 26, 2022

wojtek-t reviewed Jul 27, 2022

View reviewed changes

Argh4k force-pushed the i-111290 branch 3 times, most recently from 969ef19 to 9776963 Compare July 27, 2022 07:44

k8s-ci-robot added the area/dependency Issues or PRs related to dependency changes label Jul 27, 2022

wojtek-t reviewed Jul 27, 2022

View reviewed changes

Argh4k force-pushed the i-111290 branch from 9776963 to 11ebf61 Compare July 27, 2022 09:09

Add additional etcd check to readyz with 2 seconds timeout.

b42045a

Argh4k force-pushed the i-111290 branch from 11ebf61 to b42045a Compare July 27, 2022 12:23

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jul 27, 2022

k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jul 27, 2022

k8s-ci-robot merged commit 610b783 into kubernetes:master Jul 27, 2022

k8s-ci-robot added this to the v1.25 milestone Jul 27, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Modify timeout for etcd healthcheck #111399

Modify timeout for etcd healthcheck #111399

Argh4k commented Jul 25, 2022 •

edited

Loading

k8s-ci-robot commented Jul 25, 2022

Argh4k commented Jul 25, 2022

MadhavJivrajani commented Jul 26, 2022

wojtek-t commented Jul 26, 2022

sttts Jul 26, 2022

Argh4k Jul 26, 2022

wojtek-t Jul 27, 2022

MadhavJivrajani left a comment

sttts commented Jul 26, 2022 •

edited

Loading

wojtek-t Jul 26, 2022

wojtek-t Jul 26, 2022

mborsz Jul 26, 2022

Argh4k commented Jul 26, 2022

leilajal commented Jul 26, 2022

wojtek-t Jul 27, 2022

wojtek-t left a comment

wojtek-t Jul 27, 2022

Argh4k Jul 27, 2022

wojtek-t Jul 27, 2022

Argh4k Jul 27, 2022

wojtek-t commented Jul 27, 2022

k8s-ci-robot commented Jul 27, 2022

Modify timeout for etcd healthcheck #111399

Modify timeout for etcd healthcheck #111399

Conversation

Argh4k commented Jul 25, 2022 • edited Loading

What type of PR is this?

What this PR does / why we need it:

Which issue(s) this PR fixes:

Special notes for your reviewer:

Does this PR introduce a user-facing change?

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:

k8s-ci-robot commented Jul 25, 2022

Argh4k commented Jul 25, 2022

MadhavJivrajani commented Jul 26, 2022

wojtek-t commented Jul 26, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

MadhavJivrajani left a comment

Choose a reason for hiding this comment

sttts commented Jul 26, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Argh4k commented Jul 26, 2022

leilajal commented Jul 26, 2022

Choose a reason for hiding this comment

wojtek-t left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

wojtek-t commented Jul 27, 2022

k8s-ci-robot commented Jul 27, 2022

Argh4k commented Jul 25, 2022 •

edited

Loading

sttts commented Jul 26, 2022 •

edited

Loading