-
Notifications
You must be signed in to change notification settings - Fork 39.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Modify timeout for etcd healthcheck #111399
Conversation
Hi @Argh4k. Thanks for your PR. I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
/ok-to-test |
I'm fine with this change modulo the comments that I added. But I would like to give a bit of time to sig-apimachinery folks to take a quick look at it. /sig api-machinery |
if err != nil { | ||
return err | ||
} | ||
c.ReadyzChecks = append(c.ReadyzChecks, healthz.NamedCheck("etcd-readiness", func(r *http.Request) error { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we have a method for this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can only see AddHealthChecks which adds healthcheck to readyz/livez/healthz. It even has We should prefer this to adding healthChecks directly to the config unless we explicitly want to add a healthcheck only to a specific health endpoint.
in its description. Am I missing something?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here is the function that @sttts was talking about:
https://github.com/kubernetes/kubernetes/blob/master/staging/src/k8s.io/apiserver/pkg/server/healthz.go#L63
But that is the function at the generic server level, whereas for etcd checks we're operating still at config level.
That said - instead of doing it manually, please create a AddReadyzCheck to the Config struct.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we extend the existing tests to cover this change?: https://github.com/kubernetes/kubernetes/blob/master/staging/src/k8s.io/apiserver/pkg/server/options/etcd_test.go
I can see why we want the readyz flag. I don't see why changing the healthz timeout from 2s to 15s and potentially break users who depend on the old behaviour. Also this change is not mentioned in the "Does this PR introduce a user-facing change?" section while it is userfacing. |
@@ -234,6 +236,14 @@ func (s *EtcdOptions) addEtcdHealthEndpoint(c *server.Config) error { | |||
return healthCheck() | |||
})) | |||
|
|||
readyCheck, err := storagefactory.CreateReadyCheck(s.StorageConfig, c.DrainedNotify()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was wondering if we shouldn't use a different channel here, although given it's actually used as a stopCh it actually makes sense.
@@ -35,7 +35,8 @@ const ( | |||
|
|||
DefaultCompactInterval = 5 * time.Minute | |||
DefaultDBMetricPollInterval = 30 * time.Second | |||
DefaultHealthcheckTimeout = 2 * time.Second | |||
DefaultHealthcheckTimeout = 15 * time.Second |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I understand the motivation and I'm supportive for the motivation as described in:
#111290
But technically that's a breaking change - for people who are not setting the timeout now, they will face a default behavior change.
For me it seems much safer to leave it set to 2s and just rely that people who would like to bump it will use the already existing flag. @deads2k - FYI
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm fine with leaving etcd as 2s and using existing flag to tune this.
Changed defaults back to original values. You are right that changing them, could break it for some users. |
/triage accepted |
if err != nil { | ||
return err | ||
} | ||
c.ReadyzChecks = append(c.ReadyzChecks, healthz.NamedCheck("etcd-readiness", func(r *http.Request) error { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here is the function that @sttts was talking about:
https://github.com/kubernetes/kubernetes/blob/master/staging/src/k8s.io/apiserver/pkg/server/healthz.go#L63
But that is the function at the generic server level, whereas for etcd checks we're operating still at config level.
That said - instead of doing it manually, please create a AddReadyzCheck to the Config struct.
staging/src/k8s.io/apiserver/pkg/storage/storagebackend/factory/factory_test.go
Show resolved
Hide resolved
969ef19
to
9776963
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good overall (modulo my one comment), but test failures seems related to this PR.
staging/src/k8s.io/apiserver/go.mod
Outdated
@@ -110,6 +110,7 @@ require ( | |||
golang.org/x/term v0.0.0-20210927222741-03fcf44c2211 // indirect | |||
golang.org/x/text v0.3.7 // indirect | |||
golang.org/x/time v0.0.0-20220210224613-90d013bbcef8 // indirect | |||
golang.org/x/xerrors v0.0.0-20200804184101-5ec99f83aff1 // indirect |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this needed? You're not changing imports...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've run /hack/update-vendor.sh because pull-kubernetes-dependecies
was failing with:
Your vendored results are different:
diff -Naupr -x BUILD -x 'AUTHORS*' -x 'CONTRIBUTORS*' vendor/k8s.io/apiserver/go.mod /home/prow/go/src/k8s.io/kubernetes/_tmp/kube-vendor.wBNJ6i/kubernetes/vendor/k8s.io/apiserver/go.mod
--- vendor/k8s.io/apiserver/go.mod 2022-07-27 07:20:52.922456943 +0000
+++ /home/prow/go/src/k8s.io/kubernetes/_tmp/kube-vendor.wBNJ6i/kubernetes/vendor/k8s.io/apiserver/go.mod 2022-07-27 07:23:22.144718498 +0000
@@ -110,6 +110,7 @@ require (
golang.org/x/term v0.0.0-20210927222741-03fcf44c2211 // indirect
golang.org/x/text v0.3.7 // indirect
golang.org/x/time v0.0.0-20220210224613-90d013bbcef8 // indirect
+ golang.org/x/xerrors v0.0.0-20200804184101-5ec99f83aff1 // indirect
google.golang.org/appengine v1.6.7 // indirect
google.golang.org/genproto v0.0.0-20220502173005-c8bf987b8c21 // indirect
google.golang.org/protobuf v1.28.0 // indirect
Vendor Verify failed.
If you're seeing this locally, run the below command to fix your directories:
hack/update-vendor.sh
I'm not quite sure why it is needed. I've changed formatting from %v to %w for error and used cmpopts.EquateErrors() but I do not understand why verify-vendor says that this should be included as indirect dependency.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you maybe revert that change then? I would like to avoid combining those two...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed cmpopts, it looks like they were causing this behaviour
/lgtm Thanks! |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: Argh4k, wojtek-t The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
Increase default timeout for etcd healthcheck to 15 seconds.
Add additional etcd check to readyz with 2 seconds timeout.
What type of PR is this?
/kind feature
What this PR does / why we need it:
This PR increases default timeout for etcd healthcheck to 15 seconds and adds new etcd check to readiness check with timeout of 2 seconds. Currently, when the control plane is overloaded, healthchecks to etcd can take more than 2 seconds marking kube apsierver unhealthy, even if it is only etcd performance degradation. Adding 2 seconds check to readyz should help with load distribution between apiservers in case of etcd performance degradation.
Which issue(s) this PR fixes:
Fixes #111290
Special notes for your reviewer:
Does this PR introduce a user-facing change?
Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.: