-
Notifications
You must be signed in to change notification settings - Fork 39k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Report KCM as unhealthy if leader election is wedged. #70971
Conversation
/assign @lavalamp |
staging/src/k8s.io/client-go/tools/leaderelection/leaderelection.go
Outdated
Show resolved
Hide resolved
staging/src/k8s.io/client-go/tools/leaderelection/leaderelection.go
Outdated
Show resolved
Hide resolved
staging/src/k8s.io/client-go/tools/leaderelection/leaderelection.go
Outdated
Show resolved
Hide resolved
staging/src/k8s.io/client-go/tools/leaderelection/leaderelection.go
Outdated
Show resolved
Hide resolved
staging/src/k8s.io/client-go/tools/leaderelection/leaderelection.go
Outdated
Show resolved
Hide resolved
// If we are more than 20 seconds after the lease durection that is past the timeout on the lease renew. | ||
// Time to start reporting ourselves as unhealthy. We should have died but conditions like deadlock can | ||
// prevent this. (See #70819) | ||
expiry := l.le.observedTime.Add(l.le.config.LeaseDuration).Add(time.Second * 20) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm a little conflicted about a hardcoded 20 second timeout. I think not every user of this library will exit like controller manager does, so maybe it should be configurable?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point. I have pushed it as configurable up one level. However I'm not sure I want to push it all the way to the CLI just yet.
staging/src/k8s.io/client-go/tools/leaderelection/leaderelection.go
Outdated
Show resolved
Hide resolved
staging/src/k8s.io/client-go/tools/leaderelection/leaderelection.go
Outdated
Show resolved
Hide resolved
// If we are more than 20 seconds after the lease durection that is past the timeout on the lease renew. | ||
// Time to start reporting ourselves as unhealthy. We should have died but conditions like deadlock can | ||
// prevent this. (See #70819) | ||
expiry := l.le.observedTime.Add(l.le.config.LeaseDuration).Add(time.Second * 20) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you add a function that exposes observedTime (or the observedRecord's acquireTime), then maybe you can put all of this code in another package?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I could but I don't want to expose the inner workings of leaderelection as I feel it should remain obfuscated. The more we expose these details the less room we have to fix it later. I also think that having the optional healthz check which is responsible for/understands the inner workings of leaderelection is a good thing. I agree with not wanting to pull in server code to the client package but I think you other suggestions have essentially accomplished that.
staging/src/k8s.io/client-go/tools/leaderelection/leaderelection.go
Outdated
Show resolved
Hide resolved
staging/src/k8s.io/client-go/tools/leaderelection/leaderelection.go
Outdated
Show resolved
Hide resolved
staging/src/k8s.io/client-go/tools/leaderelection/leaderelection.go
Outdated
Show resolved
Hide resolved
staging/src/k8s.io/client-go/tools/leaderelection/leaderelection.go
Outdated
Show resolved
Hide resolved
I like the idea of this change. /milestone 1.13 |
@deads2k: The provided milestone is not valid for this repository. Milestones in this repository: [ Use In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
/priority important-soon |
/test pull-kubernetes-integration |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No structural comments this time, just a bunch of little nits.
staging/src/k8s.io/client-go/tools/leaderelection/healthzadaptor.go
Outdated
Show resolved
Hide resolved
staging/src/k8s.io/client-go/tools/leaderelection/healthzadaptor.go
Outdated
Show resolved
Hide resolved
staging/src/k8s.io/client-go/tools/leaderelection/healthzadaptor.go
Outdated
Show resolved
Hide resolved
staging/src/k8s.io/client-go/tools/leaderelection/healthzadaptor.go
Outdated
Show resolved
Hide resolved
staging/src/k8s.io/client-go/tools/leaderelection/leaderelection.go
Outdated
Show resolved
Hide resolved
staging/src/k8s.io/client-go/tools/leaderelection/leaderelection.go
Outdated
Show resolved
Hide resolved
staging/src/k8s.io/client-go/tools/leaderelection/healthzadaptor_test.go
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just nits now.
staging/src/k8s.io/client-go/tools/leaderelection/healthzadaptor.go
Outdated
Show resolved
Hide resolved
staging/src/k8s.io/client-go/tools/leaderelection/healthzadaptor_test.go
Outdated
Show resolved
Hide resolved
staging/src/k8s.io/client-go/tools/leaderelection/leaderelection.go
Outdated
Show resolved
Hide resolved
l.le = le | ||
} | ||
|
||
// NewHealthzAdaptor creates a basic healthz adaptor to monitor a leader election. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This comment missed :)
staging/src/k8s.io/client-go/tools/leaderelection/healthzadaptor_test.go
Outdated
Show resolved
Hide resolved
Feedback from lavalamp and deads2k. Changed Check() logic to be central to LeaderElector. Further changes, especially cleaning up the test code.
/approve Do we need to provide a way to disable this just in case it is causes a problem? I can't think of a specific thing that can go wrong-- actually, what about daylight savings time getting turned off? I know (hope?) we tell everyone to use UTC for servers, but... It seems remote so I won't hold this up any more. I guess folks can use Han's change to stop monitoring this if necessary anyway? |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: cheftako, lavalamp The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
Verified: $ curl "localhost:10252/healthz?verbose&exclude=leaderElection"
[+]leaderElection excluded: ok
healthz check passed |
What type of PR is this?
/kind bug
What this PR does / why we need it:
Mitigation on #70819.
Too late in the cycle to make a large x/net/http2 vendor change.
Adding code to track if the lease election is wedged and report as unhealthy.
That should allow the kubelet to restart the KCM.
Please note that this only works for clusters without --leader-elect turned off. (HA clusters must run leader election, it is theoretically optional in single master clusters )
Which issue(s) this PR fixes (optional, in
fixes #<issue number>(, fixes #<issue_number>, ...)
format, will close the issue(s) when PR gets merged):Mitigates #70819
Special notes for your reviewer:
Does this PR introduce a user-facing change?: