-
Notifications
You must be signed in to change notification settings - Fork 38.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
kubelet apiserver: be gentle closing connections on heartbeat failures #108107
Conversation
@aojea: GitHub didn't allow me to request PR reviews from the following users: johnRusk. Note that only kubernetes members and repo collaborators can review this PR, and authors cannot review their own PRs. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
/sig api-machinery |
This LGTM, but I would also like to hear from @liggitt |
Might be nice to have a comment in the code to explain why the code has to close idle connections on heartbeat failure. I.e. Why does failure of a heartbeat (which is presumably happending on connection that is not idle) signal to us that we need to close the idle ones? BTW, I like the idea of relying on Pings rather than heartbeats for monitoring health of the one "live" HTTP2 connection. That looks nice. The bit that seems confusing to me, and I suggest may need an explainatory comment, it's just the fact that closeAllConnections gets wired up to a method that closes Idle connections. It's not obvious to readers of the code (at least, not to me) why that is necessary or correct. |
yeah, let's discuss the whole problem: The thing is that the function is passed as kubernetes/cmd/kubelet/app/server.go Lines 556 to 564 in d899c39
and then plumbed to the lease controller #107879 So the real semantics should be "function that we call to do things on heartbeat failures" , that previously was "closeAllConnections" and now is "closeAllIdleConnections", but it can still mean "closeAllConnections" if HTTP2 is explicitly disabled. The current situation is that we already have retry logic in client-go on network errors (default to 10 times) kubernetes/staging/src/k8s.io/client-go/rest/request.go Lines 1000 to 1012 in d899c39
so some of the loops are really not needed since are multiplying the number of retries, per example, the node status update: kubernetes/pkg/kubelet/kubelet_node_status.go Lines 451 to 463 in d899c39
that means that an apiserver not replying will create a max of 50 connections attempts per kubelet , because it retries 5 times
TCP sockets are expensive, but you should only notice at a relative high scale, that is why I think that small or relatively idle cluster doesn't notice this problem. This PR is the easy solution, I really don't know how to document this better, but in order to keep compatibility with old systems that still use HTTP1 I feel this is the less risky approach |
/triage accepted |
Thanks for the explaination @aojea! I reckon there's no need to add any additional docs to the code, since if anyone is curious they can find this PR from the commit history in the future, and read what you wrote above. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
they show that the new function is able to recover from a situation that the TCP connection is broken but the endpoint is not aware, without requiring with close the whole connection, just forcing the client to try a new connection |
I think the rule of thumb is that we need at least one test that fails before the fix and passes after. I am not sure if we can create a test case like that here. |
Any status updates on this? I'm finding I have conversations with colleagues about this bug almost every day. Is the fix going ahead? |
/lgtm Thanks! |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: aojea, matthyx, wojtek-t The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
@aojea Awesome to see this merged. Thank you! Are there any plans to create cherry pick PRs to get this into 1.23 (and maybe 1.22)? That could be helpful for users currently suffering from the issue but not ready for a major version upgrade. |
since this isn't fixing a regression from those versions, and I'm pretty skeptical it rises to the level of critical bug fix, I wouldn't really expect a backport of this |
@liggitt The side-effect of certain cluster behaviors at scale seems to meet the "Panic, crash, hang" criterion in the cherry-pick definition. This change will ameliorate those apiserver degradation scenarios. Are we in disagreement about that? Or is the actual cherry-pick process itself non-trivial (composing a change as multiple PRs out-of-sequence, stuff like that) |
While we don't anticipate issues with this fix (otherwise we wouldn't have merged it), backporting exposes release branches to unanticipated issues. Since this is a historically fragile area, I'd be extremely cautious about taking back a fix here for anything other than a regression in one of those releases |
FYI, my understanding is that this issue is a regression, but that the regression happened several releases ago. (e.g. in 1.18 or something. I haven't checked exactly). If that's corrrect does it change anything in your reply @liggitt? (I imagine not, but I'm just checking :-)) |
Understand the practical realities at play here, thx for clarifying @liggitt |
#63492 merged in 1.11 and was picked back to 1.8.x, so this behavior has existed since then. Choosing between an edge case that can result in a crash at scale and an edge case that results in a silent and ~unrecoverable hang of nodes is not a super clear choice. I'd still lean against backporting this. |
I agree with Jordan judgement, however, in case of backport , this can only go maximum to 1.23, since this depends on b9d865a and that is clearly not backportable |
Thanks for the clarification guys. I understand your reasoning. |
So this will be in 1.25 only ? We are currently on 1.21 and affected once a week/two weeks in one of our clusters at random. |
@djsly this is in 1.24 and will be backported to 1.23. |
/priority important-soon |
…8107-upstream-release-1.23 Automated cherry pick of #108107: kubelet apiserver: be gentle closing connections on
Follow up on #104844
Alternative to #107879
Kubelet was forcefully closing all connections (idle and active) on heartbeat failures #63492
However, since #95981, all clients using HTTP2 use a health check by default that allows to detect stale connections without any additional logic.
In case users are disabling http2, by setting the environment variable DISABLE_HTTP2, the previous behavior is maintained.
/kind bug
Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.: