Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kubelet apiserver: be gentle closing connections on heartbeat failures #108107

Merged
merged 2 commits into from
Mar 9, 2022

Conversation

aojea
Copy link
Member

@aojea aojea commented Feb 14, 2022

Follow up on #104844
Alternative to #107879

Kubelet was forcefully closing all connections (idle and active) on heartbeat failures #63492
However, since #95981, all clients using HTTP2 use a health check by default that allows to detect stale connections without any additional logic.

In case users are disabling http2, by setting the environment variable DISABLE_HTTP2, the previous behavior is maintained.

/kind bug

kubelet don't forcefully close active connections on heartbeat failures, using the http2 health check mechanism to detect broken connections. Users can force the previous behavior of the kubelet by setting the environment variable DISABLE_HTTP2.

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:


@k8s-ci-robot k8s-ci-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. size/S Denotes a PR that changes 10-29 lines, ignoring generated files. kind/bug Categorizes issue or PR as related to a bug. do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Feb 14, 2022
@k8s-ci-robot k8s-ci-robot added area/kubelet sig/node Categorizes an issue or PR as relevant to SIG Node. and removed do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Feb 14, 2022
@aojea
Copy link
Member Author

aojea commented Feb 14, 2022

/assign @liggitt @wojtek-t
/cc @JohnRusk

I think that this is simpler than #107879

@k8s-ci-robot
Copy link
Contributor

@aojea: GitHub didn't allow me to request PR reviews from the following users: johnRusk.

Note that only kubernetes members and repo collaborators can review this PR, and authors cannot review their own PRs.

In response to this:

/assign @liggitt @wojtek-t
/cc @JohnRusk

I think that this is simpler than #107879

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@aojea
Copy link
Member Author

aojea commented Feb 14, 2022

/sig api-machinery

@k8s-ci-robot k8s-ci-robot added the sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. label Feb 14, 2022
@wojtek-t
Copy link
Member

This LGTM, but I would also like to hear from @liggitt

@JohnRusk
Copy link

JohnRusk commented Feb 14, 2022

Might be nice to have a comment in the code to explain why the code has to close idle connections on heartbeat failure. I.e. Why does failure of a heartbeat (which is presumably happending on connection that is not idle) signal to us that we need to close the idle ones?

BTW, I like the idea of relying on Pings rather than heartbeats for monitoring health of the one "live" HTTP2 connection. That looks nice.

The bit that seems confusing to me, and I suggest may need an explainatory comment, it's just the fact that closeAllConnections gets wired up to a method that closes Idle connections. It's not obvious to readers of the code (at least, not to me) why that is necessary or correct.

@aojea
Copy link
Member Author

aojea commented Feb 15, 2022

The bit that seems confusing to me, and I suggest may need an explainatory comment, it's just the fact that closeAllConnections gets wired up to a method that closes Idle connections. It's not obvious to readers of the code (at least, not to me) why that is necessary or correct.

yeah, let's discuss the whole problem:

The thing is that the function is passed as kubeDeps.OnHeartbeatFailure

case kubeDeps.KubeClient == nil, kubeDeps.EventClient == nil, kubeDeps.HeartbeatClient == nil:
clientConfig, closeAllConns, err := buildKubeletClientConfig(ctx, s, nodeName)
if err != nil {
return err
}
if closeAllConns == nil {
return errors.New("closeAllConns must be a valid function other than nil")
}
kubeDeps.OnHeartbeatFailure = closeAllConns

and then plumbed to the lease controller #107879

So the real semantics should be "function that we call to do things on heartbeat failures" , that previously was "closeAllConnections" and now is "closeAllIdleConnections", but it can still mean "closeAllConnections" if HTTP2 is explicitly disabled.

The current situation is that we already have retry logic in client-go on network errors (default to 10 times)

retryAfter, retry = r.retry.NextRetry(req, resp, err, func(req *http.Request, err error) bool {
// "Connection reset by peer" or "apiserver is shutting down" are usually a transient errors.
// Thus in case of "GET" operations, we simply retry it.
// We are not automatically retrying "write" operations, as they are not idempotent.
if r.verb != "GET" {
return false
}
// For connection errors and apiserver shutdown errors retry.
if net.IsConnectionReset(err) || net.IsProbableEOF(err) {
return true
}
return false
})

so some of the loops are really not needed since are multiplying the number of retries, per example, the node status update:

// updateNodeStatus updates node status to master with retries if there is any
// change or enough time passed from the last sync.
func (kl *Kubelet) updateNodeStatus() error {
klog.V(5).InfoS("Updating node status")
for i := 0; i < nodeStatusUpdateRetry; i++ {
if err := kl.tryUpdateNodeStatus(i); err != nil {
if i > 0 && kl.onRepeatedHeartbeatFailure != nil {
kl.onRepeatedHeartbeatFailure()
}
klog.ErrorS(err, "Error updating node status, will retry")
} else {
return nil
}

that means that an apiserver not replying will create a max of 50 connections attempts per kubelet , because it retries 5 times

pkg/kubelet/kubelet.go: // nodeStatusUpdateRetry specifies how many times kubelet retries when posting node status failed.
pkg/kubelet/kubelet.go: nodeStatusUpdateRetry = 5

TCP sockets are expensive, but you should only notice at a relative high scale, that is why I think that small or relatively idle cluster doesn't notice this problem.
With HTTP2 #95981 the client automatically detects idle, it has an embedded heartbeat detection, so we should not really need any heartbeat logic at all, the stdlib does it for us.
With HTTP1 you can only have one connection in the pool, so it can only be idle or active. If active the client always dial a new connection , if it is idle, we clean it to force to dial a new connection (avoid to reuse a stale connection)

This PR is the easy solution, I really don't know how to document this better, but in order to keep compatibility with old systems that still use HTTP1 I feel this is the less risky approach

@fedebongio
Copy link
Contributor

/triage accepted

@k8s-ci-robot k8s-ci-robot added triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Feb 15, 2022
@JohnRusk
Copy link

Thanks for the explaination @aojea!

I reckon there's no need to add any additional docs to the code, since if anyone is curious they can find this PR from the commit history in the future, and read what you wrote above.

@JohnRusk
Copy link

JohnRusk commented Feb 15, 2022

This PR supercedes #107781, so I should close #107781 now, right?

(Note to future readers, background discussion can be found in #107781)

@matthyx
Copy link
Contributor

matthyx commented Feb 16, 2022

This PR supercedes #107781, so I should close #107781 now, right?

yes please :)

Copy link
Contributor

@matthyx matthyx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@aojea
Copy link
Member Author

aojea commented Feb 20, 2022

b9d865a

they show that the new function is able to recover from a situation that the TCP connection is broken but the endpoint is not aware, without requiring with close the whole connection, just forcing the client to try a new connection

@ryanzhang-oss
Copy link

b9d865a

they show that the new function is able to recover from a situation that the TCP connection is broken but the endpoint is not aware, without requiring with close the whole connection, just forcing the client to try a new connection

I think the rule of thumb is that we need at least one test that fails before the fix and passes after. I am not sure if we can create a test case like that here.

@JohnRusk
Copy link

JohnRusk commented Mar 8, 2022

Any status updates on this? I'm finding I have conversations with colleagues about this bug almost every day. Is the fix going ahead?

@wojtek-t
Copy link
Member

wojtek-t commented Mar 9, 2022

/lgtm
/approve

Thanks!

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Mar 9, 2022
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: aojea, matthyx, wojtek-t

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Mar 9, 2022
@k8s-ci-robot k8s-ci-robot merged commit a41f9e9 into kubernetes:master Mar 9, 2022
SIG Node PR Triage automation moved this from Triage to Done Mar 9, 2022
@k8s-ci-robot k8s-ci-robot added this to the v1.24 milestone Mar 9, 2022
@JohnRusk
Copy link

JohnRusk commented Mar 9, 2022

@aojea Awesome to see this merged. Thank you! Are there any plans to create cherry pick PRs to get this into 1.23 (and maybe 1.22)? That could be helpful for users currently suffering from the issue but not ready for a major version upgrade.

@liggitt
Copy link
Member

liggitt commented Mar 9, 2022

https://github.com/kubernetes/community/blob/master/contributors/devel/sig-release/cherry-picks.md#what-kind-of-prs-are-good-for-cherry-picks

since this isn't fixing a regression from those versions, and I'm pretty skeptical it rises to the level of critical bug fix, I wouldn't really expect a backport of this

@jackfrancis
Copy link
Contributor

@liggitt The side-effect of certain cluster behaviors at scale seems to meet the "Panic, crash, hang" criterion in the cherry-pick definition. This change will ameliorate those apiserver degradation scenarios. Are we in disagreement about that? Or is the actual cherry-pick process itself non-trivial (composing a change as multiple PRs out-of-sequence, stuff like that)

@liggitt
Copy link
Member

liggitt commented Mar 9, 2022

While we don't anticipate issues with this fix (otherwise we wouldn't have merged it), backporting exposes release branches to unanticipated issues. Since this is a historically fragile area, I'd be extremely cautious about taking back a fix here for anything other than a regression in one of those releases

@JohnRusk
Copy link

JohnRusk commented Mar 9, 2022

anything other than a regression in one of those releases

FYI, my understanding is that this issue is a regression, but that the regression happened several releases ago. (e.g. in 1.18 or something. I haven't checked exactly). If that's corrrect does it change anything in your reply @liggitt? (I imagine not, but I'm just checking :-))

@jackfrancis
Copy link
Contributor

Understand the practical realities at play here, thx for clarifying @liggitt

@liggitt
Copy link
Member

liggitt commented Mar 9, 2022

FYI, my understanding is that this issue is a regression, but that the regression happened several releases ago.

#63492 merged in 1.11 and was picked back to 1.8.x, so this behavior has existed since then. Choosing between an edge case that can result in a crash at scale and an edge case that results in a silent and ~unrecoverable hang of nodes is not a super clear choice. I'd still lean against backporting this.

@aojea
Copy link
Member Author

aojea commented Mar 9, 2022

I agree with Jordan judgement, however, in case of backport , this can only go maximum to 1.23, since this depends on b9d865a and that is clearly not backportable

@JohnRusk
Copy link

JohnRusk commented Mar 9, 2022

Thanks for the clarification guys. I understand your reasoning.

@djsly
Copy link
Contributor

djsly commented Mar 23, 2022

So this will be in 1.25 only ? We are currently on 1.21 and affected once a week/two weeks in one of our clusters at random.

@ehashman
Copy link
Member

@djsly this is in 1.24 and will be backported to 1.23.

@ehashman
Copy link
Member

/priority important-soon

@k8s-ci-robot k8s-ci-robot added priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. and removed needs-priority Indicates a PR lacks a `priority/foo` label and requires one. labels Apr 12, 2022
k8s-ci-robot added a commit that referenced this pull request May 9, 2022
…8107-upstream-release-1.23

Automated cherry pick of #108107: kubelet apiserver: be gentle closing connections on
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. area/kubelet cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/bug Categorizes issue or PR as related to a bug. lgtm "Looks good to me", indicates that a PR is ready to be merged. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. release-note Denotes a PR that will be considered when it comes time to generate release notes. sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. sig/node Categorizes an issue or PR as relevant to SIG Node. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

None yet