Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kubelet fails to heartbeat with API server with stuck TCP connections #48638

Closed
derekwaynecarr opened this issue Jul 7, 2017 · 33 comments

Comments

@derekwaynecarr
Copy link
Member

commented Jul 7, 2017

Is this a BUG REPORT or FEATURE REQUEST?:
/kind bug

What happened:
operator is running an HA master setup with a LB in front. kubelet attempts to update node status, but tryUpdateNodeStatus wedges. based on the goroutine dump, the wedge happens when it attempts to GET the latest state of the node from the master. operator observed 15 minute intervals between attempts to update node status when kubelet could not contact master. assume this is when the LB ultimately closes the connection. the impact is that node controller then marked node as lost, and workload was evicted.

What you expected to happen:
expected the kubelet to timeout client-side.
right now, no kubelet->master communication has a timeout.
ideally, the kubelet -> master communication would have a timeout based on the configuration of the node-status-update-frequency so that no single attempt to update status wedges future attempts.

How to reproduce it (as minimally and precisely as possible):
see above.

@k8s-github-robot

This comment has been minimized.

Copy link
Contributor

commented Jul 7, 2017

@derekwaynecarr There are no sig labels on this issue. Please add a sig label by:
(1) mentioning a sig: @kubernetes/sig-<team-name>-misc
e.g., @kubernetes/sig-api-machinery-* for API Machinery
(2) specifying the label manually: /sig <label>
e.g., /sig scalability for sig/scalability

Note: method (1) will trigger a notification to the team. You can find the team list here and label list here

@derekwaynecarr

This comment has been minimized.

Copy link
Member Author

commented Jul 7, 2017

@vishh

This comment has been minimized.

Copy link
Member

commented Jul 7, 2017

@derekwaynecarr

This comment has been minimized.

Copy link
Member Author

commented Jul 7, 2017

we have to setup some timeout, but i am worried about the potential impact on any watch code.

@derekwaynecarr

This comment has been minimized.

Copy link
Member Author

commented Jul 7, 2017

note: this is different than the Dial timeout (which we do appear to have default).

@xiangpengzhao

This comment has been minimized.

Copy link
Member

commented Jul 7, 2017

I remember @jayunit100 has a related PR but I don't know if it has been merged finally.

@derekwaynecarr

This comment has been minimized.

Copy link
Member Author

commented Jul 7, 2017

need to take a pass at kube-proxy too. i see no explicit timeout there as well at first glance.

@ncdc

This comment has been minimized.

Copy link
Member

commented Jul 9, 2017

@derekwaynecarr fyi I reported this earlier as #44557

@gmarek

This comment has been minimized.

Copy link
Member

commented Jul 14, 2017

#48926 is a blocker for this.

@derekwaynecarr

This comment has been minimized.

Copy link
Member Author

commented Jul 17, 2017

@obeattie had a recommendation worth evaluating here:
#41916 (comment)

it would require us to tune net.ipv4.tcp_retries2
see: https://www.kernel.org/doc/Documentation/networking/ip-sysctl.txt

in practice, it would make sense for this value to potentially be monitored by the kubelet to ensure node status update interval falls within desired range.

@vishh @gmarek - what are your thoughts? longer term, we should still split client timeouts for long vs short operations, but this tweak would help address many situations where LB communication to a master introduces problems.

@obeattie

This comment has been minimized.

Copy link

commented Jul 17, 2017

@derekwaynecarr Have you seen #48670?

@derekwaynecarr

This comment has been minimized.

Copy link
Member Author

commented Jul 17, 2017

@obeattie will take a look, looks promising.

@SleepyBrett

This comment has been minimized.

Copy link

commented Jul 18, 2017

We've had three major events in the last few weeks that comes down to this problem. Watches set up through an elb node that gets replaced or scaled down cause large numbers of nodes to go not ready for 15 minutes causing very scary cluster turbulence. ( We've generally seen between a third to half the nodes go not ready ). We're currently evaluating other ways to load balance the api servers for the components we currently send through the elb ( I haven't poured through everything but I think that boils down to the kubelet and the proxy (possibly flannel) ).

@gmarek

This comment has been minimized.

Copy link
Member

commented Jul 19, 2017

I know it's not strictly related to this problem, but after a third of a 'zone' goes down NodeController will drastically reduce ratio in which it evicts Pods - exactly to protect you from screwing your cluster too much in situations like this.

Given that it is important problem you could try pushing @kubernetes/sig-api-machinery-feature-requests to make required changes to client-go

@jdumars

This comment has been minimized.

Copy link
Member

commented Jul 25, 2017

/sig azure
Added for visibility

george-angel added a commit to utilitywarehouse/tf_kube_aws that referenced this issue Feb 23, 2018

Switch master ELB to NLB
Primarily to mitigate kubernetes issue with kubeletes losing connection
to masters ( ~06:20 in the morning for us ), and after a period ejecting
pods, until that connection is reistablished. Some of the related issues
and pull requests:

 - kubernetes/kubernetes#41916
 - kubernetes/kubernetes#48638
 - kubernetes/kubernetes#52176

Using a NLB is a suggested mitigation from:
kubernetes/kubernetes#52176 (comment)

NLBs do not support security groups, so we delete the elb SG and open
443 to the world on the master security group
@sstarcher

This comment has been minimized.

Copy link

commented Apr 12, 2018

@derekwaynecarr @liggitt Wanted to reiterate we are seeing this issue in k8s 1.9.3
kubernetes/kops#4946

@liggitt liggitt reopened this May 7, 2018

@liggitt liggitt changed the title kubelet -> master client communication does not have a timeout kubelet fails to heartbeat with API server with stuck TCP connections May 7, 2018

@liggitt liggitt added the sig/aws label May 7, 2018

@liggitt

This comment has been minimized.

Copy link
Member

commented May 7, 2018

reopening, the hang is resolved but the underlying stuck TCP connection issue is not

@liggitt

This comment has been minimized.

Copy link
Member

commented May 7, 2018

since we're already tracking open API connections from the kubelet in client cert rotation cases, and have the ability to force close those connections, the simplest change is likely to be to reuse that option in response to a heartbeat failure. that has the added benefit of dropping dead connections for all kubelet -> API calls, not just the heartbeat client

@liggitt

This comment has been minimized.

Copy link
Member

commented May 7, 2018

WIP in #63492

@recollir

This comment has been minimized.

Copy link

commented May 11, 2018

I am wondering if this would also apply to the kube-proxy. As it also maintains a “connection” to the api server and would suffer from the same (???).

@redbaron

This comment has been minimized.

Copy link
Contributor

commented May 11, 2018

Ideally fix should be in client-go

@2rs2ts

This comment has been minimized.

Copy link
Contributor

commented May 12, 2018

@recollir yes, I'm pretty sure you're right. This issue should be rescoped to include kube-proxy IMO

@liggitt

This comment has been minimized.

Copy link
Member

commented May 12, 2018

one issue at a time :)

persistent kubelet heartbeat failure results in all workloads being evicted. kube-proxy network issues are disruptive for some workloads, but not necessarily all

kube-proxy (and general client-go support) would need a different mechanism, since those components do not heartbeat with the api like the kubelet does. I'd recommend spawning a separate issue for kube-proxy handling of this condition.

@obeattie

This comment has been minimized.

Copy link

commented May 12, 2018

Since this issue has been re-opened, would there be any value in me re-opening my PR for this commit? Monzo has been running this patch in production since last July and it has eliminated this problem entirely, for all uses of client-go.

@liggitt

This comment has been minimized.

Copy link
Member

commented May 12, 2018

Since this issue has been re-opened, would there be any value in me re-opening my PR for this commit? Monzo has been running this patch in production since last July and it has eliminated this problem entirely, for all uses of client-go.

I still have several reservations about that fix:

  • it relies on behavior that appears to be undefined (it changes properties on the result of net.TCPConn#File(), documented as "The returned os.File's file descriptor is different from the connection's. Attempting to change properties of the original using this duplicate may or may not have the desired effect.")
  • calling net.TCPConn#File() has implications I'm unsure of: "On Unix systems this will cause the SetDeadline methods to stop working."
  • it appears to only trigger closing in response to unacknowledged outgoing data... that doesn't seem like it would help clients (like kube-proxy) with long-running receive-only watch stream connections

that said, if @kubernetes/sig-api-machinery-pr-reviews and/or @kubernetes/sig-network-pr-reviews feel strongly that is the correct direction to pursue, that would be helpful to hear.

@redbaron

This comment has been minimized.

Copy link
Contributor

commented May 13, 2018

Few notes on these very valid concerns:

  • https://golang.org/pkg/net/#TCPConn.File is returning dup'ed filedescriptor, which AFAIK share all underneath structures in kernel, except for entry in file descriptor table, so either can be used with same results. Program should be aware not to try to use them simultaneously though, for exact same reasons.
  • today returned filedescriptor is set to blocking mode. Probably it can be mitigated by setting it back to nonblocking mode. In Go 1.11 returned fd is going to be in same blocking/unblocking mode as it was before .File() call: golang/go#24942
  • Maybe it will not help simple watchers, I am not familiar with Informers internals, but I was under the impression that they are not only watching, but also periodically resyncing state, these resync would trigger outgoing data transfer which would then be detected .

liggitt pushed a commit to liggitt/kubernetes that referenced this issue May 15, 2018

Kubernetes Submit Queue
Merge pull request kubernetes#63492 from liggitt/node-heartbeat-close…
…-connections

Automatic merge from submit-queue. If you want to cherry-pick this change to another branch, please follow the instructions <a href="https://github.com/kubernetes/community/blob/master/contributors/devel/cherry-picks.md">here</a>.

track/close kubelet->API connections on heartbeat failure

xref kubernetes#48638
xref kubernetes-incubator/kube-aws#598

we're already typically tracking kubelet -> API connections and have the ability to force close them as part of client cert rotation. if we do that tracking unconditionally, we gain the ability to also force close connections on heartbeat failure as well. it's a big hammer (means reestablishing pod watches, etc), but so is having all your pods evicted because you didn't heartbeat.

this intentionally does minimal refactoring/extraction of the cert connection tracking transport in case we want to backport this

* first commit unconditionally sets up the connection-tracking dialer, and moves all the cert management logic inside an if-block that gets skipped if no certificate manager is provided (view with whitespace ignored to see what actually changed)
* second commit plumbs the connection-closing function to the heartbeat loop and calls it on repeated failures

follow-ups:
* consider backporting this to 1.10, 1.9, 1.8
* refactor the connection managing dialer to not be so tightly bound to the client certificate management

/sig node
/sig api-machinery

```release-note
kubelet: fix hangs in updating Node status after network interruptions/changes between the kubelet and API server
```
@obeattie

This comment has been minimized.

Copy link

commented May 16, 2018

Indeed: as far as I understand, the behaviour is not undefined, it's just defined in Linux rather than in Go. I think the Go docs could be clearer on this. Here's the relevant section from dup(2):

After a successful return from one of these system calls, the old and new file descriptors may be used interchangeably. They refer to the same open file description (see open(2)) and thus share file offset and file status flags; for example, if the file offset is modified by using lseek(2) on one of the descriptors, the offset is also changed for the other.

The two descriptors do not share file descriptor flags (the close-on-exec flag).

My code doesn't modify flags after obtaining the fd, instead its only use is in a call to setsockopt(2). The docs for that call are fairly clear that it modifies properties of the socket referred to by the descriptor, not the descriptor itself:

getsockopt() and setsockopt() manipulate options for the socket referred to by the file descriptor sockfd.

I agree that the original descriptor being set to blocking mode is annoying. Go's code is clear that this will not prevent anything from working, just that more OS threads may be required for I/O:

https://github.com/golang/go/blob/516f5ccf57560ed402cdae28a36e1dc9e81444c3/src/net/fd_unix.go#L313-L315

Given that a single Kubelet (or otherwise use of client-go) establishes a small number of long-lived connections to the apiservers, and that this will be fixed in Go 1.11, I don't think this is a significant issue.

I am happy for this to be fixed in another way, but given we know that this works and does not require invasive changes to the apiserver to achieve, I think it is a reasonable solution. I have heard from several production users of Kubernetes that this has bitten them in the same way it bit us.

@liggitt

This comment has been minimized.

Copy link
Member

commented Jun 19, 2018

this issue is resolved for the kubelet in #63492

#65012 is open for the broader client-go issue. hoisted the last few comments to that issue

/close

@workhardcc

This comment has been minimized.

Copy link

commented Oct 11, 2018

Did scheduler & controller → api-server has the same issue?

@redbaron

This comment has been minimized.

Copy link
Contributor

commented Oct 11, 2018

@workhardcc , yes, both use client-go , which AFAIK remains unfixed

@workhardcc

This comment has been minimized.

Copy link

commented Oct 24, 2018

@redbaron Did scheduler & controller fix this problem? If not, kubelet connect api-server ok(reconnect), but scheduler & controller connect api-server failed. This is also an exception.

@corest

This comment has been minimized.

Copy link

commented Mar 19, 2019

I'm experiencing this again with k8s 1.13.4 and azure only (filtered with kubelet_node_status.go):

I0319 17:14:25.579175   68746 kubelet_node_status.go:478] Setting node status at position 6
I0319 17:29:29.987934   68746 kubelet_node_status.go:478] Setting node status at position 7

For 15 minutes node doesn't update status. There are no issues on network level and node actually is fully operational. Adjusting --node-monitor-grace-period= on controller-manager helps, but that is not a solution

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.