-
Notifications
You must be signed in to change notification settings - Fork 40.6k
Low nf_conntrack_tcp_timeout_close_wait default causes intermittent ECONNREFUSED on GKE #32551
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
@kubernetes/sig-network |
Consider for a 1.4.x patch |
@porridge nice sleuthing. is it normal for clients to have so much delay in calling close? |
My completely unconfirmed gut feeling is that this happens when:
|
Moving out of v1.4 as this is a P2. |
Asking this to be P1 for GKE given how many customers are hitting it, though it's not GKE specific. |
@matchstick for prio shuffle shuffle |
Two potential solutions I could think of:
But bumping
The downside of this approach is that it requires adding another chain on POSTROUTING. Although the rules in the chain will be limited: Thoughts? |
A little from column A and a little from column B? I think we can tune the close_wait timeout to 60 minutes - it will only affect sockets in CLOSE_WAIT, which isn't a normal state to linger in (I think?!). I think the other thing you suggest is the real fix. We want to fix that ANYWAY, but we should probably think hard about how to fix it properly. I don't think a simple flag is enough - I'd love to see this go into the per-node configmaps... |
Note that doing just the second option will not help with cases where the
server is something else than the metadata server.
@freehan If I'm not mistaken, on the metadata server's side we're not
talking about CLOSE_WAIT, we are talking about the FIN_WAIT_1 state - the
state transitions are not symmetric, and depend on who closes the
connection first, in our case the metadata server does the close first:
https://drupal.star.bnl.gov/STAR/blog-entry/jeromel/2009/feb/18/tcp-parameters-linux-kernel
And here the metadata server is actually being helpful by setting it to 60
minutes. In case of Linux, there does not even seem to be a timeout for
this state!?
http://serverfault.com/questions/7689/how-do-i-get-rid-of-sockets-in-fin-wait1-state
Note that net.ipv4.fin_timeout is something else.
|
Correct my understand if it's off:
metadata server is in FIN_WAIT_1
metadata server is in FIN_WAIT_2, client is in CLOSE_WAIT. Client now needs to send a FIN by calling close() and entering LAST_ACK, this would put metadataserver in TIME_WAIT which will naturally close on timeout. The problems are:
So we need to set the default nf_conntrack_tcp_timeout_close_wait to however long we expect clients to stall the close() invocation. I think 1m is a decent estimate? if you have a select/read, the select will return when the remote end closes, and the read will fail right? why keep it open for any longer. |
I think we simply don't know why the client waited several minutes to close their end, but it's valid TCP. |
Yeah it's valid to leave the socket in CLOSE_WAIT indefinitely. I think our defaults should guard against this, not encourage it, unless I'm wrong in assuming that most clients call close() pretty quickly. |
Clearly not all, and we have seen long CLOSE_WAIT before. It seems easy to On Sun, Oct 2, 2016 at 8:40 PM, Prashanth B notifications@github.com
|
I don't think there's a CLOSE_WAIT timeout. There is:
Increasing the netfilter timeout souns totally fine, but we need to balance that with other conntrack tunables like In an ideal world we would count connection states (maybe from |
Assigning to @bowei In the short term, adding a kube-proxy parameter for |
Some notes on reproducing the behavior, as the Linux NAT logic seems to be conditional on the flags present in the packet itself: On the node (10.240.0.4), set the CLOSE_WAIT timeout to something small:
Connect to the metadata server from within the pod (10.180.3.24), and hold on to the socket:
Wait for the conntrack entry to expire (it will be in CLOSE_WAIT). Now if we close the socket, we will see that the last packet (it happens to be RST in this case) will NOT be NAT'd:
TCPDUMP output for the interaction
Subsequent run of
However, if we send data instead of
Tcpdump output:
(Not) reproducing in our clusterWhen trying to recreate the situation inside our cluster, I observed that when a SYN is sent for an stale 5-tuple between our GCI machines (i.e. an initial SYN from machine A to B, B is holding a connection from A that does not exist anymore) The response is not RST, but an ACK from the B->A which results in an RST A->B. The subsequent SYN retry from A->B succeeds. In this case, the |
Guys if I am out of order please nuke this comment, We are running:
We have a Micro services Architecture model that allows third party application to request Telemetry data (We are streaming 1200ms/s via a service that uses 0MQ http://zeromq.org/ as the underlying transport layer, this data is then written to a Cassandra Cluster). The stream is also "Published" which allows literally thousands of clients listening to their data stream - Potentially creating thousands of connections and creating the above problem. We also expose a REST web service to request data in the case a client lost connection and want to "download" the missing data. What we are seeing is that all services will run perfectly (Multiple replicas - Pods exposed via Kubernetes services) for up to 24 Hours, then all connections to the services are dropped. What we could see by digging through the logs, (I am sure we have not covered all the logs, we are still novices) but it is mainly seen from the Client Applications getting a "Connection Time-out" when trying to retrieve data from the REST web services. |
@mtbbiker that sounds like a different problem -- if you can open a separate issue from this one that would be great. This issue is about CLOSE_WAIT, not a timeout in ESTABLISHED state. Also, neither of the timeouts should be hit if you have active traffic going over the connections. |
@bowei 100% Thanks I'll open a new Issue |
Automatic merge from submit-queue Add test image for networking related tests This test image is to support the e2e test for kubernetes#32551
Automatic merge from submit-queue Adds TCPCloseWaitTimeout option to kube-proxy for sysctl nf_conntrack… Adds TCPCloseWaitTimeout option to kube-proxy for sysctl nf_conntrack_tcp_timeout_time_wait Fixes issue #32551
Kubernetes version (use
kubectl version
):Client Version: version.Info{Major:"1", Minor:"3", GitVersion:"v1.3.5", GitCommit:"b0deb2eb8f4037421077f77cb163dbb4c0a2a9f5", GitTreeState:"clean", BuildDate:"2016-08-11T20:29:08Z", GoVersion:"go1.6.2", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"3", GitVersion:"v1.3.5", GitCommit:"b0deb2eb8f4037421077f77cb163dbb4c0a2a9f5", GitTreeState:"clean", BuildDate:"2016-08-11T20:21:58Z", GoVersion:"go1.6.2", Compiler:"gc", Platform:"linux/amd64"}
Environment:
uname -a
):3.16.0-4-amd64What happened:
We're seeing plenty of "connection refused" errors when connecting to the GCE metadata server from within pods in our GKE cluster. These connections are mostly from OAuth2 client libraries (in Go and Python) trying to fetch access token when connecting to any GCP service. However as explained below, this can happen to connections to other services as well.
Example (this one from Cloud SQL proxy):
What you expected to happen:
No intermittent failures.
How to reproduce it (as minimally and precisely as possible):
CLOSE_WAIT
connections expire from node conntrack tables.Anything else do we need to know:
TL;DR: kernel's default nf_conntrack_tcp_timeout_close_wait of 60s is too low: http://marc.info/?l=netfilter-devel&m=117568928824030&w=2
I caught this with tcpdump on our GKE cluster. Here it is, with additional lines (the ones with 5-second resolution) from conntrack(1) outputs for the interesting 5-tuple.
Actors:
pod 10.192.2.35 starts a connection, it's NATed to node source port 35715
connection is established. HTTP request:
HTTP response:
another HTTP request:
and another HTTP response:
time passes, conntrack sees the connection as established:
after 30s of inactivity metadata server closes its side of the connection
and client acks that:
but does not close its side of connection yet
time passes, and conntrack sees the connection as half-closed (a.k.a. close_wait):
after a minute, though (nf_conntrack_tcp_timeout_close_wait sysctl knob's default), it loses patience and drops the entry from the conntrack table.
As a result, the connection is no longer NATed any more, and when the client wakes up 2.5 minutes later to close the connection, it shows up with its non-NATed source IP:
obviously the metadata server TCP stack explains there is no connection for such 5-tuple, with an RST bit:
So at this point the "TCP 10.241.0.49.35715 > 169.254.169.254.80" connection is still open in metadata server's mind, since the FIN above did not change anything.
Therefore 59 minutes later, when the node happily maps a completely different connection (made from a different pod) to the same source port:
However since the metadata server still remembers the old connection, and the sequence numbers and the state of the connection do not match, so it rejects the connection attempt with an RST:
This is seen in userspace as "connection refused".
Interestingly, just ten minutes later there is yet another connection attempt, NATed to the same source port, and at this point apparently the metadata server no longer remembers the old connection, as it accepts the new one.
This suggests the CLOSE_WAIT lifetime on metadata server is between 59 and 69 minutes, probably 60.
The text was updated successfully, but these errors were encountered: