Skip to content

Low nf_conntrack_tcp_timeout_close_wait default causes intermittent ECONNREFUSED on GKE #32551

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
porridge opened this issue Sep 13, 2016 · 23 comments
Assignees
Labels
area/kube-proxy area/kubelet priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. sig/network Categorizes an issue or PR as relevant to SIG Network.
Milestone

Comments

@porridge
Copy link
Member

Kubernetes version (use kubectl version):
Client Version: version.Info{Major:"1", Minor:"3", GitVersion:"v1.3.5", GitCommit:"b0deb2eb8f4037421077f77cb163dbb4c0a2a9f5", GitTreeState:"clean", BuildDate:"2016-08-11T20:29:08Z", GoVersion:"go1.6.2", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"3", GitVersion:"v1.3.5", GitCommit:"b0deb2eb8f4037421077f77cb163dbb4c0a2a9f5", GitTreeState:"clean", BuildDate:"2016-08-11T20:21:58Z", GoVersion:"go1.6.2", Compiler:"gc", Platform:"linux/amd64"}

Environment:

  • Cloud provider or hardware configuration: GKE
  • OS (e.g. from /etc/os-release): PRETTY_NAME="Debian GNU/Linux 7 (wheezy)"
  • Kernel (e.g. uname -a):3.16.0-4-amd64

What happened:

We're seeing plenty of "connection refused" errors when connecting to the GCE metadata server from within pods in our GKE cluster. These connections are mostly from OAuth2 client libraries (in Go and Python) trying to fetch access token when connecting to any GCP service. However as explained below, this can happen to connections to other services as well.

Example (this one from Cloud SQL proxy):

2016/09/08 06:37:10 couldn't connect to "[project]:us-central1:[db-name]": Post https://www.googleapis.com/sql/v1beta4/projects/[project]/instances/[db-name]/createEphemeral?alt=json: Get http://169.254.169.254/computeMetadata/v1/instance/service-accounts/default/token: dial tcp 169.254.169.254:80: getsockopt: connection refused

What you expected to happen:

No intermittent failures.

How to reproduce it (as minimally and precisely as possible):

  1. Create plenty of TCP connections from pods to a single host:port. Leave the connections half-closed by the remote end, do not close it from the other side.
  2. Let these CLOSE_WAIT connections expire from node conntrack tables.
  3. Create more such connections, and eventually node's netfilter will reuse a NAT source port which is still in use in the remote end's mind. Such connections will be reset by the remote end, resulting in "connection refused" errors on the userspace side.

Anything else do we need to know:

TL;DR: kernel's default nf_conntrack_tcp_timeout_close_wait of 60s is too low: http://marc.info/?l=netfilter-devel&m=117568928824030&w=2

I caught this with tcpdump on our GKE cluster. Here it is, with additional lines (the ones with 5-second resolution) from conntrack(1) outputs for the interesting 5-tuple.
Actors:

  • 10.192.2.35 is a pod IP
  • 10.241.0.49 is the node IP
  • 169.254.169.254 is the GCE metadata server

pod 10.192.2.35 starts a connection, it's NATed to node source port 35715

09:13:55.069526 IP 10.241.0.49.35715 > 169.254.169.254.80: Flags [S], seq 1235205249, win 29200, options [mss 1460,sackOK,TS val 148646959 ecr 0,nop,wscale 7], length 0
09:13:55.069696 IP 169.254.169.254.80 > 10.241.0.49.35715: Flags [S.], seq 3311972373, ack 1235205250, win 65535, options [mss 1420,eol], length 0
09:13:55.069742 IP 10.241.0.49.35715 > 169.254.169.254.80: Flags [.], ack 1, win 29200, length 0

connection is established. HTTP request:

09:13:55.069982 IP 10.241.0.49.35715 > 169.254.169.254.80: Flags [P.], seq 1:216, ack 1, win 29200, length 215
09:13:55.070112 IP 169.254.169.254.80 > 10.241.0.49.35715: Flags [.], ack 216, win 65320, length 0
09:13:55.070127 IP 169.254.169.254.80 > 10.241.0.49.35715: Flags [.], ack 216, win 65535, length 0

HTTP response:

09:13:55.070411 IP 169.254.169.254.80 > 10.241.0.49.35715: Flags [P.], seq 1:510, ack 216, win 65535, length 509
09:13:55.070441 IP 10.241.0.49.35715 > 169.254.169.254.80: Flags [.], ack 510, win 30016, length 0

another HTTP request:

09:13:55.071634 IP 10.241.0.49.35715 > 169.254.169.254.80: Flags [P.], seq 216:464, ack 510, win 30016, length 248
09:13:55.071731 IP 169.254.169.254.80 > 10.241.0.49.35715: Flags [.], ack 464, win 65287, length 0
09:13:55.071763 IP 169.254.169.254.80 > 10.241.0.49.35715: Flags [.], ack 464, win 65535, length 0

and another HTTP response:

09:13:55.071936 IP 169.254.169.254.80 > 10.241.0.49.35715: Flags [P.], seq 510:955, ack 464, win 65535, length 445
09:13:55.108816 IP 10.241.0.49.35715 > 169.254.169.254.80: Flags [.], ack 955, win 31088, length 0

time passes, conntrack sees the connection as established:

09:13:59    tcp  6 86395 ESTABLISHED src=10.192.2.35 dst=169.254.169.254 sport=35715 dport=80 src=169.254.169.254 dst=10.241.0.49 sport=80 dport=35715 [ASSURED] mark=0 use=1
09:14:04    tcp  6 86390 ESTABLISHED src=10.192.2.35 dst=169.254.169.254 sport=35715 dport=80 src=169.254.169.254 dst=10.241.0.49 sport=80 dport=35715 [ASSURED] mark=0 use=1
09:14:09    tcp  6 86385 ESTABLISHED src=10.192.2.35 dst=169.254.169.254 sport=35715 dport=80 src=169.254.169.254 dst=10.241.0.49 sport=80 dport=35715 [ASSURED] mark=0 use=1
09:14:14    tcp  6 86380 ESTABLISHED src=10.192.2.35 dst=169.254.169.254 sport=35715 dport=80 src=169.254.169.254 dst=10.241.0.49 sport=80 dport=35715 [ASSURED] mark=0 use=1
09:14:19    tcp  6 86375 ESTABLISHED src=10.192.2.35 dst=169.254.169.254 sport=35715 dport=80 src=169.254.169.254 dst=10.241.0.49 sport=80 dport=35715 [ASSURED] mark=0 use=1
09:14:24    tcp  6 86370 ESTABLISHED src=10.192.2.35 dst=169.254.169.254 sport=35715 dport=80 src=169.254.169.254 dst=10.241.0.49 sport=80 dport=35715 [ASSURED] mark=0 use=1

after 30s of inactivity metadata server closes its side of the connection

09:14:25.073869 IP 169.254.169.254.80 > 10.241.0.49.35715: Flags [F.], seq 955, ack 464, win 65535, length 0

and client acks that:

09:14:25.112831 IP 10.241.0.49.35715 > 169.254.169.254.80: Flags [.], ack 956, win 31088, length 0

but does not close its side of connection yet

time passes, and conntrack sees the connection as half-closed (a.k.a. close_wait):

09:14:29    tcp  6 55 CLOSE_WAIT src=10.192.2.35 dst=169.254.169.254 sport=35715 dport=80 src=169.254.169.254 dst=10.241.0.49 sport=80 dport=35715 [ASSURED] mark=0 use=1
09:14:34    tcp  6 50 CLOSE_WAIT src=10.192.2.35 dst=169.254.169.254 sport=35715 dport=80 src=169.254.169.254 dst=10.241.0.49 sport=80 dport=35715 [ASSURED] mark=0 use=1
09:14:40    tcp  6 44 CLOSE_WAIT src=10.192.2.35 dst=169.254.169.254 sport=35715 dport=80 src=169.254.169.254 dst=10.241.0.49 sport=80 dport=35715 [ASSURED] mark=0 use=1
09:14:45    tcp  6 39 CLOSE_WAIT src=10.192.2.35 dst=169.254.169.254 sport=35715 dport=80 src=169.254.169.254 dst=10.241.0.49 sport=80 dport=35715 [ASSURED] mark=0 use=1
09:14:50    tcp  6 34 CLOSE_WAIT src=10.192.2.35 dst=169.254.169.254 sport=35715 dport=80 src=169.254.169.254 dst=10.241.0.49 sport=80 dport=35715 [ASSURED] mark=0 use=1
09:14:55    tcp  6 29 CLOSE_WAIT src=10.192.2.35 dst=169.254.169.254 sport=35715 dport=80 src=169.254.169.254 dst=10.241.0.49 sport=80 dport=35715 [ASSURED] mark=0 use=1
09:15:00    tcp  6 24 CLOSE_WAIT src=10.192.2.35 dst=169.254.169.254 sport=35715 dport=80 src=169.254.169.254 dst=10.241.0.49 sport=80 dport=35715 [ASSURED] mark=0 use=1
09:15:05    tcp  6 19 CLOSE_WAIT src=10.192.2.35 dst=169.254.169.254 sport=35715 dport=80 src=169.254.169.254 dst=10.241.0.49 sport=80 dport=35715 [ASSURED] mark=0 use=1
09:15:10    tcp  6 14 CLOSE_WAIT src=10.192.2.35 dst=169.254.169.254 sport=35715 dport=80 src=169.254.169.254 dst=10.241.0.49 sport=80 dport=35715 [ASSURED] mark=0 use=1
09:15:15    tcp  6 9 CLOSE_WAIT src=10.192.2.35 dst=169.254.169.254 sport=35715 dport=80 src=169.254.169.254 dst=10.241.0.49 sport=80 dport=35715 [ASSURED] mark=0 use=1
09:15:21    tcp  6 4 CLOSE_WAIT src=10.192.2.35 dst=169.254.169.254 sport=35715 dport=80 src=169.254.169.254 dst=10.241.0.49 sport=80 dport=35715 [ASSURED] mark=0 use=1

after a minute, though (nf_conntrack_tcp_timeout_close_wait sysctl knob's default), it loses patience and drops the entry from the conntrack table.
As a result, the connection is no longer NATed any more, and when the client wakes up 2.5 minutes later to close the connection, it shows up with its non-NATed source IP:

09:17:00.019964 IP 10.192.2.35.35715 > 169.254.169.254.80: Flags [F.], seq 1235205713, ack 3311973329, win 31088, length 0

obviously the metadata server TCP stack explains there is no connection for such 5-tuple, with an RST bit:

09:17:00.020053 IP 169.254.169.254.80 > 10.192.2.35.35715: Flags [R], seq 3311973329, win 65535, length 0

So at this point the "TCP 10.241.0.49.35715 > 169.254.169.254.80" connection is still open in metadata server's mind, since the FIN above did not change anything.

Therefore 59 minutes later, when the node happily maps a completely different connection (made from a different pod) to the same source port:

10:14:14.639439 IP 10.241.0.49.35715 > 169.254.169.254.80: Flags [S], seq 301038175, win 29200, options [mss 1460,sackOK,TS val 149551851 ecr 0,nop,wscale 7], length 0

However since the metadata server still remembers the old connection, and the sequence numbers and the state of the connection do not match, so it rejects the connection attempt with an RST:

10:14:14.639520 IP 169.254.169.254.80 > 10.241.0.49.35715: Flags [R.], seq 982994923, ack 3360800223, win 65535, length 0

This is seen in userspace as "connection refused".

Interestingly, just ten minutes later there is yet another connection attempt, NATed to the same source port, and at this point apparently the metadata server no longer remembers the old connection, as it accepts the new one.

This suggests the CLOSE_WAIT lifetime on metadata server is between 59 and 69 minutes, probably 60.

10:24:18.059755 IP 10.241.0.49.35715 > 169.254.169.254.80: Flags [S], seq 3149308939, win 29200, options [mss 1460,sackOK,TS val 149702706 ecr 0,nop,wscale 7], length 0
10:24:18.060016 IP 169.254.169.254.80 > 10.241.0.49.35715: Flags [S.], seq 146596554, ack 3149308940, win 65535, options [mss 1420,eol], length 0
10:24:18.060075 IP 10.241.0.49.35715 > 169.254.169.254.80: Flags [.], ack 1, win 29200, length 0
10:24:18.060180 IP 10.241.0.49.35715 > 169.254.169.254.80: Flags [P.], seq 1:193, ack 1, win 29200, length 192
10:24:18.060375 IP 169.254.169.254.80 > 10.241.0.49.35715: Flags [.], ack 193, win 65343, length 0
10:24:18.060392 IP 169.254.169.254.80 > 10.241.0.49.35715: Flags [.], ack 193, win 65535, length 0
10:24:18.060661 IP 169.254.169.254.80 > 10.241.0.49.35715: Flags [P.], seq 1:271, ack 193, win 65535, length 270
10:24:18.060724 IP 10.241.0.49.35715 > 169.254.169.254.80: Flags [.], ack 271, win 30016, length 0
10:24:18.061528 IP 10.241.0.49.35715 > 169.254.169.254.80: Flags [F.], seq 193, ack 271, win 30016, length 0
10:24:18.061647 IP 169.254.169.254.80 > 10.241.0.49.35715: Flags [.], ack 194, win 65535, length 0
10:24:18.061674 IP 169.254.169.254.80 > 10.241.0.49.35715: Flags [F.], seq 271, ack 194, win 65535, length 0
10:24:18.061695 IP 10.241.0.49.35715 > 169.254.169.254.80: Flags [.], ack 272, win 30016, length 0
@thockin thockin added sig/network Categorizes an issue or PR as relevant to SIG Network. area/kube-proxy area/kubelet labels Sep 13, 2016
@thockin thockin removed the area/dns label Sep 13, 2016
@thockin
Copy link
Member

thockin commented Sep 13, 2016

@kubernetes/sig-network

@thockin thockin added the priority/backlog Higher priority than priority/awaiting-more-evidence. label Sep 13, 2016
@thockin thockin added this to the v1.4 milestone Sep 13, 2016
@thockin
Copy link
Member

thockin commented Sep 13, 2016

Consider for a 1.4.x patch

@ravilr
Copy link
Contributor

ravilr commented Sep 13, 2016

@porridge nice sleuthing. is it normal for clients to have so much delay in calling close?

@porridge
Copy link
Member Author

My completely unconfirmed gut feeling is that this happens when:

  • our code makes some long-running GCE operations,
  • for the whole duration of such operation, it keep a GCP client connection
    object around (as it's needed to poll operation status)
  • this GCP client connection object probably keeps a reference to the
    oauth2client helper plugin object (which retrieves auth token from metadata
    server as needed)
  • this helper plugin object probably keeps the metadata server connection
    open until it's garbage collected

@goltermann goltermann modified the milestones: v1.5, v1.4 Sep 14, 2016
@goltermann
Copy link
Contributor

Moving out of v1.4 as this is a P2.

@alex-mohr alex-mohr added priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. and removed priority/backlog Higher priority than priority/awaiting-more-evidence. labels Sep 15, 2016
@alex-mohr
Copy link
Contributor

Asking this to be P1 for GKE given how many customers are hitting it, though it's not GKE specific.

@thockin
Copy link
Member

thockin commented Sep 15, 2016

@matchstick for prio shuffle shuffle

@freehan
Copy link
Contributor

freehan commented Sep 26, 2016

Two potential solutions I could think of:

  • We bump nf_conntrack_tcp_timeout_close_wait and match it with metadata server. Or we let metadata server tune down their close_wait timeout. Not sure why they set it to 60 minutes in the first place???

But bumping nf_conntrack_tcp_timeout_close_wait seems risky, since this will make it easier to run out of conntrack entries. https://lists.netfilter.org/pipermail/netfilter-devel/2007-April/027510.html

  • Currently we SNAT all outbound traffic. All connection with destination NOT in nonMasqueradeCIDR will be SNATed. 169.254.169.254 happens to fall in this category. We can add an excludeMasqueradeCIDRs parameter in kubelet and exclude traffic to 169.254.169.1/24 from SNAT. Or make nonMasqueradeCIDR into nonMasqueradeCIDRs to accept multiple cidrs.

The downside of this approach is that it requires adding another chain on POSTROUTING. Although the rules in the chain will be limited:
1 rule for excluding 169.254.169.1/24, 1 rule for SNAT outbound traffic.
or
1 rule per nonMasqueradeCIDR and 1 rule for SNAT the rest

Thoughts?

@thockin
Copy link
Member

thockin commented Sep 27, 2016

A little from column A and a little from column B? I think we can tune the close_wait timeout to 60 minutes - it will only affect sockets in CLOSE_WAIT, which isn't a normal state to linger in (I think?!). I think the other thing you suggest is the real fix. We want to fix that ANYWAY, but we should probably think hard about how to fix it properly. I don't think a simple flag is enough - I'd love to see this go into the per-node configmaps...

@porridge
Copy link
Member Author

porridge commented Sep 27, 2016 via email

@bprashanth
Copy link
Contributor

Correct my understand if it's off:

after 30s of inactivity metadata server closes its side of the connection

metadata server is in FIN_WAIT_1

and client acks that:

metadata server is in FIN_WAIT_2, client is in CLOSE_WAIT.

Client now needs to send a FIN by calling close() and entering LAST_ACK, this would put metadataserver in TIME_WAIT which will naturally close on timeout.

The problems are:

  • both FIN_WAIT_2 and CLOSE_WAIT depend on app calling close()
  • netfilter drops close_wait to protect the system against socket leaks from badly behaved apps (attack vector)

So we need to set the default nf_conntrack_tcp_timeout_close_wait to however long we expect clients to stall the close() invocation. I think 1m is a decent estimate? if you have a select/read, the select will return when the remote end closes, and the read will fail right? why keep it open for any longer.

@thockin
Copy link
Member

thockin commented Oct 3, 2016

I think we simply don't know why the client waited several minutes to close their end, but it's valid TCP.

@bprashanth
Copy link
Contributor

Yeah it's valid to leave the socket in CLOSE_WAIT indefinitely. I think our defaults should guard against this, not encourage it, unless I'm wrong in assuming that most clients call close() pretty quickly.

@thockin
Copy link
Member

thockin commented Oct 3, 2016

Clearly not all, and we have seen long CLOSE_WAIT before. It seems easy to
accidentally have socket open for an HTTP API or something and not be
using it for a while, so not notice that it was closed until the next time
you try to use it. I think this timeout should be fairly long, actually,
since it's valid indefinitely. Or maybe there's a tunable that lets us
force send the last FIN after a timeout?

On Sun, Oct 2, 2016 at 8:40 PM, Prashanth B notifications@github.com
wrote:

Yeah it's valid to leave the socket in CLOSE_WAIT indefinitely. I think
our defaults should guard against this, not encourage it, unless I'm wrong
in assuming that most clients call close() pretty quickly.


You are receiving this because you are on a team that was mentioned.
Reply to this email directly, view it on GitHub
#32551 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AFVgVMRp2xjvFKzMtXi1ig_uIdf2cQG3ks5qwHkugaJpZM4J7Tn3
.

@bprashanth
Copy link
Contributor

I don't think there's a CLOSE_WAIT timeout. There is:

  • net.ipv4.tcp_fin_timeout: A FIN_WAIT_2 timeout to force the initiator (server in this case) into TIME_WAIT
  • net.ipv4.netfilter.ip_conntrack_tcp_timeout_time_wait: TIME_WAIT timeout, usually 2msl

Increasing the netfilter timeout souns totally fine, but we need to balance that with other conntrack tunables like /proc/sys/net/ipv4/netfilter/ip_conntrack_max and /proc/sys/net/ipv4/netfilter/ip_conntrack_buckets (hashsize).

In an ideal world we would count connection states (maybe from /proc/net/ip_conntrack), and try to accout them to pods, so when conntrack tables fill up we can point fingers.

@freehan
Copy link
Contributor

freehan commented Oct 10, 2016

Assigning to @bowei

In the short term, adding a kube-proxy parameter for net.netfilter.nf_conntrack_tcp_timeout_close_wait. Defaults to 3600 (seconds).

@freehan freehan assigned bowei and unassigned freehan Oct 10, 2016
@bowei
Copy link
Member

bowei commented Oct 20, 2016

Some notes on reproducing the behavior, as the Linux NAT logic seems to be conditional on the flags present in the packet itself:

On the node (10.240.0.4), set the CLOSE_WAIT timeout to something small:

$  sudo sysctl net.netfilter.nf_conntrack_tcp_timeout_close_wait=10 

Connect to the metadata server from within the pod (10.180.3.24), and hold on to the socket:

$ cat > /tmp/test.py
import socket

local = ('10.180.3.24', 7777) # pod local ip
remote = ('169.254.169.254', 80) # metadata server

sock = socket.socket()
sock.bind(local)

sock.connect(remote)
sock.send('GET / HTTP/1.1\r\nConnection: close\r\n\r\n')

$ python -i /tmp/test.py

Wait for the conntrack entry to expire (it will be in CLOSE_WAIT).

Now if we close the socket, we will see that the last packet (it happens to be RST in this case) will NOT be NAT'd:

>>> sock.close()

TCPDUMP output for the interaction

18:35:32.316639 IP 10.240.0.4.7777 > 169.254.169.254.80: Flags [S], seq 887387349, win 28400, options [mss 1420,sackOK,TS val 81361042 ecr 0,nop,wscale 7], length 0
18:35:32.316859 IP 169.254.169.254.80 > 10.240.0.4.7777: Flags [S.], seq 2737060503, ack 887387350, win 65535, options [mss 1420,eol], length 0
18:35:32.316946 IP 10.240.0.4.7777 > 169.254.169.254.80: Flags [.], ack 1, win 28400, length 0
18:35:32.317259 IP 10.240.0.4.7777 > 169.254.169.254.80: Flags [P.], seq 1:38, ack 1, win 28400, length 37
...
## DOES NOT GET NAT'd
18:36:00.832768 IP 10.180.3.24.7777 > 169.254.169.254.80: Flags [R.], seq 887387387, ack 2737152397, win 65320, length 0
                   ^^^^^^^^^^^

Subsequent run of test.py will get a RST:

18:36:10.319033 IP 10.240.0.4.7777 > 169.254.169.254.80: Flags [S], seq 1481175026, win 28400, options [mss 1420,sackOK,TS val 81399045 ecr 0,nop,wscale 7], length 0
18:36:10.319215 IP 169.254.169.254.80 > 10.240.0.4.7777: Flags [R.], seq 1557906793, ack 593787678, win 65535, length 0

However, if we send data instead of close(), the entry is recreated and the packet does get NAT'd.

>>> sock.send('x')

Tcpdump output:

# Send data
18:54:03.918358 IP 10.240.0.4.7777 > 169.254.169.254.80: Flags [P.], seq 38:39, ack 91894, win 65320, length 1
                   ^^^^^^^^^^
18:54:03.918532 IP 169.254.169.254.80 > 10.240.0.4.7777: Flags [.], ack 39, win 65534, length 0
18:54:03.918572 IP 169.254.169.254.80 > 10.240.0.4.7777: Flags [.], ack 39, win 65535, length 0
# This is invalid for HTTP, so server drops us by sending up a bunch of RST
18:54:03.918732 IP 10.240.0.4.7777 > 169.254.169.254.80: Flags [R.], seq 39, ack 91894, win 65320, length 0
18:54:03.918758 IP 169.254.169.254.80 > 10.240.0.4.7777: Flags [R], seq 953376519, win 65535, length 0
18:54:03.918787 IP 169.254.169.254.80 > 10.240.0.4.7777: Flags [R], seq 953376519, win 65535, length 0
18:54:03.918853 IP 169.254.169.254.80 > 10.240.0.4.7777: Flags [R], seq 953376519, win 65535, length 0

(Not) reproducing in our cluster

When trying to recreate the situation inside our cluster, I observed that when a SYN is sent for an stale 5-tuple between our GCI machines (i.e. an initial SYN from machine A to B, B is holding a connection from A that does not exist anymore)

The response is not RST, but an ACK from the B->A which results in an RST A->B. The subsequent SYN retry from A->B succeeds. In this case, the connect() from A does not have an error.

@mtbbiker
Copy link

Guys if I am out of order please nuke this comment,
We are new to running Kubernetes (On CoreOS) in our own data centre and we are experiencing behaviour that seems to be consistent with this post:

We are running:

Client Version: version.Info{Major:"1", Minor:"2", GitVersion:"v1.2.4", GitCommit:"3eed1e3be6848b877ff80a93da3785d9034d0a4f", GitTreeState:"clean"}
Server Version: version.Info{Major:"1", Minor:"3", GitVersion:"v1.3.3+coreos.0", GitCommit:"d8dcde95b396ecd9a74b779cda9bc4d5b71e8550", GitTreeState:"clean"}

CoreOS stable (1068.8.0)

We have a Micro services Architecture model that allows third party application to request Telemetry data (We are streaming 1200ms/s via a service that uses 0MQ http://zeromq.org/ as the underlying transport layer, this data is then written to a Cassandra Cluster). The stream is also "Published" which allows literally thousands of clients listening to their data stream - Potentially creating thousands of connections and creating the above problem. We also expose a REST web service to request data in the case a client lost connection and want to "download" the missing data.

What we are seeing is that all services will run perfectly (Multiple replicas - Pods exposed via Kubernetes services) for up to 24 Hours, then all connections to the services are dropped. What we could see by digging through the logs, (I am sure we have not covered all the logs, we are still novices) but it is mainly seen from the Client Applications getting a "Connection Time-out" when trying to retrieve data from the REST web services.
From http://kubernetes.io/docs/admin/kube-proxy/ and if I understand the above correct then the default conntrack-tcp-timeout-established duration is set at 24 Hours (Is this just a coincidence with the failures we are experiencing?) or is our failure due to the above posts?
If we leave Kubernetes, we are not trying any restarts of services, then it
Any pointers in how and where to look to help if it can solve the above and our problem will gladly be appreciated.

@bowei
Copy link
Member

bowei commented Oct 24, 2016

@mtbbiker that sounds like a different problem -- if you can open a separate issue from this one that would be great.

This issue is about CLOSE_WAIT, not a timeout in ESTABLISHED state. Also, neither of the timeouts should be hit if you have active traffic going over the connections.

@mtbbiker
Copy link

@bowei 100% Thanks I'll open a new Issue

jessfraz pushed a commit to jessfraz/kubernetes that referenced this issue Oct 30, 2016
Automatic merge from submit-queue

Add test image for networking related tests

This test image is to support the e2e test for kubernetes#32551
k8s-github-robot pushed a commit that referenced this issue Nov 4, 2016
Automatic merge from submit-queue

Adds TCPCloseWaitTimeout option to kube-proxy for sysctl nf_conntrack…

Adds TCPCloseWaitTimeout option to kube-proxy for sysctl nf_conntrack_tcp_timeout_time_wait
 
Fixes issue #32551
@dims
Copy link
Member

dims commented Nov 16, 2016

This needs to be triaged as a release-blocker or not for 1.5 @porridge @bowei @thockin

@bowei
Copy link
Member

bowei commented Nov 16, 2016

@dims Unless you are talking about the issue filed by mtbbiker, this issue has been fixed by #35919

@thockin
Copy link
Member

thockin commented Nov 16, 2016

@bowei can you tweak the subject of #35919 to be a better release note? Specifically the default value of that param?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/kube-proxy area/kubelet priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. sig/network Categorizes an issue or PR as relevant to SIG Network.
Projects
None yet
Development

No branches or pull requests