-
Notifications
You must be signed in to change notification settings - Fork 38.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
kube-proxy too many files open #5461
Comments
lsof of kube-proxy: |
P1 On Fri, Mar 13, 2015 at 2:34 PM, Brian Grant notifications@github.com
|
@thockin Done. |
@jmreicha Any chance you could try a more recent version of Kubernetes? |
@bgrant0607 that is what I am aiming for at this point. My cluster is basically dead. What version should I move to? |
Try Kubernetes v0.12.0 |
@quinton-hoole I'm going to shoot for v0.12.0, I will update once the new version is up and running. |
Just an update. Site traffic died off and I stopped getting the "too many open files" errors in the kube-proxy logs and I am not experiencing the slowness issues currently. I think I just have to figure out where to increase the limits and how to not break anything else in the process. |
Yes it sounds like we're either leaking file handles through not closing connections when necessary, or you're simply seeing more concurrent connections than your file open limits allow. Given the following, it looks like you're simply hitting a 1024 file handle limit somewhere: $ wc -l ~/Downloads/db8afb49.txt I don't know that part of the code very well, but can poke around a bit. |
@dchen1107 What is the intended per-process file handle limit on the node, and where do we set that (if at all - it sounds like we might be hitting the 1024 default limit)? |
@quinton-hoole we don't set them at all today. Arguable we should for kube-proxy. I will re-purpose this issue for that (if you don't mind :) ) assigning myself. @jmreicha can you provide the output of the following:
|
@timothysc won't that only fix systemd systems though? Since it does address the issue at hand, I'm fine closing :) |
Yes I'm okay with closing. |
I checked a few other services but not all of them are respecting the 65k limit like the Would it be worthwhile to update the getting started guides with the correct file limits or make a note about the file limit issue somewhere? |
I'm working on a PR to have the kube-proxy monitor its limit and warn if its too low. We also need to start documenting the bare requirements for each Kube component, I think there is an issue out for that. |
How sure are we that we don't leak FDs? An IRC (paddyforan) user of version 0.13 reported this problem. |
Background: services stopped responding, was directed by @lavalamp to look into kube-proxy. I'm running v0.13.0 on GCE, using container-vm-v20150305. I'm currently pulling all the files in /var/log off the nodes before restarting them to (hopefully) get things back online, but there are several gigabytes of log files ( |
I did test this some time ago, manually, probably should capture it as an ls -l /proc/ On Fri, Apr 10, 2015 at 5:00 PM, Paddy notifications@github.com wrote:
|
|
Jeepers. I'll try to manually rig up a test and see if I can find the On Fri, Apr 10, 2015 at 5:19 PM, Paddy notifications@github.com wrote:
|
I've still got the clusters up and am happy to run tests or grant SSH access if someone wants to poke around a bit with a live version of the problem. It's not the end of the world for me to just spin up a new cluster and update DNS, especially if it helps. [edit] restarting the cluster now, but still have ~12GB of log files, if anyone's interested. |
Interestingly I can not repro this against head.
What you're seeing here is the UDP timeout of 1 minute (which may be too
What you see here is that the endpoint exists, but hitting it 100 times Can you try these experiments on a freshly killed-and-restarted kube-proxy? On Fri, Apr 10, 2015 at 5:29 PM, Paddy notifications@github.com wrote:
|
FWIW we don't see leaking fd's on the proxy, going form 0 to 100 pods per node then back again. Depending on how they user starts, they will need to up the ulimit, we only changed the systemd unit files... If folks start another way, then we'll need to doc. |
I've been trying to reproduce this on and off again for a while now, and haven't been able to. There is a non-zero chance that it was at the application level, not at the kubernetes level (e.g., the influxdb monitoring tool went berserk, or something.) |
I am experiencing leaked filehandles with 0.17.1's kube-proxy: i tuned up logging on kube-proxy to --v=4 and found a reproducable pattern. We are hitting an endpoint through an apache reverse proxy, which is exposed via publicIP. With every http request we see 2 filehandles created, which are never returned. Correlating to the request & fd openings are entries in the log (1 per request): There seems to be some throttling to this: the 5th request will neither trigger increase the fd count permanently nor trigger "Mapped service". After you wait a while, the leaking starts again. The animation clarifies maybe what i'm talking about: |
@jayunit100 ^ could you please verify. |
@timothysc @mkulke okay, interesting !
|
This looks like sessions being kept alive at @mkulke's rev-proxy or at @jayunit100's public IP balancer. I can not repro this directly.
|
For stronger evidence:
|
yup, shall we close? from what i can see.
at worst, i think there is some possibility that, in some cases, if you have a persistent service, and you keep pollling it, and then you stop - and keep the service around - maybe some open files persist? but it doesnt seem like a flagrant file descripter leak is occuring in side kubeproxy. of course, i have only been looking at this for a few hours, but thats my first guess. |
hi there, plz have a look at this one, it is related i guess: #8891 the leaked handles are not returned in our setup, i tried to meditate over the kube-proxy code, but i could not find out why the handles are not closed. it appears that connections between a keepalive'd tomcat and kube-proxy are not closed, but they're not closed even on the deletion of the tomcat pod. |
@jayunit100 This issue is already closed. :) The only fix here would be to force-close sockets after a timeout, which seems sort of hostile since we are "just" a proxy. @mkulke can you repro my results without your extra reverse-proxy? |
@thokin: i'll try that later, but i suspect everything will be fine. The problems seem to be related to keepalive connections from proxy to backend. While those can be turned off, it this is a pretty common scenario (vanilla apache conf on rhel7 + vanilla tomcat7) and users can eventually break/slow down a cluster by submitting misconfigured pods. |
ACK. We could close connections from the middle (the kube-proxy) but I am On Mon, Jun 1, 2015 at 10:16 AM, Magnus Kulke notifications@github.com
|
I am hitting an issue that is (presumably) affecting the performance of my Kubernetes cluster. I am running a service on this host that is proxying web traffic and notice that web requests that hit pods/containers on this host are incredibly slow. This seems to happen randomly to different hosts and I can temporarily resolve the issue by killing the sick host, but eventually this issue seems to crop back up on other hosts.
CoreOS - v561
Kubernetes - v0.8.1
I have checked the flannel logs and those all seem to be in order (not erroring at least) but I am still not sure if that can be ruled out as a possible problem yet.
Here is what I am seeing in the kube-proxy service logs:
The text was updated successfully, but these errors were encountered: