-
Notifications
You must be signed in to change notification settings - Fork 38.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
memory leak in kubelet 1.12.5 #73587
Comments
/sig node |
@kubernetes/sig-node-bugs |
@Shnatsel: Reiterating the mentions to trigger a notification: In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
I think it might be a side effect of #71731, but I couldn't find any numbers in there. The comments mention an offline discussion, was this brought up? |
What happens (I only investigated it very briefly) is that kubelet creates a bunch of reflectors for every pod — secret mounts, configmap mounts, token mounts etc. Each reflector registers a number of histograms, gauges, etc. and even when the reflector is stopped and removed the metrics are never removed. After a few days on a busy cluster you get millions of metrics bloating up kubelet and everything else using the reflectors. |
This was broken since v1.12.0, so probably unrelated to #71731. |
correct @aermakov-zalando that PR is only in v1.14.0-alpha.2 v1.14.0-alpha.1 and master |
For everyone that finds this issue and needs a patch to disable the reflector metrics:
|
@szuecs Do you have same metrics after disabling ReflectorMetricsProvider? |
no, it drops the reflector metrics |
@wojtek-t is this indirectly caused by switching to watch-based managers? |
Yeah - switching to watch resulted in more extensive using of a reflector. I think that we don't really need those metrics, so if we could switch them off in Kubelets, that should solve the problem. |
#73624 is sent out for review to fix that. |
Actually, I realized that I don't fully understand the problem.
|
Wouldn't it be better to rewrite the reflector metrics so they're aggregated in a better way rather than relying on people not accidentally enabling them? Or at least put a huge warning on top saying "this will leak memory like crazy" just so the same situation doesn't repeat in other code using client-go? |
@aermakov-zalando - that's a separate issue whether the metrics are reasonable or not (and node sig-node related). |
Our production configuration is this one: https://github.com/zalando-incubator/kubernetes-on-aws/tree/beta We run with an image called |
I try to figure out an issue why these metrics are even there and I found issues from the past showing leaks already in older versions: Sounds for, that we need a postmortem to make this not happening again. I don't find any issue in kubernetes nor in client-go, that could reasonable explain why these were introduced in the first place. I guess I just did not found it, and I hope someone can highlight where the decision came from. |
I am running Kubernetes v1.12.3 and I don't see reflector metrics being used by kubelet ( FWIW, the cluster is created using kops. Any idea, how are reflector metrics enabled? |
@sjenning and myself will also look into this as well and see what we can find. |
I would also suggest not using imports for side effects because this usually creates hard to debug and non-obvious problems like this. |
@aermakov-zalando So you mean disable the reflector metrics totally? |
No, I suggest changing the code so that the end users (kubelet, apiserver, etc) would have to explicitly enable the metrics by calling a function instead of having this happen as a side effect of an import statement. |
LGTM. @wojtek-t @derekwaynecarr @yujuhong WDYT? We should probably do this for both kubelet and control plane services (apiserver and controller-manager). |
I'm fine with that. |
+1 |
+1 |
Why wouldn't we want to just actually disable this metric wholesale? Enabling this flag would basically be the same thing as saying enable a memory leak, no? |
/assign @logicalhan |
Han points out that, since the metric has a random suffix each time, it's not very useful for monitoring anyway. |
/sig api-machinery |
…emory leak ref: kubernetes#73587 ref: kubernetes#74636 Origin-commit: 01380498b02d6dee75e52d9ce54e9a5dffef24fb
…emory leak ref: kubernetes/kubernetes#73587 ref: kubernetes/kubernetes#74636 Origin-commit: 01380498b02d6dee75e52d9ce54e9a5dffef24fb Kubernetes-commit: f77a2c16c80223249ead526ca12caa6962117888
…emory leak ref: kubernetes#73587 ref: kubernetes#74636 Origin-commit: 01380498b02d6dee75e52d9ce54e9a5dffef24fb
…emory leak ref: kubernetes/kubernetes#73587 ref: kubernetes/kubernetes#74636 Origin-commit: 01380498b02d6dee75e52d9ce54e9a5dffef24fb Kubernetes-commit: fd85bbcb7e0922b8889c85fad1f5f2d4ca7a3fa7
What happened:
After upgrading to kubernetes 1.12.5 we observe failing nodes, that are caused by kubelet eating all over the memory after some time.
I use image
k8s.gcr.io/hyperkube:v1.12.5
to run kubelet on 102 clusters and since a week we see some nodes leaking memory, caused by kubelet.I investigated some of these kubelets with strace and pprof.
With 3s of running strace I saw >= 50 openat() calls to the same file from the same threadid (pid) from kubelet:
If I do pprof kubelet it shows client_go metrics and compress is taking over most of the compute time.
Memory profile png:
![mem_profile001](https://user-images.githubusercontent.com/50872/52052850-043d8080-2558-11e9-8c2f-93db2eb56850.png)
The reflector metrics seem to be the problem:
What you expected to happen:
I expect that kubelet does not need so much memory
How to reproduce it (as minimally and precisely as possible):
I don't know
Anything else we need to know?:
One of the affected clusters has only 120 Pods and 3 Pods are in CrashLoopBackOff state, one pod since 6 days on an affected node which was investigated.
Environment:
kubectl version
): v1.12.5uname -a
): Linux ip-172-31-10-50.eu-central-1.compute.internal 4.14.63-coreos Unit test coverage in Kubelet is lousy. (~30%) #1 SMP Wed Aug 15 22:26:16 UTC 2018 x86_64 Intel(R) Xeon(R) CPU E5-2686 v4 @ 2.30GHz GenuineIntel GNU/LinuxThe text was updated successfully, but these errors were encountered: