Join GitHub today
GitHub is home to over 40 million developers working together to host and review code, manage projects, and build software together.Sign up
Potential resource leak #759
Ever since July 10th, we have run into problems on our cluster. Almost immediately, we ran out of inotify watches, which we fixed by just increasing the limit -- see #717.
Since then, the CPU of our nodes has slowly increased, like a memory leak but CPU. This can be attributed to the Kubelet process, which a profile indicates most of the time spent by cAdvisor.
Here is a graph of the minimum CPU of our nodes for a 4 hour interval. During a 4 hour time, it is almost certain we will have no test pods scheduled on the node, so this essentially shows the base overhead CPU of the nodes.
The big drops are times when we got rid of nodes. You can see the problems seem to start right around when we started using kind (note - it is possible it is a coincidence. My only evidence it is related to kind is the timing).
Within two weeks we see some nodes using 90% of CPU just idling.
What you expected to happen:
How to reproduce it (as minimally and precisely as possible):
I did run into that while investigating, but it seems the conclusion was
By the way, if it is relevant, we never do
but ... then the old cluster containers keep running forever ... is it possible to add the
But we run it in a pod, once the pod is removed shouldn't everything be cleaned up? Or maybe because we have
in our pod spec, so it never gets properly cleaned up.
I'll try adding the delete cluster to the end.
Does Kubernetes prow do this in their tests using kind? I am worried if the test crashes part way through we won't properly clean up
@BenTheElder is the authority on this, but the tests that are running in the CI execute the