-
Notifications
You must be signed in to change notification settings - Fork 266
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Possible goroutines leak from watch events #533
Comments
Could be related to #293 which never was located, but was not happening in all circumstances. Good investigation so far! |
Yes this makes sense. If that's okay with you I will test your provided fix #534 on a cluster to see if it works and post back the findings/graphs. |
#537 has fixed this issue in our test environment |
Describe the bug
Memory consumption of the kuberhealthy deployment increases steadily when running inside a cluster performing several checks. We deployed kuberhealthy together with a daemonset-, dns-name-resolution- and network-connection-check into muliple clusters. We set a memory limit to 50Mi via Kubernetes resources. Thereby a pattern emerged that the main kuberhealthy server deployment was OOM killed after about 2 days. Following shows the graphs for the
container_memory_usage_bytes
prometheus metric coming from cadvisor. Left are clusters with the limit set to 50Mi, on the right there are clusters where the memory limit is removed:By that I assume that the memory consumption would rise indefinitely.
After adding pprof to kuberhealty and looking at the goroutines graph you can see that an increasing number of goroutines are put into waiting state at
runtime.gopark
as they seem to block/wait at thereceive
method of apimachinery'sStreamWatcher
:Here the corresponding graph:
Steps To Reproduce
Expected behavior
That kuberhealthy does not continuously accumulate goroutines during its runtime.
Versions
5.3.0-46-generic #38-Ubuntu SMP Fri Mar 27 17:37:05 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux
v1.17.6
v2.2.0
Additional context
I tried to pinpoint where the problem occurs but my unfamiliarity with the code I did not yet succeed. Perhaps someone with more knowledge might chime in.
My first guess was following lines: https://github.com/Comcast/kuberhealthy/blob/v2.2.0/cmd/kuberhealthy/kuberhealthy.go#L162
The
watchForKHCheckChanges
creates a watcher (https://github.com/Comcast/kuberhealthy/blob/v2.2.0/cmd/kuberhealthy/kuberhealthy.go#L282) which internally writesEvents
to a unbuffered channel. Those Events are processes and new Signals are written to another unbuffered channel (https://github.com/Comcast/kuberhealthy/blob/v2.2.0/cmd/kuberhealthy/kuberhealthy.go#L297). But then thenotifyChanLimiter
and themonitorExternalChecks
work with a buffered channel that is processed every 10 seconds (https://github.com/Comcast/kuberhealthy/blob/v2.2.0/cmd/kuberhealthy/util.go#L47). So it might be that the Watch writes Events into the unbuffered channel for an goroutines pile up processing those. But this is all just a theory and probably totally wrong :).The text was updated successfully, but these errors were encountered: