Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Possible goroutines leak from watch events #533

Closed
johannesfrey opened this issue Jun 17, 2020 · 5 comments · Fixed by #534 or #537
Closed

Possible goroutines leak from watch events #533

johannesfrey opened this issue Jun 17, 2020 · 5 comments · Fixed by #534 or #537
Assignees
Labels
bug Something isn't working
Milestone

Comments

@johannesfrey
Copy link

johannesfrey commented Jun 17, 2020

Describe the bug
Memory consumption of the kuberhealthy deployment increases steadily when running inside a cluster performing several checks. We deployed kuberhealthy together with a daemonset-, dns-name-resolution- and network-connection-check into muliple clusters. We set a memory limit to 50Mi via Kubernetes resources. Thereby a pattern emerged that the main kuberhealthy server deployment was OOM killed after about 2 days. Following shows the graphs for the container_memory_usage_bytes prometheus metric coming from cadvisor. Left are clusters with the limit set to 50Mi, on the right there are clusters where the memory limit is removed:

kuberhealthy01

By that I assume that the memory consumption would rise indefinitely.

After adding pprof to kuberhealty and looking at the goroutines graph you can see that an increasing number of goroutines are put into waiting state at runtime.gopark as they seem to block/wait at the receive method of apimachinery's StreamWatcher:

goroutine 2486 [chan send, 143 minutes]:
k8s.io/apimachinery/pkg/watch.(*StreamWatcher).receive(0xc0004e0700)
	/go/pkg/mod/k8s.io/apimachinery@v0.0.0-20190817020851-f2f3a405f61d/pkg/watch/streamwatcher.go:127 +0x15b
created by k8s.io/apimachinery/pkg/watch.NewStreamWatcher
	/go/pkg/mod/k8s.io/apimachinery@v0.0.0-20190817020851-f2f3a405f61d/pkg/watch/streamwatcher.go:71 +0xbe

Here the corresponding graph:

kuberhealthy-goroutines

Steps To Reproduce

  • Let kuberhealthy run freely over a certain amount of time with several checks

Expected behavior
That kuberhealthy does not continuously accumulate goroutines during its runtime.

Versions

  • Cluster OS: 5.3.0-46-generic #38-Ubuntu SMP Fri Mar 27 17:37:05 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux
  • Kubernetes Version: v1.17.6
  • Kuberhealthy Release: v2.2.0

Additional context

I tried to pinpoint where the problem occurs but my unfamiliarity with the code I did not yet succeed. Perhaps someone with more knowledge might chime in.
My first guess was following lines: https://github.com/Comcast/kuberhealthy/blob/v2.2.0/cmd/kuberhealthy/kuberhealthy.go#L162

The watchForKHCheckChanges creates a watcher (https://github.com/Comcast/kuberhealthy/blob/v2.2.0/cmd/kuberhealthy/kuberhealthy.go#L282) which internally writes Events to a unbuffered channel. Those Events are processes and new Signals are written to another unbuffered channel (https://github.com/Comcast/kuberhealthy/blob/v2.2.0/cmd/kuberhealthy/kuberhealthy.go#L297). But then the notifyChanLimiter and the monitorExternalChecks work with a buffered channel that is processed every 10 seconds (https://github.com/Comcast/kuberhealthy/blob/v2.2.0/cmd/kuberhealthy/util.go#L47). So it might be that the Watch writes Events into the unbuffered channel for an goroutines pile up processing those. But this is all just a theory and probably totally wrong :).

@johannesfrey johannesfrey added the bug Something isn't working label Jun 17, 2020
@integrii
Copy link
Collaborator

Could be related to #293 which never was located, but was not happening in all circumstances.

Good investigation so far!

@integrii
Copy link
Collaborator

Following your path here, I think the problem may be (at least partially) here and here.

In both these places, a go routine is launched on each iteration of the for loop that never ends unless the context that started the entire function is cancelled... Thoughts?

@integrii integrii self-assigned this Jun 17, 2020
@johannesfrey
Copy link
Author

johannesfrey commented Jun 18, 2020

Following your path here, I think the problem may be (at least partially) here and here.

In both these places, a go routine is launched on each iteration of the for loop that never ends unless the context that started the entire function is cancelled... Thoughts?

Yes this makes sense. If that's okay with you I will test your provided fix #534 on a cluster to see if it works and post back the findings/graphs.

@sbueringer
Copy link
Contributor

@integrii The goroutine leak is still there, I opened another PR to fix it: #537

@cagdascirit
Copy link

#537 has fixed this issue in our test environment

@integrii integrii added this to the 2.3.0 milestone Jul 13, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
4 participants