Possible goroutines leak from watch events #533

johannesfrey · 2020-06-17T12:55:47Z

Describe the bug
Memory consumption of the kuberhealthy deployment increases steadily when running inside a cluster performing several checks. We deployed kuberhealthy together with a daemonset-, dns-name-resolution- and network-connection-check into muliple clusters. We set a memory limit to 50Mi via Kubernetes resources. Thereby a pattern emerged that the main kuberhealthy server deployment was OOM killed after about 2 days. Following shows the graphs for the container_memory_usage_bytes prometheus metric coming from cadvisor. Left are clusters with the limit set to 50Mi, on the right there are clusters where the memory limit is removed:

By that I assume that the memory consumption would rise indefinitely.

After adding pprof to kuberhealty and looking at the goroutines graph you can see that an increasing number of goroutines are put into waiting state at runtime.gopark as they seem to block/wait at the receive method of apimachinery's StreamWatcher:

goroutine 2486 [chan send, 143 minutes]:
k8s.io/apimachinery/pkg/watch.(*StreamWatcher).receive(0xc0004e0700)
	/go/pkg/mod/k8s.io/apimachinery@v0.0.0-20190817020851-f2f3a405f61d/pkg/watch/streamwatcher.go:127 +0x15b
created by k8s.io/apimachinery/pkg/watch.NewStreamWatcher
	/go/pkg/mod/k8s.io/apimachinery@v0.0.0-20190817020851-f2f3a405f61d/pkg/watch/streamwatcher.go:71 +0xbe

Here the corresponding graph:

Steps To Reproduce

Let kuberhealthy run freely over a certain amount of time with several checks

Expected behavior
That kuberhealthy does not continuously accumulate goroutines during its runtime.

Versions

Cluster OS: 5.3.0-46-generic #38-Ubuntu SMP Fri Mar 27 17:37:05 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux
Kubernetes Version: v1.17.6
Kuberhealthy Release: v2.2.0

Additional context

I tried to pinpoint where the problem occurs but my unfamiliarity with the code I did not yet succeed. Perhaps someone with more knowledge might chime in.
My first guess was following lines: https://github.com/Comcast/kuberhealthy/blob/v2.2.0/cmd/kuberhealthy/kuberhealthy.go#L162

The watchForKHCheckChanges creates a watcher (https://github.com/Comcast/kuberhealthy/blob/v2.2.0/cmd/kuberhealthy/kuberhealthy.go#L282) which internally writes Events to a unbuffered channel. Those Events are processes and new Signals are written to another unbuffered channel (https://github.com/Comcast/kuberhealthy/blob/v2.2.0/cmd/kuberhealthy/kuberhealthy.go#L297). But then the notifyChanLimiter and the monitorExternalChecks work with a buffered channel that is processed every 10 seconds (https://github.com/Comcast/kuberhealthy/blob/v2.2.0/cmd/kuberhealthy/util.go#L47). So it might be that the Watch writes Events into the unbuffered channel for an goroutines pile up processing those. But this is all just a theory and probably totally wrong :).

The text was updated successfully, but these errors were encountered:

integrii · 2020-06-17T17:33:13Z

Could be related to #293 which never was located, but was not happening in all circumstances.

Good investigation so far!

integrii · 2020-06-17T17:46:40Z

Following your path here, I think the problem may be (at least partially) here and here.

In both these places, a go routine is launched on each iteration of the for loop that never ends unless the context that started the entire function is cancelled... Thoughts?

johannesfrey · 2020-06-18T05:08:35Z

Following your path here, I think the problem may be (at least partially) here and here.

In both these places, a go routine is launched on each iteration of the for loop that never ends unless the context that started the entire function is cancelled... Thoughts?

Yes this makes sense. If that's okay with you I will test your provided fix #534 on a cluster to see if it works and post back the findings/graphs.

sbueringer · 2020-06-20T04:56:14Z

@integrii The goroutine leak is still there, I opened another PR to fix it: #537

cagdascirit · 2020-06-22T07:14:28Z

#537 has fixed this issue in our test environment

johannesfrey added the bug Something isn't working label Jun 17, 2020

integrii self-assigned this Jun 17, 2020

integrii mentioned this issue Jun 17, 2020

goroutine leak fix #534

Merged

integrii closed this as completed in #534 Jun 18, 2020

sbueringer mentioned this issue Jun 20, 2020

fix goroutine leak in watchForCheckerPodShutdown() #537

Merged

integrii added this to the 2.3.0 milestone Jul 13, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Possible goroutines leak from watch events #533

Possible goroutines leak from watch events #533

johannesfrey commented Jun 17, 2020 •

edited

integrii commented Jun 17, 2020

integrii commented Jun 17, 2020

johannesfrey commented Jun 18, 2020 •

edited

sbueringer commented Jun 20, 2020

cagdascirit commented Jun 22, 2020

Possible goroutines leak from watch events #533

Possible goroutines leak from watch events #533

Comments

johannesfrey commented Jun 17, 2020 • edited

integrii commented Jun 17, 2020

integrii commented Jun 17, 2020

johannesfrey commented Jun 18, 2020 • edited

sbueringer commented Jun 20, 2020

cagdascirit commented Jun 22, 2020

johannesfrey commented Jun 17, 2020 •

edited

johannesfrey commented Jun 18, 2020 •

edited