Goroutine leak in `Instance.SetPeers` #26

bohde · 2019-11-13T16:21:40Z

I'm currently testing a Kubernetes deployment of gubernator, and noticed the number of goroutines are growing linearly over time. Here's the graph we're seeing from our monitoring system.

This also results in a similar leak in memory. In order to diagnose this, I've deployed a version of gubernator with pprof endpoints enabled, and found that goroutines grow in 3 functions:

PeerClient.run
Interval.run
grpc.Dial

The root cause seems to be that in Instance.SetPeer, a new PeerClient is created for every PeerInfo without reusing any existing PeerClients. This causes a goroutine leak linearly proportional to the number of peers. In addition, there is no shutdown code for removed peers, so this code should leak a goroutine for every peer that is removed.

I suspect that this was exposed by some weird interaction with the Kubernetes integration and our test environment, since I see peer update logs every few minutes.

I have a fork where I'm testing the following changes that I would be happy to send a pull request for:

Add a Shutdown method to PeerClient that will close the request queue
Change PeerClient.run to send any enqueued requests when the request queue is closed
Change PeerClient.run to call interval.Stop on return
Reuse existing PeerClients inside Instance.SetPeers
Shutdown any removed PeerClients inside Instance.SetPeers

The text was updated successfully, but these errors were encountered:

thrawn01 · 2019-11-18T21:20:27Z

This would be great! As you mentioned, we are not seeing this in our prod environments because our peer sets are pretty stable. The hard part of this would be knowing when all in flight requests to a peer have completed. It's possible that we could just assume the peer is no longer fielding requests because it's been removed from the peer set. Thoughts?

bohde · 2019-11-19T15:23:58Z

It appears the there's an experimental API to wait for a given GRPC connection to enter a state, which could be used to build graceful shutdowns (https://godoc.org/google.golang.org/grpc#ClientConn.WaitForStateChange). A shutdown process that accounts for this may be as follows:

Close the request queue
Send the pending batch
Wait for connection state to be idle
Close the connection

I'll investigate whether this approach is viable.

bohde · 2019-11-20T19:40:31Z

I couldn't reliably build graceful shutdowns of the connection using the WaitForStateChange API, so ended up using a sync.WaitGroup to track in-flight requests instead.

The changes are in my PR #28.

bohde mentioned this issue Nov 20, 2019

Fix goroutine leak in Instance.SetPeers #28

Merged

thrawn01 closed this as completed in #28 Dec 3, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Goroutine leak in `Instance.SetPeers` #26

Goroutine leak in `Instance.SetPeers` #26

bohde commented Nov 13, 2019

thrawn01 commented Nov 18, 2019

bohde commented Nov 19, 2019

bohde commented Nov 20, 2019

Goroutine leak in Instance.SetPeers #26

Goroutine leak in Instance.SetPeers #26

Comments

bohde commented Nov 13, 2019

thrawn01 commented Nov 18, 2019

bohde commented Nov 19, 2019

bohde commented Nov 20, 2019

Goroutine leak in `Instance.SetPeers` #26

Goroutine leak in `Instance.SetPeers` #26