Skip to content
This repository has been archived by the owner on Apr 19, 2024. It is now read-only.

Goroutine leak in Instance.SetPeers #26

Closed
bohde opened this issue Nov 13, 2019 · 3 comments · Fixed by #28
Closed

Goroutine leak in Instance.SetPeers #26

bohde opened this issue Nov 13, 2019 · 3 comments · Fixed by #28

Comments

@bohde
Copy link
Contributor

bohde commented Nov 13, 2019

I'm currently testing a Kubernetes deployment of gubernator, and noticed the number of goroutines are growing linearly over time. Here's the graph we're seeing from our monitoring system.

Image+2019-11-13+at+10 03 01+AM

This also results in a similar leak in memory. In order to diagnose this, I've deployed a version of gubernator with pprof endpoints enabled, and found that goroutines grow in 3 functions:

  1. PeerClient.run
  2. Interval.run
  3. grpc.Dial

The root cause seems to be that in Instance.SetPeer, a new PeerClient is created for every PeerInfo without reusing any existing PeerClients. This causes a goroutine leak linearly proportional to the number of peers. In addition, there is no shutdown code for removed peers, so this code should leak a goroutine for every peer that is removed.

I suspect that this was exposed by some weird interaction with the Kubernetes integration and our test environment, since I see peer update logs every few minutes.

I have a fork where I'm testing the following changes that I would be happy to send a pull request for:

  1. Add a Shutdown method to PeerClient that will close the request queue
  2. Change PeerClient.run to send any enqueued requests when the request queue is closed
  3. Change PeerClient.run to call interval.Stop on return
  4. Reuse existing PeerClients inside Instance.SetPeers
  5. Shutdown any removed PeerClients inside Instance.SetPeers
@thrawn01
Copy link
Contributor

This would be great! As you mentioned, we are not seeing this in our prod environments because our peer sets are pretty stable. The hard part of this would be knowing when all in flight requests to a peer have completed. It's possible that we could just assume the peer is no longer fielding requests because it's been removed from the peer set. Thoughts?

@bohde
Copy link
Contributor Author

bohde commented Nov 19, 2019

It appears the there's an experimental API to wait for a given GRPC connection to enter a state, which could be used to build graceful shutdowns (https://godoc.org/google.golang.org/grpc#ClientConn.WaitForStateChange). A shutdown process that accounts for this may be as follows:

  1. Close the request queue
  2. Send the pending batch
  3. Wait for connection state to be idle
  4. Close the connection

I'll investigate whether this approach is viable.

@bohde
Copy link
Contributor Author

bohde commented Nov 20, 2019

I couldn't reliably build graceful shutdowns of the connection using the WaitForStateChange API, so ended up using a sync.WaitGroup to track in-flight requests instead.

The changes are in my PR #28.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants