You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
{{ message }}
This repository has been archived by the owner on Apr 19, 2024. It is now read-only.
I'm currently testing a Kubernetes deployment of gubernator, and noticed the number of goroutines are growing linearly over time. Here's the graph we're seeing from our monitoring system.
This also results in a similar leak in memory. In order to diagnose this, I've deployed a version of gubernator with pprof endpoints enabled, and found that goroutines grow in 3 functions:
PeerClient.run
Interval.run
grpc.Dial
The root cause seems to be that in Instance.SetPeer, a new PeerClient is created for every PeerInfo without reusing any existing PeerClients. This causes a goroutine leak linearly proportional to the number of peers. In addition, there is no shutdown code for removed peers, so this code should leak a goroutine for every peer that is removed.
I suspect that this was exposed by some weird interaction with the Kubernetes integration and our test environment, since I see peer update logs every few minutes.
I have a fork where I'm testing the following changes that I would be happy to send a pull request for:
Add a Shutdown method to PeerClient that will close the request queue
Change PeerClient.run to send any enqueued requests when the request queue is closed
Change PeerClient.run to call interval.Stop on return
This would be great! As you mentioned, we are not seeing this in our prod environments because our peer sets are pretty stable. The hard part of this would be knowing when all in flight requests to a peer have completed. It's possible that we could just assume the peer is no longer fielding requests because it's been removed from the peer set. Thoughts?
I couldn't reliably build graceful shutdowns of the connection using the WaitForStateChange API, so ended up using a sync.WaitGroup to track in-flight requests instead.
I'm currently testing a Kubernetes deployment of gubernator, and noticed the number of goroutines are growing linearly over time. Here's the graph we're seeing from our monitoring system.
This also results in a similar leak in memory. In order to diagnose this, I've deployed a version of gubernator with pprof endpoints enabled, and found that goroutines grow in 3 functions:
PeerClient.run
Interval.run
grpc.Dial
The root cause seems to be that in
Instance.SetPeer
, a newPeerClient
is created for everyPeerInfo
without reusing any existingPeerClients
. This causes a goroutine leak linearly proportional to the number of peers. In addition, there is no shutdown code for removed peers, so this code should leak a goroutine for every peer that is removed.I suspect that this was exposed by some weird interaction with the Kubernetes integration and our test environment, since I see peer update logs every few minutes.
I have a fork where I'm testing the following changes that I would be happy to send a pull request for:
Shutdown
method toPeerClient
that will close the request queuePeerClient.run
to send any enqueued requests when the request queue is closedPeerClient.run
to callinterval.Stop
on returnPeerClients
insideInstance.SetPeers
PeerClients
insideInstance.SetPeers
The text was updated successfully, but these errors were encountered: