New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
cache: prevent goroutine leak in agent cache #14908
Conversation
27877b5
to
b20c6ae
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you both for finding and fixing this @rboyer and @boxofrad
I'm wondering if we can fix this in a simpler way by using singleflight on the go routine spawn by fetch 🤔. We already have a key passed to fetch that we can leverage for this.
I think you'd need something like "preemptible singleflight" to match what this cache is currently doing. Basically if you try to do a fetch for |
Mostly for my own knowledge but why do we need to cancel the old one and start a new one. Wouldn't that result in the same. Which is the element pointed by that key stored/refreshed in cache? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@rboyer I had a few questions about this. I'm not sure I understand the logic for notifyBlockingQueryOnce
.
If you think it's easier we can do a synchronous review of this early next week.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The more I stare at this cache code that I don't fully understand, the more I feel like perhaps this solution I have here is wrong and the actual cache code itself is similarly wrong and the two wrongs cancel out to make a bigger wrong (removing the cache-ability of good requests at the expense of fixing this perceived bug). |
OK I changed this 100%. Now it solely deals with the goroutine that If another fetch for the same key gets to the current line of code which fires off the goroutine it cancels the former one before spawning the new one. To ensure the tracking map is cleaned up at the right time, the unique identity is used in a defer in the goroutine so that the item in the map for this key is only deleted if the identity matches. |
// If we're over the attempt minimum, start an exponential backoff. | ||
if wait := backOffWait(attempt); wait > 0 { | ||
time.Sleep(wait) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After the initial refactor it turned out that the goroutine leak was back!
Then I changed time.Sleep
into cancellable ones via the select
pattern and the leak went away.
@@ -18,6 +18,7 @@ import ( | |||
"github.com/hashicorp/consul/acl" | |||
"github.com/hashicorp/consul/lib/ttlcache" | |||
"github.com/hashicorp/consul/sdk/testutil" | |||
"github.com/hashicorp/consul/sdk/testutil/retry" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The changes in this file remove a flake from the -race
version of these tests.
Thanks all for finding this and working on it. At a first glance I'm not super clear on where the original leak happens - the cache was supposed to single flight requests so we didn't support cancellation because there should only ever be at most one The PR description mentions error handling but I'm not quite sure what are the actual steps to cause the leak? Could you explain how the leak happens as far as you understand RB? Is the problem that we are removing the item from cache on an error but not stopping its background fetch path? If so that's surely a bug but that's not what this code change is fixing... If the entry is still there, how are we ending up with a different fetch loop for the same key?
My mental model for the cache was that it should not be possible for there to be any other goroutine for the same key if we ever get to the line that spawns a new one... The PR here looks reasonable, but I'm not really clear on how the leak shows up so it's a little hard to judge. I'm also interested in the details being super clearly documented here because I'm super wary of how complex agent cache is - we've often end up introducing other bugs when we try to fix bugs in here! I'm assuming it is a correct fix, but would be awesome to have really good info about exactly why we think it's the right fix for next time we have to come back here and understand! |
@@ -836,11 +872,46 @@ func (c *Cache) fetch(key string, r getOptions, allowNew bool, attempt uint, ign | |||
r.Info.MinIndex = 0 | |||
c.fetch(key, r, false, attempt, true) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is where fetch
calls itself, but it does so in a goroutine and then returns, which exits the original goroutine. The net at the end of this call is that you have the same number of goroutines when one exits.
@rboyer and I looked at this together yesterday. I want to record my understanding before it slips through my fingers again! The intended behaviour (simplified from actual code for clarity) here for refreshing cache types is:
The intent of all this is that the cache has "single flighted" the request so that many concurrent requests will just see the existing entry has a waiter and will wait on that. The background refresh "loop" (which is really tail call recursion) continues until the cache entry is evicted which breaks the tail call) which won't happen until TTL has passed without an active request for that entry. The BugSo the intention is that only a single fetching routine is active at any one time for a given cache key... but there is an major flaw in the design that makes that not true in actually quite a lot of cases! For the whole time that the So it's fatally flawed! But because of the complexities of the recursive calls, it's really hard to fix cleanly.
@rboyer and I concluded that this whole thing could probably be a lot simpler to reason about if we refactored it so that the background refresh was a single method with a for loop instead of recursive calls. I don't know if RB has looked at that since we spoke but our guess is that it's a big enough change that we'd be very nervous about rushing it in with the current history of breaking agent cache whenever we touch it and the desire to fix this issue soon. So the approach @rboyer took here fixes the bug (we think without causing any new ones) specifically because:
The tl;dr - this is very much a band-aid fix for a bug that is a violation of the core design goal of the cache. But we think it's the least likely thing to cause additional bugs and so be safest to merge quickly (this week) so be in the latest release. As noted inline, while this seems safer than some alternatives to reason about - it does probably have complex consequences like my inline-hypothesis that it could cause us to bypass backoff waiting on errors in some cases which could cause DDOS to servers. |
select { | ||
case <-time.After(totalWait): | ||
case <-handle.stopCh: | ||
return |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What happens if another fetch
happens in another goroutine right after this point but before our own fetch
call?
- other
fetch
will see our handle, close our chan (no effect) and replace it with their own with a greater ID (say 10) - other
fetch
will then spawnfetch.internal
and start an RPC - we continue into our own
fetch
see the handle with ID 10 and close it's chan, make our new one with ID 11, and then re-enterfetch.internal
- other
fetch.internal
returns from RPC, see's it's chan closed and exits (wasted RPC) - we complete our RPC and assuming no new one has started, update the cache entry.
Conclusion: seems safe although it does mean we still end up doing some redundant RPCs. I think that's OK as a quick fix.
Hmmm so one thing this could possible cause would be effectively a DOS of servers on error. Normally we'd have the retry backoff protecting servers if repeated errors happen. It would look like this:
fetch.internal 1
gets an error, updates entry, starts backoff wait- client loop has a smaller backoff (or none) so retries which results in a new
fetch 2
fetch 2
will seeFetching == false
and so spawn afetch.internal 2
while closing the stop chan forfetch.internal 1
. it will immediately make another request to servers even though we are supposed to be waiting.- eventually
fetch.internal 1
finishes it's backoff wait and see's it's been deposed and so just returns fetch.internal 2
also hits error, starts backoff- client retries sooner again etc.
We might be OK because our internal cache clients generally use Notify
which probably does the right thing in that it does its own backoff loop IIRC. But this one seems at least a bit risky that this fix potentially means that we'll bypass error backoffs?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmmm so one thing this could possible cause would be effectively a DOS of servers on error.
That's actually already happening on error with the existing code, since getting an error is Fetching=false
.
agent/cache/cache.go
Outdated
} | ||
|
||
select { | ||
case <-time.After(totalWait): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I find this a little subtle to read - it takes me a while to convince myself that the first case is not falling through to the return
as well...
It's a nit as most go programmers are probably OK with this but I'd consider adding a comment to explicitly call out that the first case is just continuing execution. Or re-order the cases?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This look good! I think this approach would work. I have few questions which could lead to simplifications.
Here is the more complete attempt at fixing this problem by removing the weird recursion and then introducing a new tracking variable to the cacheEntry type: #14956 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks again @rboyer and @banks for prioritizing this, and the thorough write-up ⭐
It seems safe to me (acknowledging the DoS vector you've called out), but with that said, #14924 solves the immediate problem for agentless so if you'd rather hold off for the more complete fix in #14956 that'd be fine for us too.
I've rephrased the follow-on PR to better state that it fixes the RPC issue since this PR fixes the goroutine leak. That should make it easier to rationalize landing both changes on |
There is a bug in the error handling code for the Agent cache subsystem discovered: 1. NotifyCallback calls notifyBlockingQuery which calls getWithIndex in a loop (which backs off on-error up to 1 minute) 2. getWithIndex calls fetch if there’s no valid entry in the cache 3. fetch starts a goroutine which calls Fetch on the cache-type, waits for a while (again with backoff up to 1 minute for errors) and then calls fetch to trigger a refresh The end result being that every 1 minute notifyBlockingQuery spawns an ancestry of goroutines that essentially lives forever. This PR ensures that the goroutine started by `fetch` cancels any prior goroutine spawned by the same line for the same key. In isolated testing where a cache type was tweaked to indefinitely error, this patch prevented goroutine counts from skyrocketing.
e21545b
to
0831352
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @rboyer for chasing this.
I have a small esthetic comment but not blocking.
// By the time we get here the system WANTS to make a replacement fetcher, so | ||
// we terminate the prior one and replace it. | ||
handle := c.getOrReplaceFetchHandle(key) | ||
go func(handle fetchHandle) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we can consider extracting this giant anonymous function into a function
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a side effect of the other PR (#14956) already.
Actually doing it has some tracking state that is currently part of the closure, but would need to be adjusted as input/output parameters and it gets fiddly.
I did it on the road to #14956 and I wouldn't want to do that here on the band-aid for fear of introducing an inadvertent bug in the data flow.
…n error (#14956) This continues the work done in #14908 where a crude solution to prevent a goroutine leak was implemented. The former code would launch a perpetual goroutine family every iteration (+1 +1) and the fixed code simply caused a new goroutine family to first cancel the prior one to prevent the leak (-1 +1 == 0). This PR refactors this code completely to: - make it more understandable - remove the recursion-via-goroutine strangeness - prevent unnecessary RPC fetches when the prior one has errored. The core issue arose from a conflation of the entry.Fetching field to mean: - there is an RPC (blocking query) in flight right now - there is a goroutine running to manage the RPC fetch retry loop The problem is that the goroutine-leak-avoidance check would treat Fetching like (2), but within the body of a goroutine it would flip that boolean back to false before the retry sleep. This would cause a new chain of goroutines to launch which #14908 would correct crudely. The refactored code uses a plain for-loop and changes the semantics to track state for "is there a goroutine associated with this cache entry" instead of the former. We use a uint64 unique identity per goroutine instead of a boolean so that any orphaned goroutines can tell when they've been replaced when the expiry loop deletes a cache entry while the goroutine is still running and is later replaced.
Description
There is a bug in the error handling code for the Agent cache subsystem discovered by @boxofrad :
NotifyCallback
callsnotifyBlockingQuery
which callsgetWithIndex
in a loop (which backs off on-error up to 1 minute)getWithIndex
callsfetch
if there’s no valid entry in the cachefetch
starts a goroutine which callsFetch
on the cache-type, waits for a while (again with backoff up to 1 minute for errors) and then callsfetch
to trigger a refreshThe end result being that every 1 minute
notifyBlockingQuery
spawns an ancestry of goroutines that essentially lives forever.This PR ensures that the goroutine started by
fetch
cancels any prior goroutine spawned by the same line for the same key.In isolated testing where a cache type was tweaked to indefinitely error, this patch prevented goroutine counts from skyrocketing.