cache: prevent goroutine leak in agent cache #14908

rboyer · 2022-10-06T21:38:03Z

Description

There is a bug in the error handling code for the Agent cache subsystem discovered by @boxofrad :

NotifyCallback calls notifyBlockingQuery which calls getWithIndex in a loop (which backs off on-error up to 1 minute)
getWithIndex calls fetch if there’s no valid entry in the cache
fetch starts a goroutine which calls Fetch on the cache-type, waits for a while (again with backoff up to 1 minute for errors) and then calls fetch to trigger a refresh

The end result being that every 1 minute notifyBlockingQuery spawns an ancestry of goroutines that essentially lives forever.

This PR ensures that the goroutine started by fetch cancels any prior goroutine spawned by the same line for the same key.

In isolated testing where a cache type was tweaked to indefinitely error, this patch prevented goroutine counts from skyrocketing.

dhiaayachi

Thank you both for finding and fixing this @rboyer and @boxofrad
I'm wondering if we can fix this in a simpler way by using singleflight on the go routine spawn by fetch 🤔. We already have a key passed to fetch that we can leverage for this.

rboyer · 2022-10-07T15:06:04Z

I think you'd need something like "preemptible singleflight" to match what this cache is currently doing. Basically if you try to do a fetch for X and there is already an X, cancel the old one and make a new one. singleflight instead would say to avoid making a new and wait for the old.

dhiaayachi · 2022-10-07T20:32:45Z

I think you'd need something like "preemptible singleflight" to match what this cache is currently doing. Basically if you try to do a fetch for X and there is already an X, cancel the old one and make a new one. singleflight instead would say to avoid making a new and wait for the old.

Mostly for my own knowledge but why do we need to cancel the old one and start a new one. Wouldn't that result in the same. Which is the element pointed by that key stored/refreshed in cache?

dhiaayachi

@rboyer I had a few questions about this. I'm not sure I understand the logic for notifyBlockingQueryOnce.
If you think it's easier we can do a synchronous review of this early next week.

agent/cache/watch.go

boxofrad

Thanks for picking this up @rboyer 🙌🏻

This approach makes sense to me, but I'll reiterate that I don't fully understand everything that's going on here (so take my approval with a pinch of salt!).

Is there somebody else who has more context? Maybe @banks or @mkeeler?

agent/cache/cache.go

agent/cache/watch.go

rboyer · 2022-10-10T15:38:06Z

Mostly for my own knowledge but why do we need to cancel the old one and start a new one. Wouldn't that result in the same. Which is the element pointed by that key stored/refreshed in cache?
@dhiaayachi

The more I stare at this cache code that I don't fully understand, the more I feel like perhaps this solution I have here is wrong and the actual cache code itself is similarly wrong and the two wrongs cancel out to make a bigger wrong (removing the cache-ability of good requests at the expense of fixing this perceived bug).
🤔

rboyer · 2022-10-10T16:23:32Z

OK I changed this 100%.

Now it solely deals with the goroutine that fetch spawns. Rather than handle the cancelling at the level of getWithIndex instead the fetch-spawned goroutine (which includes the child goroutine chain it spawns) is assigned both a unique identity and a stop channel which are stored in a map.

If another fetch for the same key gets to the current line of code which fires off the goroutine it cancels the former one before spawning the new one.

To ensure the tracking map is cleaned up at the right time, the unique identity is used in a defer in the goroutine so that the item in the map for this key is only deleted if the identity matches.

rboyer · 2022-10-10T16:39:49Z

agent/cache/cache.go

 			// If we're over the attempt minimum, start an exponential backoff.
 			if wait := backOffWait(attempt); wait > 0 {
-				time.Sleep(wait)


After the initial refactor it turned out that the goroutine leak was back!

Then I changed time.Sleep into cancellable ones via the select pattern and the leak went away.

rboyer · 2022-10-10T19:32:43Z

agent/cache/cache_test.go

@@ -18,6 +18,7 @@ import (
 	"github.com/hashicorp/consul/acl"
 	"github.com/hashicorp/consul/lib/ttlcache"
 	"github.com/hashicorp/consul/sdk/testutil"
+	"github.com/hashicorp/consul/sdk/testutil/retry"


The changes in this file remove a flake from the -race version of these tests.

banks · 2022-10-10T19:56:10Z

Thanks all for finding this and working on it.

At a first glance I'm not super clear on where the original leak happens - the cache was supposed to single flight requests so we didn't support cancellation because there should only ever be at most one Fetch being called at anyone time for a given type and key, those goroutines might continue for a long time (the refresh loop) even if the item is no longer being watched. But don't "leak" in the sense that when the cache item's TTL expires (72 hours in some cases) then the entry is removed and the background refresh loop stopped next time it unblocks (max 10 mins later).

The PR description mentions error handling but I'm not quite sure what are the actual steps to cause the leak? Could you explain how the leak happens as far as you understand RB?

Is the problem that we are removing the item from cache on an error but not stopping its background fetch path? If so that's surely a bug but that's not what this code change is fixing... If the entry is still there, how are we ending up with a different fetch loop for the same key?

This PR ensures that the goroutine started by fetch cancels any prior goroutine spawned by the same line for the same key.

My mental model for the cache was that it should not be possible for there to be any other goroutine for the same key if we ever get to the line that spawns a new one...

The PR here looks reasonable, but I'm not really clear on how the leak shows up so it's a little hard to judge. I'm also interested in the details being super clearly documented here because I'm super wary of how complex agent cache is - we've often end up introducing other bugs when we try to fix bugs in here! I'm assuming it is a correct fix, but would be awesome to have really good info about exactly why we think it's the right fix for next time we have to come back here and understand!

rboyer · 2022-10-10T20:03:31Z

agent/cache/cache.go

@@ -836,11 +872,46 @@ func (c *Cache) fetch(key string, r getOptions, allowNew bool, attempt uint, ign
 			r.Info.MinIndex = 0
 			c.fetch(key, r, false, attempt, true)


This is where fetch calls itself, but it does so in a goroutine and then returns, which exits the original goroutine. The net at the end of this call is that you have the same number of goroutines when one exits.

banks · 2022-10-11T11:38:07Z

@rboyer and I looked at this together yesterday. I want to record my understanding before it slips through my fingers again!

The intended behaviour (simplified from actual code for clarity) here for refreshing cache types is:

Client fetch (either via Get or Notify both internally call getWithIndex
getWithIndex calls fetch which returns the waiter for the cache item (creating it if it didn't already exist, but just returning the waiter if it did **). The client goroutine (G1) is now blocked on that Waiter.
meanwhile fetch, if it found no existing, "fetching" cache entry it created one with Fetching == true, spawned a new anonymous goroutine (G2) that I'll refer to as fetch.internal.
fetch.internal does the actual blocking RPC to servers (possibly blocking for a while). When it is done either with a new result, the same result (blocking timeout) or error, it updates the cache entry and sets Fetching == false. Once it's released the lock it signals the old Waiter which unblocks any waiting goroutines.
G1 (client goroutine) unblocks, finds the update or error and returns (usually)
Meanwhile... the fetch.internal prepares to continue. It waits if there was an error and some backoff is needed, it also waits if there is a RefreshTimer configured for that type. All the time it's waiting, the cache entry exists but with Fetching == false
When it's done waiting, the fetch.internal goroutine recursively calls fetch in the same goroutine (G2)
G2 now makes it through the fetch method back to the point where it decides if it should continue - since Fetching == false (probably, we'll get to that) it effectively goes back to 3 above - sets entry to Fetching == true, spawns a new goroutine (G3) to run a new fetch.internal

The intent of all this is that the cache has "single flighted" the request so that many concurrent requests will just see the existing entry has a waiter and will wait on that. The background refresh "loop" (which is really tail call recursion) continues until the cache entry is evicted which breaks the tail call) which won't happen until TTL has passed without an active request for that entry.

The Bug

So the intention is that only a single fetching routine is active at any one time for a given cache key... but there is an major flaw in the design that makes that not true in actually quite a lot of cases!

For the whole time that the fetch.internal call is waiting before recusing back into fetch for a refresh, a new request for the same key will see Fetching == false and so will spawn a new separate fetch.internal.

So it's fatally flawed! But because of the complexities of the recursive calls, it's really hard to fix cleanly.

If we just don't set Fetching == false once the RPC is done then when the same goroutine re-enters fetch it will see Fetching == false and so it will exit, breaking the background refresh loop.
If we add another field that explicitly marks the entry as "refreshing" and use that to decide whether to spawn a new goroutine or not too then you hit the same problem as above.
Singleflight doesn't work because that also breaks the recursive refresh loop in a similar way - it sees the request as already "in progress" when it re-enters and so stops, breaking the refresh loop for everyone.
If we just assume that a refreshing cache type always has a running background goroutine (generally true) and so only ever spawn one if we actually created a new entry, that is probably right almost always except that we currently exit the background refresh loop on an ACL not found error. Which means that in some edge cases where ACL tokens were not found on servers because they were slightly stale, the background loop would silently go away and never recover again for the lifetime of that agent (or until clients stopped requesting it for the TTL). We probably implicitly rely on subsequent fetches re-spawning the background routine to mask temporary ACL not found errors due to replication lag/stale queries.
- We didn't verify this it's possible it would be OK but another risk factor for "breaking" something that works.
- We also originally thought it would be a problem if the ACL was deleted and recreated but that can't happen - if it's deleted then it's OK to stop forever because ACL tokens can't be recreated with the same token secret, and because the token is part of the cache key so if a new token is created and distributed to the client then it will be a different cache entry anyway.
- I also wondered if this would be impacted by other ACL failures which can happen like the token having a policy removed and re-added, but the code specifically only breaks on "ACL (token) not found" and not just for any permission denied error.

@rboyer and I concluded that this whole thing could probably be a lot simpler to reason about if we refactored it so that the background refresh was a single method with a for loop instead of recursive calls. I don't know if RB has looked at that since we spoke but our guess is that it's a big enough change that we'd be very nervous about rushing it in with the current history of breaking agent cache whenever we touch it and the desire to fix this issue soon.

So the approach @rboyer took here fixes the bug (we think without causing any new ones) specifically because:

There is a separate data structure tracking which keys have active fetches. (The main cache could do this but it's complicated because we replace the cache entry each time it updates)
The separate data structure tracks not just boolean state, but also identifies each entrance into fetch.internal with a unique ID.
Each entrance into fetch.internal closes the chan of the previous handle - which is checked after blocking in the RPC AND waiting for any backoff before re-entering fetch. Which breaks the old goroutine chain.
Each fetch.internal defers a cleanup of its own handle, but leaves the state alone if another routine has taken over since it has a known ID.

The tl;dr - this is very much a band-aid fix for a bug that is a violation of the core design goal of the cache. But we think it's the least likely thing to cause additional bugs and so be safest to merge quickly (this week) so be in the latest release.

As noted inline, while this seems safer than some alternatives to reason about - it does probably have complex consequences like my inline-hypothesis that it could cause us to bypass backoff waiting on errors in some cases which could cause DDOS to servers.

banks · 2022-10-11T11:52:36Z

agent/cache/cache.go

+			select {
+			case <-time.After(totalWait):
+			case <-handle.stopCh:
+				return


What happens if another fetch happens in another goroutine right after this point but before our own fetch call?

other fetch will see our handle, close our chan (no effect) and replace it with their own with a greater ID (say 10)

other fetch will then spawn fetch.internal and start an RPC

we continue into our own fetch see the handle with ID 10 and close it's chan, make our new one with ID 11, and then re-enter fetch.internal

other fetch.internal returns from RPC, see's it's chan closed and exits (wasted RPC)

we complete our RPC and assuming no new one has started, update the cache entry.

Conclusion: seems safe although it does mean we still end up doing some redundant RPCs. I think that's OK as a quick fix.

Hmmm so one thing this could possible cause would be effectively a DOS of servers on error. Normally we'd have the retry backoff protecting servers if repeated errors happen. It would look like this:

fetch.internal 1 gets an error, updates entry, starts backoff wait

client loop has a smaller backoff (or none) so retries which results in a new fetch 2

fetch 2 will see Fetching == false and so spawn a fetch.internal 2 while closing the stop chan for fetch.internal 1. it will immediately make another request to servers even though we are supposed to be waiting.

eventually fetch.internal 1 finishes it's backoff wait and see's it's been deposed and so just returns

fetch.internal 2 also hits error, starts backoff

client retries sooner again etc.

We might be OK because our internal cache clients generally use Notify which probably does the right thing in that it does its own backoff loop IIRC. But this one seems at least a bit risky that this fix potentially means that we'll bypass error backoffs?

Hmmm so one thing this could possible cause would be effectively a DOS of servers on error.

That's actually already happening on error with the existing code, since getting an error is Fetching=false.

banks · 2022-10-11T12:00:00Z

agent/cache/cache.go

+			}
+
+			select {
+			case <-time.After(totalWait):


I find this a little subtle to read - it takes me a while to convince myself that the first case is not falling through to the return as well...

It's a nit as most go programmers are probably OK with this but I'd consider adding a comment to explicitly call out that the first case is just continuing execution. Or re-order the cases?

dhiaayachi

This look good! I think this approach would work. I have few questions which could lead to simplifications.

agent/cache/cache.go

rboyer · 2022-10-11T20:44:16Z

Here is the more complete attempt at fixing this problem by removing the weird recursion and then introducing a new tracking variable to the cacheEntry type: #14956

boxofrad

Thanks again @rboyer and @banks for prioritizing this, and the thorough write-up ⭐

It seems safe to me (acknowledging the DoS vector you've called out), but with that said, #14924 solves the immediate problem for agentless so if you'd rather hold off for the more complete fix in #14956 that'd be fine for us too.

rboyer · 2022-10-17T18:39:07Z

I've rephrased the follow-on PR to better state that it fixes the RPC issue since this PR fixes the goroutine leak. That should make it easier to rationalize landing both changes on main for 1.15.0

There is a bug in the error handling code for the Agent cache subsystem discovered: 1. NotifyCallback calls notifyBlockingQuery which calls getWithIndex in a loop (which backs off on-error up to 1 minute) 2. getWithIndex calls fetch if there’s no valid entry in the cache 3. fetch starts a goroutine which calls Fetch on the cache-type, waits for a while (again with backoff up to 1 minute for errors) and then calls fetch to trigger a refresh The end result being that every 1 minute notifyBlockingQuery spawns an ancestry of goroutines that essentially lives forever. This PR ensures that the goroutine started by `fetch` cancels any prior goroutine spawned by the same line for the same key. In isolated testing where a cache type was tweaked to indefinitely error, this patch prevented goroutine counts from skyrocketing.

dhiaayachi

Thank you @rboyer for chasing this.
I have a small esthetic comment but not blocking.

dhiaayachi · 2022-10-17T19:30:56Z

agent/cache/cache.go

+	// By the time we get here the system WANTS to make a replacement fetcher, so
+	// we terminate the prior one and replace it.
+	handle := c.getOrReplaceFetchHandle(key)
+	go func(handle fetchHandle) {


Maybe we can consider extracting this giant anonymous function into a function

That's a side effect of the other PR (#14956) already.

Actually doing it has some tracking state that is currently part of the closure, but would need to be adjusted as input/output parameters and it gets fiddly.

I did it on the road to #14956 and I wouldn't want to do that here on the band-aid for fear of introducing an inadvertent bug in the data flow.

This reverts commit fe2d41d.

…n error (#14956) This continues the work done in #14908 where a crude solution to prevent a goroutine leak was implemented. The former code would launch a perpetual goroutine family every iteration (+1 +1) and the fixed code simply caused a new goroutine family to first cancel the prior one to prevent the leak (-1 +1 == 0). This PR refactors this code completely to: - make it more understandable - remove the recursion-via-goroutine strangeness - prevent unnecessary RPC fetches when the prior one has errored. The core issue arose from a conflation of the entry.Fetching field to mean: - there is an RPC (blocking query) in flight right now - there is a goroutine running to manage the RPC fetch retry loop The problem is that the goroutine-leak-avoidance check would treat Fetching like (2), but within the body of a goroutine it would flip that boolean back to false before the retry sleep. This would cause a new chain of goroutines to launch which #14908 would correct crudely. The refactored code uses a plain for-loop and changes the semantics to track state for "is there a goroutine associated with this cache entry" instead of the former. We use a uint64 unique identity per goroutine instead of a boolean so that any orphaned goroutines can tell when they've been replaced when the expiry loop deletes a cache entry while the goroutine is still running and is later replaced.

rboyer requested a review from a team October 6, 2022 21:38

rboyer self-assigned this Oct 6, 2022

rboyer requested review from dhiaayachi, boxofrad and riddhi89 and removed request for a team October 6, 2022 21:38

github-actions bot added the theme/agent-cache Agent Cache label Oct 6, 2022

rboyer force-pushed the fix-agent-cache-leak branch from 27877b5 to b20c6ae Compare October 6, 2022 21:43

dhiaayachi reviewed Oct 6, 2022

View reviewed changes

riddhi89 mentioned this pull request Oct 7, 2022

Service http checks data source for agentless proxies #14924

Merged

dhiaayachi reviewed Oct 7, 2022

View reviewed changes

agent/cache/watch.go Outdated Show resolved Hide resolved

agent/cache/watch.go Outdated Show resolved Hide resolved

agent/cache/watch.go Outdated Show resolved Hide resolved

agent/cache/watch.go Outdated Show resolved Hide resolved

boxofrad approved these changes Oct 10, 2022

View reviewed changes

agent/cache/cache.go Outdated Show resolved Hide resolved

agent/cache/watch.go Outdated Show resolved Hide resolved

rboyer requested review from boxofrad and dhiaayachi October 10, 2022 16:24

rboyer added the pr/no-metrics-test label Oct 10, 2022

rboyer commented Oct 10, 2022

View reviewed changes

banks reviewed Oct 11, 2022

View reviewed changes

dhiaayachi requested changes Oct 11, 2022

View reviewed changes

agent/cache/cache.go Outdated Show resolved Hide resolved

agent/cache/cache.go Outdated Show resolved Hide resolved

agent/cache/cache.go Show resolved Hide resolved

rboyer mentioned this pull request Oct 11, 2022

cache: refactor agent cache fetching to prevent unnecessary fetches on error #14956

Merged

boxofrad approved these changes Oct 12, 2022

View reviewed changes

rboyer added the backport/1.11 label Oct 17, 2022

rboyer added backport/1.12 Changes are backported to 1.12 backport/1.13 Changes are backported to 1.13 backport/1.14 Changes are backported to 1.14 and removed backport/1.14 Changes are backported to 1.14 labels Oct 17, 2022

rboyer requested review from dhiaayachi and banks October 17, 2022 18:37

rboyer changed the title ~~cache: prevent goroutine leak in agent cache when fetching results errors~~ cache: prevent goroutine leak in agent cache Oct 17, 2022

rboyer force-pushed the fix-agent-cache-leak branch from e21545b to 0831352 Compare October 17, 2022 18:44

rboyer requested a review from boxofrad October 17, 2022 18:48

dhiaayachi approved these changes Oct 17, 2022

View reviewed changes

rboyer merged commit fe2d41d into main Oct 17, 2022

rboyer deleted the fix-agent-cache-leak branch October 17, 2022 19:38

rboyer added a commit that referenced this pull request Oct 17, 2022

Revert "cache: prevent goroutine leak in agent cache (#14908)"

d7bc978

This reverts commit fe2d41d.

rboyer added a commit that referenced this pull request Oct 24, 2022

Revert "cache: prevent goroutine leak in agent cache (#14908)"

fe0d91c

This reverts commit fe2d41d.

rboyer added a commit that referenced this pull request Oct 24, 2022

Revert "cache: prevent goroutine leak in agent cache (#14908)"

9171905

This reverts commit fe2d41d.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cache: prevent goroutine leak in agent cache #14908

cache: prevent goroutine leak in agent cache #14908

rboyer commented Oct 6, 2022 •

edited

dhiaayachi left a comment

rboyer commented Oct 7, 2022

dhiaayachi commented Oct 7, 2022 •

edited

dhiaayachi left a comment

boxofrad left a comment

rboyer commented Oct 10, 2022

rboyer commented Oct 10, 2022

rboyer Oct 10, 2022

rboyer Oct 10, 2022

banks commented Oct 10, 2022 •

edited

rboyer Oct 10, 2022

banks commented Oct 11, 2022 •

edited

banks Oct 11, 2022

rboyer Oct 17, 2022 •

edited

banks Oct 11, 2022

dhiaayachi left a comment

rboyer commented Oct 11, 2022

boxofrad left a comment

rboyer commented Oct 17, 2022

dhiaayachi left a comment

dhiaayachi Oct 17, 2022

rboyer Oct 17, 2022

		@@ -836,11 +872,46 @@ func (c *Cache) fetch(key string, r getOptions, allowNew bool, attempt uint, ign
		r.Info.MinIndex = 0
		c.fetch(key, r, false, attempt, true)

cache: prevent goroutine leak in agent cache #14908

cache: prevent goroutine leak in agent cache #14908

Conversation

rboyer commented Oct 6, 2022 • edited

Description

dhiaayachi left a comment

Choose a reason for hiding this comment

rboyer commented Oct 7, 2022

dhiaayachi commented Oct 7, 2022 • edited

dhiaayachi left a comment

Choose a reason for hiding this comment

boxofrad left a comment

Choose a reason for hiding this comment

rboyer commented Oct 10, 2022

rboyer commented Oct 10, 2022

rboyer Oct 10, 2022

Choose a reason for hiding this comment

rboyer Oct 10, 2022

Choose a reason for hiding this comment

banks commented Oct 10, 2022 • edited

rboyer Oct 10, 2022

Choose a reason for hiding this comment

banks commented Oct 11, 2022 • edited

The Bug

banks Oct 11, 2022

Choose a reason for hiding this comment

rboyer Oct 17, 2022 • edited

Choose a reason for hiding this comment

banks Oct 11, 2022

Choose a reason for hiding this comment

dhiaayachi left a comment

Choose a reason for hiding this comment

rboyer commented Oct 11, 2022

boxofrad left a comment

Choose a reason for hiding this comment

rboyer commented Oct 17, 2022

dhiaayachi left a comment

Choose a reason for hiding this comment

dhiaayachi Oct 17, 2022

Choose a reason for hiding this comment

rboyer Oct 17, 2022

Choose a reason for hiding this comment

rboyer commented Oct 6, 2022 •

edited

dhiaayachi commented Oct 7, 2022 •

edited

banks commented Oct 10, 2022 •

edited

banks commented Oct 11, 2022 •

edited

rboyer Oct 17, 2022 •

edited