Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cache: prevent goroutine leak in agent cache #14908

Merged
merged 1 commit into from Oct 17, 2022
Merged

Conversation

rboyer
Copy link
Member

@rboyer rboyer commented Oct 6, 2022

Description

There is a bug in the error handling code for the Agent cache subsystem discovered by @boxofrad :

  1. NotifyCallback calls notifyBlockingQuery which calls getWithIndex in a loop (which backs off on-error up to 1 minute)
  2. getWithIndex calls fetch if there’s no valid entry in the cache
  3. fetch starts a goroutine which calls Fetch on the cache-type, waits for a while (again with backoff up to 1 minute for errors) and then calls fetch to trigger a refresh

The end result being that every 1 minute notifyBlockingQuery spawns an ancestry of goroutines that essentially lives forever.

This PR ensures that the goroutine started by fetch cancels any prior goroutine spawned by the same line for the same key.

In isolated testing where a cache type was tweaked to indefinitely error, this patch prevented goroutine counts from skyrocketing.

@rboyer rboyer requested a review from a team October 6, 2022 21:38
@rboyer rboyer self-assigned this Oct 6, 2022
@rboyer rboyer requested review from dhiaayachi, boxofrad and riddhi89 and removed request for a team October 6, 2022 21:38
@github-actions github-actions bot added the theme/agent-cache Agent Cache label Oct 6, 2022
Copy link
Contributor

@dhiaayachi dhiaayachi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you both for finding and fixing this @rboyer and @boxofrad
I'm wondering if we can fix this in a simpler way by using singleflight on the go routine spawn by fetch 🤔. We already have a key passed to fetch that we can leverage for this.

@rboyer
Copy link
Member Author

rboyer commented Oct 7, 2022

I think you'd need something like "preemptible singleflight" to match what this cache is currently doing. Basically if you try to do a fetch for X and there is already an X, cancel the old one and make a new one. singleflight instead would say to avoid making a new and wait for the old.

@dhiaayachi
Copy link
Contributor

dhiaayachi commented Oct 7, 2022

I think you'd need something like "preemptible singleflight" to match what this cache is currently doing. Basically if you try to do a fetch for X and there is already an X, cancel the old one and make a new one. singleflight instead would say to avoid making a new and wait for the old.

Mostly for my own knowledge but why do we need to cancel the old one and start a new one. Wouldn't that result in the same. Which is the element pointed by that key stored/refreshed in cache?

Copy link
Contributor

@dhiaayachi dhiaayachi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rboyer I had a few questions about this. I'm not sure I understand the logic for notifyBlockingQueryOnce.
If you think it's easier we can do a synchronous review of this early next week.

agent/cache/watch.go Outdated Show resolved Hide resolved
agent/cache/watch.go Outdated Show resolved Hide resolved
agent/cache/watch.go Outdated Show resolved Hide resolved
agent/cache/watch.go Outdated Show resolved Hide resolved
Copy link
Contributor

@boxofrad boxofrad left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for picking this up @rboyer 🙌🏻

This approach makes sense to me, but I'll reiterate that I don't fully understand everything that's going on here (so take my approval with a pinch of salt!).

Is there somebody else who has more context? Maybe @banks or @mkeeler?

agent/cache/cache.go Outdated Show resolved Hide resolved
agent/cache/watch.go Outdated Show resolved Hide resolved
@rboyer
Copy link
Member Author

rboyer commented Oct 10, 2022

Mostly for my own knowledge but why do we need to cancel the old one and start a new one. Wouldn't that result in the same. Which is the element pointed by that key stored/refreshed in cache?
@dhiaayachi

The more I stare at this cache code that I don't fully understand, the more I feel like perhaps this solution I have here is wrong and the actual cache code itself is similarly wrong and the two wrongs cancel out to make a bigger wrong (removing the cache-ability of good requests at the expense of fixing this perceived bug).
🤔

@rboyer
Copy link
Member Author

rboyer commented Oct 10, 2022

OK I changed this 100%.

Now it solely deals with the goroutine that fetch spawns. Rather than handle the cancelling at the level of getWithIndex instead the fetch-spawned goroutine (which includes the child goroutine chain it spawns) is assigned both a unique identity and a stop channel which are stored in a map.

If another fetch for the same key gets to the current line of code which fires off the goroutine it cancels the former one before spawning the new one.

To ensure the tracking map is cleaned up at the right time, the unique identity is used in a defer in the goroutine so that the item in the map for this key is only deleted if the identity matches.

// If we're over the attempt minimum, start an exponential backoff.
if wait := backOffWait(attempt); wait > 0 {
time.Sleep(wait)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After the initial refactor it turned out that the goroutine leak was back!

Then I changed time.Sleep into cancellable ones via the select pattern and the leak went away.

@@ -18,6 +18,7 @@ import (
"github.com/hashicorp/consul/acl"
"github.com/hashicorp/consul/lib/ttlcache"
"github.com/hashicorp/consul/sdk/testutil"
"github.com/hashicorp/consul/sdk/testutil/retry"
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The changes in this file remove a flake from the -race version of these tests.

@banks
Copy link
Member

banks commented Oct 10, 2022

Thanks all for finding this and working on it.

At a first glance I'm not super clear on where the original leak happens - the cache was supposed to single flight requests so we didn't support cancellation because there should only ever be at most one Fetch being called at anyone time for a given type and key, those goroutines might continue for a long time (the refresh loop) even if the item is no longer being watched. But don't "leak" in the sense that when the cache item's TTL expires (72 hours in some cases) then the entry is removed and the background refresh loop stopped next time it unblocks (max 10 mins later).

The PR description mentions error handling but I'm not quite sure what are the actual steps to cause the leak? Could you explain how the leak happens as far as you understand RB?

Is the problem that we are removing the item from cache on an error but not stopping its background fetch path? If so that's surely a bug but that's not what this code change is fixing... If the entry is still there, how are we ending up with a different fetch loop for the same key?

This PR ensures that the goroutine started by fetch cancels any prior goroutine spawned by the same line for the same key.

My mental model for the cache was that it should not be possible for there to be any other goroutine for the same key if we ever get to the line that spawns a new one...

The PR here looks reasonable, but I'm not really clear on how the leak shows up so it's a little hard to judge. I'm also interested in the details being super clearly documented here because I'm super wary of how complex agent cache is - we've often end up introducing other bugs when we try to fix bugs in here! I'm assuming it is a correct fix, but would be awesome to have really good info about exactly why we think it's the right fix for next time we have to come back here and understand!

@@ -836,11 +872,46 @@ func (c *Cache) fetch(key string, r getOptions, allowNew bool, attempt uint, ign
r.Info.MinIndex = 0
c.fetch(key, r, false, attempt, true)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is where fetch calls itself, but it does so in a goroutine and then returns, which exits the original goroutine. The net at the end of this call is that you have the same number of goroutines when one exits.

@banks
Copy link
Member

banks commented Oct 11, 2022

@rboyer and I looked at this together yesterday. I want to record my understanding before it slips through my fingers again!

The intended behaviour (simplified from actual code for clarity) here for refreshing cache types is:

  1. Client fetch (either via Get or Notify both internally call getWithIndex
  2. getWithIndex calls fetch which returns the waiter for the cache item (creating it if it didn't already exist, but just returning the waiter if it did **). The client goroutine (G1) is now blocked on that Waiter.
  3. meanwhile fetch, if it found no existing, "fetching" cache entry it created one with Fetching == true, spawned a new anonymous goroutine (G2) that I'll refer to as fetch.internal.
  4. fetch.internal does the actual blocking RPC to servers (possibly blocking for a while). When it is done either with a new result, the same result (blocking timeout) or error, it updates the cache entry and sets Fetching == false. Once it's released the lock it signals the old Waiter which unblocks any waiting goroutines.
  5. G1 (client goroutine) unblocks, finds the update or error and returns (usually)
  6. Meanwhile... the fetch.internal prepares to continue. It waits if there was an error and some backoff is needed, it also waits if there is a RefreshTimer configured for that type. All the time it's waiting, the cache entry exists but with Fetching == false
  7. When it's done waiting, the fetch.internal goroutine recursively calls fetch in the same goroutine (G2)
  8. G2 now makes it through the fetch method back to the point where it decides if it should continue - since Fetching == false (probably, we'll get to that) it effectively goes back to 3 above - sets entry to Fetching == true, spawns a new goroutine (G3) to run a new fetch.internal

The intent of all this is that the cache has "single flighted" the request so that many concurrent requests will just see the existing entry has a waiter and will wait on that. The background refresh "loop" (which is really tail call recursion) continues until the cache entry is evicted which breaks the tail call) which won't happen until TTL has passed without an active request for that entry.

The Bug

So the intention is that only a single fetching routine is active at any one time for a given cache key... but there is an major flaw in the design that makes that not true in actually quite a lot of cases!

For the whole time that the fetch.internal call is waiting before recusing back into fetch for a refresh, a new request for the same key will see Fetching == false and so will spawn a new separate fetch.internal.

So it's fatally flawed! But because of the complexities of the recursive calls, it's really hard to fix cleanly.

  • If we just don't set Fetching == false once the RPC is done then when the same goroutine re-enters fetch it will see Fetching == false and so it will exit, breaking the background refresh loop.
  • If we add another field that explicitly marks the entry as "refreshing" and use that to decide whether to spawn a new goroutine or not too then you hit the same problem as above.
  • Singleflight doesn't work because that also breaks the recursive refresh loop in a similar way - it sees the request as already "in progress" when it re-enters and so stops, breaking the refresh loop for everyone.
  • If we just assume that a refreshing cache type always has a running background goroutine (generally true) and so only ever spawn one if we actually created a new entry, that is probably right almost always except that we currently exit the background refresh loop on an ACL not found error. Which means that in some edge cases where ACL tokens were not found on servers because they were slightly stale, the background loop would silently go away and never recover again for the lifetime of that agent (or until clients stopped requesting it for the TTL). We probably implicitly rely on subsequent fetches re-spawning the background routine to mask temporary ACL not found errors due to replication lag/stale queries.
    • We didn't verify this it's possible it would be OK but another risk factor for "breaking" something that works.
    • We also originally thought it would be a problem if the ACL was deleted and recreated but that can't happen - if it's deleted then it's OK to stop forever because ACL tokens can't be recreated with the same token secret, and because the token is part of the cache key so if a new token is created and distributed to the client then it will be a different cache entry anyway.
    • I also wondered if this would be impacted by other ACL failures which can happen like the token having a policy removed and re-added, but the code specifically only breaks on "ACL (token) not found" and not just for any permission denied error.

@rboyer and I concluded that this whole thing could probably be a lot simpler to reason about if we refactored it so that the background refresh was a single method with a for loop instead of recursive calls. I don't know if RB has looked at that since we spoke but our guess is that it's a big enough change that we'd be very nervous about rushing it in with the current history of breaking agent cache whenever we touch it and the desire to fix this issue soon.

So the approach @rboyer took here fixes the bug (we think without causing any new ones) specifically because:

  1. There is a separate data structure tracking which keys have active fetches. (The main cache could do this but it's complicated because we replace the cache entry each time it updates)
  2. The separate data structure tracks not just boolean state, but also identifies each entrance into fetch.internal with a unique ID.
  3. Each entrance into fetch.internal closes the chan of the previous handle - which is checked after blocking in the RPC AND waiting for any backoff before re-entering fetch. Which breaks the old goroutine chain.
  4. Each fetch.internal defers a cleanup of its own handle, but leaves the state alone if another routine has taken over since it has a known ID.

The tl;dr - this is very much a band-aid fix for a bug that is a violation of the core design goal of the cache. But we think it's the least likely thing to cause additional bugs and so be safest to merge quickly (this week) so be in the latest release.

As noted inline, while this seems safer than some alternatives to reason about - it does probably have complex consequences like my inline-hypothesis that it could cause us to bypass backoff waiting on errors in some cases which could cause DDOS to servers.

select {
case <-time.After(totalWait):
case <-handle.stopCh:
return
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens if another fetch happens in another goroutine right after this point but before our own fetch call?

  • other fetch will see our handle, close our chan (no effect) and replace it with their own with a greater ID (say 10)
  • other fetch will then spawn fetch.internal and start an RPC
  • we continue into our own fetch see the handle with ID 10 and close it's chan, make our new one with ID 11, and then re-enter fetch.internal
  • other fetch.internal returns from RPC, see's it's chan closed and exits (wasted RPC)
  • we complete our RPC and assuming no new one has started, update the cache entry.

Conclusion: seems safe although it does mean we still end up doing some redundant RPCs. I think that's OK as a quick fix.

Hmmm so one thing this could possible cause would be effectively a DOS of servers on error. Normally we'd have the retry backoff protecting servers if repeated errors happen. It would look like this:

  1. fetch.internal 1 gets an error, updates entry, starts backoff wait
  2. client loop has a smaller backoff (or none) so retries which results in a new fetch 2
  3. fetch 2 will see Fetching == false and so spawn a fetch.internal 2 while closing the stop chan for fetch.internal 1. it will immediately make another request to servers even though we are supposed to be waiting.
  4. eventually fetch.internal 1 finishes it's backoff wait and see's it's been deposed and so just returns
  5. fetch.internal 2 also hits error, starts backoff
  6. client retries sooner again etc.

We might be OK because our internal cache clients generally use Notify which probably does the right thing in that it does its own backoff loop IIRC. But this one seems at least a bit risky that this fix potentially means that we'll bypass error backoffs?

Copy link
Member Author

@rboyer rboyer Oct 17, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmmm so one thing this could possible cause would be effectively a DOS of servers on error.

That's actually already happening on error with the existing code, since getting an error is Fetching=false.

}

select {
case <-time.After(totalWait):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I find this a little subtle to read - it takes me a while to convince myself that the first case is not falling through to the return as well...

It's a nit as most go programmers are probably OK with this but I'd consider adding a comment to explicitly call out that the first case is just continuing execution. Or re-order the cases?

Copy link
Contributor

@dhiaayachi dhiaayachi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This look good! I think this approach would work. I have few questions which could lead to simplifications.

agent/cache/cache.go Outdated Show resolved Hide resolved
agent/cache/cache.go Outdated Show resolved Hide resolved
agent/cache/cache.go Show resolved Hide resolved
@rboyer
Copy link
Member Author

rboyer commented Oct 11, 2022

Here is the more complete attempt at fixing this problem by removing the weird recursion and then introducing a new tracking variable to the cacheEntry type: #14956

Copy link
Contributor

@boxofrad boxofrad left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks again @rboyer and @banks for prioritizing this, and the thorough write-up ⭐

It seems safe to me (acknowledging the DoS vector you've called out), but with that said, #14924 solves the immediate problem for agentless so if you'd rather hold off for the more complete fix in #14956 that'd be fine for us too.

@rboyer rboyer added backport/1.12 Changes are backported to 1.12 backport/1.13 Changes are backported to 1.13 backport/1.14 Changes are backported to 1.14 and removed backport/1.14 Changes are backported to 1.14 labels Oct 17, 2022
@rboyer
Copy link
Member Author

rboyer commented Oct 17, 2022

I've rephrased the follow-on PR to better state that it fixes the RPC issue since this PR fixes the goroutine leak. That should make it easier to rationalize landing both changes on main for 1.15.0

@rboyer rboyer changed the title cache: prevent goroutine leak in agent cache when fetching results errors cache: prevent goroutine leak in agent cache Oct 17, 2022
There is a bug in the error handling code for the Agent cache subsystem discovered:

1. NotifyCallback calls notifyBlockingQuery which calls getWithIndex in
   a loop (which backs off on-error up to 1 minute)

2. getWithIndex calls fetch if there’s no valid entry in the cache

3. fetch starts a goroutine which calls Fetch on the cache-type, waits
   for a while (again with backoff up to 1 minute for errors) and then
   calls fetch to trigger a refresh

The end result being that every 1 minute notifyBlockingQuery spawns an
ancestry of goroutines that essentially lives forever.

This PR ensures that the goroutine started by `fetch` cancels any prior
goroutine spawned by the same line for the same key.

In isolated testing where a cache type was tweaked to indefinitely
error, this patch prevented goroutine counts from skyrocketing.
Copy link
Contributor

@dhiaayachi dhiaayachi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @rboyer for chasing this.
I have a small esthetic comment but not blocking.

// By the time we get here the system WANTS to make a replacement fetcher, so
// we terminate the prior one and replace it.
handle := c.getOrReplaceFetchHandle(key)
go func(handle fetchHandle) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we can consider extracting this giant anonymous function into a function

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a side effect of the other PR (#14956) already.

Actually doing it has some tracking state that is currently part of the closure, but would need to be adjusted as input/output parameters and it gets fiddly.

I did it on the road to #14956 and I wouldn't want to do that here on the band-aid for fear of introducing an inadvertent bug in the data flow.

@rboyer rboyer merged commit fe2d41d into main Oct 17, 2022
@rboyer rboyer deleted the fix-agent-cache-leak branch October 17, 2022 19:38
rboyer added a commit that referenced this pull request Oct 17, 2022
rboyer added a commit that referenced this pull request Oct 24, 2022
rboyer added a commit that referenced this pull request Oct 24, 2022
rboyer added a commit that referenced this pull request Oct 25, 2022
…n error (#14956)

This continues the work done in #14908 where a crude solution to prevent a
goroutine leak was implemented. The former code would launch a perpetual
goroutine family every iteration (+1 +1) and the fixed code simply caused a
new goroutine family to first cancel the prior one to prevent the
leak (-1 +1 == 0).

This PR refactors this code completely to:

- make it more understandable
- remove the recursion-via-goroutine strangeness
- prevent unnecessary RPC fetches when the prior one has errored.

The core issue arose from a conflation of the entry.Fetching field to mean:

- there is an RPC (blocking query) in flight right now
- there is a goroutine running to manage the RPC fetch retry loop

The problem is that the goroutine-leak-avoidance check would treat
Fetching like (2), but within the body of a goroutine it would flip that
boolean back to false before the retry sleep. This would cause a new
chain of goroutines to launch which #14908 would correct crudely.

The refactored code uses a plain for-loop and changes the semantics
to track state for "is there a goroutine associated with this cache entry"
instead of the former.

We use a uint64 unique identity per goroutine instead of a boolean so
that any orphaned goroutines can tell when they've been replaced when
the expiry loop deletes a cache entry while the goroutine is still running
and is later replaced.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport/1.12 Changes are backported to 1.12 backport/1.13 Changes are backported to 1.13 pr/no-metrics-test theme/agent-cache Agent Cache
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants