Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

agent: add an inflight cache better concurrent request handling #10705

Merged
merged 20 commits into from Jan 26, 2021

Conversation

calvn
Copy link
Member

@calvn calvn commented Jan 15, 2021

This PR refactors LeaseCache.Send to better handle identical concurrent requests that are processed at this layer.

The original code relied solely on an idLock, which would get updated to a write lock if the request resulted in a cache miss, meaning that non-cacheable identical concurrent requests (e.g. Vault KV reads) would be processed serially. It also was not properly protecting identical concurrent cacheable requests on a clean cache from being proxied more than once.

This PR introduces an inflight cache to allow for one of these requests to win and proceed with processing the request before the other ones are processed. This implementation takes a fan-out approach to ensure that a request is processed at least once (and that only one of them does) so that it has a chance to for LeaseCache to cache its response.

The changes in this PR also ensures that identical concurrent cacheable requests that does not have it response cached yet will only ever have the request proxied once since the inflight cache will only allow one winner to fully process the request before the others can proceed.

@calvn calvn added this to the 1.7 milestone Jan 15, 2021
@ncabatoff
Copy link
Collaborator

Just to confirm my understanding: let's assume we have a steady stream of non-cacheable requests coming in for the exact same URL, say 1/ms, and they take 20ms each for Vault to process.

The first request to hit Agent will get added to the cache. The next 19 requests will queue up behind it. After 20ms the first request is done and Agent will then immediately forward all 19 other requests that came in while we were waiting.

As new requests come in every 1ms, each of them will be forwarded immediately since the inflight req ch is closed.

If there were an interruption in the flow of incoming requests long enough for the cache entry to be removed, the whole sequence would restart, with an initial delay, then all pending requests released at once, then no more delays.

This approach is motivated by the fact that we don't cache some requests, and we don't know at the outset whether a request will be cacheable.

I don't think it's bad, it's certainly an improvement over what we have now, and I don't have a better idea right now, but it does lead to some burstiness as well as somewhat unpredictable latency.

@calvn
Copy link
Member Author

calvn commented Jan 15, 2021

As new requests come in every 1ms, each of them will be forwarded immediately since the inflight req ch is closed.

Yes, that's correct.

If there were an interruption in the flow of incoming requests long enough for the cache entry to be removed, the whole sequence would restart, with an initial delay, then all pending requests released at once, then no more delays.

Yes, that's the case as well. Although, if there is no interrupt and this is a continuous steady stream of requests, the inflight cache entry might never get cleaned up. On the bright side, the delta here is just on the remaining counter and not an increase in the case size. In theory it means that it prevents the inflight cache from shrinking back down, but in practice I don't see this scenario to be large enough (numerous distinct and steady-stream concurrent requests against a single agent instance) to overwhelm the cache/memory. You'd need to have a lot of these steady stream requests happening at the same time and on unique requests to populate and take up a large enough cache size for it to become an issue.

Good point on the burstiness aspect of this logic. I don't have a straight answer to this other than introducing a jitter, but this would also add additional latency. Maybe that's a fine trade-off to avoid overwhelming the Vault server.

@ncabatoff
Copy link
Collaborator

Good point on the burstiness aspect of this logic. I don't have a straight answer to this other than introducing a jitter, but this would also add additional latency. Maybe that's a fine trade-off to avoid overwhelming the Vault server.

I wasn't asking for any changes, just making sure we're clear on the new behaviour. Like I say, I don't think is particularly bad, it's just potentially a bit surprising. I vote against adding more complexity to the solution unless it improves the user experience. I don't think there's a straightforward way to stagger the backlogged requests, at least I don't see a way to improve things in general - it feels like any attempt we make could help some in some cases and hurt in others.

defer func() {
// Cleanup on the cache if there are no remaining inflight requests.
// This is the last step, so we defer the call first
if inflight != nil && inflight.remaining.Load() == 0 {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When would inflight be nil at this point?

Copy link
Member Author

@calvn calvn Jan 19, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It shouldn't be nil because the conditional further down always assigns this values to something, but I'm nil-checking just in case since this is within a defer that's called right after the variable is declared but before the value is assigned.

// Cleanup on the cache if there are no remaining inflight requests.
// This is the last step, so we defer the call first
if inflight != nil && inflight.remaining.Load() == 0 {
c.inflightCache.Delete(id)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This felt a little racy since we can be here multiple times for the same id if a request comes in right after the condition Load() == 0 is checked. But I've gone through it and don't think there is any harmful behavior. The inflight object still exists even if it's deleted from the cache so the final request can complete, and calling Delete() on an id not present is a no-op.

@calvn
Copy link
Member Author

calvn commented Jan 25, 2021

@ncabatoff @kalafut @briankassouf thoughts on having this backported to 1.6.x? It's technically a bug fix, but at the same time it's not a common case and not noticeable unless it's under a specific scenario. We're also tight on time for the upcoming patch release.

select {
case <-ctx.Done():
return nil, ctx.Err()
case <-inflight.ch:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So right now if we detect here that the thread processing the request has completed (channel has been closed) then we simply continue. But once we get down to:

cachedResp, err := c.checkCacheForRequest(id)

We'd see an nil cachedResp since that is still going to only cache leased values. Then, i think, we'd simply re-send the request to the Vault server. I think this fix is missing a step where we store the resulting request in the inflightRequest object and access it here when the channel is closed. Thoughts?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The winner who had closed the channel will have cached the response before this thread gets to call c.checkCacheForRequest(id) so it will result in a cache hit. In the case that the request resulted in a non-cacheable response, it would proxy to Vault as it should.

The changes in the PR don't actually prevent identical non-cacheable requests from being proxied to Vault; it simply allows one of the requests to be processed first (since we don't know if it's cacheable) before opening the floodgate to let other identical request to be processed concurrently. I don't think there's a need to store the actual request/response object in the inflightRequest .

@vercel
Copy link

vercel bot commented Jan 26, 2021

This pull request is being automatically deployed with Vercel (learn more).
To see the status of your deployments, click below or on the icon next to each commit.

vault-storybook – ./ui

🔍 Inspect: https://vercel.com/hashicorp/vault-storybook/fx2c6casd
✅ Preview: https://vault-storybook-git-agent-inflight-cache.hashicorp.vercel.app

@calvn calvn merged commit df51db7 into master Jan 26, 2021
@calvn calvn deleted the agent-inflight-cache branch January 26, 2021 20:09
jartek pushed a commit to jartek/vault that referenced this pull request Sep 11, 2021
…icorp#10705)

* agent: do not grap idLock writelock until caching entry

* agent: inflight cache using sync.Map

* agent: implement an inflight caching mechanism

* agent/lease: add lock for inflight cache to prevent simultaneous Set calls

* agent/lease: lock on a per-ID basis so unique requests can be processed independently

* agent/lease: add some concurrency tests

* test: use lease_id for uniqueness

* agent: remove env flags, add comments around locks

* agent: clean up test comment

* agent: clean up test comment

* agent: remove commented debug code

* agent/lease: word-smithing

* Update command/agent/cache/lease_cache.go

Co-authored-by: Nick Cabatoff <ncabatoff@hashicorp.com>

* agent/lease: return the context error if the Done ch got closed

* agent/lease: fix data race in concurrency tests

* agent/lease: mockDelayProxier: return ctx.Err() if context got canceled

* agent/lease: remove unused inflightCacheLock

* agent/lease: test: bump context timeout to 3s

Co-authored-by: Nick Cabatoff <ncabatoff@hashicorp.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants