Fix several ACL token/policy resolution issues. #5246

mkeeler · 2019-01-22T15:08:14Z

The main issue was that token specific issues (not able to access a particular policy or the token being deleted after initial fetching) were poisoning the policy cache.

A second issue was that for concurrent token resolutions, the first resolution to get started would go fetch all the policies. If before the policies were retrieved a second resolution request came in, the new request would register watchers for those policies but then never block waiting for them to complete. This resulted in using the default policy when it shouldn't have.

This PR fixes those issues. The second issue was simplest to fix. We were using the wrong value to determine the number of times to block on our wait channel for async policy resolution responses. Previously we looped newAsyncFetchIDs number of times which when other policy resolutions for required policies are already ongoing will be less than the total number of policies needed. This was changed to the number of policies that need to be fetched (regardless of which resolution request is the one to initiate the policy fetch)

The first issue required a few changes.

On the server side, the ACL.PolicyResolve endpoint was modified to return permission denied errors when trying to resolve a policy that the token is not linked with. This way we can differentiate between non-existent policies and policies the token just isn't allowed to retrieve.

On the client side token/policy got a little more complex. Token errors during policy resolution (both not found and permission denied) are now handled better. First they no longer modify the policy cache for these errors. Instead for a not found error, it overrides the identity used for the request in the local cache to nullify it and store the not found error. For a permission denied errors, the cached identity is just removed which will allow subsequent requests to be fetched again. For both token related errors it will cause policy resolution to stop. For permission denied errors and not found errors for other tokens, the request will be retried.

For example:

Token A is linked to policies 1, 2 and 3
Token B is linked to policies 2, 3 and 4

Then the following flow is how things will work:

ResolveToken('') is started
Token A's identity gets fetched and cached
ResolveToken('') is started
Token B's identity gets fetched and cached
Some external entity modifies Token A to unlink policy 3.
Async request to resolve policies 1,2 and 3 is started using Token A
Async request to resolve policy 4 is started using Token B
Request to resolve with Token A returns Permission Denied
Token A gets removed from the identity cache
Both policy resolution requests get stopped due to the token error for A
In the background the Token B policy resolution is still ongoing (or may have finished). This will populate the cache for policy 4.
Token A's identity must be re-fetched as it is no longer cached.
Token B's identity is cached and will be used for policy resolution.
Something external deletes Token B.
Policy resolution of policies 2 and 3 is started for Token B.
Policy resolution of policy 1 is started for Token A.
Policy resolution of 2 and 3 returns with a not found error.
Both policy resolution requests get stopped due to the token error for B.
In the background the Token A policy resolution of policy 1 is still ongoing (or may have finished). This will populate the cache for policy 1.
Because of the not found token error, token B resolution will stop and propagate the not found error as if the very first identity resolution requests had come up with the not found error.
The Token A resolution request will be retried.
Token A's identity will be retrieved from the cache.
Token A will start a policy resolution for policy 2.
The server returns policy 2
Token A policy resolution completes and policies 1 and 2 get compiled into the Authorizer and are cached/returned.

1 - Use the right method to fire async not found errors when the ACL.PolicyResolve RPC returns that error. This was previously accidentally firing a token result instead of a policy result which would have effectively done nothing (unless there happened to be a token with a secret id == the policy id being resolved. 2. When concurrent policy resolution is being done we single flight the requests. The bug before was that for the policy resolution that was going to piggy back on anothers RPC results it wasn’t waiting long enough for the results to come back due to looping with the wrong variable.

hashicorp-cla · 2019-01-22T15:08:22Z

All committers have signed the CLA.

banks

Wow.

To the extent that I understand this it seems good to me. Your description makes it clear what the issue is and the code looks plausibly like it solves it and the test verifies that. That said, it's all subtle enough that I can't say with high-degree fo confidence that this is now fully correct in all possible concurrent executions!

If you are happy then I think it's fine as it is. If you feel like there is risk of introducing a worse CVE (i.e. fail open/use wrong policy) than the current fail-closed bug, we can think about other ways we could build confidence in this case.

agent/consul/acl_test.go

Co-Authored-By: mkeeler <mkeeler@users.noreply.github.com>

mkeeler · 2019-01-22T16:40:45Z

@banks The only potential for granting more access than desired is really an issue with caching tokens at all.

If for example you have two tokens that link the same policies and you resolve them both only 1 of them is going to be used to resolve policies. If between the original identity fetch/cache the second token gets modified to remove one of those policies the new state will not be detected until the identity cache expires. I don't see this as an issue. So long as we cache tokens/identities and policies we will have to deal with stale data which is what this scenario is.

Also this isn't new with this code or even with the new ACL system. I am confident that this introduces no risk of failing open. In order to be granted access you still have to have a policy that grants you that access. This PR just helps to ensure that we don't fail unnecessarily to resolve the policy causing denial of access erroneously.

banks · 2019-01-22T17:03:30Z

Gotcha sounds reasonable to me!

…

On Tue, Jan 22, 2019 at 4:40 PM Matt Keeler ***@***.***> wrote: @banks <https://github.com/banks> The only potential for granting more access than desired is really an issue with caching tokens at all. If for example you have two tokens that link the same policies and you resolve them both only 1 of them is going to be used to resolve policies. If between the original identity fetch/cache the second token gets modified to remove one of those policies the new state will not be detected until the identity cache expires. I don't see this as an issue. So long as we cache tokens/identities and policies we will have to deal with stale data which is what this scenario is. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#5246 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAHYU5TZ43HQxMZ1D8KFlvEqlOVmzfcWks5vFz8PgaJpZM4aM3ny> .

…andling.

mkeeler added 2 commits January 18, 2019 10:33

Fix a handful of other edge case ACL scenarios

8095a2f

mkeeler added this to the 1.4.1 milestone Jan 22, 2019

banks approved these changes Jan 22, 2019

View reviewed changes

agent/consul/acl_test.go Outdated Show resolved Hide resolved

Update agent/consul/acl_test.go

0a60325

Co-Authored-By: mkeeler <mkeeler@users.noreply.github.com>

Add a few more unit tests to handle the concurrent resolution error h…

e8a00c6

…andling.

mkeeler merged commit 579a8b3 into master Jan 22, 2019

mkeeler deleted the bugfix/acl-fixes branch January 23, 2019 14:19

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix several ACL token/policy resolution issues. #5246

Fix several ACL token/policy resolution issues. #5246

mkeeler commented Jan 22, 2019

hashicorp-cla commented Jan 22, 2019 •

edited

Loading

banks left a comment

mkeeler commented Jan 22, 2019 •

edited

Loading

banks commented Jan 22, 2019 via email

Fix several ACL token/policy resolution issues. #5246

Fix several ACL token/policy resolution issues. #5246

Conversation

mkeeler commented Jan 22, 2019

hashicorp-cla commented Jan 22, 2019 • edited Loading

banks left a comment

Choose a reason for hiding this comment

mkeeler commented Jan 22, 2019 • edited Loading

banks commented Jan 22, 2019 via email

hashicorp-cla commented Jan 22, 2019 •

edited

Loading

mkeeler commented Jan 22, 2019 •

edited

Loading