Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix several ACL token/policy resolution issues. #5246

Merged
merged 4 commits into from
Jan 22, 2019
Merged

Conversation

mkeeler
Copy link
Member

@mkeeler mkeeler commented Jan 22, 2019

Fixes #5219

The main issue was that token specific issues (not able to access a particular policy or the token being deleted after initial fetching) were poisoning the policy cache.

A second issue was that for concurrent token resolutions, the first resolution to get started would go fetch all the policies. If before the policies were retrieved a second resolution request came in, the new request would register watchers for those policies but then never block waiting for them to complete. This resulted in using the default policy when it shouldn't have.

This PR fixes those issues. The second issue was simplest to fix. We were using the wrong value to determine the number of times to block on our wait channel for async policy resolution responses. Previously we looped newAsyncFetchIDs number of times which when other policy resolutions for required policies are already ongoing will be less than the total number of policies needed. This was changed to the number of policies that need to be fetched (regardless of which resolution request is the one to initiate the policy fetch)

The first issue required a few changes.

On the server side, the ACL.PolicyResolve endpoint was modified to return permission denied errors when trying to resolve a policy that the token is not linked with. This way we can differentiate between non-existent policies and policies the token just isn't allowed to retrieve.

On the client side token/policy got a little more complex. Token errors during policy resolution (both not found and permission denied) are now handled better. First they no longer modify the policy cache for these errors. Instead for a not found error, it overrides the identity used for the request in the local cache to nullify it and store the not found error. For a permission denied errors, the cached identity is just removed which will allow subsequent requests to be fetched again. For both token related errors it will cause policy resolution to stop. For permission denied errors and not found errors for other tokens, the request will be retried.

For example:

Token A is linked to policies 1, 2 and 3
Token B is linked to policies 2, 3 and 4

Then the following flow is how things will work:

  1. ResolveToken('') is started
  2. Token A's identity gets fetched and cached
  3. ResolveToken('') is started
  4. Token B's identity gets fetched and cached
  5. Some external entity modifies Token A to unlink policy 3.
  6. Async request to resolve policies 1,2 and 3 is started using Token A
  7. Async request to resolve policy 4 is started using Token B
  8. Request to resolve with Token A returns Permission Denied
  9. Token A gets removed from the identity cache
  10. Both policy resolution requests get stopped due to the token error for A
  11. In the background the Token B policy resolution is still ongoing (or may have finished). This will populate the cache for policy 4.
  12. Token A's identity must be re-fetched as it is no longer cached.
  13. Token B's identity is cached and will be used for policy resolution.
  14. Something external deletes Token B.
  15. Policy resolution of policies 2 and 3 is started for Token B.
  16. Policy resolution of policy 1 is started for Token A.
  17. Policy resolution of 2 and 3 returns with a not found error.
  18. Both policy resolution requests get stopped due to the token error for B.
  19. In the background the Token A policy resolution of policy 1 is still ongoing (or may have finished). This will populate the cache for policy 1.
  20. Because of the not found token error, token B resolution will stop and propagate the not found error as if the very first identity resolution requests had come up with the not found error.
  21. The Token A resolution request will be retried.
  22. Token A's identity will be retrieved from the cache.
  23. Token A will start a policy resolution for policy 2.
  24. The server returns policy 2
  25. Token A policy resolution completes and policies 1 and 2 get compiled into the Authorizer and are cached/returned.

1 - Use the right method to fire async not found errors when the ACL.PolicyResolve RPC returns that error. This was previously accidentally firing a token result instead of a policy result which would have effectively done nothing (unless there happened to be a token with a secret id == the policy id being resolved.

2. When concurrent policy resolution is being done we single flight the requests. The bug before was that for the policy resolution that was going to piggy back on anothers RPC results it wasn’t waiting long enough for the results to come back due to looping with the wrong variable.
@hashicorp-cla
Copy link

hashicorp-cla commented Jan 22, 2019

CLA assistant check
All committers have signed the CLA.

@mkeeler mkeeler added this to the 1.4.1 milestone Jan 22, 2019
Copy link
Member

@banks banks left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wow.

To the extent that I understand this it seems good to me. Your description makes it clear what the issue is and the code looks plausibly like it solves it and the test verifies that. That said, it's all subtle enough that I can't say with high-degree fo confidence that this is now fully correct in all possible concurrent executions!

If you are happy then I think it's fine as it is. If you feel like there is risk of introducing a worse CVE (i.e. fail open/use wrong policy) than the current fail-closed bug, we can think about other ways we could build confidence in this case.

agent/consul/acl_test.go Outdated Show resolved Hide resolved
Co-Authored-By: mkeeler <mkeeler@users.noreply.github.com>
@mkeeler
Copy link
Member Author

mkeeler commented Jan 22, 2019

@banks The only potential for granting more access than desired is really an issue with caching tokens at all.

If for example you have two tokens that link the same policies and you resolve them both only 1 of them is going to be used to resolve policies. If between the original identity fetch/cache the second token gets modified to remove one of those policies the new state will not be detected until the identity cache expires. I don't see this as an issue. So long as we cache tokens/identities and policies we will have to deal with stale data which is what this scenario is.

Also this isn't new with this code or even with the new ACL system. I am confident that this introduces no risk of failing open. In order to be granted access you still have to have a policy that grants you that access. This PR just helps to ensure that we don't fail unnecessarily to resolve the policy causing denial of access erroneously.

@banks
Copy link
Member

banks commented Jan 22, 2019 via email

@mkeeler mkeeler merged commit 579a8b3 into master Jan 22, 2019
@mkeeler mkeeler deleted the bugfix/acl-fixes branch January 23, 2019 14:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants