Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix issue with changing the agent token causing failure to renew the auto-encrypt certificate #8311

Merged
merged 2 commits into from
Jul 21, 2020

Conversation

mkeeler
Copy link
Member

@mkeeler mkeeler commented Jul 14, 2020

The fallback method would still work but it would get into a state where it would let the certificate expire for 10s before getting a new one. And the new one used the less secure RPC endpoint.

This is also a pretty large refactoring of the auto encrypt code. I was going to write some tests around the certificate monitoring but it was going to be impossible to get a TestAgent configured in such a way that I could write a test that ran in less than an hour or two to exercise the functionality.

Moving the certificate monitoring into its own package will allow for dependency injection and in particular mocking the cache types to control how it hands back certificates and how long those certificates should live. This will allow for exercising the main loop more than would be possible with it coupled so tightly with the Agent.

@mkeeler mkeeler force-pushed the bugfix/auto-encrypt-token-update branch 5 times, most recently from d6695c0 to 40a3686 Compare July 17, 2020 17:23
@mkeeler mkeeler marked this pull request as ready for review July 17, 2020 17:24
@mkeeler mkeeler requested a review from a team July 17, 2020 17:24
@mkeeler mkeeler force-pushed the bugfix/auto-encrypt-token-update branch from 40a3686 to e0baa39 Compare July 17, 2020 18:08
@mkeeler mkeeler force-pushed the bugfix/auto-encrypt-token-update branch 2 times, most recently from 9178841 to a80e8a4 Compare July 17, 2020 20:53
@mkeeler mkeeler force-pushed the bugfix/auto-encrypt-token-update branch from a80e8a4 to 32ead2e Compare July 20, 2020 15:19
@crhino
Copy link
Contributor

crhino commented Jul 20, 2020

Saw some interesting behavior while testing this out locally.

The scenario is as follows:

  1. Update agent token to non-existent ACL token: curl -sv -H"X-Consul-Token: roottoken" localhost:30100/v1/agent/token/agent -XPUT -d '{"token": "adfasdf"}'
  2. Create new token with correct privileges (I just created another token with global-management permissions to test)
  3. Update agent token again to valid token: curl -sv -H"X-Consul-Token: roottoken" localhost:30100/v1/agent/token/agent -XPUT -d '{"token": "f725a93c-3b16-5ee3-ef20-7f31c2725d39"}'

After updating to a new, valid token, I still see log lines on the client stating:

    2020-07-20T17:09:12.962Z [ERROR] agent.client: RPC failed to server: method=ConnectCA.Sign server=172.24.0.5:8300 error="rpc error making call: ACL not found"

After putting in a debug log line into the ConnectCA.Sign endpoint, I can that somehow the client is still using the adfasdf token to attempt to retrieve a leaf cert. This is after we have already successfully retrieved a leaf certificate with the valid token:

    2020-07-20T17:04:29.687Z [DEBUG] agent.server: ConnectCA.Sign: what token am I using?: token=roottoken service_id=<nil> agent_id="&{bd5896ce-0f58-b0b3-128d-23e379782001.consul chris1 consul-dc1-client0}"
    2020-07-20T17:08:26.770Z [DEBUG] agent.server: ConnectCA.Sign: what token am I using?: token=adfasdf service_id=<nil> agent_id="&{bd5896ce-0f58-b0b3-128d-23e379782001.consul chris1 consul-dc1-client0}"
...
    2020-07-20T17:08:39.336Z [DEBUG] agent.server: ConnectCA.Sign: what token am I using?: token=f725a93c-3b16-5ee3-ef20-7f31c2725d39 service_id=<nil> agent_id="&{bd5896ce-0f58-b0b3-128d-23e379782001.consul chris1 consul-dc1-client0}"
    2020-07-20T17:08:39.385Z [DEBUG] agent.server: ConnectCA.Sign: what token am I using?: token=adfasdf service_id=<nil> agent_id="&{bd5896ce-0f58-b0b3-128d-23e379782001.consul chris1 consul-dc1-client0}"

It feels like there is still a goroutine running for the leaf certificate cache request? Looking through the code I was not able to figure why this was happening though, it seems like we are correctly cancelling contexts during the token updates.

@mkeeler mkeeler force-pushed the bugfix/auto-encrypt-token-update branch from 32ead2e to eecc925 Compare July 20, 2020 18:54
@crhino
Copy link
Contributor

crhino commented Jul 20, 2020

For posterity, we figured out that the issue above is related to the cache package keeping around a background refresh of the certificate, which is currently expected behavior. So, not necessary to address in this PR.

Copy link
Contributor

@crhino crhino left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Works as expected and nicer to reason about. :shipit:

Copy link
Member

@hanshasselberg hanshasselberg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, left a small comment about a comment.

agent/cert-monitor/cert_monitor.go Outdated Show resolved Hide resolved
@mkeeler mkeeler force-pushed the bugfix/auto-encrypt-token-update branch from 9269ef7 to 0e770fa Compare July 21, 2020 13:19
@mkeeler
Copy link
Member Author

mkeeler commented Jul 21, 2020

It turned out we could just disable the background refresh for leaf certs as that functionality was unused for that type and it was causing issues.

@mkeeler mkeeler force-pushed the bugfix/auto-encrypt-token-update branch 3 times, most recently from 13ce08a to e956565 Compare July 21, 2020 15:45
…auto-encrypt certificate

The fallback method would still work but it would get into a state where it would let the certificate expire for 10s before getting a new one. And the new one used the less secure RPC endpoint.

This is also a pretty large refactoring of the auto encrypt code. I was going to write some tests around the certificate monitoring but it was going to be impossible to get a TestAgent configured in such a way that I could write a test that ran in less than an hour or two to exercise the functionality.

Moving the certificate monitoring into its own package will allow for dependency injection and in particular mocking the cache types to control how it hands back certificates and how long those certificates should live. This will allow for exercising the main loop more than would be possible with it coupled so tightly with the Agent.
The rationale behind removing them is that all of our own code (xDS, builtin connect proxy) use the cache notification mechanism. This ensures that the blocking fetch behind the scenes is always executing. Therefore the only way you might go to get a certificate and have to wait is when 1) the request has never been made for that cert before or 2) you are using the v1/agent/connect/ca/leaf API for retrieving the cert yourself.

In the first case, the refresh change doesn’t alter the behavior. In the second case, it can be mitigated by using blocking queries with that API which just like normal cache notification mechanism will cause the blocking fetch to be initiated and to get leaf certs as soon as needed.

If you are not using blocking queries, or Envoy/xDS, or the builtin connect proxy but are retrieving the certs yourself then the HTTP endpoint might take a little longer to respond.

This also renames the RefreshTimeout field on the register options to QueryTimeout to more accurately reflect that it is used for any type that supports blocking queries.
@mkeeler mkeeler force-pushed the bugfix/auto-encrypt-token-update branch from e956565 to 12acdd7 Compare July 21, 2020 16:19
@mkeeler mkeeler merged commit 3c09482 into master Jul 21, 2020
@mkeeler mkeeler deleted the bugfix/auto-encrypt-token-update branch July 21, 2020 17:15
@hashicorp-ci
Copy link
Contributor

🍒❌ Cherry pick of commit 3c09482 onto release/1.8.x failed! Build Log

@mkeeler
Copy link
Member Author

mkeeler commented Jul 21, 2020

Backport PR is here: #8352

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants