Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OIDC key rotations happen up to 60 seconds after they are scheduled #11438

Closed
ianferguson opened this issue Apr 22, 2021 · 8 comments
Closed

Comments

@ianferguson
Copy link
Contributor

ianferguson commented Apr 22, 2021

Describe the bug

The identity backend's JWT signing keys are rotated in the PeriodicFunc of the identity backend, which is triggered every 60 seconds by the RollbackManager. The PeriodicFunc only rotates keys that are overdue for rotation, which means that in practice there is a gap of up to 60 seconds between when a key is scheduled to be rotated and when it actually is.

This becomes an issue because the Cache-Control header returned on calls to identity/oidc/.well-known/keys appears to be calculated based on the scheduled rotation time, and will return Cache-Control: no-store for the 0 to 60 seconds between the scheduled rotation time and when the RollbackManager calls the Identity backend's PeriodicFunc and the rotation happens.

When normal longer signing key rotation periods, such as 24h or 7d, are used there is still a gap between the scheduled key rotation time and the actual rotation. This means that for up to 60 seconds every service that follows the Cache-Control directives for managing their active JWKS sets for verification will potentially make a call to Vault to fetch the JWKS for every single inbound call made to the service.

The service has to choose between caching the JWKS received with a Cache-Control: no-store for 1 second or 1 minute and potentially rejecting some valid requests signed with a new Key ID not in the old (improperly cached) JWKS set, or potentially slamming vault with thousands of requests, one per every request received by the service.

** Original Description **

note: I originally filed this issue as a quirk of sub 2 minute rotation configurations, but on further investigation it has broader consequences. The original description is below.

The identity backend allows users to configure signing key rotations as frequently as once a minute, but in practice cannot rotate keys more frequently than 1m, and will only be rotated every 2 minutes.

I believe this is due to the key rotation being handled by the PeriodicFunc for the identity backend, which is in turn managed/scheduled by the RollbackManager, which is fixed at a once a minute tick speed:

rollbackPeriod = time.Minute

when the periodic frequency is set to 60-121 seconds (< 120), when the rollback manager runs at time + 60 seconds, it is not time to rotate the keys yet. It isn't until the RollbackManager ticks a second time (time + 120 total seconds) that the key will be eligible for rotation

To Reproduce
Steps to reproduce the behavior:
run a vault dev server with debug log level set:

vault server -dev -dev-root-token-id=dev-root-token  -log-level=debug                  

In another terminal/tab configure a signing key with a rotation period of 1m (or any value from 60-119s):

VAULT_ADDR=http://127.0.0.1:8200
VAULT_TOKEN=dev-root-token
vault write identity/oidc/key/a-key rotation_period=1m verification_ttl=1m allowed_client_ids="*"  

if you watch the dev server logs at that point, it will print rotation/deletion secrets every 2 minutes, rather than every 1 minute:

2021-04-22T11:42:53.687-0400 [DEBUG] secrets.identity.identity_218e930c: rotating OIDC key: key=notary
2021-04-22T11:44:53.688-0400 [DEBUG] secrets.identity.identity_218e930c: rotating OIDC key: key=notary
2021-04-22T11:44:53.792-0400 [DEBUG] secrets.identity.identity_218e930c: deleted OIDC public key: key_id=080f4b88-8104-8159-97b9-542d07195614
2021-04-22T11:46:53.687-0400 [DEBUG] secrets.identity.identity_218e930c: rotating OIDC key: key=notary
2021-04-22T11:46:53.959-0400 [DEBUG] secrets.identity.identity_218e930c: deleted OIDC public key: key_id=a79baa4a-c76c-a38e-9968-58d436a56525
2021-04-22T11:48:53.687-0400 [DEBUG] secrets.identity.identity_218e930c: rotating OIDC key: key=notary
2021-04-22T11:48:54.010-0400 [DEBUG] secrets.identity.identity_218e930c: deleted OIDC public key: key_id=91eacb84-9cf5-d2e1-b973-9fc61aa09c41
2021-04-22T11:50:53.688-0400 [DEBUG] secrets.identity.identity_218e930c: rotating OIDC key: key=notary
2021-04-22T11:50:53.907-0400 [DEBUG] secrets.identity.identity_218e930c: deleted OIDC public key: key_id=0ad16d14-3a2a-2db4-ce68-93c82488c122

Additionally, when polling the server's JWKS endpoint with curl, key ids will be visible for longer than the configured (verification_ttl + rotation_period) time (1m each, for a total of 2m) that they should be visible for

watch -n10 'curl --silent $VAULT_ADDR/v1/identity/oidc/.well-known/keys | jq -r ".keys[].kid"'

The cache control behavior can be observed by running a command like the following:

while true; do
  echo -n $(date)
  curl --verbose --silent $VAULT_ADDR/v1/identity/oidc/.well-known/keys 2>&1 | grep Cache-C; 
  sleep 1;
done

Which will output something like

Apr 22 14:31:51 < Cache-Control: max-age=2                                                                                                                                                                                                                                                    
Apr 22 14:31:52 < Cache-Control: max-age=1
Apr 22 14:31:53 < Cache-Control: max-age=0
Apr 22 14:31:54 < Cache-Control: no-store
# ~58 lines of "Cache-Control: no-store"  trimmed out
Apr 22 14:32:53 < Cache-Control: no-store
Apr 22 14:32:54 < Cache-Control: max-age=60
Apr 22 14:32:55 < Cache-Control: max-age=59
Apr 22 14:32:56 < Cache-Control: max-age=58

Expected behavior
Vault would ideally rotate the signing keys no later than the specified frequency. I think if the rotation frequency is 24 hours, it is better to rotate at 23 hours, 59 minutes and 59 seconds than at 24 hours and 1 second since the last rotation, all things considered

This would both ensure that people could assert that their keys are rotated within a certain time, which may be important for compliance reasons, and more importantly would remove the period of time where services verifying JWTs issued by vault would potentially do a sideband network call per incoming call to Vault's identity/oidc/.well-known/keys end point.

The crossed out suggestion above would not work, because a server verifying JWTs would potentially read the JWKS keys, get a Cache-Control: 25 header back, only to have Vault rotate and add a new signing key the next second. During those 25 seconds that one server would continue using its cached copy of the old JWKS set, and reject JWTs signed with the new key.

Based on that, I think the only way to remove both the Cache-Control: no-data state and jitter/spread out when services retrieve new JWKS sets from Vault is to implement publish new JWKS keys well ahead of starting to sign JWTs with them as documented by @evanj in #9201.

Environment:

  • Vault Server Version (retrieve with vault status): v1.6.2
  • Vault CLI Version (retrieve with vault version): v1.6.2
  • Server Operating System/Architecture: OS X

Vault server configuration file(s):

none, can be done using vault server -dev

Additional context

This issue partially duplicates the narrower half of #9201, but is focused only on the potential bug related to actual key rotation happening after scheduled rotation and doesn't include the feature request for pre-publishing future keys that #9201 includes

@ianferguson ianferguson changed the title OIDC key rotation periods below 2m are not honored OIDC key rotations happen up to 60 seconds after they are scheduled Apr 22, 2021
@ianferguson
Copy link
Contributor Author

@kalafut sorry to ping you, but after initially filing this issue I did some more investigation and it changed from what I think was a curiosity (you can't rotate JWT signing keys more frequently than 2m) to more of an issue, due to the up to 60 seconds where the /identity/oidc/.well-known/keys endpoint returns Cache-Control: no-cache at rotation time which can cause a stampede of calls from services verifying JWTs to Vault

as a result, I heavily rewrote the initial issue to include the increased impact

@ianferguson
Copy link
Contributor Author

I've put together a prototype for advertising new keys 1 rotation_period of time in advance of them being used on this branch: https://github.com/hashicorp/vault/compare/DataDog:ianferguson/prepublish_jwt_signing_keys

The branch additionally updates the Cache-Control logic to return a random value between 0 and the lowest rotation_period and validation_ttl of all keys on the vault cluster, to smooth out the rate servers call Vault for JWKS keys and avoid the thundering herd issues mentioned in #9201

It appears to work when I run it locally: with a dev server:

2021-04-22T19:58:24.626-0400 [DEBUG] secrets.identity.identity_7602cd79: generated OIDC public key to sign JWTs: key_id=f0b37182-c588-ff0c-4153-cd0a527eada7
2021-04-22T19:58:24.763-0400 [DEBUG] secrets.identity.identity_7602cd79: generated OIDC public key for future use: key_id=c4806dcd-3e2a-6eaf-5743-61ae431eb664
2021-04-22T20:00:22.960-0400 [DEBUG] secrets.identity.identity_7602cd79: rotating OIDC key: key=notary
2021-04-22T20:00:23.226-0400 [DEBUG] secrets.identity.identity_7602cd79: generated OIDC public key for future use: key_id=16b5f20f-1ac2-6d64-d1e5-6ddcd774f679
2021-04-22T20:00:23.227-0400 [DEBUG] secrets.identity.identity_7602cd79: rotated OIDC public key. now using: key_id=c4806dcd-3e2a-6eaf-5743-61ae431eb664
2021-04-22T20:01:22.962-0400 [DEBUG] secrets.identity.identity_7602cd79: deleted OIDC public key: key_id=f0b37182-c588-ff0c-4153-cd0a527eada7
2021-04-22T20:02:22.960-0400 [DEBUG] secrets.identity.identity_7602cd79: rotating OIDC key: key=notary
2021-04-22T20:02:23.312-0400 [DEBUG] secrets.identity.identity_7602cd79: generated OIDC public key for future use: key_id=258e47cc-0040-bda3-a559-1bba38d4d701
2021-04-22T20:02:23.313-0400 [DEBUG] secrets.identity.identity_7602cd79: rotated OIDC public key. now using: key_id=16b5f20f-1ac2-6d64-d1e5-6ddcd774f679
2021-04-22T20:02:23.313-0400 [DEBUG] secrets.identity.identity_7602cd79: deleted OIDC public key: key_id=c4806dcd-3e2a-6eaf-5743-61ae431eb664

and a sample of cache control values:

Apr 22 20:03:43 Thu Apr 22 20:03:43 EDT 2021< Cache-Control: max-age=58
Apr 22 20:03:45 Thu Apr 22 20:03:45 EDT 2021< Cache-Control: max-age=28
Apr 22 20:03:46 Thu Apr 22 20:03:46 EDT 2021< Cache-Control: max-age=62
Apr 22 20:03:47 Thu Apr 22 20:03:47 EDT 2021< Cache-Control: max-age=38
Apr 22 20:03:48 Thu Apr 22 20:03:48 EDT 2021< Cache-Control: max-age=38
Apr 22 20:03:49 Thu Apr 22 20:03:49 EDT 2021< Cache-Control: max-age=16
Apr 22 20:03:50 Thu Apr 22 20:03:50 EDT 2021< Cache-Control: max-age=2

But the code could probably use more tests and some cleanup.

Let me know if this is an approach y'all are interested in, and if so I can add tests, clean up the code and open that branch as a pull request

@ianferguson
Copy link
Contributor Author

I saw the OIDC description of a rotation strategy (https://openid.net/specs/openid-connect-core-1_0.html#RotateSigKeys) referenced via the coreos PR hashicorp/cap#36 references,

That strategy addresses the cases in this issue and #9201 where the Cache-Control header would be 0 or no-store, but still requires all servers verifying JWTs from an Vault issuer to successfully read the JWKS endpoint on the Vault cluster at nearly the same time (when a new KID appears), which is not ideal.

Additionally, using the user provided kid claim to trigger a network call to the JWKS endpoint prior to the JWT being authenticated may allow an attacker to generate load or induce denial of service attacks by spamming a service with a stream of JWTs that are invalid but have unique kid claims in their JWT headers, and trigger a network call from the service to the JWKS endpoint.

For those reasons, I think it is still advantageous for clients to utilize a more evenly spread rotation process with new keys being published for some period prior to being used.

@fairclothjm
Copy link
Contributor

@ianferguson Thanks for reporting this issue! If you would be willing to put up a PR for your branch, I would be happy to work with you to get it merged.

@ianferguson
Copy link
Contributor Author

@fairclothjm excellent! That branch was a very rough proof of concept, I should have time to clean it up next week and get it opened as a pull request

Please do let me know if y'all have any preferences around the design or configurable options, happy to incorporate any preferences y'all have

@fairclothjm
Copy link
Contributor

@ianferguson Sounds good, thanks!

@ianferguson
Copy link
Contributor Author

sorry I wasn't able to clean my branch up for a PR and thank you so much for taking this across the line!

@fairclothjm
Copy link
Contributor

Thanks for reporting and providing the initial groundwork!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
3 participants