core: add jwks rpc and http api #18035

schmichael · 2023-07-22T00:45:32Z

Yanked out of #17434

JWKS endpoint to expose the public keys used to sign workload identity JWTs. The ultimate purpose of this endpoint is to enable integrations with 3rd party JWT auth methods such as Consul's and Vault's.

Not documenting this yet, but have a note on the project board. It's not really possible to use safely to use until we add audiences, expiration, and rotation to workload identities, so I'd like to leave it "hidden" or at least "unsupported" until we can offer a secure experience.

nomad/keyring_endpoint.go

tgross · 2023-07-24T13:41:26Z

nomad/keyring_endpoint.go

+				if keyMeta.Inactive() {
+					// Skip inactive keys
+					continue
+				}


I went back to the RFC on this and it's not totally clear to me how this will work with root key rotation and identities being pulled from the client asynchronously. Suppose the following scenario:

Allocation is placed with identity signed with root key A

Root key is rotated to key B

Allocation makes a request to third party with identity signed by A

Third-party hasn't seen this key before, and makes the JKWS request but the public key A is missing

I think we still need to show the public key for any key that can possibly have an unexpired JWT? If we cap the maximum JWT expiry to the root key rotation window (30d), that would suggest we need to display the active key + at least one previous key.

(I might suggest there's no harm in showing all the inactive keys but because the root key is used to encrypt Variables too and we can't GC them without a full rekey, his could be a fairly large set over time.)

I went back to the RFC on this and it's not totally clear to me how this will work

Yeah when writing the RFC I assumed inactive meant invalid and that is not the case: inactive keys are valid.

When fixing this up I was going to switch this logic to only skip deprecated keys, but it appears that state is deprecated: https://github.com/hashicorp/nomad/blob/v1.6.1/nomad/structs/keyring.go#L61-L64

JWKS must include the key for any credential that should be considered valid.

JWKS should not include keys for any credentials that should already be expired.

I can think of 2 paths forward, but both will require a new pending key status:

pending keys

Where "creation" is cluster creation (or at least when they first upgrade).

At creation + rotation_threshold / 2 the leader should create a new key with the pending status. Pending keys must never be used for signing!

At creation + rotation_threshold the current active key becomes inactive, and the current pending key becomes active.

For rotation_threshold / 2 amount of time there is no pending key (we want to limit how early we create it a bit because the longer a root key exists, the more likely it is to get leaked).

Goto 1

This will be our pre-publishing mechanism like Vault implemented in hashicorp/vault#12414 (although the description makes it sound caching related, the code is pre-creating the next key).

A. Redefine Key GC

This approach adjusts our existing root_key_gc_threshold to account for JWT's new expiration behavior:

Maintain the existing root_key_rotation_threshold = "30 days"

Change root_key_gc_threshold to 30 days from 1 hour.

If gc_threshold < rotation_threshold, then force gc_threshold = rotation_threshold and spew a big warning in server logs.

Validate that identity.ttl <= gc_threshold.

This means the JWKS endpoint can expose all non-deprecated keys (and if someday deprecated keys go away, so can that if statement here).

Users who manually configured their rotation_threshold > root_key_gc_threshold will get the gc_threshold forcibly bumped on upgrade and a warning in their logs. This sucks as it's changing behavior that a user specifically selected, but I think it's our best choice because:

JWT expiration fundamentally impacts how the keyring is used, so I think it's appropriate to adapt the keyring's operation to properly handle that.

Functionally there should be no user visible changes: variables will still work as before. JWTs will work as before. nomad operator root keyring rotate will still work. inactive keys will just exist for much longer.

B. New Key Status

Instead of changing the GC threshold we could add an invalid or expired status for keys in addition to the new pending status both approaches require. (Simply un-deprecating deprecated might also be sufficient. I'm unclear on the plan/story there.)

When a key is rotated its status still becomes inactive. These keys are still valid and therefore included in this endpoint.

After rotation_threshold passes the key becomes invalid (or whatever name we choose). These keys are invalid and therefore excluded from this endpoint.

After gc_threshold elapses we GC invalid keys, not inactive keys.

The pro here is that we don't change anyone's settings out from underneath them...

...the con is that we do that by just injecting a completely new intermediate state they have no control over. 😅

Functionally the approaches seem to the same to me. I think the new status (B) might be more user friendly as it makes invalid keys observable instead of gc'ing them away for good. However, new states offer opportunities for new bugs! If we forget that we now store untrusted invalid keys and accidently use one somewhere, that's a potentially serious problem.

For this reason I think I have a slight preference for (A).

Some additional background to consider when chosing between these two options:

Because it's such an expensive operation we don't currently, by default, ever re-key Variables. The periodic rotation just rotates the key for new Variables we encrypt, and even nomad operator root keyring rotate requires a -full flag to kick off a re-key. So keys aren't going to get GC'd unless an org's Variables are ephemeral or get frequently updated (the updates will get encrypted with the active key).

The original intent of deprecated was that after all Variables for a key were re-keyed and Workload Identities expired, the key could be safely deleted manually. I can't find in the RFC or the original PRs why we didn't simply delete the key at that point but it might have been paranoia/debuggability.

We intentionally punted on trying to figure out a better solution to the full re-key because as you noted at the time a 30d rotation means we end up with 120 keys after 10 years, and that's just not enough to care about.

The pending key state sounds great. The rotation_threshold / 2 does mean we're publishing 15 days in advance by default, which seems like a lot, but that's a nitpick based only on vibes and I'm fine with it if Vault was.

Option A. Redefine Key GC changes the threshold but otherwise leaves the existing GC behavior intact. That includes not re-keying by default and leaving the inactive keys around for potentially quite a long while.

Option B. New Key Status is arguably rolling back to what we were originally trying to do with the deprecated state, except with a working and reasonable expiration policy on Allocations because the JWTs expire now instead of living the lifetime of the Allocation. But the key can't really be considered deprecated until we've done a re-key, and that makes its value more dubious.

A third option would be to have an expired state (same as Option B) and when the GC finds keys in the expired state it kicks off a very slow re-key of all the Variables using that key, and then the expired key can be deleted. But we could also do the exact same thing with the inactive state by checking the rotation threshold first to ensure no JWTs are using the key.

So all that being said and one option not being "obviously" better than another, I'd advocate for option A with a follow-up down the road to do a slow re-key of Variables using inactive keys that are older than the GC threshold.

The pending key state sounds great. The rotation_threshold / 2 does mean we're publishing 15 days in advance by default, which seems like a lot, but that's a nitpick based only on vibes and I'm fine with it if Vault was.

Good point. I think we'll probably want probably want to see if we can keep rotations on the order of single-digit-days.

To be clear the / 2 was just a nice round number to go for. My goal was to keep it as simple, proportional, and deterministic as possible without just forcing every little decision on users via config parameters. We could really easily put users in the position of:

Ok so this alloc's client disconnected 3 days ago, has a max disconnect of 6 days, root keys are rotated every 14 days, and my identities have ttl's of 4 days so.........

That scenario is like 4 time-based config parameters already and doesn't cover JWKS or the behavior of 3rd party systems... so I kind of hope to just keep renewal logic proportional.

All that to say I'll make a note to revisit this when the server rotation PR goes up.

I'd advocate for option A with a follow-up down the road to do a slow re-key of Variables using inactive keys that are older than the GC threshold.

Thanks for all of the context. It is interesting how distinct the root Variable symmetric key is from the root Workload Identity asymmetric key... we could also "just" decouple those since Variables are stateful and Workload Identities are not (soon). As usual "just" is papering over the huge amount of work and complexity that would entail, so I doubt that's the right option.

That being said leaving keys-used-for-Variables-but-not-for-valid-WIs around indefinitely seems fine as long as we prune them from JWKS reasonably.

So yeah, I think we have some wiggle room here both in initial (upcoming!) implementation as well as how to evolve it over time.

nomad/structs/keyring.go

tgross · 2023-07-24T13:49:41Z

command/agent/keyring_endpoint.go

+		switch alg := pubKey.Algorithm; alg {
+		case structs.PubKeyAlgEdDSA:
+			// Convert public key bytes to an ed25519 public key
+			jwk.Key = ed25519.PublicKey(pubKey.PublicKey)
+		default:
+			s.logger.Warn("unknown public key algorithm. server is likely newer than client", "alg", alg)
+		}


I think this error message is suggesting that the request came to a Nomad client and was forwarded to a server that was newer, but the text isn't super clear and it's not necessarily the case (ex. the request was sent to a Nomad server that hadn't been upgraded and then that was forwarded to the leader which was).

What if the RPC returned both the []byte and the string representation needed by downstream consumers, so that we didn't have to worry about this conversion at all? I know this isn't how we'd normally handle this kind of problem, but it seems like it'd cut out a lot of upgrade-related worries in the future.

Yeah I struggled with what responsibilities to put in the RPC vs HTTP handlers. If I understand your proposal it would mean the KeyringPublicKey struct the RPC handler returns would get changed to something like:

type KeyringPublicKey struct { KeyID string `json:"kid"` // Always "OKP" (octet key pair) currently KeyType string `json:"kty"` PublicKey []byte `json:"x"` // Always "EdDSA" currently Algorithm string `json:"crv"` // Always "Ed25519" currently, must match alg Curve string `json:"alg"` // Always "sig" currently Use string `json:"use"` // Purely for Nomad's cache control use. Not part of JWK spec. CreateTime int64 `json:"-"` }

Then we would just JSON marshal struct { Keys []KeyringPublicKey } in the HTTP handler and not even touch go-jose.JSONWebKey.

Reason 1 Against: Too DIY

I avoided this initially because it felt too DIY. The nice thing about go-jose.JSONWebKey is that it knows how to convert key material to the appropriate algorithm (agl), curve (crv), and type (kty) fields. In the future if we add other key types, as long as we updated the mapping of Algorithm -> Go Type, go-jose would ensure it was marshalled properly.

That being said go-jose is "saving" us 2 fields currently (kty and crv)... and crv we're effectively setting ourselves by doing the type conversion (that as you know is easy to forget!).

So my "too DIY to skip go-jose" argument seems weak at best...

Reason 2 Against: Validating JWTs in the Agent

But there's another reason to do the conversion here: if we ever want to validate JWTs in the Agent, the Agent needs to do the conversion we're doing here. There's no way to avoid converting Key []byte -> ed25519.PublicKey at some point in order to use it.

The auth middleware used for the Task API and the WhoAmI RPC would both benefit from being able to validate JWTs using these PublicKeys. In my other branch I have a Keyring.ListPublic -> PublicKeyCache conversion implemented so the JWKS endpoint and auth middleware could share the underlying public keys in their native Go types.

I think this is reason enough to keep go-jose.JSONWebKey around even though the type conversions are gross and error prone? I think we can keep the problematic conversion to 1 or 2 places in code.

Alternative: JSONWebKeySet internally

Plan: just have the RPC return go-jose.JSONWebKeySet and just let the HTTP handler serialize it.

Pro: All of the key handling is the encrypter and keyring_endpoint files.

Con:

I don't think it's safe to send go-jose's JSONWebKey struct over our msgpack RPC. Maybe it is, and I could test it, but even if it is today I'm worried a change in either go-jose, msgpack, or even Go's crypto library could break things.

My main concern is JSONWebKey.Key any type confusion: go-jose relies on the key's concrete type to properly emit the right JWKS JSON (through a custom marshal implementation and internal struct). If the RPC/msgpack layer were to lose the concrete type information, go-jose may guess that it's a symmetric encryption key and cause a really confusing error (that I caused the other day and you helped me debug!).

I'm also a little worried that if we started using X509 certificates in the future that the x509.Certificate might not roundtrip cleanly through msgpack, so we'd silently change things like timestamps?

I don't know... these may be unreasonable concerns, but explicit is better than implicit seems doubly true to me when it comes to cryptographic materials and Key interface{} is the definition of implicit.

If I understand your proposal it would mean the KeyringPublicKey struct the RPC handler returns would get changed to something like:
...

Yeah that's pretty much what I was thinking. To be clear, I'm less concerned about where in the code the conversion is happening than avoiding weird cross-version problems in doing that conversion with info we got from the server. But...

The auth middleware used for the Task API and the WhoAmI RPC would both benefit from being able to validate JWTs using these PublicKeys.

I think I'm probably sold on the basis of this alone. We know we're likely to want to do this in the near-ish future.

What do you think about my comment about the text of the error message though?

tgross · 2023-07-24T13:52:51Z

command/agent/http.go

+	// JWKS Handler
+	s.mux.HandleFunc("/.well-known/jwks.json", s.wrap(s.JWKSRequest))


Typically the HTTP server will be protected with mTLS, which won't be reachable by third parties. So this probably needs to get served on a separate HTTP listener. (On its own port, I suppose?) This PR is already a nice size so we don't have to solve that in this PR, but just wanted to make sure we didn't forget about that problem.

Yeah in the RFC (internal only, sorry folks) I call this out as a problem.

The Task API (+Workload Identity) gets around the mTLS problem for things running in Nomad... which our friends Consul and Vault usually are not.

...so then you're left with the option of running an mTLS terminating proxy just for this endpoint which is a hassle to say the least. How the proxy and the JWT's issuer field interacted could be problematic as well, although I don't think for Consul and Vault because their JWT auth methods allow you to specify the JWKSURL explicitly and only seem to use the issuer claim for mapping.

That being said this is only a problem for folks running with tls.verify_https_client = true explicitly as it defaults to false even with mTLS enabled. I think this is still an ok default and recommendation as the HTTP endpoints will still be protected by (asymmetric) TLS, and the Consul and Vault JWT auth methods support setting a custom CA cert for that.

.......even after all of that we should probably add a "Only TLS, Not mTLS" HTTP listener for this, the eventual OIDC discovery endpoint, and maybe other "low risk" endpoints like metrics.

While I feel like the situation has fairly reasonable options, I think it's clearly too complicated. Somebody hardening their cluster by setting tls.verify_https_client = true one day breaking their Consul integration due to the JWT auth method lacking a client certificate would just be hellish to debug. Definitely a : (╯°□°）╯︵ ┻━┻ situation I want to avoid.

I keep forgetting about tls.verify_https_client = false being the default. Even Vault's equivalent is set to false.

We definitely don't need to solve this in this PR for sure 😀

Co-authored-by: Tim Gross <tgross@hashicorp.com>

tgross · 2023-07-27T13:32:45Z

helper/retry_test.go

+	// When we fallback to the min wait it means we miss the expiration, but this
+	// is necessary to prevent stampedes after outages and partitions.
+	must.GreaterEq(t, exp.Sub(now()), renew)


It would be nice to have this comment in the docstring for ExpiryToRenewTime rather than the test, because the "why" wasn't clear looking at the code.

tgross · 2023-07-27T13:34:11Z

helper/retry.go

+// If the expiration is in the past or less than the min wait, then the min
+// wait time will be used with jitter.


But doesn't this behavior floor the result at 15min? If so, when would the client ever renew if the time to renew is always 15min in the future? Is the assumption that clients will call this function (or something that calls it, like the JWKS endpoint) once and then set a timer on that value?

Is the assumption that clients will call this function (or something that calls it, like the JWKS endpoint) once and then set a timer on that value?

Yes, but oof you're right this function has a sort of Fence Post Problem: if you call it before each fetch to determine whether it's time to fetch: you'll never renew!

However if you call it after you fetch it will give you a reasonable time to fetch in. JWKS is the latter, so I think it's safe:

Consul fetches JWKS v1 response, Cache-Control=15 minutes

Consul uses JWKS v1 for 15 minutes

Consul fetches JWKS v2 response, Cache-Control=...

If Consul was using something like ExpiryToRenewTime prior to each fetch I think we'd have the problem you mention.

I'll see if there's a safer way to write this function when I post the next PR that references it. It's possible we don't want to actually share this code at all and can just make it a private function for JWKS to use.

(also just realized I named it "...to renew time" when it returns a duration which is a it silly)

tgross

LGTM! 👍 I've left a couple remaining questions but I don't think there's anything blocking here at this point.

core: add jwks rpc and http api

ab1ea61

schmichael requested a review from tgross July 22, 2023 00:45

vercel bot deployed to Preview – nomad-storybook-and-ui July 22, 2023 00:46 View deployment

tgross reviewed Jul 24, 2023

View reviewed changes

schmichael and others added 2 commits July 24, 2023 16:49

Apply suggestions from code review

59dd317

Co-authored-by: Tim Gross <tgross@hashicorp.com>

add missing helper file

1f76474

vercel bot deployed to Preview – nomad-storybook-and-ui July 24, 2023 23:54 View deployment

test retry logic

b28436b

vercel bot deployed to Preview – nomad-storybook-and-ui July 26, 2023 00:25 View deployment

tell semgrep its ok that jwks is unauth'd

027c6b2

vercel bot deployed to Preview – nomad-storybook-and-ui July 26, 2023 20:11 View deployment

only skip deprecated keys and test listpublic rpc

3f252d1

vercel bot deployed to Preview – nomad-storybook-and-ui July 26, 2023 23:12 View deployment

schmichael marked this pull request as ready for review July 27, 2023 00:08

tgross reviewed Jul 27, 2023

View reviewed changes

tgross approved these changes Jul 27, 2023

View reviewed changes

schmichael merged commit d14362e into main Jul 27, 2023
24 of 25 checks passed

schmichael deleted the f-jwks branch July 27, 2023 18:27

schmichael mentioned this pull request Aug 2, 2023

Add support for transparent authentication to the Task API #18125

Open

schmichael mentioned this pull request Oct 31, 2023

docs: changelog & basic docs for 1.7 WI changes #18936

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

core: add jwks rpc and http api #18035

core: add jwks rpc and http api #18035

schmichael commented Jul 22, 2023 •

edited

tgross Jul 24, 2023

schmichael Jul 26, 2023 •

edited

tgross Jul 27, 2023 •

edited

schmichael Jul 27, 2023

tgross Jul 24, 2023

schmichael Jul 24, 2023

tgross Jul 24, 2023

tgross Jul 27, 2023

tgross Jul 24, 2023

schmichael Jul 24, 2023

tgross Jul 27, 2023 •

edited

tgross Jul 27, 2023

tgross Jul 27, 2023

schmichael Jul 27, 2023

schmichael Jul 27, 2023

tgross left a comment

		// JWKS Handler
		s.mux.HandleFunc("/.well-known/jwks.json", s.wrap(s.JWKSRequest))

		// If the expiration is in the past or less than the min wait, then the min
		// wait time will be used with jitter.

core: add jwks rpc and http api #18035

core: add jwks rpc and http api #18035

Conversation

schmichael commented Jul 22, 2023 • edited

Choose a reason for hiding this comment

schmichael Jul 26, 2023 • edited

Choose a reason for hiding this comment

pending keys

A. Redefine Key GC

B. New Key Status

tgross Jul 27, 2023 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Reason 1 Against: Too DIY

Reason 2 Against: Validating JWTs in the Agent

Alternative: JSONWebKeySet internally

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tgross Jul 27, 2023 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tgross left a comment

Choose a reason for hiding this comment

schmichael commented Jul 22, 2023 •

edited

schmichael Jul 26, 2023 •

edited

`pending` keys

tgross Jul 27, 2023 •

edited

tgross Jul 27, 2023 •

edited