Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Automated Key Rotation Not Occurring Every 30 Days #23528

Closed
ChefAustin opened this issue Jul 10, 2024 · 4 comments · Fixed by #23577
Closed

Automated Key Rotation Not Occurring Every 30 Days #23528

ChefAustin opened this issue Jul 10, 2024 · 4 comments · Fixed by #23577
Assignees
Labels
stage/accepted Confirmed, and intend to work on. No timeline committment though. theme/keyring type/bug
Milestone

Comments

@ChefAustin
Copy link

ChefAustin commented Jul 10, 2024

Nomad version

$ nomad version
Nomad v1.8.0+ent
BuildDate 2024-05-28T18:08:23Z
Revision 5a7a5d80944f1568c8bcd4e71a7f923fc04f2461

Operating system and Environment details

$ cat /etc/lsb-release
DISTRIB_ID=Ubuntu
DISTRIB_RELEASE=22.04
DISTRIB_CODENAME=jammy
DISTRIB_DESCRIPTION="Ubuntu 22.04.4 LTS"

Issue

[Possibly related to #19669 ? ¯\_(ツ)_/¯]

According to the ‘Key Rotation’ section of Hashicorp Nomad's Key Management Operation Docs, “Nomad automatically rotates the active encryption key every 30 days”.

$ nomad operator root keyring list -verbose
Key                                   State   Create Time
c5b22e0f-0fdf-5ae6-6b81-8bec90df57c9  active  2023-08-03T08:23:36+02:00

Note how on my cluster, the key's Create Time is greater than 30 days (indicating that, perhaps, the recurring 30 day task for key rotation is not occurring as it should be).

Note: It is entirely possible that I am misunderstanding some nuances of how the keyring and key rotation work.

(Also: This issue dupes internal JIRA ticket NET-10398 && Hashicorp Support request #154558.)

Reproduction steps

  • Bring online a Nomad cluster using v1.6.1
  • Upgrade to v1.7.3
  • Upgrade to v1.7.3-ent
  • Upgrade to v1.7.4-ent
  • Create a second cluster (v1.7.4-ent) in a different region and federate it with the first cluster
  • Upgrade to v1.8.0-ent
  • Wait 31 days
  • Check keyring list for active key's "Create Time"

Expected Result

Active key's Create Time should not be more than 30 days old.

Actual Result

Active key's Create Time is older than 30 days (342 days old in my case).

@tgross
Copy link
Member

tgross commented Jul 10, 2024

Hi @ChefAustin! I don't think federation is really involved here, as we don't replicate the keyring between regions (see #20123 for more discussion of that). But there's obviously something broken here for sure. I'm actively working on #19669 (and a host of other keyring-related issues), so I'll investigate this as part of that body of work.

(Also, thanks for the pointer to NET-10398; it hadn't shown up on the Nomad engineering board yet.)

@tgross tgross self-assigned this Jul 10, 2024
@tgross tgross added this to the 1.8.x milestone Jul 10, 2024
@tgross tgross added stage/accepted Confirmed, and intend to work on. No timeline committment though. theme/keyring and removed theme/workload-identity labels Jul 10, 2024
@tgross
Copy link
Member

tgross commented Jul 10, 2024

Hi @ChefAustin just a heads up that I've figured out this is happening because of #16359. I'm not going to block fixing this on solving #16359 more generally though; we have the CreateTime on the key metadata already so I can use that instead of the CreateIndex to check the rotation threshold.

@ChefAustin
Copy link
Author

Hi @ChefAustin just a heads up that I've figured out this is happening because of #16359. I'm not going to block fixing this on solving #16359 more generally though; we have the CreateTime on the key metadata already so I can use that instead of the CreateIndex to check the rotation threshold.

Great to hear that this bugfix isn't blocked by a larger effort. I appreciate your digging into this! Thanks.

tgross added a commit that referenced this issue Jul 12, 2024
When a root key is rotated, the servers immediately start signing Workload
Identities with the new active key. But workloads may be using those WI tokens
to sign into external services, which may not have had time to fetch the new
public key and which might try to fetch new keys as needed.

Add support for prepublishing keys. Prepublished keys will be visible in the
JWKS endpoint but will not be used for signing or encryption until their
`PublishTime`. Update the periodic key rotation to prepublish keys at half the
`root_key_rotation_threshold` window, and promote prepublished keys to active
after the `PublishTime`.

This changeset also fixes two bugs in periodic root key rotation and garbage
collection, both of which can't be safely fixed without implementing
prepublishing:

* Periodic root key rotation would never happen because the default
  `root_key_rotation_threshold` of 720h exceeds the 72h maximum window of the FSM
  time table. We now compare the `CreateTime` against the wall clock time instead
  of the time table. (We expect to remove the time table in future work, ref
  #16359)
* Root key garbage collection could GC keys that were used to sign
  identities. We now wait until `root_key_rotation_threshold` +
  `root_key_gc_threshold` before GC'ing a key.

Ref: https://hashicorp.atlassian.net/browse/NET-10398
Ref: https://hashicorp.atlassian.net/browse/NET-10280
Fixes: #19669
Fixes: #23528
tgross added a commit that referenced this issue Jul 12, 2024
When a root key is rotated, the servers immediately start signing Workload
Identities with the new active key. But workloads may be using those WI tokens
to sign into external services, which may not have had time to fetch the new
public key and which might try to fetch new keys as needed.

Add support for prepublishing keys. Prepublished keys will be visible in the
JWKS endpoint but will not be used for signing or encryption until their
`PublishTime`. Update the periodic key rotation to prepublish keys at half the
`root_key_rotation_threshold` window, and promote prepublished keys to active
after the `PublishTime`.

This changeset also fixes two bugs in periodic root key rotation and garbage
collection, both of which can't be safely fixed without implementing
prepublishing:

* Periodic root key rotation would never happen because the default
  `root_key_rotation_threshold` of 720h exceeds the 72h maximum window of the FSM
  time table. We now compare the `CreateTime` against the wall clock time instead
  of the time table. (We expect to remove the time table in future work, ref
  #16359)
* Root key garbage collection could GC keys that were used to sign
  identities. We now wait until `root_key_rotation_threshold` +
  `root_key_gc_threshold` before GC'ing a key.

Ref: https://hashicorp.atlassian.net/browse/NET-10398
Ref: https://hashicorp.atlassian.net/browse/NET-10280
Fixes: #19669
Fixes: #23528
tgross added a commit that referenced this issue Jul 12, 2024
When a root key is rotated, the servers immediately start signing Workload
Identities with the new active key. But workloads may be using those WI tokens
to sign into external services, which may not have had time to fetch the new
public key and which might try to fetch new keys as needed.

Add support for prepublishing keys. Prepublished keys will be visible in the
JWKS endpoint but will not be used for signing or encryption until their
`PublishTime`. Update the periodic key rotation to prepublish keys at half the
`root_key_rotation_threshold` window, and promote prepublished keys to active
after the `PublishTime`.

This changeset also fixes two bugs in periodic root key rotation and garbage
collection, both of which can't be safely fixed without implementing
prepublishing:

* Periodic root key rotation would never happen because the default
  `root_key_rotation_threshold` of 720h exceeds the 72h maximum window of the FSM
  time table. We now compare the `CreateTime` against the wall clock time instead
  of the time table. (We expect to remove the time table in future work, ref
  #16359)
* Root key garbage collection could GC keys that were used to sign
  identities. We now wait until `root_key_rotation_threshold` +
  `root_key_gc_threshold` before GC'ing a key.
* When rekeying a root key, the core job did not mark the key as inactive after
  the rekey was complete.

Ref: https://hashicorp.atlassian.net/browse/NET-10398
Ref: https://hashicorp.atlassian.net/browse/NET-10280
Fixes: #19669
Fixes: #23528
Fixes: #19368
tgross added a commit that referenced this issue Jul 12, 2024
When a root key is rotated, the servers immediately start signing Workload
Identities with the new active key. But workloads may be using those WI tokens
to sign into external services, which may not have had time to fetch the new
public key and which might try to fetch new keys as needed.

Add support for prepublishing keys. Prepublished keys will be visible in the
JWKS endpoint but will not be used for signing or encryption until their
`PublishTime`. Update the periodic key rotation to prepublish keys at half the
`root_key_rotation_threshold` window, and promote prepublished keys to active
after the `PublishTime`.

This changeset also fixes two bugs in periodic root key rotation and garbage
collection, both of which can't be safely fixed without implementing
prepublishing:

* Periodic root key rotation would never happen because the default
  `root_key_rotation_threshold` of 720h exceeds the 72h maximum window of the FSM
  time table. We now compare the `CreateTime` against the wall clock time instead
  of the time table. (We expect to remove the time table in future work, ref
  #16359)
* Root key garbage collection could GC keys that were used to sign
  identities. We now wait until `root_key_rotation_threshold` +
  `root_key_gc_threshold` before GC'ing a key.
* When rekeying a root key, the core job did not mark the key as inactive after
  the rekey was complete.

Ref: https://hashicorp.atlassian.net/browse/NET-10398
Ref: https://hashicorp.atlassian.net/browse/NET-10280
Fixes: #19669
Fixes: #23528
Fixes: #19368
tgross added a commit that referenced this issue Jul 18, 2024
When a root key is rotated, the servers immediately start signing Workload
Identities with the new active key. But workloads may be using those WI tokens
to sign into external services, which may not have had time to fetch the new
public key and which might try to fetch new keys as needed.

Add support for prepublishing keys. Prepublished keys will be visible in the
JWKS endpoint but will not be used for signing or encryption until their
`PublishTime`. Update the periodic key rotation to prepublish keys at half the
`root_key_rotation_threshold` window, and promote prepublished keys to active
after the `PublishTime`.

This changeset also fixes two bugs in periodic root key rotation and garbage
collection, both of which can't be safely fixed without implementing
prepublishing:

* Periodic root key rotation would never happen because the default
  `root_key_rotation_threshold` of 720h exceeds the 72h maximum window of the FSM
  time table. We now compare the `CreateTime` against the wall clock time instead
  of the time table. (We expect to remove the time table in future work, ref
  #16359)
* Root key garbage collection could GC keys that were used to sign
  identities. We now wait until `root_key_rotation_threshold` +
  `root_key_gc_threshold` before GC'ing a key.
* When rekeying a root key, the core job did not mark the key as inactive after
  the rekey was complete.

Ref: https://hashicorp.atlassian.net/browse/NET-10398
Ref: https://hashicorp.atlassian.net/browse/NET-10280
Fixes: #19669
Fixes: #23528
Fixes: #19368
tgross added a commit that referenced this issue Jul 18, 2024
When a root key is rotated, the servers immediately start signing Workload
Identities with the new active key. But workloads may be using those WI tokens
to sign into external services, which may not have had time to fetch the new
public key and which might try to fetch new keys as needed.

Add support for prepublishing keys. Prepublished keys will be visible in the
JWKS endpoint but will not be used for signing or encryption until their
`PublishTime`. Update the periodic key rotation to prepublish keys at half the
`root_key_rotation_threshold` window, and promote prepublished keys to active
after the `PublishTime`.

This changeset also fixes two bugs in periodic root key rotation and garbage
collection, both of which can't be safely fixed without implementing
prepublishing:

* Periodic root key rotation would never happen because the default
  `root_key_rotation_threshold` of 720h exceeds the 72h maximum window of the FSM
  time table. We now compare the `CreateTime` against the wall clock time instead
  of the time table. (We expect to remove the time table in future work, ref
  #16359)
* Root key garbage collection could GC keys that were used to sign
  identities. We now wait until `root_key_rotation_threshold` +
  `root_key_gc_threshold` before GC'ing a key.
* When rekeying a root key, the core job did not mark the key as inactive after
  the rekey was complete.

Ref: https://hashicorp.atlassian.net/browse/NET-10398
Ref: https://hashicorp.atlassian.net/browse/NET-10280
Fixes: #19669
Fixes: #23528
Fixes: #19368
tgross added a commit that referenced this issue Jul 19, 2024
When a root key is rotated, the servers immediately start signing Workload
Identities with the new active key. But workloads may be using those WI tokens
to sign into external services, which may not have had time to fetch the new
public key and which might try to fetch new keys as needed.

Add support for prepublishing keys. Prepublished keys will be visible in the
JWKS endpoint but will not be used for signing or encryption until their
`PublishTime`. Update the periodic key rotation to prepublish keys at half the
`root_key_rotation_threshold` window, and promote prepublished keys to active
after the `PublishTime`.

This changeset also fixes two bugs in periodic root key rotation and garbage
collection, both of which can't be safely fixed without implementing
prepublishing:

* Periodic root key rotation would never happen because the default
  `root_key_rotation_threshold` of 720h exceeds the 72h maximum window of the FSM
  time table. We now compare the `CreateTime` against the wall clock time instead
  of the time table. (We expect to remove the time table in future work, ref
  #16359)
* Root key garbage collection could GC keys that were used to sign
  identities. We now wait until `root_key_rotation_threshold` +
  `root_key_gc_threshold` before GC'ing a key.
* When rekeying a root key, the core job did not mark the key as inactive after
  the rekey was complete.

Ref: https://hashicorp.atlassian.net/browse/NET-10398
Ref: https://hashicorp.atlassian.net/browse/NET-10280
Fixes: #19669
Fixes: #23528
Fixes: #19368
@tgross tgross closed this as completed in 2f43534 Jul 19, 2024
tgross added a commit that referenced this issue Jul 19, 2024
…23651)

When a root key is rotated, the servers immediately start signing Workload Identities with the new active key. But workloads may be using those WI tokens to sign into external services, which may not have had time to fetch the new public key and which might try to fetch new keys as needed.

Add support for prepublishing keys. Prepublished keys will be visible in the JWKS endpoint but will not be used for signing or encryption until their `PublishTime`. Update the periodic key rotation to prepublish keys at half the `root_key_rotation_threshold` window, and promote prepublished keys to active after the `PublishTime`.

This changeset also fixes three bugs in periodic root key rotation and garbage collection, none of which can be safely fixed without implementing prepublishing:

* Periodic root key rotation would never happen because the default `root_key_rotation_threshold` of 720h exceeds the 72h maximum window of the FSM time table. We now compare the `CreateTime` against the wall clock time instead of the time table. (We expect to remove the time table in future work, ref #16359)
* Root key garbage collection could GC keys that were used to sign identities. We now wait until `root_key_rotation_threshold` + `root_key_gc_threshold` before GC'ing a key.
*  When rekeying a root key, the core job did not mark the key as inactive after the rekey was complete.

Ref: https://hashicorp.atlassian.net/browse/NET-10398
Ref: https://hashicorp.atlassian.net/browse/NET-10280
Fixes: #19669
Fixes: #23528
Fixes: #19368

Co-authored-by: Tim Gross <tgross@hashicorp.com>
@tgross tgross modified the milestones: 1.8.x, 1.8.3 Jul 19, 2024
@tgross
Copy link
Member

tgross commented Jul 19, 2024

Fixed in #23577 and will ship in the next regular release of Nomad 1.8.x, with backports to Nomad 1.7.x/1.6.x Enterprise.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
stage/accepted Confirmed, and intend to work on. No timeline committment though. theme/keyring type/bug
Projects
Development

Successfully merging a pull request may close this issue.

2 participants