Skip to content

CNTRLPLANE-2993: Add KMS TPv2 foundations#1960

Merged
openshift-merge-bot[bot] merged 10 commits intoopenshift:masterfrom
ardaguclu:cntrlplane-2993
Apr 14, 2026
Merged

CNTRLPLANE-2993: Add KMS TPv2 foundations#1960
openshift-merge-bot[bot] merged 10 commits intoopenshift:masterfrom
ardaguclu:cntrlplane-2993

Conversation

@ardaguclu
Copy link
Copy Markdown
Member

This PR adds KMS foundations in encryption controllers in library-go for Tech Preview v2 work.

@openshift-ci-robot openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Mar 17, 2026
@openshift-ci openshift-ci bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Mar 17, 2026
@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci bot commented Mar 17, 2026

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@openshift-ci-robot
Copy link
Copy Markdown

openshift-ci-robot commented Mar 17, 2026

@ardaguclu: This pull request references CNTRLPLANE-2993 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.22.0" version, but no target version was set.

Details

In response to this:

This PR adds KMS foundations in encryption controllers in library-go for Tech Preview v2 work.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@ardaguclu ardaguclu changed the title CNTRLPLANE-2993: Add KMS TPv2 foundations WIP: CNTRLPLANE-2993: Add KMS TPv2 foundations Mar 17, 2026
@ardaguclu ardaguclu marked this pull request as ready for review March 17, 2026 07:39
@openshift-ci openshift-ci bot requested review from cybertron and jsafrane March 17, 2026 07:39
@ardaguclu ardaguclu changed the title WIP: CNTRLPLANE-2993: Add KMS TPv2 foundations CNTRLPLANE-2993: Add KMS TPv2 foundations Mar 17, 2026
@openshift-ci openshift-ci bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Mar 17, 2026
@ardaguclu
Copy link
Copy Markdown
Member Author

/uncc @cybertron @jsafrane

Comment thread enhancements/kube-apiserver/kms-encryption-foundations.md Outdated
@ardaguclu
Copy link
Copy Markdown
Member Author

Copy link
Copy Markdown
Member

@flavianmissi flavianmissi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking great, left some minor comments.

Comment thread enhancements/kube-apiserver/kms-encryption-foundations.md Outdated
Comment thread enhancements/kube-apiserver/kms-encryption-foundations.md Outdated
Copy link
Copy Markdown
Contributor

@p0lyn0mial p0lyn0mial left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

did a first quick pass, will revisit this pr.

- Recovery from KMS key loss (separate EP for GA)
- Automatic `key_id` rotation detection (Tech Preview v2)
- KMS plugin deployment/lifecycle management (covered by a separate EP)
- KMS plugin health checks (GA)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does it mean we will update this document in the future with details on plugin HC ?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think yes, it would be better to reflect the HC mechanism (including monitoring the key_id) here.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, so this is our goal. Would it make sense to slightly restructure the docs and maybe introduce phases or different beta stages e.g. Beta1/Beta2) and then specify which work belongs to each stage?

That could help highlight our goals and non-goals in one section, while also making it easier to track progress across the beta stages.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That would be better to track the progress and reflect our future plans. I've updated based on it. There is new TechPreview v3 section and some of the goals are defined in there.

I've added configuring timeout via unsupportedOverrides in TP v3 as well.

- Automatic `key_id` rotation detection (Tech Preview v2)
- KMS plugin deployment/lifecycle management (covered by a separate EP)
- KMS plugin health checks (GA)
- Recovery from KMS key loss (GA)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

last time we discussed this, we said we didn't want to support this recovery mode.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct. Since this is already in non-goals section, I'll just remove the GA.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would just keep it in non-goals section. I think that we could justify the decision so that future reads know why we decided to not implement the support for it.

Users specify plugin-specific configuration for managed KMS provider types (e.g. Vault) via the APIServer resource (API fields covered by a separate EP).
Encryption controllers split the KMS configuration API into two parts stored atomically in encryption key secrets:

1. `kms-config` — fields for EncryptionConfiguration (apiVersion, name, endpoint, timeout)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This field will hold common configuration for all KMS providers. I’m wondering if we should expose it to admins in case the default values (e.g., timeout) are incorrect.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This field will hold common configuration for all KMS providers.

nit: might be worth mentioning that

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

apiVersion is static v2. Name and endpoints are generated by us. Only remaining field is timeout which can be discussed #1954. If we decide to expose timeout, it will be a part of in-place fields.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would it make sense to discuss during the standup meeting?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that this field will hold structured versioned data, right ? if yes, would it make sense to emphasize that ?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also I've added configuring timeout via unsupportedOverrides as a goal for TP v3.

#### Encryption Controllers

**keyController** manages encryption key lifecycle. Creates encryption key secrets in `openshift-config-managed` namespace. For KMS mode, creates secrets storing KMS configuration.
For Tech Preview v2, also splits configuration into `kms-config` and `kms-sidecar-config`, performs field-level comparison, and validates credential secrets.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we need to support migration from v1 to v2? For example, what happens if a user enables TP v1 and then provides a Vault KMS configuration that indicates v2?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

According to my analysis, key controller can successfully decide to generate new key for the new configuration and every other controller will handle this correct. Of course, it would be desirable to test this.

annotations:
encryption.apiserver.operator.openshift.io/mode: "KMS"
type: Opaque
data:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we describe and show how credentials for the KMS plugin will be stored?
Should we also explain how the public key used to verify the remote KMS instance will be stored?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is a good point. I've updated the EP by mentioning how we are planning to manage Secret and ConfigMap references and keep the our Secrets (i.e. key, encryption-config) up to date.

There are no preconditions for enabling KMS for the first time.

#### Variation: Migration Between Encryption Modes
#### Variation: Updates Requiring Migration (Tech Preview v2)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we describe what happens if the previous key/plugin hasn’t finished (our invariants)?

Should we describe how we are going to recover from incorrect configuration in various fields? (the ones that lead to new key generation and the ones that just update existing cfg)

Should we describe how we are going to recover if an admin overwrites the current configuration?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. I've briefly mentioned about these points based on the design document.

**Note:** Automatic weekly key rotation (used for aescbc/aesgcm) is disabled for KMS since rotation is triggered externally.

#### Variation: KMS Key Rotation (Tech Preview v2)
#### Steps for Enabling KMS Encryption (Tech Preview v2)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We also need a section that explains how we will report the current progress to platform users (e.g. which KMS plugins are in use)

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about we handle this when we design the removal of the unused kms plugins #1960 (comment). So that it lists active kms plugins as well as the ones that can be decommisioned.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure, this could be Beta2/3/4. I would just mark this is TODO.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea. This is added a new goal under TP v3.


When the user sets the encryption mode to identity, keyController creates a new encryption key secret for identity mode. The EncryptionConfiguration contains identity as write provider and the KMS plugin as read provider until migration completes.

After migration, the unused KMS plugin is removed from EncryptionConfiguration and status conditions notify the admin. Backups encrypted with the previous KMS plugin are not restorable without access to that plugin. This removal mechanism is out of scope in Tech Preview v2.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we could explain why this (removing unused plugins from EC) is important.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we want to share details on how are we going to implement the removal ?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we could explain why this (removing unused plugins from EC) is important.

Yes, we should. I've updated the EP.

do we want to share details on how are we going to implement the removal ?

I think we are not ready yet to design this flow, so maybe we should add it later?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure, this could be Beta2/3/4. I would just mark this is TODO.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea. This is added a new goal under TP v3.

- KMS plugin health checks (Tech Preview v2)
- Recovery from KMS key loss (separate EP for GA)
- Automatic `key_id` rotation detection (Tech Preview v2)
- KMS plugin deployment/lifecycle management (covered by a separate EP)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would it make sense to mark this as our goal and point to an external doc/EP?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes it makes sense. I've moved API definition and plugin lifecycle under goals section. I've referenced API definition EP. Since plugin lifecycle does not have any doc or EP, I couldn't reference it.

- KMS plugin deployment/lifecycle management (covered by a separate EP)
- KMS plugin health checks (GA)
- Recovery from KMS key loss
- Automatic `key_id` rotation detection (GA)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would it make sense to mark this as our goal and point to an external doc/EP, Beta2/BetaN seciton?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've moved this as a goal under TP v3. But there is no public doc or EP for it.

Encryption controllers split the KMS configuration API into multiple parts stored atomically in encryption key secrets:

1. `kms-config` — fields for EncryptionConfiguration (apiVersion, name, endpoint, timeout)
2. `kms-sidecar-config` — provider-specific fields for sidecar containers (image, vault-address, listen-address, transit-mount, transit-key, etc.)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that this could hold structured per provider data, maybe even versioned. Thoughts ?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking that this will hold generic structured KMSConfig https://github.com/openshift/api/blob/master/config/v1/types_kmsencryption.go#L7. So everyone can access all the data after deserializing it.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So that kms-sidecar-config will always be deserialized to same KMSConfig but the internal data may vary per provider.

Users specify plugin-specific configuration for managed KMS provider types (e.g. Vault) via the APIServer resource (API fields covered by a separate EP).
Encryption controllers split the KMS configuration API into two parts stored atomically in encryption key secrets:

1. `kms-config` — fields for EncryptionConfiguration (apiVersion, name, endpoint, timeout)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that this field will hold structured versioned data, right ? if yes, would it make sense to emphasize that ?

Users specify plugin-specific configuration for managed KMS provider types (e.g. Vault) via the APIServer resource (API fields covered by a separate EP).
Encryption controllers split the KMS configuration API into multiple parts stored atomically in encryption key secrets:

1. `kms-config` — fields for EncryptionConfiguration (apiVersion, name, endpoint, timeout)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would it make sense to rename this field to kms-encryption-configuration ?

Copy link
Copy Markdown
Member Author

@ardaguclu ardaguclu Apr 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've renamed all the fields;

Please let me know your thoughts.


1. `kms-config` — fields for EncryptionConfiguration (apiVersion, name, endpoint, timeout)
2. `kms-sidecar-config` — provider-specific fields for sidecar containers (image, vault-address, listen-address, transit-mount, transit-key, etc.)
3. `kms-credentials` — credential data fetched from referenced secrets (e.g., approle credentials from `openshift-config` namespace)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just wondering if we could encode the type information and the resource name, so we could build a small utility function that automatically resolves the type, name, and content. wdyt ?

Copy link
Copy Markdown
Member Author

@ardaguclu ardaguclu Apr 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think, I don't understand the motivation and mechanism here. Would you mind elaborating?

3. `kms-credentials` — credential data fetched from referenced secrets (e.g., approle credentials from `openshift-config` namespace)
4. `kms-configmap-data` — ConfigMap data needed by KMS plugins (e.g., CA bundles)

Storing all in the same secret avoids race conditions where EncryptionConfiguration references a KMS plugin whose sidecar configuration or credentials are not yet available.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in case we would like to be more specific, we could say:

Storing all related data in a single secret avoids race conditions caused by reading live, independently changing configuration. 

In kas-o, the targetConfigController operates on live data and may generate a manifest based on the current sidecar configuration. However, this configuration can change before the RevisionController creates a revision. As a result, the generated manifest may no longer match the actual configuration state at the time the revision is created. 

Keeping all dependent configuration in a single secret ensures consistency and guarantees that both controllers operate on the same, atomic snapshot of data. 

Additionally, consolidating the data in a single secret leverages existing revisioning and cleanup mechanisms.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is better. Updated.

@@ -65,18 +66,37 @@ Encryption controllers use the static endpoint in EncryptionConfiguration. KMS-t

**Tech Preview v2 (Managed Plugin Lifecycle):**
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that this paragraph is too detailed. Would it make sense to keep it high-level similar to the Tech Preview v1 paragraph?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've updated by removing redundant details. Please let me know your opinions.

Users specify plugin-specific configuration for managed KMS provider types (e.g. Vault) via the APIServer resource (API fields covered by a separate EP).
Encryption controllers split the KMS configuration API into two parts stored atomically in encryption key secrets:

1. `kms-config` — fields for EncryptionConfiguration (apiVersion, name, endpoint, timeout)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This field will hold common configuration for all KMS providers.

nit: might be worth mentioning that


1. `kms-config` — fields for EncryptionConfiguration (apiVersion, name, endpoint, timeout)
2. `kms-sidecar-config` — provider-specific fields for sidecar containers (image, vault-address, listen-address, transit-mount, transit-key, etc.)

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: not sure if kms-config and kms-sidecar-config express well the diff between these 2 configurations. Maybe something like kms-api-config and kms-provider-config would be better?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about these #1960 (comment)?

# Vault API specific fields
```

2. keyController detects the configuration, splits it into `kms-config` and `kms-sidecar-config`, and creates an encryption key secret:
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It actually splits the configuration into more fields (other than these 2), right?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, you are right. I've updated that section to reflect it better.

kind: Secret
metadata:
name: encryption-config-kube-apiserver-9
namespace: openshift-kube-apiserver
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm confused: the state controller doesn't seem to create the secret with the key in the name, nor it creates the secret in this namespace. Or am I missing something here?

As far as I see, the state controller creates this secret instead:

name: encryption-config-openshift-kube-apiserver
namespace: openshift-config-managed

Copy link
Copy Markdown
Member Author

@ardaguclu ardaguclu Apr 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is correct. As far as I understand there is a copying mechanism (in operators) that automatically copies encryption-config-openshift-*-apiserver Secrets in openshift-config-managed (which are created by state_controller as you said) under dedicated operator namespaces in revisioned format.

So when state_controller updates encryption-config-openshift-kube-apiserver in openshift-config-managed, the mechanism will generated new encryption-config-kube-apiserver-10 under openshift-kube-apiserver ns.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. Just for completeness, the copying is done by the resourcesynccontroller: https://github.com/openshift/library-go/blob/HEAD/pkg/operator/resourcesynccontroller/resourcesync_controller.go#L31-L31

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is good to know that. Thank you

namespace: openshift-kube-apiserver
type: Opaque
data:
encryption-config: <EncryptionConfiguration>
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For my knowledge, could you explain why this isn't kms-ec-config-1?

Copy link
Copy Markdown
Member Author

@ardaguclu ardaguclu Apr 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This field encryption-config has existed for a long time and it is used by all encryption modes. Due to backwards compatibility reasons, we need to keep it. It is identical to kms-ec-config-1.

kind: Secret
metadata:
name: encryption-config-kube-apiserver-9
namespace: openshift-kube-apiserver
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, this is the synced secret, not the secret created by the state controller, right?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, exactly. But content is identical of the Secret that is managed by state controller.

Fields that only affect the container spec (e.g., image for CVE fixes) do not change the KEK:

1. keyController updates the existing encryption key secret in-place. No new secret is created.
2. stateController detects the change and triggers a new revision with the updated `kms-sidecar-config`.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Technically the state controller just updates the secret (and the revision controller actually triggers the new revision), correct?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, correct


When the user sets the encryption mode to identity, keyController creates a new encryption key secret for identity mode. The EncryptionConfiguration contains identity as write provider and the KMS plugin as read provider until migration completes.

After migration, the unused KMS plugin is removed from EncryptionConfiguration. This is important because leaving stale providers in EncryptionConfiguration means the API server will continue attempting to connect to the old KMS plugin at startup, blocking readiness if the plugin is no longer available. Status conditions notify the admin that the KMS plugin can be safely decommissioned. Backups encrypted with the previous KMS plugin are not restorable without access to that plugin. The removal mechanism is out of scope in Tech Preview v2.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Status conditions notify the admin that the KMS plugin can be safely decommissioned.

Is the admin responsible for decommissioning the KSM plugin? Shouldn't the lifecycle management take care of that?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lifecycle management will only run kms plugins listed in encryption-config Secret. So once encryption controllers do not populate the old kms plugin in the Secret, Lifecycle won't run old kms plugin anyway.

But decommisioning of the kms plugin requires some action on the respective kms provider. So we can only notify about the current state and it is up to cluster admin. I'd like to hear opinions from @p0lyn0mial

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lifecycle management will only run kms plugins listed in encryption-config Secret.

That's a good point, the plugin will simply not be started. Thanks

Copy link
Copy Markdown
Contributor

@p0lyn0mial p0lyn0mial left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

did another pass this morning. ptal and let me know if my comments make sense. thanks!

- Support updating the KMS timeout field via `unsupportedConfigOverrides`

**GA — Goals:**
- Failure mode coverage: loss of access to KMS service, lost encryption keys, loss of credentials
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for each failure mode we would like to have detection and mitigation mechanism. maybe worth mentioning.

the failures modes we discussed were captured at https://docs.google.com/document/d/1htjUQmCk60AVkcOe1QzLSYNrBhO7T8hVsFDiTbR4AJM/edit?tab=t.0#heading=h.7q1jphnlk300

would it make sense to group supported failures (by type/category)and be more specific which modes exactly we would like to support ?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Drilled down to subsections.

- KMS plugin health checks (Tech Preview v2)
- Recovery from KMS key loss (separate EP for GA)
- Automatic `key_id` rotation detection (Tech Preview v2)
- Recovery from KMS key loss
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should state why we don't want to support this mode (for future readers) ?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a statement about how we exclude this.

- Seamless migration between encryption modes (aescbc ↔ KMS, KMS ↔ KMS)
- Field-level comparison to distinguish migration-requiring vs. in-place changes
- Pre-flight checks before generating new encryption keys
- Credential/ConfigMap validation with degraded status reporting
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what does it mean ? how can we validate credentials ?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By validation of Secret/ConfigMap, I meant this #1960 (comment). I will clarify this statement.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We'll need another design session about this.

- API field definitions for KMS provider configuration in APIServer resource (covered by a [separate EP](https://github.com/openshift/enhancements/pull/1954))

**Tech Preview v3 — Goals:**
- Report current KMS encryption status to platform users (e.g., active KMS plugins, migration progress)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

won't migration progress be already reported ?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've removed migration progress to not confuse

- Report current KMS encryption status to platform users (e.g., active KMS plugins, migration progress)
- Automatic `key_id` rotation detection
- KMS plugin health checks
- Feature parity with existing modes (monitoring, migration, key rotation)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wdym by that ?
migration will be supported because we mentioned it in the prev section v2 goals.
key rotation is already mentioned in this section.
do we have special monitoring for encryption ?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This became redundant. Removed.

3. `kms-secret-data` — content of the referenced Secret (e.g., approle credentials)
4. `kms-configmap-data` — content of the referenced ConfigMap (e.g., CA bundles)

Storing all related data in a single secret ensures consistency and leverages existing revisioning and cleanup mechanisms.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pasting my prev comment just in case we would like to add more context (justifies why):

Storing all related data in a single secret avoids race conditions caused by reading live, independently changing configuration. 

In kas-o, the targetConfigController operates on live data and may generate a manifest based on the current sidecar configuration. However, this configuration can change before the RevisionController creates a revision. As a result, the generated manifest may no longer match the actual configuration state at the time the revision is created. 

Keeping all dependent configuration in a single secret ensures consistency and guarantees that both controllers operate on the same, atomic snapshot of data. 

Additionally, consolidating the data in a single secret leverages existing revisioning and cleanup mechanisms.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated.

**Key changes in library-go:**
1. Add KMS mode constant and track KMS configuration in encryption key secrets
2. Split configuration into kms-encryption-config, kms-provider-config, kms-secret-data, and kms-configmap-data; copy with keyID suffix to encryption-configuration secrets (Tech Preview v2)
3. Field-level comparison, credential/ConfigMap validation, and periodic sync of referenced resources to all active key secrets (Tech Preview v2)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how can we validate credential/ConfigMap ?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Validation of Secret/ConfigMap means that we'll check the existence of the Secret/ConfigMap and if it doesn't exist, we'll degrade until cluster admin recreates the same resource. I think, I should clarify this, since it is ambiguous.

**Key changes in library-go:**
1. Add KMS mode constant and track KMS configuration in encryption key secrets
2. Split configuration into kms-encryption-config, kms-provider-config, kms-secret-data, and kms-configmap-data; copy with keyID suffix to encryption-configuration secrets (Tech Preview v2)
3. Field-level comparison, credential/ConfigMap validation, and periodic sync of referenced resources to all active key secrets (Tech Preview v2)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we also mention that controller will propagate updates from the API configuration ?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea. Updated.


7. conditionController updates status conditions: `EncryptionInProgress`, then `EncryptionCompleted`.

There are no preconditions for enabling KMS for the first time.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have the precondition of passing the preflight-checks before the configured KMS plugin can be enabled for the first time, no ?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've updated statement by mentioning about preflight checks.

As a side note: Maybe we should apply pre-flight checks not only key creation but all of them (including the updates). So that we can always verify that any update will work properly. WDYT?

#### Variation: Updates Requiring Migration (Tech Preview v2)

When external KMS rotates the key internally:
If a field affecting the KEK is changed (**vault-address**, **vault-namespace**, **transit-key**, **transit-mount**), keyController creates a new encryption key secret with the next keyID.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it make sens to link to the Preconditions for Configuration Changes (Tech Preview v2) section for more details on when a new key will be created ?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Referenced.

- KMS plugin health checks (Tech Preview v2)
- Recovery from KMS key loss (separate EP for GA)
- Automatic `key_id` rotation detection (Tech Preview v2)
- Recovery from KMS key loss (if the key is deleted externally, recovery is equivalent to bootstrapping the cluster from scratch)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that we should provide more context.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For example:

Recovery from this situation would require deleting all resources that we are unable to decode and then recreating them from scratch. This process is costly and complex to implement (for example, all certificates would need to be reissued, the etcd cluster rebuilt, etc.), and is comparable in effort to implementing a full re-bootstrap. Additionally, the recovery flow would need to be covered by CI tests to catch potential regressions.

Moreover, the platform itself would not be able to recreate resources required by user workloads, since only users have the necessary knowledge about them. In practice, this means users must have their own mechanisms for restoring these resources.

On the Vault side, the key (the KEK used to encrypt the cluster seed, which Kubernetes then uses to generate DEKs for encrypting cluster data) is stored in Vault’s Transit secrets engine. By default, keys in Transit have `deletion_allowed set` to `false`. A Vault administrator would need to explicitly change this setting to true in order to allow key deletion. In general standard best practices should be followed. This includes enforcing least-privilege access to sensitive API endpoints, such as those used for key deletion or key configuration updates. It is also recommended to periodically back up keys, so they can be restored if needed.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've created a dedicated section for this and referenced from the non-goal.

@p0lyn0mial
Copy link
Copy Markdown
Contributor

I like this document. I’m convinced the mechanics described in it will work. There are still sections that require experimentation and design (e.g. key rotation), but those can be updated later. I’d like to focus on the internal v2 version for now. Thank you for putting this document together and thanks to everyone who reviewed it.

@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci bot commented Apr 13, 2026

@ardaguclu: all tests passed!

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@benluddy
Copy link
Copy Markdown

/approve

@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci bot commented Apr 13, 2026

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: benluddy

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Apr 13, 2026
@p0lyn0mial
Copy link
Copy Markdown
Contributor

/lgtm

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Apr 14, 2026
@openshift-merge-bot openshift-merge-bot bot merged commit 482fb5b into openshift:master Apr 14, 2026
2 checks passed
@ardaguclu ardaguclu deleted the cntrlplane-2993 branch April 14, 2026 07:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. lgtm Indicates that a PR is ready to be merged.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants