test/encryption: add InvalidConfigRecoveryScenario for KMS plugin image by gangwgr · Pull Request #2249 · openshift/library-go

gangwgr · 2026-05-29T05:10:32Z

test/encryption: add InvalidImageRecoveryScenario for KMS plugin image

Summary by CodeRabbit

Tests
- Added new encryption recovery test scenarios for validating system behavior during invalid configuration states and recovery processes.

coderabbitai · 2026-05-29T05:14:08Z

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

@coderabbitai resume to resume automatic reviews.
@coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

▶️ Resume reviews
🔍 Trigger review

Walkthrough

This PR adds a new test scenario and helper function to validate encryption recovery workflows. The InvalidImageRecoveryScenario type extends the base BasicScenario with fields for a valid KMS provider and a degradation wait callback. The TestEncryptionInvalidImageRecovery function orchestrates a recovery flow: it waits for the operator to enter a degraded state, asserts encryption remains on KMS during degradation, then switches to a valid KMS provider using existing encryption helpers.

Changes

Invalid Image Recovery Test Scenario

Layer / File(s)	Summary
InvalidImageRecoveryScenario type and test function `test/library/encryption/scenarios.go`	Imports `k8s.io/apimachinery/pkg/runtime/schema`, introduces `InvalidImageRecoveryScenario` type with `ValidImageProvider` and `WaitForDegraded` fields, and implements `TestEncryptionInvalidImageRecovery` to validate required fields, wait for operator degradation, assert KMS encryption persists during degradation, and apply a valid KMS provider to complete recovery.

🎯 2 (Simple) | ⏱️ ~8 minutes

Important

Pre-merge checks failed

Please resolve all errors before merging. Addressing warnings is optional.

❌ Failed checks (1 error, 3 warnings)

Check name	Status	Explanation	Resolution
Container-Privileges	❌ Error	PR adds DaemonSet manifest with privileged: true in two containers without inline justification; justification exists only in separate rolebinding comment.	Add inline comments in k8s_mock_kms_plugin_daemonset.yaml explaining why privileged: true is required for Unix socket creation on host filesystem.
Docstring Coverage	⚠️ Warning	Docstring coverage is 75.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.
Test Structure And Quality	⚠️ Warning	Missing post-recovery validation of operator readiness. SetAndWaitForEncryptionType only ensures encryption key migration, not that operator degradation is cleared.	Add explicit post-recovery assertion/wait to verify operator is no longer degraded, symmetric to WaitForDegraded callback.
Title check	⚠️ Warning	The PR title mentions 'InvalidConfigRecoveryScenario' but the actual added type is 'InvalidImageRecoveryScenario', creating a naming mismatch.	Update the PR title to 'test/encryption: add InvalidImageRecoveryScenario for KMS plugin image' to match the actual exported type name added in the changeset.

✅ Passed checks (11 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Stable And Deterministic Test Names	✅ Passed	New code has no Ginkgo test definitions and uses only static strings in logging, with no dynamic values that could vary between test runs.
Microshift Test Compatibility	✅ Passed	No Ginkgo e2e tests (It/Describe/Context/When) added. Changes add helper functions in test/library/, not e2e tests. MicroShift check applies to e2e tests, not helpers.
Single Node Openshift (Sno) Test Compatibility	✅ Passed	Code is test helper library in test/library/encryption/scenarios.go with no Ginkgo test declarations (It(), Describe(), etc.), so SNO compatibility check does not apply to infrastructure library code.
Topology-Aware Scheduling Compatibility	✅ Passed	This PR modifies only test code (test/library/encryption/scenarios.go), not deployment manifests, operator code, or controllers. The custom check explicitly applies only to those categories.
Ote Binary Stdout Contract	✅ Passed	No process-level stdout writes detected. Code is library package with test helper using only t.Log (test-safe), no fmt.Print or klog at process level.
Ipv6 And Disconnected Network Test Compatibility	✅ Passed	The added code is a helper library (not Ginkgo e2e test) with no IPv4 hardcoded addresses, external connectivity requirements, or IPv6-unsafe patterns.
No-Weak-Crypto	✅ Passed	PR adds test scenario code with no weak crypto algorithms, custom crypto implementations, or insecure secret comparisons detected.
No-Sensitive-Data-In-Logs	✅ Passed	All logging statements use constant string literals describing test steps. No sensitive data (passwords, tokens, keys, PII, hostnames, credentials) is logged.

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@test/library/encryption/scenarios.go`:
- Around line 334-344: After scenario.WaitForDegraded(ctx, e) add an explicit
assertion that the cluster is still using KMS before performing recovery: fetch
the current encryption provider/mode from the test environment (using the
existing test env variable e / scenario helpers) and assert it equals the KMS
provider (e.g., compare against the expected KMS string or enum used elsewhere
in tests), then only proceed to call SetAndWaitForEncryptionType(...). Place
this check immediately after WaitForDegraded(ctx, e) so the test explicitly
verifies the rollout remained stuck on KMS prior to applying the valid KMS
image.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 3e514184-6cde-4c92-a790-09007c2462a2

📥 Commits

Reviewing files that changed from the base of the PR and between c7d4322 and 7099f22.

📒 Files selected for processing (3)

test/library/encryption/helpers.go
test/library/encryption/kms/vault.go
test/library/encryption/scenarios.go

ardaguclu · 2026-05-29T05:51:34Z

+	// InvalidVaultKMSPluginImage is an OCI image reference that passes API validation
+	// (correct format, sha256 digest, sufficient length) but does not exist in any registry.
+	// Use this to test degradation when the KMS plugin image cannot be pulled.
+	InvalidVaultKMSPluginImage = "quay.io/openshifttest/mock-kms-plugin-nonexistent@sha256:0000000000000000000000000000000000000000000000000000000000000000"


We don't need to define a new variable for this. We can just an invalid image where we need it.

coderabbitai

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@test/library/encryption/scenarios.go`:
- Around line 306-309: The test calls scenario.AssertFunc(...) without checking
it; add a nil-guard before invoking it to avoid panics. Update the test near the
existing preconditions (where WaitForDegraded, InvalidImageProvider.Type and
ValidImageProvider.Type are asserted) to assert or check that
scenario.AssertFunc is not nil (e.g. require.NotNil(t, scenario.AssertFunc,
"...") or if scenario.AssertFunc != nil { ... }) and then call
scenario.AssertFunc, ensuring the call only happens when AssertFunc is present.
- Around line 323-337: The test asserts the cluster stayed on KMS without first
exercising the AESCBC switch attempt; insert a call that triggers the AESCBC
switch attempt (the helper used in the scenario flow to start a switch to
AESCBC) immediately after scenario.WaitForDegraded(...) and before
scenario.AssertFunc(...), so the test actually attempts the switch path and then
verifies KMS stickiness; keep the final recovery call to
SetAndWaitForEncryptionType(...) as-is.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 599ae865-8045-47c1-9b17-fb860b6458d4

📥 Commits

Reviewing files that changed from the base of the PR and between 7099f22 and 105508c.

📒 Files selected for processing (3)

test/library/encryption/helpers.go
test/library/encryption/kms/vault.go
test/library/encryption/scenarios.go

🚧 Files skipped from review as they are similar to previous changes (2)

test/library/encryption/helpers.go
test/library/encryption/kms/vault.go

ardaguclu · 2026-06-01T06:01:26Z


+// InvalidImageVaultEncryptionProvider is a Vault KMS EncryptionProvider with a
+// non-existent plugin image. Use this for testing degradation when the image cannot be pulled.
+var InvalidImageVaultEncryptionProvider = library.EncryptionProvider{


Do we really need this?. We can use the current valid encryption provider. After that we can modify it with invalid image.

ardaguclu · 2026-06-01T06:03:58Z

+
+// InvalidImageVaultKMSPluginConfig is identical to DefaultVaultKMSPluginConfig
+// but uses a non-existent image that will fail to pull.
+var InvalidImageVaultKMSPluginConfig = configv1.APIServerEncryption{


I think we don't need this. We can use the current Vault definition.

ardaguclu · 2026-06-01T06:07:28Z

+// WaitForPodImagePullBackOff polls pods in the given namespace until at least one pod
+// has an init container or container stuck in ImagePullBackOff. This is useful for
+// detecting that an invalid KMS plugin image is causing static pod failure.
+func WaitForPodImagePullBackOff(ctx context.Context, t testing.TB, kubeClient kubernetes.Interface, namespace, labelSelector string, timeout time.Duration) {


That is too specific error to be waiting. When kms plugin is unreachable, I'd expect that operator goes degraded. We should wait for the degradation instead of pod's specific condition. Because since apiserver can not communicate with the plugin, it should report unhealthy.

What is the result of your tests?. It didn't go degraded?.

Also noone uses this function currently

In our tests, the operator does not go Degraded when the KMS plugin image is invalid. It just stays stuck in Progressing (NodeInstallerProgressing) because the static pod revision can never complete. The apiserver never actually starts with the bad plugin, so it never gets a chance to report unhealthy, the image pull fails before the pod runs.

if we are planning to fix that, so i removed imagepull func

ardaguclu · 2026-06-01T06:07:44Z

+	BasicScenario
+	// InvalidImageProvider is the KMS EncryptionProvider configured with an invalid
+	// (non-existent or broken) KMS plugin image. Enabling this should cause degradation.
+	InvalidImageProvider EncryptionProvider


I think we don't need this.

ardaguclu · 2026-06-01T06:08:39Z

+	// step 2: wait for degraded — the operator is stuck because the invalid KMS image
+	// prevents the NodeInstaller from completing the rollout
+	t.Log("Waiting for operator to report degraded status")
+	scenario.WaitForDegraded(ctx, e)


That should be sufficient instead of waiting for imagepullerr

not sure it is sufficient, same reason #2249 (comment)

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@test/library/encryption/scenarios.go`:
- Around line 309-320: TestEncryptionInvalidImageRecovery currently fixes the
invalid image by calling SetAndWaitForEncryptionType (which may only call
waitForNoNewEncryptionKey) but never asserts that the operator cleared its
degraded state or that the plugin rollout completed; add a post-recovery
wait/assert symmetric to WaitForDegraded: after SetAndWaitForEncryptionType in
TestEncryptionInvalidImageRecovery call a helper (or extend helpers.go) such as
WaitForNotDegraded / WaitForOperatorAvailable that checks the operator's
Degraded condition is cleared and the plugin deployment(s) reached desired
rollout (no failing pods, available replicas) — use the same client helpers used
by WaitForDegraded and reference SetAndWaitForEncryptionType, WaitForDegraded,
and waitForNoNewEncryptionKey when implementing the new post-recovery assertion
so the test only passes once the operator is healthy and the plugin rollout
finished.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: e90f5601-4663-4ad5-9a30-a1126553b141

📥 Commits

Reviewing files that changed from the base of the PR and between cd32cc3 and e11492b.

📒 Files selected for processing (1)

test/library/encryption/scenarios.go

ardaguclu · 2026-06-01T07:01:26Z

+	// KMS plugin image. Applying this after degradation should restore the cluster.
+	ValidImageProvider EncryptionProvider
+	// WaitForDegraded should block until the operator reports a degraded condition.
+	WaitForDegraded func(ctx context.Context, t testing.TB)


Where is the implementation of this function?

It will be assigned by the caller?

Yes, it's assigned by the caller in kas-o

ardaguclu · 2026-06-01T07:02:31Z

+
+	// step 1: wait for degraded — the operator is stuck because the invalid KMS image
+	// prevents the NodeInstaller from completing the rollout
+	t.Log("Waiting for operator to report degraded status")


We haven't set invalid image yet?

we can set the invalid image by call in test

Due to the nature of this Scenario, in my opinion we should set it here to be explicit about it.

we can add invalid image setup in library-go also, both works
do we need add here or keep it in caller?

I think, we should have it here. This is the core part of the scenario

gangwgr · 2026-06-01T08:34:48Z

/test unit

ardaguclu · 2026-06-01T10:14:43Z

@gangwgr you are right. It is not degrading, it stuck in progressing;

$ oc get co
NAME                                       VERSION                                                AVAILABLE   PROGRESSING   DEGRADED   SINCE   MESSAGE
authentication                             4.23.0-0-2026-06-01-074800-test-ci-ln-iqf6b22-latest   True        True          False      95m     APIServerDeploymentProgressing: deployment/apiserver.openshift-oauth-apiserver: 1/3 pods have been updated to the latest generation and 2/3 pods are available
kube-apiserver                             4.23.0-0-2026-06-01-074800-test-ci-ln-iqf6b22-latest   True        True          False      112m    NodeInstallerProgressing: 3 nodes are at revision 7; 0 nodes have achieved new revision 9
openshift-apiserver                        4.23.0-0-2026-06-01-074800-test-ci-ln-iqf6b22-latest   True        True          False      107m    APIServerDeploymentProgressing: deployment/apiserver.openshift-apiserver: 1/3 pods have been updated to the latest generation and 2/3 pods are available

but in reality pod is stuck in error;

$ oc get pods -n openshift-kube-apiserver
NAME                                                              READY   STATUS                  RESTARTS   AGE
kube-apiserver-ip-10-0-102-172.us-west-1.compute.internal         0/6     Init:ImagePullBackOff   0          8m11s
$ oc get pods -n openshift-apiserver
NAME                         READY   STATUS                  RESTARTS      AGE
apiserver-6c94754994-fks4q   0/3     Init:ImagePullBackOff   0             17m
$ oc get pods -n openshift-oauth-apiserver
NAME                         READY   STATUS                  RESTARTS   AGE
apiserver-65f95f8675-x77jh   0/2     Init:ImagePullBackOff   0          17m

So we should wait for ImagePullBackOff error as you did previously, after that patching with correct image and wait for the success.
/cc @p0lyn0mial

ardaguclu · 2026-06-01T10:15:56Z

BTW, it depends on the field we use for in-place field updates. Image field gets this error but there might be another field that causes degradation.

p0lyn0mial · 2026-06-01T10:33:16Z

+//  1. Apply invalid KMS encryption config (causes operator to get stuck)
+//  2. Wait for degraded/stuck state
+//  3. Fix by applying valid KMS image and wait for full encryption migration
+type InvalidImageRecoveryScenario struct {


wondering if we could create a more generic scenario that would accept an invalid configuration (not necessarily invalid image) and possibly inject the config at different stages: before encryption has been enabled, after encryption has been enabled, during migration etc.

WDYT ?

I think, this is a good idea. Although currently we only have image field as in-place field (other fields in Vault are all trigerring migration), it would be better to have generic approach.

Yes, updated

I think we can test the combinations in one scenario;

Enable KMS with invalid image

Update to aescbc and see that there is no switch to aescbc

Patch KMS with correct image with previous configuration (because apiserver/cluster is on aescbc)

See that everything works

Patch again with invalid KMS image

See that revision is stuck

Patch with correct KMS image

Immediately patch with incorrect KMS image -- during migration

Patch with correct KMS image again

do we need multiple times invalid image? it will increase case duration

I think we need to test all cases.

can we combine below also in one case?

Unreachable vault

Bad creds

invalid transit mount

Unreachable vault

This can be tested, if this is caused by invalid configuration

Bad creds

Updating content of referenced Secrets hasn't been implemented yet. But when it is implemented, this can be a separate scenario.

Invalid transit mount

Invalid transit mount needs to be caught by preflight checker.

ardaguclu · 2026-06-03T07:14:04Z

+
+// TestEncryptionConfigScenario executes each step in sequence, stopping
+// on the first failure.
+func TestEncryptionConfigScenario(ctx context.Context, t testing.TB, scenario EncryptionConfigScenario) {


@gangwgr I've implemented my suggested changes in #2265 for all the changes in this PR. Could you please take a look at?

…image fix Adds a test scenario that verifies the cluster can recover when a KMS encryption config is set with an invalid plugin image: 1. Caller applies invalid KMS image config (operator degrades) 2. TestEncryptionInvalidImageRecovery waits for degraded 3. Fixes config with valid KMS image and waits for full migration The scenario is deliberately minimal: the caller is responsible for applying the invalid config, making it provider-agnostic and avoiding pre-baked invalid image constants in library-go.

openshift-ci · 2026-06-03T10:16:55Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: gangwgr
Once this PR has been reviewed and has the lgtm label, please assign p0lyn0mial for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details

Needs approval from an approver in each of these files:

test/library/encryption/OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

openshift-ci · 2026-06-03T10:23:48Z

@gangwgr: all tests passed!

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

openshift-ci Bot requested review from dgrisonnet and p0lyn0mial May 29, 2026 05:10

coderabbitai Bot reviewed May 29, 2026

View reviewed changes

Comment thread test/library/encryption/scenarios.go Outdated

ardaguclu reviewed May 29, 2026

View reviewed changes

gangwgr force-pushed the kms-invalid-image-test branch from 7099f22 to 105508c Compare May 29, 2026 06:01

coderabbitai Bot reviewed May 29, 2026

View reviewed changes

Comment thread test/library/encryption/scenarios.go Outdated

Comment thread test/library/encryption/scenarios.go Outdated

gangwgr force-pushed the kms-invalid-image-test branch 3 times, most recently from d8669c0 to cd32cc3 Compare May 29, 2026 06:21

ardaguclu reviewed Jun 1, 2026

View reviewed changes

gangwgr force-pushed the kms-invalid-image-test branch from cd32cc3 to e11492b Compare June 1, 2026 06:37

coderabbitai Bot reviewed Jun 1, 2026

View reviewed changes

Comment thread test/library/encryption/scenarios.go Outdated

ardaguclu reviewed Jun 1, 2026

View reviewed changes

gangwgr force-pushed the kms-invalid-image-test branch from e11492b to 8553200 Compare June 1, 2026 07:36

p0lyn0mial reviewed Jun 1, 2026

View reviewed changes

gangwgr force-pushed the kms-invalid-image-test branch from 8553200 to c0c5c6a Compare June 1, 2026 10:51

gangwgr changed the title ~~test/encryption: add InvalidImageRecoveryScenario for KMS plugin image~~ test/encryption: add InvalidConfigRecoveryScenario for KMS plugin image Jun 1, 2026

gangwgr force-pushed the kms-invalid-image-test branch 2 times, most recently from bb4d377 to 192cc25 Compare June 2, 2026 06:07

ardaguclu mentioned this pull request Jun 3, 2026

Review #2265

Closed

ardaguclu reviewed Jun 3, 2026

View reviewed changes

This was referenced Jun 3, 2026

[Auto-Generated] KMS Team PR Dashboard #2266

Closed

[Auto-Generated] KMS Team PR Dashboard #2267

Closed

[Auto-Generated] KMS Team PR Dashboard #2268

Open

openshift-ci Bot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jun 3, 2026

gangwgr force-pushed the kms-invalid-image-test branch from 192cc25 to f6868ba Compare June 3, 2026 10:16

openshift-ci Bot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jun 3, 2026

Conversation

gangwgr commented May 29, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented May 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviews paused

Walkthrough

Changes

Pre-merge checks failed

❌ Failed checks (1 error, 3 warnings)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gangwgr commented Jun 1, 2026

Uh oh!

ardaguclu commented Jun 1, 2026

Uh oh!

ardaguclu commented Jun 1, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ardaguclu Jun 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gangwgr commented May 29, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented May 29, 2026 •

edited

Loading

ardaguclu Jun 1, 2026 •

edited

Loading