test/encryption: add InvalidConfigRecoveryScenario for KMS plugin image#2249
test/encryption: add InvalidConfigRecoveryScenario for KMS plugin image#2249gangwgr wants to merge 1 commit into
Conversation
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
WalkthroughThis PR adds a new test scenario and helper function to validate encryption recovery workflows. The ChangesInvalid Image Recovery Test Scenario
🎯 2 (Simple) | ⏱️ ~8 minutes Important Pre-merge checks failedPlease resolve all errors before merging. Addressing warnings is optional. ❌ Failed checks (1 error, 3 warnings)
✅ Passed checks (11 passed)
✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@test/library/encryption/scenarios.go`:
- Around line 334-344: After scenario.WaitForDegraded(ctx, e) add an explicit
assertion that the cluster is still using KMS before performing recovery: fetch
the current encryption provider/mode from the test environment (using the
existing test env variable e / scenario helpers) and assert it equals the KMS
provider (e.g., compare against the expected KMS string or enum used elsewhere
in tests), then only proceed to call SetAndWaitForEncryptionType(...). Place
this check immediately after WaitForDegraded(ctx, e) so the test explicitly
verifies the rollout remained stuck on KMS prior to applying the valid KMS
image.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 3e514184-6cde-4c92-a790-09007c2462a2
📒 Files selected for processing (3)
test/library/encryption/helpers.gotest/library/encryption/kms/vault.gotest/library/encryption/scenarios.go
| // InvalidVaultKMSPluginImage is an OCI image reference that passes API validation | ||
| // (correct format, sha256 digest, sufficient length) but does not exist in any registry. | ||
| // Use this to test degradation when the KMS plugin image cannot be pulled. | ||
| InvalidVaultKMSPluginImage = "quay.io/openshifttest/mock-kms-plugin-nonexistent@sha256:0000000000000000000000000000000000000000000000000000000000000000" |
There was a problem hiding this comment.
We don't need to define a new variable for this. We can just an invalid image where we need it.
7099f22 to
105508c
Compare
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@test/library/encryption/scenarios.go`:
- Around line 306-309: The test calls scenario.AssertFunc(...) without checking
it; add a nil-guard before invoking it to avoid panics. Update the test near the
existing preconditions (where WaitForDegraded, InvalidImageProvider.Type and
ValidImageProvider.Type are asserted) to assert or check that
scenario.AssertFunc is not nil (e.g. require.NotNil(t, scenario.AssertFunc,
"...") or if scenario.AssertFunc != nil { ... }) and then call
scenario.AssertFunc, ensuring the call only happens when AssertFunc is present.
- Around line 323-337: The test asserts the cluster stayed on KMS without first
exercising the AESCBC switch attempt; insert a call that triggers the AESCBC
switch attempt (the helper used in the scenario flow to start a switch to
AESCBC) immediately after scenario.WaitForDegraded(...) and before
scenario.AssertFunc(...), so the test actually attempts the switch path and then
verifies KMS stickiness; keep the final recovery call to
SetAndWaitForEncryptionType(...) as-is.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 599ae865-8045-47c1-9b17-fb860b6458d4
📒 Files selected for processing (3)
test/library/encryption/helpers.gotest/library/encryption/kms/vault.gotest/library/encryption/scenarios.go
🚧 Files skipped from review as they are similar to previous changes (2)
- test/library/encryption/helpers.go
- test/library/encryption/kms/vault.go
d8669c0 to
cd32cc3
Compare
|
|
||
| // InvalidImageVaultEncryptionProvider is a Vault KMS EncryptionProvider with a | ||
| // non-existent plugin image. Use this for testing degradation when the image cannot be pulled. | ||
| var InvalidImageVaultEncryptionProvider = library.EncryptionProvider{ |
There was a problem hiding this comment.
Do we really need this?. We can use the current valid encryption provider. After that we can modify it with invalid image.
|
|
||
| // InvalidImageVaultKMSPluginConfig is identical to DefaultVaultKMSPluginConfig | ||
| // but uses a non-existent image that will fail to pull. | ||
| var InvalidImageVaultKMSPluginConfig = configv1.APIServerEncryption{ |
There was a problem hiding this comment.
I think we don't need this. We can use the current Vault definition.
| // WaitForPodImagePullBackOff polls pods in the given namespace until at least one pod | ||
| // has an init container or container stuck in ImagePullBackOff. This is useful for | ||
| // detecting that an invalid KMS plugin image is causing static pod failure. | ||
| func WaitForPodImagePullBackOff(ctx context.Context, t testing.TB, kubeClient kubernetes.Interface, namespace, labelSelector string, timeout time.Duration) { |
There was a problem hiding this comment.
That is too specific error to be waiting. When kms plugin is unreachable, I'd expect that operator goes degraded. We should wait for the degradation instead of pod's specific condition. Because since apiserver can not communicate with the plugin, it should report unhealthy.
What is the result of your tests?. It didn't go degraded?.
There was a problem hiding this comment.
Also noone uses this function currently
There was a problem hiding this comment.
In our tests, the operator does not go Degraded when the KMS plugin image is invalid. It just stays stuck in Progressing (NodeInstallerProgressing) because the static pod revision can never complete. The apiserver never actually starts with the bad plugin, so it never gets a chance to report unhealthy, the image pull fails before the pod runs.
There was a problem hiding this comment.
if we are planning to fix that, so i removed imagepull func
| BasicScenario | ||
| // InvalidImageProvider is the KMS EncryptionProvider configured with an invalid | ||
| // (non-existent or broken) KMS plugin image. Enabling this should cause degradation. | ||
| InvalidImageProvider EncryptionProvider |
| // step 2: wait for degraded — the operator is stuck because the invalid KMS image | ||
| // prevents the NodeInstaller from completing the rollout | ||
| t.Log("Waiting for operator to report degraded status") | ||
| scenario.WaitForDegraded(ctx, e) |
There was a problem hiding this comment.
That should be sufficient instead of waiting for imagepullerr
There was a problem hiding this comment.
not sure it is sufficient, same reason #2249 (comment)
cd32cc3 to
e11492b
Compare
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@test/library/encryption/scenarios.go`:
- Around line 309-320: TestEncryptionInvalidImageRecovery currently fixes the
invalid image by calling SetAndWaitForEncryptionType (which may only call
waitForNoNewEncryptionKey) but never asserts that the operator cleared its
degraded state or that the plugin rollout completed; add a post-recovery
wait/assert symmetric to WaitForDegraded: after SetAndWaitForEncryptionType in
TestEncryptionInvalidImageRecovery call a helper (or extend helpers.go) such as
WaitForNotDegraded / WaitForOperatorAvailable that checks the operator's
Degraded condition is cleared and the plugin deployment(s) reached desired
rollout (no failing pods, available replicas) — use the same client helpers used
by WaitForDegraded and reference SetAndWaitForEncryptionType, WaitForDegraded,
and waitForNoNewEncryptionKey when implementing the new post-recovery assertion
so the test only passes once the operator is healthy and the plugin rollout
finished.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: e90f5601-4663-4ad5-9a30-a1126553b141
📒 Files selected for processing (1)
test/library/encryption/scenarios.go
| // KMS plugin image. Applying this after degradation should restore the cluster. | ||
| ValidImageProvider EncryptionProvider | ||
| // WaitForDegraded should block until the operator reports a degraded condition. | ||
| WaitForDegraded func(ctx context.Context, t testing.TB) |
There was a problem hiding this comment.
Where is the implementation of this function?
There was a problem hiding this comment.
It will be assigned by the caller?
There was a problem hiding this comment.
Yes, it's assigned by the caller in kas-o
|
|
||
| // step 1: wait for degraded — the operator is stuck because the invalid KMS image | ||
| // prevents the NodeInstaller from completing the rollout | ||
| t.Log("Waiting for operator to report degraded status") |
There was a problem hiding this comment.
We haven't set invalid image yet?
There was a problem hiding this comment.
we can set the invalid image by call in test
There was a problem hiding this comment.
Due to the nature of this Scenario, in my opinion we should set it here to be explicit about it.
There was a problem hiding this comment.
we can add invalid image setup in library-go also, both works
do we need add here or keep it in caller?
There was a problem hiding this comment.
I think, we should have it here. This is the core part of the scenario
e11492b to
8553200
Compare
|
/test unit |
|
@gangwgr you are right. It is not degrading, it stuck in progressing; but in reality pod is stuck in error; So we should wait for ImagePullBackOff error as you did previously, after that patching with correct image and wait for the success. |
|
BTW, it depends on the field we use for in-place field updates. Image field gets this error but there might be another field that causes degradation. |
| // 1. Apply invalid KMS encryption config (causes operator to get stuck) | ||
| // 2. Wait for degraded/stuck state | ||
| // 3. Fix by applying valid KMS image and wait for full encryption migration | ||
| type InvalidImageRecoveryScenario struct { |
There was a problem hiding this comment.
wondering if we could create a more generic scenario that would accept an invalid configuration (not necessarily invalid image) and possibly inject the config at different stages: before encryption has been enabled, after encryption has been enabled, during migration etc.
WDYT ?
There was a problem hiding this comment.
I think, this is a good idea. Although currently we only have image field as in-place field (other fields in Vault are all trigerring migration), it would be better to have generic approach.
There was a problem hiding this comment.
I think we can test the combinations in one scenario;
- Enable KMS with invalid image
- Update to aescbc and see that there is no switch to aescbc
- Patch KMS with correct image with previous configuration (because apiserver/cluster is on aescbc)
- See that everything works
- Patch again with invalid KMS image
- See that revision is stuck
- Patch with correct KMS image
- Immediately patch with incorrect KMS image -- during migration
- Patch with correct KMS image again
There was a problem hiding this comment.
do we need multiple times invalid image? it will increase case duration
There was a problem hiding this comment.
I think we need to test all cases.
There was a problem hiding this comment.
can we combine below also in one case?
- Unreachable vault
- Bad creds
- invalid transit mount
There was a problem hiding this comment.
Unreachable vault
This can be tested, if this is caused by invalid configuration
Bad creds
Updating content of referenced Secrets hasn't been implemented yet. But when it is implemented, this can be a separate scenario.
Invalid transit mount
Invalid transit mount needs to be caught by preflight checker.
8553200 to
c0c5c6a
Compare
bb4d377 to
192cc25
Compare
|
|
||
| // TestEncryptionConfigScenario executes each step in sequence, stopping | ||
| // on the first failure. | ||
| func TestEncryptionConfigScenario(ctx context.Context, t testing.TB, scenario EncryptionConfigScenario) { |
…image fix Adds a test scenario that verifies the cluster can recover when a KMS encryption config is set with an invalid plugin image: 1. Caller applies invalid KMS image config (operator degrades) 2. TestEncryptionInvalidImageRecovery waits for degraded 3. Fixes config with valid KMS image and waits for full migration The scenario is deliberately minimal: the caller is responsible for applying the invalid config, making it provider-agnostic and avoiding pre-baked invalid image constants in library-go.
192cc25 to
f6868ba
Compare
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: gangwgr The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
|
@gangwgr: all tests passed! Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
test/encryption: add InvalidImageRecoveryScenario for KMS plugin image
Summary by CodeRabbit