CNTRLPLANE-1401: feat(e2e): add spot remediation verification to SpotTerminationHandlerTest by zhfeng · Pull Request #8028 · openshift/hypershift

zhfeng · 2026-03-23T02:54:00Z

Summary

Extend SpotTerminationHandlerTest to verify the full spot remediation flow after NTH taints the node:

Step 7: Machine gets hypershift.openshift.io/spot-interruption-signal annotation (spot remediation controller acted)
Step 8: Machine gets deletionTimestamp (Machine deleted)
Step 9: Exactly 1 replacement Machine is created with a different name and the interruptible-instance label (CAPI replacement inherits spot marker)

Also tightens the node count assertion from len(nodes) == 0 to len(nodes) != 1 to match the 1-replica NodePool.

Test plan

Verify e2e test compiles (go vet -tags e2e ./test/e2e/...)
Run spot termination handler e2e test with the new verification steps

Summary by CodeRabbit

Tests
- Enhanced end-to-end testing for spot instance termination handling with stricter validation of node availability and comprehensive verification of the Machine replacement workflow during interruption events, ensuring proper annotation handling and instance label assignment.

openshift-ci-robot · 2026-03-23T02:54:04Z

Pipeline controller notification
This repo is configured to use the pipeline controller. Second-stage tests will be triggered either automatically or after lgtm label is added, depending on the repository configuration. The pipeline controller will automatically detect which contexts are required and will utilize /test Prow commands to trigger the second stage.

For optional jobs, comment /test ? to see a list of all defined jobs. To trigger manually all jobs from second stage use /pipeline required command.

This repository is configured in: LGTM mode

openshift-ci · 2026-03-23T02:54:05Z

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

coderabbitai · 2026-03-23T02:54:09Z

Important

Review skipped

Auto reviews are limited based on label configuration.

🚫 Review skipped — only excluded labels are configured. (1)

do-not-merge/work-in-progress

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Repository YAML (base), Organization UI (inherited)

Review profile: CHILL

Plan: Pro

Run ID: b5d56fac-706b-440f-9019-221e51020324

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

🔍 Trigger review

📝 Walkthrough

Walkthrough

This change updates a spot termination handler end-to-end test to add more rigorous verification logic. The test now enforces exactly one ready spot node instead of at least one, extracts the owning CAPI Machine reference from the Node's annotations, and introduces new polling assertions to verify the Machine receives a spot interruption signal annotation, gains a deletion timestamp, and that a replacement Machine with the interruptible instance label is created. A new constant spotInterruptionSignalAnnotation is added to support these verifications. The actual cleanup logic remains functionally unchanged, only renumbered to Step 10.

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

openshift-ci-robot · 2026-03-23T03:04:05Z

@zhfeng: This pull request references CNTRLPLANE-1401 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.22.0" version, but no target version was set.

Details

In response to this:

Summary

Extend SpotTerminationHandlerTest to verify the full spot remediation flow after NTH taints the node:

Step 7: Machine gets hypershift.openshift.io/spot-interruption-signal annotation (spot remediation controller acted)

Step 8: Machine gets deletionTimestamp (Machine deleted)

Step 9: Exactly 1 replacement Machine is created with a different name and the interruptible-instance label (CAPI replacement inherits spot marker)

Also tightens the node count assertion from len(nodes) == 0 to len(nodes) != 1 to match the 1-replica NodePool.

Test plan

Verify e2e test compiles (go vet -tags e2e ./test/e2e/...)

Run spot termination handler e2e test with the new verification steps

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

openshift-ci-robot · 2026-03-23T03:11:12Z

@zhfeng: This pull request references CNTRLPLANE-1401 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.22.0" version, but no target version was set.

Details

In response to this:

Summary

Extend SpotTerminationHandlerTest to verify the full spot remediation flow after NTH taints the node:

Step 7: Machine gets hypershift.openshift.io/spot-interruption-signal annotation (spot remediation controller acted)

Step 8: Machine gets deletionTimestamp (Machine deleted)

Step 9: Exactly 1 replacement Machine is created with a different name and the interruptible-instance label (CAPI replacement inherits spot marker)

Also tightens the node count assertion from len(nodes) == 0 to len(nodes) != 1 to match the 1-replica NodePool.

Test plan

Verify e2e test compiles (go vet -tags e2e ./test/e2e/...)

Run spot termination handler e2e test with the new verification steps

Summary by CodeRabbit

Tests

Enhanced end-to-end testing for spot instance termination handling with stricter validation of node availability and comprehensive verification of the Machine replacement workflow during interruption events, ensuring proper annotation handling and instance label assignment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

enxebre · 2026-03-23T08:14:32Z

/test e2e-aws

cwbotbot · 2026-03-23T10:10:03Z

Test Results

e2e-aws

Status: ✅ PASS
Started: 2026-03-23T08:15:10Z
View Job
View Job History

…rTest Extend SpotTerminationHandlerTest to verify the full spot remediation flow after NTH taints the node: Machine gets spot-interruption-signal annotation, Machine is marked for deletion, and exactly 1 replacement Machine is created with the interruptible-instance label.

zhfeng · 2026-04-02T05:58:08Z

@enxebre do you think it is Ok for merging?

enxebre · 2026-04-08T11:52:13Z

/approve

openshift-ci · 2026-04-08T11:52:45Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: enxebre, zhfeng

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details

Needs approval from an approver in each of these files:

~~OWNERS~~ [enxebre]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

enxebre · 2026-04-08T11:58:58Z

/lgtm

openshift-ci-robot · 2026-04-08T11:59:17Z

Scheduling tests matching the pipeline_run_if_changed or not excluded by pipeline_skip_if_only_changed parameters:
/test e2e-aks-4-21
/test e2e-aws-4-21
/test e2e-aks
/test e2e-aws
/test e2e-aws-upgrade-hypershift-operator
/test e2e-azure-self-managed
/test e2e-kubevirt-aws-ovn-reduced
/test e2e-v2-aws

openshift-ci · 2026-04-08T15:24:18Z

@zhfeng: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name	Commit	Details	Required	Rerun command
ci/prow/e2e-aws-4-21	`b1e5c9f`	link	true	`/test e2e-aws-4-21`

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

hypershift-jira-solve-ci · 2026-04-09T14:40:17Z

Now I have comprehensive evidence. Let me produce the final report:

Test Failure Analysis Complete (Multi-Job)

Job 1: e2e-aws-4-21

Job Information

Prow Job: pull-ci-openshift-hypershift-main-e2e-aws-4-21
Build ID: 2041848373236994048
Target: e2e-aws-4-21
Status: failure

Failed Steps

Step	One-line Failure
hypershift-aws-run-e2e-nested	`TestNodePool/HostedCluster0/Main/TestNTOPerformanceProfile` timed out waiting for performance profile status ConfigMap

Error

nodepool_nto_performanceprofile_test.go:159: Failed to wait for performance profile status ConfigMap to exist in 10m0s: context deadline exceeded
eventually.go:384: observed invalid **v1.ConfigMap state after 10m0s
eventually.go:401:  - observed **v1.ConfigMap collection invalid: expected 1 performance profile status ConfigMaps, got 0

Summary

TestNTOPerformanceProfile failed because the Node Tuning Operator did not produce the expected performance profile status ConfigMap (labeled NodeTuningGeneratedPerformanceProfileStatusLabel: "true") in the hosted control plane namespace within the 10-minute timeout. The performance profile config ConfigMap was created successfully (in 3s), but the corresponding status ConfigMap was never generated. This is a pre-existing flaky test unrelated to PR #8028.

Evidence

The performance profile config ConfigMap was created and validated successfully at line 112 of nodepool_nto_performanceprofile_test.go ("Successfully waited for performance profile ConfigMap to exist with correct name labels and annotations in 3s")
The status ConfigMap poll at line 159 timed out after 10m0s — 0 ConfigMaps found with the status label
TestSpotTerminationHandler was SKIPPED on this job because version < 4.22 (test only supported on version 4.22 and above)
All other 468 tests (of 473 total, 35 skipped) passed — this single TestNTOPerformanceProfile failure is the sole cause of the job failure
This failure is NOT related to PR CNTRLPLANE-1401: feat(e2e): add spot remediation verification to SpotTerminationHandlerTest #8028 — the PR adds spot remediation verification, and TestSpotTerminationHandler was skipped entirely in this job

Root Cause

Pre-existing NTO flake: The Node Tuning Operator failed to reconcile the PerformanceProfile status ConfigMap within the allotted timeout. This is a known intermittent issue with the NTO operator in CI environments — it is unrelated to the changes in PR #8028.

Job 2: e2e-aws

Job Information

Prow Job: pull-ci-openshift-hypershift-main-e2e-aws
Build ID: 2041848373203439616
Target: e2e-aws
Status: failure (2h timeout)

Failed Steps

Step	One-line Failure
hypershift-aws-run-e2e-nested	Process did not finish before 2h0m0s timeout — `TestCreateCluster` stuck in `EnsureMetricsForwarderWorking`

Error

{"component":"entrypoint","level":"error","msg":"Process did not finish before 2h0m0s timeout","time":"2026-04-08T14:33:44Z"}
{"component":"entrypoint","level":"error","msg":"Process did not exit before 30m0s grace period","time":"2026-04-08T15:03:44Z"}
error: failed to execute wrapped command: exit status 127

Summary

All individual tests passed (including TestSpotTerminationHandler which completed successfully in 65.38s), but TestCreateCluster got stuck in its sub-test EnsureMetricsForwarderWorking. This sub-test waited indefinitely for the kube-apiserver Prometheus target to appear via the metrics-forwarder, repeatedly logging "kube-apiserver target via metrics-forwarder not found in Prometheus active targets (will retry)". The test never timed out internally and the entire 2h job timeout was consumed, killing the process.

Evidence

TestSpotTerminationHandler PASSED successfully:

NodePool node-pool-gskxl-test-spot-termination created with 1 spot replica (took 7m51s for node to become ready)
SQS queue created, rebalance recommendation event sent
Node tainted with aws-node-termination-handler/rebalance-recommendation in 5s
Machine annotated with spot-interruption-signal in 0s
Machine marked for deletion in 0s
New spot remediation verification (PR CNTRLPLANE-1401: feat(e2e): add spot remediation verification to SpotTerminationHandlerTest #8028): Replacement Machine with interruptible-instance label detected in 50s
Cleanup completed (SQS queue deleted, policy removed)
Total time: 887.40s (14m47s), of which the SpotTerminationHandlerTest subtest took 65.38s

TestCreateCluster/Main/EnsureMetricsForwarderWorking was the blocking test:

Ran at line 1554: util_metrics_proxy.go:50: Enabling metrics forwarding on HostedCluster
Endpoint-resolver deployment waited for
Then entered an infinite retry loop at util_metrics_proxy.go:117: "kube-apiserver target via metrics-forwarder not found in Prometheus active targets (will retry)"
Last seen retrying at line 1668 — never succeeded
TestCreateCluster never emitted a PASS/FAIL result, and was still running when the 2h timeout killed the process

Timing analysis — did TestSpotTerminationHandler contribute to the timeout?

TestSpotTerminationHandler took 887.40s (14m47s) of wall-clock time
However, all HostedCluster0/Main tests run in parallel within the same hosted cluster — TestSpotTerminationHandler ran concurrently with TestNodePoolInPlaceUpgrade (946s), TestNTOMachineConfigGetsRolledOut (1001s), TestRollingUpgrade (1179s), etc.
The overall HostedCluster0 wall-clock was 3631s (60m31s), dominated by long-running upgrade tests
The blocking test was EnsureMetricsForwarderWorking in TestCreateCluster, which runs in a separate hosted cluster from TestSpotTerminationHandler
Conclusion: Even without TestSpotTerminationHandler, the timeout would still have occurred because EnsureMetricsForwarderWorking was stuck in an infinite retry loop

TestNTOPerformanceProfile PASSED on this job (561.08s) — the flake only affected e2e-aws-4-21

Root Cause

Pre-existing metrics-forwarder infrastructure issue: TestCreateCluster/Main/EnsureMetricsForwarderWorking has no internal timeout and entered an infinite retry loop waiting for a Prometheus target that never appeared. This consumed the entire 2h job timeout. This test is version-gated to 4.22+ (it was skipped on the 4-21 job), and the failure is unrelated to PR #8028.

Aggregated Root Cause

Summary

Job	Root Cause	Related to PR #8028?
e2e-aws-4-21	`TestNTOPerformanceProfile` flake — NTO failed to create performance profile status ConfigMap	No — `TestSpotTerminationHandler` was SKIPPED (version < 4.22)
e2e-aws	`EnsureMetricsForwarderWorking` infinite retry — metrics-forwarder Prometheus target never appeared	No — `TestSpotTerminationHandler` passed; stuck test is in separate hosted cluster

Neither failure is caused by PR #8028 (CNTRLPLANE-1401). Both are pre-existing, independent infrastructure/flake issues:

TestNTOPerformanceProfile is a known intermittent failure where the Node Tuning Operator fails to reconcile the performance profile status ConfigMap in time. This has no connection to spot termination handling.
EnsureMetricsForwarderWorking is a 4.22-only test that lacks an internal timeout, causing it to retry indefinitely when the metrics-forwarder fails to register its kube-apiserver target in Prometheus. The test should have an internal timeout to fail fast rather than consuming the entire job timeout.

Recommendations

PR CNTRLPLANE-1401: feat(e2e): add spot remediation verification to SpotTerminationHandlerTest #8028 is safe to retry — both failures are pre-existing and unrelated to the spot remediation verification changes
EnsureMetricsForwarderWorking needs a timeout: The test at util_metrics_proxy.go:117 retries indefinitely without a deadline. Adding a bounded timeout (e.g., 10-15 minutes) would allow the test to fail fast and not consume the entire job budget
TestNTOPerformanceProfile flake should be tracked: The NTO status ConfigMap creation is intermittently failing in CI — consider increasing the 10m timeout or investigating why the NTO operator sometimes fails to reconcile

Artifacts

e2e-aws-4-21 artifacts: .work/prow-job-analyze-test-failure/2041848373236994048/logs/
e2e-aws artifacts: .work/prow-job-analyze-test-failure/2041848373203439616/logs/

hypershift-jira-solve-ci · 2026-04-13T08:14:31Z

Now I have all the evidence. Let me compile the final report:

Test Failure Analysis Complete (Multi-Step)

Job Information — Job 1: e2e-aws-4-21

Prow Job: pull-ci-openshift-hypershift-main-e2e-aws-4-21
Build ID: 2041848373236994048
Target: e2e-aws-4-21
Failed Steps: 1 (hypershift-aws-run-e2e-nested)
URL: https://prow.ci.openshift.org/view/gs/test-platform-results/pr-logs/pull/openshift_hypershift/8028/pull-ci-openshift-hypershift-main-e2e-aws-4-21/2041848373236994048

Job Information — Job 2: e2e-aws

Prow Job: pull-ci-openshift-hypershift-main-e2e-aws
Build ID: 2041848373203439616
Target: e2e-aws
Failed Steps: 1 (hypershift-aws-run-e2e-nested)
URL: https://prow.ci.openshift.org/view/gs/test-platform-results/pr-logs/pull/openshift_hypershift/8028/pull-ci-openshift-hypershift-main-e2e-aws/2041848373203439616

PR Under Test

PR: openshift/hypershift#8028 — CNTRLPLANE-1401: feat(e2e): add spot remediation verification to SpotTerminationHandlerTest
Author: @zhfeng
Changes: Adds Steps 7–9 to the SpotTerminationHandlerTest, verifying that the spot remediation controller annotates the Machine with hypershift.openshift.io/spot-interruption-signal, that the Machine is marked for deletion, and that a replacement Machine with the interruptible-instance label is created by CAPI. Also tightens the node count assertion from len(nodes) == 0 to len(nodes) != 1.

Failed Step Analyses

Job 1: e2e-aws-4-21 — Step: hypershift-aws-run-e2e-nested

Error

TestNodePool/HostedCluster0/Main/TestNTOPerformanceProfile failed after 1059s (17m39s):

nodepool_nto_performanceprofile_test.go:159: Failed to wait for performance profile status ConfigMap to exist in 10m0s: context deadline exceeded
eventually.go:384: observed invalid **v1.ConfigMap state after 10m0s
eventually.go:401:  - observed **v1.ConfigMap collection invalid: expected 1 performance profile status ConfigMaps, got 0

Additionally, the after() framework hook reported invalid HostedCluster conditions for TestNodePool/HostedCluster0:

util.go:3025: Failed to wait for HostedCluster e2e-clusters-khdz2/node-pool-qkjhh to have valid conditions in 2s: context deadline exceeded
- incorrect condition: wanted ClusterVersionSucceeding=False, got ClusterVersionSucceeding=True
- incorrect condition: wanted ClusterVersionAvailable=False, got ClusterVersionAvailable=True
- incorrect condition: wanted DataPlaneConnectionAvailable=Unknown, got DataPlaneConnectionAvailable=True
- incorrect condition: wanted ClusterVersionProgressing=True, got ClusterVersionProgressing=False

Summary

This failure is NOT caused by PR #8028. The TestSpotTerminationHandler test was skipped in e2e-aws-4-21 because the hosted cluster runs OCP 4.21, and the test has a version gate requiring 4.22+:

nodepool_spot_termination_handler_test.go:90: test only supported on version 4.22 and above

The actual failures are:

TestNTOPerformanceProfile — A pre-existing flaky test. The Node Tuning Operator (NTO) did not create the expected performance profile status ConfigMap (labeled NodeTuningGeneratedPerformanceProfileStatusLabel: "true") within 10 minutes. This is an NTO reconciliation timing issue unrelated to the PR.
HostedCluster conditions mismatch — The after() hook found no worker nodes (they were removed during test teardown) and expected ClusterVersionSucceeding=False, ClusterVersionAvailable=False, etc. But the cluster was actually healthy (ClusterVersionSucceeding=True), meaning the HostedCluster controller hadn't yet reconciled conditions to reflect the "no workers" state. This is a framework condition validation race unrelated to the PR changes.

Evidence

TestSpotTerminationHandler was SKIPPED (line 4231): "test only supported on version 4.22 and above"
TestNTOPerformanceProfile failure (line 1087): waited 10m for ConfigMap, got 0 results
Conditions mismatch (line 4270–4274): framework expected unhealthy conditions but cluster reported healthy
473 tests ran, 35 skipped, 4 failures (all nested under TestNodePool → HostedCluster0 → Main → TestNTOPerformanceProfile)

Job 2: e2e-aws — Step: hypershift-aws-run-e2e-nested

Error

Process did not finish before 2h0m0s timeout
Process did not exit before 30m0s grace period
exit status 127

Summary

This failure IS directly caused by PR #8028. The TestSpotTerminationHandler test ran and passed in e2e-aws (OCP 4.22), taking 887.40s (~14m47s) total:

NodePool creation + node readiness: ~822s (~13m42s)
SpotTerminationHandlerTest subtest (Steps 1–9 including PR's new Steps 7–9): 65.38s

The new verification steps (Steps 7–9) added by the PR completed quickly (~1 minute), but the overall test time of ~15 minutes pushed the full test suite past the 2h CI timeout. When the timeout hit at 2026-04-08T14:33:44Z, the test suite was still running TestCreateCluster/Main/EnsureMetricsForwarderWorking (repeatedly retrying: "kube-apiserver target via metrics-forwarder not found in Prometheus active targets (will retry)").

All completed tests passed, including the spot test:

--- PASS: TestNodePool/HostedCluster0/Main/TestSpotTerminationHandler (887.40s)
    --- PASS: TestNodePool/HostedCluster0/Main/TestSpotTerminationHandler/SpotTerminationHandlerTest (65.38s)

The process was killed at 2h, then the 30m grace period expired. TestCreateCluster never completed.

Evidence

Timeout at line 2215: "Process did not finish before 2h0m0s timeout" at 2026-04-08T14:33:44Z
Grace period expired at line 2236: "Process did not exit before 30m0s grace period" at 2026-04-08T15:03:44Z
TestCreateCluster/Main/EnsureMetricsForwarderWorking was last active test (lines 1655–1667), repeatedly failing to find Prometheus target
TestSpotTerminationHandler took 887s (line 2165) — the new Steps 7–9 added ~65s but the overall test (including spot NodePool provisioning) consumed ~15 minutes of wall time
The new Steps 7–9 themselves executed very fast:
- Step 7 (annotation check): 0s (already present)
- Step 8 (deletion check): 0s (already deleted)
- Step 9 (replacement Machine): ~seconds (replacement was already provisioned)

Aggregated Root Cause

Job	Step	One-line Failure
e2e-aws-4-21	hypershift-aws-run-e2e-nested	TestNTOPerformanceProfile flake: NTO failed to create performance profile status ConfigMap in 10m (pre-existing, unrelated to PR)
e2e-aws	hypershift-aws-run-e2e-nested	2h process timeout: TestSpotTerminationHandler added ~15m execution time, pushing suite past deadline while TestCreateCluster/EnsureMetricsForwarderWorking was still retrying

Root Cause Hypothesis

The two failures have different root causes:

e2e-aws-4-21 (TestNTOPerformanceProfile): Not caused by PR CNTRLPLANE-1401: feat(e2e): add spot remediation verification to SpotTerminationHandlerTest #8028. The TestSpotTerminationHandler was skipped (version < 4.22). The TestNTOPerformanceProfile failure is a known flaky test where NTO does not create the performance profile status ConfigMap within the 10-minute timeout. This same test passed in the e2e-aws job (561.08s). The conditions validation failure is a secondary symptom caused by the framework's after() hook seeing 0 worker nodes after the NTO test consumed node resources and expecting degraded conditions, but the HostedCluster controller hadn't yet reconciled.
e2e-aws (2h timeout): Caused by the overall e2e suite time budget. The TestSpotTerminationHandler test — which was previously skipped in e2e-aws (pre-PR, it ran only on 4.22) — now runs with the new Steps 7–9 added by this PR. While the new steps themselves completed in ~65s, the full test lifecycle (creating a spot NodePool, provisioning an EC2 instance, running verification, waiting for Machine replacement) takes ~15 minutes. This additional ~15 minutes of serial execution within TestNodePool/HostedCluster0/Main pushed the overall suite past the 2h deadline. The actual test that was killed (TestCreateCluster/Main/EnsureMetricsForwarderWorking) was also slow (retrying Prometheus target discovery), but it would have likely completed within budget without the extra spot test time.

Recommendations

For the e2e-aws timeout: The TestSpotTerminationHandler test runs serially within the TestNodePool/HostedCluster0/Main block, and its ~15m wall time is the primary factor pushing the suite over 2h. Options:
- Increase the CI step timeout from 2h to 2h30m to accommodate the spot test
- Optimize the spot test: The 822s spent waiting for the spot NodePool to become ready is the dominant cost. Consider reusing an existing spot-capable NodePool instead of creating a new one, or move the test to its own HostedCluster to run in parallel with other tests
- Reduce waits in Steps 7–9: The new steps complete in <1s because the remediation controller acts quickly. The timeouts (5m, 5m, 10m) are generous but not the issue — the NodePool creation time is
For the TestNTOPerformanceProfile flake: This is a pre-existing issue unrelated to PR CNTRLPLANE-1401: feat(e2e): add spot remediation verification to SpotTerminationHandlerTest #8028. File a separate bug or check if one already exists for the NTO performance profile status ConfigMap creation reliability.
For the conditions validation mismatch: This is a framework issue in hypershift_framework.go where the after() hook races against condition reconciliation. It's not blocking and not caused by this PR.

Artifacts

Job 1 test artifacts: .work/prow-job-analyze-test-failure/2041848373236994048/logs/
Job 2 test artifacts: .work/prow-job-analyze-test-failure/2041848373203439616/logs/

hypershift-jira-solve-ci · 2026-04-14T09:18:35Z

Now I have all the evidence. Let me produce the final structured report.

Test Failure Analysis Complete (Multi-Step)

Job Information

Prow Job: pull-ci-openshift-hypershift-main-e2e-aws
Build ID: 2041848373203439616
Target: e2e-aws
PR: #8028 — CNTRLPLANE-1401: feat(e2e): add spot remediation verification to SpotTerminationHandlerTest
Job URL: https://prow.ci.openshift.org/view/gs/test-platform-results/pr-logs/pull/openshift_hypershift/8028/pull-ci-openshift-hypershift-main-e2e-aws/2041848373203439616
Duration: 3h 24m (12:00 – 15:24 UTC, 2026-04-08)
Failed Steps: 1 (hypershift-aws-run-e2e-nested)

Failed Step Analyses

Step: `hypershift-aws-run-e2e-nested` (test phase)

Error

Process did not finish before 2h0m0s timeout
Container test exited with code 127, reason Error

Summary

The e2e test binary hit its 2-hour hard timeout while TestCreateCluster was still running. All other top-level test suites completed successfully. 50 TestCreateCluster subtests were killed mid-execution.

The root cause is the PR's new spot remediation verification steps, which trigger a real Machine deletion and replacement cycle. While the new steps themselves complete quickly (~50s), the ongoing machine replacement during the subsequent HostedCluster0 teardown caused a ~18-minute teardown penalty — enough to push the overall suite past the 2h limit.

Evidence

Timeline:

Event	Timestamp	Elapsed
Test binary started	12:33:46	0m
TestSpotTerminationHandler/SpotTerminationHandlerTest started	~12:41:37	~8m
PR Step 7: spot-interruption-signal annotation verified	~12:42:00	instant (0s)
PR Step 8: Machine deletionTimestamp verified	~12:42:00	instant (0s)
PR Step 9: Replacement Machine created	~12:42:50	50s
NodePool status stabilized after machine replacement	~12:47:51	5m 51s
HostedCluster0/Teardown began	~13:08	~34m
Teardown error: "hostedcluster wasn't finalized, aborting delete: context deadline exceeded"	~13:15	-
HostedCluster0/Teardown completed	~13:33:26	25.3m teardown
2h timeout hit — TestCreateCluster killed	14:33:44	120m

Test suite completion status (20 parallel tests):

Test Suite	Status	Duration
TestNodePool/HostedCluster0	✅ PASS	60.5m (teardown: 25.3m)
TestNodePool/HostedCluster2	✅ PASS	56.1m (teardown: 7.1m)
TestAutoscaling	✅ PASS	53.3m (teardown: 7.7m)
TestUpgradeControlPlane	✅ PASS	51.0m (teardown: 7.8m)
TestCreateClusterRequestServingIsolation	✅ PASS	32.3m
TestNodePoolAutoscalingScaleFromZero	✅ PASS	31.0m
TestCreateClusterPrivateWithRouteKAS	✅ PASS	30.6m
TestCreateClusterCustomConfig	✅ PASS	28.2m
TestCreateClusterProxy	✅ PASS	25.8m
TestCreateClusterPrivate	✅ PASS	23.9m
TestHAEtcdChaos	✅ PASS	0.5m
TestCreateCluster	❌ TIMEOUT	>120m (killed)

Teardown comparison — HostedCluster0 vs. others:

HostedCluster0 (spot test cluster):  1517.60s = 25.3m  ← ~18m penalty
HostedCluster2 (trust bundle test):   423.10s =  7.1m  ← normal
TestAutoscaling:                      463.57s =  7.7m  ← normal
TestUpgradeControlPlane:              467.00s =  7.8m  ← normal

The HostedCluster0 teardown was 3.6× slower than all other teardowns. The teardown logged:

hypershift_framework.go:559: Failed to destroy cluster, will retry: hostedcluster wasn't finalized, aborting delete: context deadline exceeded
journals.go:213: No machines associated with infra id node-pool-gskxl were found. Skipping journal dump.

This confirms that the machine replacement triggered by the PR's Step 8 (Machine deletion) left the cluster in a state where finalization was delayed.

PR's new test steps executed in the build log (lines 489-522):

nodepool_spot_termination_handler_test.go:316: Waiting for Machine ... to be annotated with hypershift.openshift.io/spot-interruption-signal
nodepool_spot_termination_handler_test.go:317: Successfully waited for ... spot-interruption-signal annotation in 0s
nodepool_spot_termination_handler_test.go:338: Waiting for Machine ... to be marked for deletion
nodepool_spot_termination_handler_test.go:339: Successfully waited for ... deletionTimestamp in 0s
nodepool_spot_termination_handler_test.go:360: Waiting for a replacement Machine (not ...) with label hypershift.openshift.io/interruptible-instance
nodepool_spot_termination_handler_test.go:361: Successfully waited for ... replacement Machine with spot label in 50s
nodepool_spot_termination_handler_test.go:402: Replacement Machine with interruptible-instance label is created

Contributing factor — EnsureMetricsForwarderWorking retry loop:
TestCreateCluster/Main/EnsureMetricsForwarderWorking was stuck retrying when the timeout hit:

util_metrics_proxy.go:117: kube-apiserver target via metrics-forwarder not found in Prometheus active targets (will retry)

This message appeared 7 times, indicating the test was blocked waiting for a Prometheus scrape target that never appeared. Even without the teardown penalty, this test may have consumed significant time.

Artifacts Examined

.work/prow-job-analyze-test-failure/2041848373203439616/logs/build-log.txt
.work/prow-job-analyze-test-failure/2041848373203439616/logs/hypershift-aws-run-e2e-nested-build-log.txt
.work/prow-job-analyze-test-failure/2041848373203439616/logs/junit_operator.xml

Aggregated Root Cause

Failed Steps Summary

Step	One-line Failure
`hypershift-aws-run-e2e-nested`	2h timeout: PR's spot remediation steps caused 18m extra teardown, pushing TestCreateCluster past the time limit

Root Cause Hypothesis

Primary cause: PR #8028 introduced spot remediation verification steps (Steps 7–9 in the test) that trigger a real Machine deletion and replacement cycle on the HostedCluster0 cluster. While the verification itself completes quickly (50s for replacement detection), the side effect — an in-flight machine replacement — causes the HostedCluster0 teardown to take 25.3 minutes instead of the normal ~7 minutes. This ~18-minute penalty is enough to push the overall parallel test suite past its 2-hour hard timeout, killing TestCreateCluster before it can complete.

Causal chain:

PR Step 8 triggers Machine deletion via spot-interruption-signal annotation → CAPI deletes the Machine
PR Step 9 verifies a replacement Machine is created (takes 50s) → replacement Machine is still provisioning
The test's Run() returns, but the replacement Machine is still being created
HostedCluster0 teardown begins with an in-flight machine replacement
Teardown fails with "hostedcluster wasn't finalized, aborting delete: context deadline exceeded" and retries
Teardown eventually succeeds after 25.3m (vs. ~7m normal)
TestCreateCluster, running in parallel, is killed at the 2h mark while still executing EnsureMetricsForwarderWorking

Contributing factor: TestCreateCluster/Main/EnsureMetricsForwarderWorking was stuck in a retry loop waiting for a Prometheus scrape target. This may have also consumed extra time, though the primary cause of the timeout is the teardown penalty.

Recommendations

Wait for replacement Machine to be fully ready before returning from the spot test: After Step 9 (replacement Machine creation), add a step that waits for the replacement Machine to become fully provisioned (e.g., wait for the Machine's status.phase to be Running and the replacement Node to become Ready). This ensures the machine replacement is fully settled before HostedCluster0 teardown begins, avoiding the extended teardown time.
Alternative: Scale down the spot NodePool before returning: After verifying the replacement Machine is created (Step 9), explicitly scale the spot NodePool to 0 replicas and wait for all Machines to be deleted. This ensures no in-flight machines during teardown. Example:
```
// After verifying replacement, scale down the NodePool to avoid teardown delays
err = e2eutil.UpdateObject(t, s.ctx, s.mgmtClient, &nodePool, func(obj *hyperv1.NodePool) {
    zero := int32(0)
    obj.Spec.Replicas = &zero
})
```
Investigate EnsureMetricsForwarderWorking flakiness: This test was stuck retrying "kube-apiserver target via metrics-forwarder not found in Prometheus active targets" and may independently contribute to timeout pressure on the e2e suite.

Artifacts

Test artifacts: .work/prow-job-analyze-test-failure/2041848373203439616/logs/
Build log: .work/prow-job-analyze-test-failure/2041848373203439616/logs/build-log.txt
Step log: .work/prow-job-analyze-test-failure/2041848373203439616/logs/hypershift-aws-run-e2e-nested-build-log.txt
JUnit: .work/prow-job-analyze-test-failure/2041848373203439616/logs/junit_operator.xml

mgencur · 2026-04-15T10:52:24Z

/retest

codecov · 2026-04-15T13:09:28Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 26.48%. Comparing base (ff6021e) to head (b1e5c9f).
⚠️ Report is 214 commits behind head on main.

Additional details and impacted files

@@           Coverage Diff           @@
##             main    #8028   +/-   ##
=======================================
  Coverage   26.48%   26.48%           
=======================================
  Files        1087     1087           
  Lines      105002   105002           
=======================================
  Hits        27809    27809           
  Misses      74785    74785           
  Partials     2408     2408

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

openshift-ci bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Mar 23, 2026

openshift-ci bot added do-not-merge/needs-area needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. area/testing Indicates the PR includes changes for e2e testing and removed do-not-merge/needs-area labels Mar 23, 2026

zhfeng changed the title ~~feat(e2e): make SQS queue name configurable and add spot remediation verification~~ feat(e2e): add spot remediation verification to SpotTerminationHandlerTest Mar 23, 2026

zhfeng force-pushed the verify_OCPSTRAT-1677 branch from cbcce5f to 01bd766 Compare March 23, 2026 03:00

openshift-ci bot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Mar 23, 2026

zhfeng changed the title ~~feat(e2e): add spot remediation verification to SpotTerminationHandlerTest~~ CNTRLPLANE-1401: feat(e2e): add spot remediation verification to SpotTerminationHandlerTest Mar 23, 2026

openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Mar 23, 2026

zhfeng marked this pull request as ready for review March 23, 2026 03:05

openshift-ci bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Mar 23, 2026

openshift-ci bot requested review from csrwng and jparrill March 23, 2026 03:06

zhfeng force-pushed the verify_OCPSTRAT-1677 branch from 01bd766 to b1e5c9f Compare March 27, 2026 08:52

openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Apr 8, 2026

openshift-ci bot assigned enxebre Apr 8, 2026

openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Apr 8, 2026

Conversation

zhfeng commented Mar 23, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test plan

Summary by CodeRabbit

Uh oh!

openshift-ci-robot commented Mar 23, 2026

Uh oh!

openshift-ci bot commented Mar 23, 2026

Uh oh!

coderabbitai bot commented Mar 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review skipped

Walkthrough

Uh oh!

openshift-ci-robot commented Mar 23, 2026 • edited by openshift-ci bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test plan

Uh oh!

openshift-ci-robot commented Mar 23, 2026 • edited by openshift-ci bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test plan

Summary by CodeRabbit

Uh oh!

enxebre commented Mar 23, 2026

Uh oh!

cwbotbot commented Mar 23, 2026

Test Results

e2e-aws

Uh oh!

zhfeng commented Apr 2, 2026

Uh oh!

enxebre commented Apr 8, 2026

Uh oh!

openshift-ci bot commented Apr 8, 2026

Uh oh!

enxebre commented Apr 8, 2026

Uh oh!

openshift-ci-robot commented Apr 8, 2026

Uh oh!

openshift-ci bot commented Apr 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

hypershift-jira-solve-ci bot commented Apr 9, 2026 • edited by openshift-ci bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Test Failure Analysis Complete (Multi-Job)

Job 1: e2e-aws-4-21

Job Information

Failed Steps

Error

Summary

Evidence

Root Cause

Job 2: e2e-aws

Job Information

Failed Steps

Error

Summary

Evidence

Root Cause

Aggregated Root Cause

Summary

Recommendations

Artifacts

Uh oh!

hypershift-jira-solve-ci bot commented Apr 13, 2026 • edited by openshift-ci bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Test Failure Analysis Complete (Multi-Step)

Job Information — Job 1: e2e-aws-4-21

Job Information — Job 2: e2e-aws

PR Under Test

Failed Step Analyses

Job 1: e2e-aws-4-21 — Step: hypershift-aws-run-e2e-nested

Error

Summary

Evidence

Job 2: e2e-aws — Step: hypershift-aws-run-e2e-nested

zhfeng commented Mar 23, 2026 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Mar 23, 2026 •

edited

Loading

openshift-ci-robot commented Mar 23, 2026 •

edited by openshift-ci bot

Loading

openshift-ci-robot commented Mar 23, 2026 •

edited by openshift-ci bot

Loading

openshift-ci bot commented Apr 8, 2026 •

edited

Loading

hypershift-jira-solve-ci bot commented Apr 9, 2026 •

edited by openshift-ci bot

Loading

hypershift-jira-solve-ci bot commented Apr 13, 2026 •

edited by openshift-ci bot

Loading

hypershift-jira-solve-ci bot commented Apr 14, 2026 •

edited by openshift-ci bot

Loading

Step: `hypershift-aws-run-e2e-nested` (test phase)