Skip to content

CNTRLPLANE-1401: feat(e2e): add spot remediation verification to SpotTerminationHandlerTest#8028

Open
zhfeng wants to merge 1 commit intoopenshift:mainfrom
zhfeng:verify_OCPSTRAT-1677
Open

CNTRLPLANE-1401: feat(e2e): add spot remediation verification to SpotTerminationHandlerTest#8028
zhfeng wants to merge 1 commit intoopenshift:mainfrom
zhfeng:verify_OCPSTRAT-1677

Conversation

@zhfeng
Copy link
Copy Markdown
Contributor

@zhfeng zhfeng commented Mar 23, 2026

Summary

Extend SpotTerminationHandlerTest to verify the full spot remediation flow after NTH taints the node:

  • Step 7: Machine gets hypershift.openshift.io/spot-interruption-signal annotation (spot remediation controller acted)
  • Step 8: Machine gets deletionTimestamp (Machine deleted)
  • Step 9: Exactly 1 replacement Machine is created with a different name and the interruptible-instance label (CAPI replacement inherits spot marker)

Also tightens the node count assertion from len(nodes) == 0 to len(nodes) != 1 to match the 1-replica NodePool.

Test plan

  • Verify e2e test compiles (go vet -tags e2e ./test/e2e/...)
  • Run spot termination handler e2e test with the new verification steps

Summary by CodeRabbit

  • Tests
    • Enhanced end-to-end testing for spot instance termination handling with stricter validation of node availability and comprehensive verification of the Machine replacement workflow during interruption events, ensuring proper annotation handling and instance label assignment.

@openshift-ci-robot
Copy link
Copy Markdown

Pipeline controller notification
This repo is configured to use the pipeline controller. Second-stage tests will be triggered either automatically or after lgtm label is added, depending on the repository configuration. The pipeline controller will automatically detect which contexts are required and will utilize /test Prow commands to trigger the second stage.

For optional jobs, comment /test ? to see a list of all defined jobs. To trigger manually all jobs from second stage use /pipeline required command.

This repository is configured in: LGTM mode

@openshift-ci openshift-ci bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Mar 23, 2026
@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci bot commented Mar 23, 2026

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai bot commented Mar 23, 2026

Important

Review skipped

Auto reviews are limited based on label configuration.

🚫 Review skipped — only excluded labels are configured. (1)
  • do-not-merge/work-in-progress

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Repository YAML (base), Organization UI (inherited)

Review profile: CHILL

Plan: Pro

Run ID: b5d56fac-706b-440f-9019-221e51020324

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

This change updates a spot termination handler end-to-end test to add more rigorous verification logic. The test now enforces exactly one ready spot node instead of at least one, extracts the owning CAPI Machine reference from the Node's annotations, and introduces new polling assertions to verify the Machine receives a spot interruption signal annotation, gains a deletion timestamp, and that a replacement Machine with the interruptible instance label is created. A new constant spotInterruptionSignalAnnotation is added to support these verifications. The actual cleanup logic remains functionally unchanged, only renumbered to Step 10.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@openshift-ci openshift-ci bot added do-not-merge/needs-area needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. area/testing Indicates the PR includes changes for e2e testing and removed do-not-merge/needs-area labels Mar 23, 2026
@zhfeng zhfeng changed the title feat(e2e): make SQS queue name configurable and add spot remediation verification feat(e2e): add spot remediation verification to SpotTerminationHandlerTest Mar 23, 2026
@zhfeng zhfeng force-pushed the verify_OCPSTRAT-1677 branch from cbcce5f to 01bd766 Compare March 23, 2026 03:00
@openshift-ci openshift-ci bot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Mar 23, 2026
@zhfeng zhfeng changed the title feat(e2e): add spot remediation verification to SpotTerminationHandlerTest CNTRLPLANE-1401: feat(e2e): add spot remediation verification to SpotTerminationHandlerTest Mar 23, 2026
@openshift-ci-robot
Copy link
Copy Markdown

openshift-ci-robot commented Mar 23, 2026

@zhfeng: This pull request references CNTRLPLANE-1401 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.22.0" version, but no target version was set.

Details

In response to this:

Summary

Extend SpotTerminationHandlerTest to verify the full spot remediation flow after NTH taints the node:

  • Step 7: Machine gets hypershift.openshift.io/spot-interruption-signal annotation (spot remediation controller acted)
  • Step 8: Machine gets deletionTimestamp (Machine deleted)
  • Step 9: Exactly 1 replacement Machine is created with a different name and the interruptible-instance label (CAPI replacement inherits spot marker)

Also tightens the node count assertion from len(nodes) == 0 to len(nodes) != 1 to match the 1-replica NodePool.

Test plan

  • Verify e2e test compiles (go vet -tags e2e ./test/e2e/...)
  • Run spot termination handler e2e test with the new verification steps

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci-robot openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Mar 23, 2026
@zhfeng zhfeng marked this pull request as ready for review March 23, 2026 03:05
@openshift-ci openshift-ci bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Mar 23, 2026
@openshift-ci openshift-ci bot requested review from csrwng and jparrill March 23, 2026 03:06
@openshift-ci-robot
Copy link
Copy Markdown

openshift-ci-robot commented Mar 23, 2026

@zhfeng: This pull request references CNTRLPLANE-1401 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.22.0" version, but no target version was set.

Details

In response to this:

Summary

Extend SpotTerminationHandlerTest to verify the full spot remediation flow after NTH taints the node:

  • Step 7: Machine gets hypershift.openshift.io/spot-interruption-signal annotation (spot remediation controller acted)
  • Step 8: Machine gets deletionTimestamp (Machine deleted)
  • Step 9: Exactly 1 replacement Machine is created with a different name and the interruptible-instance label (CAPI replacement inherits spot marker)

Also tightens the node count assertion from len(nodes) == 0 to len(nodes) != 1 to match the 1-replica NodePool.

Test plan

  • Verify e2e test compiles (go vet -tags e2e ./test/e2e/...)
  • Run spot termination handler e2e test with the new verification steps

Summary by CodeRabbit

  • Tests
  • Enhanced end-to-end testing for spot instance termination handling with stricter validation of node availability and comprehensive verification of the Machine replacement workflow during interruption events, ensuring proper annotation handling and instance label assignment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@enxebre
Copy link
Copy Markdown
Member

enxebre commented Mar 23, 2026

/test e2e-aws

@cwbotbot
Copy link
Copy Markdown

Test Results

e2e-aws

…rTest

Extend SpotTerminationHandlerTest to verify the full spot remediation
flow after NTH taints the node: Machine gets spot-interruption-signal
annotation, Machine is marked for deletion, and exactly 1 replacement
Machine is created with the interruptible-instance label.
@zhfeng zhfeng force-pushed the verify_OCPSTRAT-1677 branch from 01bd766 to b1e5c9f Compare March 27, 2026 08:52
@zhfeng
Copy link
Copy Markdown
Contributor Author

zhfeng commented Apr 2, 2026

@enxebre do you think it is Ok for merging?

@enxebre
Copy link
Copy Markdown
Member

enxebre commented Apr 8, 2026

/approve

@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci bot commented Apr 8, 2026

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: enxebre, zhfeng

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Apr 8, 2026
@enxebre
Copy link
Copy Markdown
Member

enxebre commented Apr 8, 2026

/lgtm

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Apr 8, 2026
@openshift-ci-robot
Copy link
Copy Markdown

Scheduling tests matching the pipeline_run_if_changed or not excluded by pipeline_skip_if_only_changed parameters:
/test e2e-aks-4-21
/test e2e-aws-4-21
/test e2e-aks
/test e2e-aws
/test e2e-aws-upgrade-hypershift-operator
/test e2e-azure-self-managed
/test e2e-kubevirt-aws-ovn-reduced
/test e2e-v2-aws

@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci bot commented Apr 8, 2026

@zhfeng: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/e2e-aws-4-21 b1e5c9f link true /test e2e-aws-4-21

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@hypershift-jira-solve-ci
Copy link
Copy Markdown

hypershift-jira-solve-ci bot commented Apr 9, 2026

Now I have comprehensive evidence. Let me produce the final report:

Test Failure Analysis Complete (Multi-Job)

Job 1: e2e-aws-4-21

Job Information

  • Prow Job: pull-ci-openshift-hypershift-main-e2e-aws-4-21
  • Build ID: 2041848373236994048
  • Target: e2e-aws-4-21
  • Status: failure

Failed Steps

Step One-line Failure
hypershift-aws-run-e2e-nested TestNodePool/HostedCluster0/Main/TestNTOPerformanceProfile timed out waiting for performance profile status ConfigMap

Error

nodepool_nto_performanceprofile_test.go:159: Failed to wait for performance profile status ConfigMap to exist in 10m0s: context deadline exceeded
eventually.go:384: observed invalid **v1.ConfigMap state after 10m0s
eventually.go:401:  - observed **v1.ConfigMap collection invalid: expected 1 performance profile status ConfigMaps, got 0

Summary

TestNTOPerformanceProfile failed because the Node Tuning Operator did not produce the expected performance profile status ConfigMap (labeled NodeTuningGeneratedPerformanceProfileStatusLabel: "true") in the hosted control plane namespace within the 10-minute timeout. The performance profile config ConfigMap was created successfully (in 3s), but the corresponding status ConfigMap was never generated. This is a pre-existing flaky test unrelated to PR #8028.

Evidence

  • The performance profile config ConfigMap was created and validated successfully at line 112 of nodepool_nto_performanceprofile_test.go ("Successfully waited for performance profile ConfigMap to exist with correct name labels and annotations in 3s")
  • The status ConfigMap poll at line 159 timed out after 10m0s — 0 ConfigMaps found with the status label
  • TestSpotTerminationHandler was SKIPPED on this job because version < 4.22 (test only supported on version 4.22 and above)
  • All other 468 tests (of 473 total, 35 skipped) passed — this single TestNTOPerformanceProfile failure is the sole cause of the job failure
  • This failure is NOT related to PR CNTRLPLANE-1401: feat(e2e): add spot remediation verification to SpotTerminationHandlerTest #8028 — the PR adds spot remediation verification, and TestSpotTerminationHandler was skipped entirely in this job

Root Cause

Pre-existing NTO flake: The Node Tuning Operator failed to reconcile the PerformanceProfile status ConfigMap within the allotted timeout. This is a known intermittent issue with the NTO operator in CI environments — it is unrelated to the changes in PR #8028.


Job 2: e2e-aws

Job Information

  • Prow Job: pull-ci-openshift-hypershift-main-e2e-aws
  • Build ID: 2041848373203439616
  • Target: e2e-aws
  • Status: failure (2h timeout)

Failed Steps

Step One-line Failure
hypershift-aws-run-e2e-nested Process did not finish before 2h0m0s timeout — TestCreateCluster stuck in EnsureMetricsForwarderWorking

Error

{"component":"entrypoint","level":"error","msg":"Process did not finish before 2h0m0s timeout","time":"2026-04-08T14:33:44Z"}
{"component":"entrypoint","level":"error","msg":"Process did not exit before 30m0s grace period","time":"2026-04-08T15:03:44Z"}
error: failed to execute wrapped command: exit status 127

Summary

All individual tests passed (including TestSpotTerminationHandler which completed successfully in 65.38s), but TestCreateCluster got stuck in its sub-test EnsureMetricsForwarderWorking. This sub-test waited indefinitely for the kube-apiserver Prometheus target to appear via the metrics-forwarder, repeatedly logging "kube-apiserver target via metrics-forwarder not found in Prometheus active targets (will retry)". The test never timed out internally and the entire 2h job timeout was consumed, killing the process.

Evidence

TestSpotTerminationHandler PASSED successfully:

  • NodePool node-pool-gskxl-test-spot-termination created with 1 spot replica (took 7m51s for node to become ready)
  • SQS queue created, rebalance recommendation event sent
  • Node tainted with aws-node-termination-handler/rebalance-recommendation in 5s
  • Machine annotated with spot-interruption-signal in 0s
  • Machine marked for deletion in 0s
  • New spot remediation verification (PR CNTRLPLANE-1401: feat(e2e): add spot remediation verification to SpotTerminationHandlerTest #8028): Replacement Machine with interruptible-instance label detected in 50s
  • Cleanup completed (SQS queue deleted, policy removed)
  • Total time: 887.40s (14m47s), of which the SpotTerminationHandlerTest subtest took 65.38s

TestCreateCluster/Main/EnsureMetricsForwarderWorking was the blocking test:

  • Ran at line 1554: util_metrics_proxy.go:50: Enabling metrics forwarding on HostedCluster
  • Endpoint-resolver deployment waited for
  • Then entered an infinite retry loop at util_metrics_proxy.go:117: "kube-apiserver target via metrics-forwarder not found in Prometheus active targets (will retry)"
  • Last seen retrying at line 1668 — never succeeded
  • TestCreateCluster never emitted a PASS/FAIL result, and was still running when the 2h timeout killed the process

Timing analysis — did TestSpotTerminationHandler contribute to the timeout?

  • TestSpotTerminationHandler took 887.40s (14m47s) of wall-clock time
  • However, all HostedCluster0/Main tests run in parallel within the same hosted cluster — TestSpotTerminationHandler ran concurrently with TestNodePoolInPlaceUpgrade (946s), TestNTOMachineConfigGetsRolledOut (1001s), TestRollingUpgrade (1179s), etc.
  • The overall HostedCluster0 wall-clock was 3631s (60m31s), dominated by long-running upgrade tests
  • The blocking test was EnsureMetricsForwarderWorking in TestCreateCluster, which runs in a separate hosted cluster from TestSpotTerminationHandler
  • Conclusion: Even without TestSpotTerminationHandler, the timeout would still have occurred because EnsureMetricsForwarderWorking was stuck in an infinite retry loop

TestNTOPerformanceProfile PASSED on this job (561.08s) — the flake only affected e2e-aws-4-21

Root Cause

Pre-existing metrics-forwarder infrastructure issue: TestCreateCluster/Main/EnsureMetricsForwarderWorking has no internal timeout and entered an infinite retry loop waiting for a Prometheus target that never appeared. This consumed the entire 2h job timeout. This test is version-gated to 4.22+ (it was skipped on the 4-21 job), and the failure is unrelated to PR #8028.


Aggregated Root Cause

Summary

Job Root Cause Related to PR #8028?
e2e-aws-4-21 TestNTOPerformanceProfile flake — NTO failed to create performance profile status ConfigMap NoTestSpotTerminationHandler was SKIPPED (version < 4.22)
e2e-aws EnsureMetricsForwarderWorking infinite retry — metrics-forwarder Prometheus target never appeared NoTestSpotTerminationHandler passed; stuck test is in separate hosted cluster

Neither failure is caused by PR #8028 (CNTRLPLANE-1401). Both are pre-existing, independent infrastructure/flake issues:

  1. TestNTOPerformanceProfile is a known intermittent failure where the Node Tuning Operator fails to reconcile the performance profile status ConfigMap in time. This has no connection to spot termination handling.

  2. EnsureMetricsForwarderWorking is a 4.22-only test that lacks an internal timeout, causing it to retry indefinitely when the metrics-forwarder fails to register its kube-apiserver target in Prometheus. The test should have an internal timeout to fail fast rather than consuming the entire job timeout.

Recommendations

  • PR CNTRLPLANE-1401: feat(e2e): add spot remediation verification to SpotTerminationHandlerTest #8028 is safe to retry — both failures are pre-existing and unrelated to the spot remediation verification changes
  • EnsureMetricsForwarderWorking needs a timeout: The test at util_metrics_proxy.go:117 retries indefinitely without a deadline. Adding a bounded timeout (e.g., 10-15 minutes) would allow the test to fail fast and not consume the entire job budget
  • TestNTOPerformanceProfile flake should be tracked: The NTO status ConfigMap creation is intermittently failing in CI — consider increasing the 10m timeout or investigating why the NTO operator sometimes fails to reconcile

Artifacts

  • e2e-aws-4-21 artifacts: .work/prow-job-analyze-test-failure/2041848373236994048/logs/
  • e2e-aws artifacts: .work/prow-job-analyze-test-failure/2041848373203439616/logs/

@hypershift-jira-solve-ci
Copy link
Copy Markdown

hypershift-jira-solve-ci bot commented Apr 13, 2026

Now I have all the evidence. Let me compile the final report:

Test Failure Analysis Complete (Multi-Step)

Job Information — Job 1: e2e-aws-4-21

Job Information — Job 2: e2e-aws

PR Under Test

  • PR: openshift/hypershift#8028CNTRLPLANE-1401: feat(e2e): add spot remediation verification to SpotTerminationHandlerTest
  • Author: @zhfeng
  • Changes: Adds Steps 7–9 to the SpotTerminationHandlerTest, verifying that the spot remediation controller annotates the Machine with hypershift.openshift.io/spot-interruption-signal, that the Machine is marked for deletion, and that a replacement Machine with the interruptible-instance label is created by CAPI. Also tightens the node count assertion from len(nodes) == 0 to len(nodes) != 1.

Failed Step Analyses

Job 1: e2e-aws-4-21 — Step: hypershift-aws-run-e2e-nested

Error

TestNodePool/HostedCluster0/Main/TestNTOPerformanceProfile failed after 1059s (17m39s):

nodepool_nto_performanceprofile_test.go:159: Failed to wait for performance profile status ConfigMap to exist in 10m0s: context deadline exceeded
eventually.go:384: observed invalid **v1.ConfigMap state after 10m0s
eventually.go:401:  - observed **v1.ConfigMap collection invalid: expected 1 performance profile status ConfigMaps, got 0

Additionally, the after() framework hook reported invalid HostedCluster conditions for TestNodePool/HostedCluster0:

util.go:3025: Failed to wait for HostedCluster e2e-clusters-khdz2/node-pool-qkjhh to have valid conditions in 2s: context deadline exceeded
- incorrect condition: wanted ClusterVersionSucceeding=False, got ClusterVersionSucceeding=True
- incorrect condition: wanted ClusterVersionAvailable=False, got ClusterVersionAvailable=True
- incorrect condition: wanted DataPlaneConnectionAvailable=Unknown, got DataPlaneConnectionAvailable=True
- incorrect condition: wanted ClusterVersionProgressing=True, got ClusterVersionProgressing=False

Summary

This failure is NOT caused by PR #8028. The TestSpotTerminationHandler test was skipped in e2e-aws-4-21 because the hosted cluster runs OCP 4.21, and the test has a version gate requiring 4.22+:

nodepool_spot_termination_handler_test.go:90: test only supported on version 4.22 and above

The actual failures are:

  1. TestNTOPerformanceProfile — A pre-existing flaky test. The Node Tuning Operator (NTO) did not create the expected performance profile status ConfigMap (labeled NodeTuningGeneratedPerformanceProfileStatusLabel: "true") within 10 minutes. This is an NTO reconciliation timing issue unrelated to the PR.

  2. HostedCluster conditions mismatch — The after() hook found no worker nodes (they were removed during test teardown) and expected ClusterVersionSucceeding=False, ClusterVersionAvailable=False, etc. But the cluster was actually healthy (ClusterVersionSucceeding=True), meaning the HostedCluster controller hadn't yet reconciled conditions to reflect the "no workers" state. This is a framework condition validation race unrelated to the PR changes.

Evidence

  • TestSpotTerminationHandler was SKIPPED (line 4231): "test only supported on version 4.22 and above"
  • TestNTOPerformanceProfile failure (line 1087): waited 10m for ConfigMap, got 0 results
  • Conditions mismatch (line 4270–4274): framework expected unhealthy conditions but cluster reported healthy
  • 473 tests ran, 35 skipped, 4 failures (all nested under TestNodePool → HostedCluster0 → Main → TestNTOPerformanceProfile)

Job 2: e2e-aws — Step: hypershift-aws-run-e2e-nested

Error

Process did not finish before 2h0m0s timeout
Process did not exit before 30m0s grace period
exit status 127

Summary

This failure IS directly caused by PR #8028. The TestSpotTerminationHandler test ran and passed in e2e-aws (OCP 4.22), taking 887.40s (~14m47s) total:

  • NodePool creation + node readiness: ~822s (~13m42s)
  • SpotTerminationHandlerTest subtest (Steps 1–9 including PR's new Steps 7–9): 65.38s

The new verification steps (Steps 7–9) added by the PR completed quickly (~1 minute), but the overall test time of ~15 minutes pushed the full test suite past the 2h CI timeout. When the timeout hit at 2026-04-08T14:33:44Z, the test suite was still running TestCreateCluster/Main/EnsureMetricsForwarderWorking (repeatedly retrying: "kube-apiserver target via metrics-forwarder not found in Prometheus active targets (will retry)").

All completed tests passed, including the spot test:

--- PASS: TestNodePool/HostedCluster0/Main/TestSpotTerminationHandler (887.40s)
    --- PASS: TestNodePool/HostedCluster0/Main/TestSpotTerminationHandler/SpotTerminationHandlerTest (65.38s)

The process was killed at 2h, then the 30m grace period expired. TestCreateCluster never completed.

Evidence

  • Timeout at line 2215: "Process did not finish before 2h0m0s timeout" at 2026-04-08T14:33:44Z
  • Grace period expired at line 2236: "Process did not exit before 30m0s grace period" at 2026-04-08T15:03:44Z
  • TestCreateCluster/Main/EnsureMetricsForwarderWorking was last active test (lines 1655–1667), repeatedly failing to find Prometheus target
  • TestSpotTerminationHandler took 887s (line 2165) — the new Steps 7–9 added ~65s but the overall test (including spot NodePool provisioning) consumed ~15 minutes of wall time
  • The new Steps 7–9 themselves executed very fast:
    • Step 7 (annotation check): 0s (already present)
    • Step 8 (deletion check): 0s (already deleted)
    • Step 9 (replacement Machine): ~seconds (replacement was already provisioned)

Aggregated Root Cause

Job Step One-line Failure
e2e-aws-4-21 hypershift-aws-run-e2e-nested TestNTOPerformanceProfile flake: NTO failed to create performance profile status ConfigMap in 10m (pre-existing, unrelated to PR)
e2e-aws hypershift-aws-run-e2e-nested 2h process timeout: TestSpotTerminationHandler added ~15m execution time, pushing suite past deadline while TestCreateCluster/EnsureMetricsForwarderWorking was still retrying

Root Cause Hypothesis

The two failures have different root causes:

  1. e2e-aws-4-21 (TestNTOPerformanceProfile): Not caused by PR CNTRLPLANE-1401: feat(e2e): add spot remediation verification to SpotTerminationHandlerTest #8028. The TestSpotTerminationHandler was skipped (version < 4.22). The TestNTOPerformanceProfile failure is a known flaky test where NTO does not create the performance profile status ConfigMap within the 10-minute timeout. This same test passed in the e2e-aws job (561.08s). The conditions validation failure is a secondary symptom caused by the framework's after() hook seeing 0 worker nodes after the NTO test consumed node resources and expecting degraded conditions, but the HostedCluster controller hadn't yet reconciled.

  2. e2e-aws (2h timeout): Caused by the overall e2e suite time budget. The TestSpotTerminationHandler test — which was previously skipped in e2e-aws (pre-PR, it ran only on 4.22) — now runs with the new Steps 7–9 added by this PR. While the new steps themselves completed in ~65s, the full test lifecycle (creating a spot NodePool, provisioning an EC2 instance, running verification, waiting for Machine replacement) takes ~15 minutes. This additional ~15 minutes of serial execution within TestNodePool/HostedCluster0/Main pushed the overall suite past the 2h deadline. The actual test that was killed (TestCreateCluster/Main/EnsureMetricsForwarderWorking) was also slow (retrying Prometheus target discovery), but it would have likely completed within budget without the extra spot test time.

Recommendations

  1. For the e2e-aws timeout: The TestSpotTerminationHandler test runs serially within the TestNodePool/HostedCluster0/Main block, and its ~15m wall time is the primary factor pushing the suite over 2h. Options:

    • Increase the CI step timeout from 2h to 2h30m to accommodate the spot test
    • Optimize the spot test: The 822s spent waiting for the spot NodePool to become ready is the dominant cost. Consider reusing an existing spot-capable NodePool instead of creating a new one, or move the test to its own HostedCluster to run in parallel with other tests
    • Reduce waits in Steps 7–9: The new steps complete in <1s because the remediation controller acts quickly. The timeouts (5m, 5m, 10m) are generous but not the issue — the NodePool creation time is
  2. For the TestNTOPerformanceProfile flake: This is a pre-existing issue unrelated to PR CNTRLPLANE-1401: feat(e2e): add spot remediation verification to SpotTerminationHandlerTest #8028. File a separate bug or check if one already exists for the NTO performance profile status ConfigMap creation reliability.

  3. For the conditions validation mismatch: This is a framework issue in hypershift_framework.go where the after() hook races against condition reconciliation. It's not blocking and not caused by this PR.


Artifacts

  • Job 1 test artifacts: .work/prow-job-analyze-test-failure/2041848373236994048/logs/
  • Job 2 test artifacts: .work/prow-job-analyze-test-failure/2041848373203439616/logs/

@hypershift-jira-solve-ci
Copy link
Copy Markdown

hypershift-jira-solve-ci bot commented Apr 14, 2026

Now I have all the evidence. Let me produce the final structured report.

Test Failure Analysis Complete (Multi-Step)

Job Information


Failed Step Analyses

Step: hypershift-aws-run-e2e-nested (test phase)

Error

Process did not finish before 2h0m0s timeout
Container test exited with code 127, reason Error

Summary

The e2e test binary hit its 2-hour hard timeout while TestCreateCluster was still running. All other top-level test suites completed successfully. 50 TestCreateCluster subtests were killed mid-execution.

The root cause is the PR's new spot remediation verification steps, which trigger a real Machine deletion and replacement cycle. While the new steps themselves complete quickly (~50s), the ongoing machine replacement during the subsequent HostedCluster0 teardown caused a ~18-minute teardown penalty — enough to push the overall suite past the 2h limit.

Evidence

Timeline:

Event Timestamp Elapsed
Test binary started 12:33:46 0m
TestSpotTerminationHandler/SpotTerminationHandlerTest started ~12:41:37 ~8m
PR Step 7: spot-interruption-signal annotation verified ~12:42:00 instant (0s)
PR Step 8: Machine deletionTimestamp verified ~12:42:00 instant (0s)
PR Step 9: Replacement Machine created ~12:42:50 50s
NodePool status stabilized after machine replacement ~12:47:51 5m 51s
HostedCluster0/Teardown began ~13:08 ~34m
Teardown error: "hostedcluster wasn't finalized, aborting delete: context deadline exceeded" ~13:15 -
HostedCluster0/Teardown completed ~13:33:26 25.3m teardown
2h timeout hit — TestCreateCluster killed 14:33:44 120m

Test suite completion status (20 parallel tests):

Test Suite Status Duration
TestNodePool/HostedCluster0 ✅ PASS 60.5m (teardown: 25.3m)
TestNodePool/HostedCluster2 ✅ PASS 56.1m (teardown: 7.1m)
TestAutoscaling ✅ PASS 53.3m (teardown: 7.7m)
TestUpgradeControlPlane ✅ PASS 51.0m (teardown: 7.8m)
TestCreateClusterRequestServingIsolation ✅ PASS 32.3m
TestNodePoolAutoscalingScaleFromZero ✅ PASS 31.0m
TestCreateClusterPrivateWithRouteKAS ✅ PASS 30.6m
TestCreateClusterCustomConfig ✅ PASS 28.2m
TestCreateClusterProxy ✅ PASS 25.8m
TestCreateClusterPrivate ✅ PASS 23.9m
TestHAEtcdChaos ✅ PASS 0.5m
TestCreateCluster ❌ TIMEOUT >120m (killed)

Teardown comparison — HostedCluster0 vs. others:

HostedCluster0 (spot test cluster):  1517.60s = 25.3m  ← ~18m penalty
HostedCluster2 (trust bundle test):   423.10s =  7.1m  ← normal
TestAutoscaling:                      463.57s =  7.7m  ← normal
TestUpgradeControlPlane:              467.00s =  7.8m  ← normal

The HostedCluster0 teardown was 3.6× slower than all other teardowns. The teardown logged:

hypershift_framework.go:559: Failed to destroy cluster, will retry: hostedcluster wasn't finalized, aborting delete: context deadline exceeded
journals.go:213: No machines associated with infra id node-pool-gskxl were found. Skipping journal dump.

This confirms that the machine replacement triggered by the PR's Step 8 (Machine deletion) left the cluster in a state where finalization was delayed.

PR's new test steps executed in the build log (lines 489-522):

nodepool_spot_termination_handler_test.go:316: Waiting for Machine ... to be annotated with hypershift.openshift.io/spot-interruption-signal
nodepool_spot_termination_handler_test.go:317: Successfully waited for ... spot-interruption-signal annotation in 0s
nodepool_spot_termination_handler_test.go:338: Waiting for Machine ... to be marked for deletion
nodepool_spot_termination_handler_test.go:339: Successfully waited for ... deletionTimestamp in 0s
nodepool_spot_termination_handler_test.go:360: Waiting for a replacement Machine (not ...) with label hypershift.openshift.io/interruptible-instance
nodepool_spot_termination_handler_test.go:361: Successfully waited for ... replacement Machine with spot label in 50s
nodepool_spot_termination_handler_test.go:402: Replacement Machine with interruptible-instance label is created

Contributing factor — EnsureMetricsForwarderWorking retry loop:
TestCreateCluster/Main/EnsureMetricsForwarderWorking was stuck retrying when the timeout hit:

util_metrics_proxy.go:117: kube-apiserver target via metrics-forwarder not found in Prometheus active targets (will retry)

This message appeared 7 times, indicating the test was blocked waiting for a Prometheus scrape target that never appeared. Even without the teardown penalty, this test may have consumed significant time.

Artifacts Examined

  • .work/prow-job-analyze-test-failure/2041848373203439616/logs/build-log.txt
  • .work/prow-job-analyze-test-failure/2041848373203439616/logs/hypershift-aws-run-e2e-nested-build-log.txt
  • .work/prow-job-analyze-test-failure/2041848373203439616/logs/junit_operator.xml

Aggregated Root Cause

Failed Steps Summary

Step One-line Failure
hypershift-aws-run-e2e-nested 2h timeout: PR's spot remediation steps caused 18m extra teardown, pushing TestCreateCluster past the time limit

Root Cause Hypothesis

Primary cause: PR #8028 introduced spot remediation verification steps (Steps 7–9 in the test) that trigger a real Machine deletion and replacement cycle on the HostedCluster0 cluster. While the verification itself completes quickly (50s for replacement detection), the side effect — an in-flight machine replacement — causes the HostedCluster0 teardown to take 25.3 minutes instead of the normal ~7 minutes. This ~18-minute penalty is enough to push the overall parallel test suite past its 2-hour hard timeout, killing TestCreateCluster before it can complete.

Causal chain:

  1. PR Step 8 triggers Machine deletion via spot-interruption-signal annotation → CAPI deletes the Machine
  2. PR Step 9 verifies a replacement Machine is created (takes 50s) → replacement Machine is still provisioning
  3. The test's Run() returns, but the replacement Machine is still being created
  4. HostedCluster0 teardown begins with an in-flight machine replacement
  5. Teardown fails with "hostedcluster wasn't finalized, aborting delete: context deadline exceeded" and retries
  6. Teardown eventually succeeds after 25.3m (vs. ~7m normal)
  7. TestCreateCluster, running in parallel, is killed at the 2h mark while still executing EnsureMetricsForwarderWorking

Contributing factor: TestCreateCluster/Main/EnsureMetricsForwarderWorking was stuck in a retry loop waiting for a Prometheus scrape target. This may have also consumed extra time, though the primary cause of the timeout is the teardown penalty.

Recommendations

  1. Wait for replacement Machine to be fully ready before returning from the spot test: After Step 9 (replacement Machine creation), add a step that waits for the replacement Machine to become fully provisioned (e.g., wait for the Machine's status.phase to be Running and the replacement Node to become Ready). This ensures the machine replacement is fully settled before HostedCluster0 teardown begins, avoiding the extended teardown time.

  2. Alternative: Scale down the spot NodePool before returning: After verifying the replacement Machine is created (Step 9), explicitly scale the spot NodePool to 0 replicas and wait for all Machines to be deleted. This ensures no in-flight machines during teardown. Example:

    // After verifying replacement, scale down the NodePool to avoid teardown delays
    err = e2eutil.UpdateObject(t, s.ctx, s.mgmtClient, &nodePool, func(obj *hyperv1.NodePool) {
        zero := int32(0)
        obj.Spec.Replicas = &zero
    })
  3. Investigate EnsureMetricsForwarderWorking flakiness: This test was stuck retrying "kube-apiserver target via metrics-forwarder not found in Prometheus active targets" and may independently contribute to timeout pressure on the e2e suite.


Artifacts

  • Test artifacts: .work/prow-job-analyze-test-failure/2041848373203439616/logs/
  • Build log: .work/prow-job-analyze-test-failure/2041848373203439616/logs/build-log.txt
  • Step log: .work/prow-job-analyze-test-failure/2041848373203439616/logs/hypershift-aws-run-e2e-nested-build-log.txt
  • JUnit: .work/prow-job-analyze-test-failure/2041848373203439616/logs/junit_operator.xml

@mgencur
Copy link
Copy Markdown
Contributor

mgencur commented Apr 15, 2026

/retest

@codecov
Copy link
Copy Markdown

codecov bot commented Apr 15, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 26.48%. Comparing base (ff6021e) to head (b1e5c9f).
⚠️ Report is 214 commits behind head on main.

Additional details and impacted files
@@           Coverage Diff           @@
##             main    #8028   +/-   ##
=======================================
  Coverage   26.48%   26.48%           
=======================================
  Files        1087     1087           
  Lines      105002   105002           
=======================================
  Hits        27809    27809           
  Misses      74785    74785           
  Partials     2408     2408           
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. area/testing Indicates the PR includes changes for e2e testing jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. lgtm Indicates that a PR is ready to be merged.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants