CNTRLPLANE-1401: feat(e2e): add spot remediation verification to SpotTerminationHandlerTest#8028
CNTRLPLANE-1401: feat(e2e): add spot remediation verification to SpotTerminationHandlerTest#8028zhfeng wants to merge 1 commit intoopenshift:mainfrom
Conversation
|
Pipeline controller notification For optional jobs, comment This repository is configured in: LGTM mode |
|
Skipping CI for Draft Pull Request. |
|
Important Review skippedAuto reviews are limited based on label configuration. 🚫 Review skipped — only excluded labels are configured. (1)
Please check the settings in the CodeRabbit UI or the ⚙️ Run configurationConfiguration used: Repository YAML (base), Organization UI (inherited) Review profile: CHILL Plan: Pro Run ID: You can disable this status message by setting the Use the checkbox below for a quick retry:
📝 WalkthroughWalkthroughThis change updates a spot termination handler end-to-end test to add more rigorous verification logic. The test now enforces exactly one ready spot node instead of at least one, extracts the owning CAPI Machine reference from the Node's annotations, and introduces new polling assertions to verify the Machine receives a spot interruption signal annotation, gains a deletion timestamp, and that a replacement Machine with the interruptible instance label is created. A new constant ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
cbcce5f to
01bd766
Compare
|
@zhfeng: This pull request references CNTRLPLANE-1401 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.22.0" version, but no target version was set. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
@zhfeng: This pull request references CNTRLPLANE-1401 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.22.0" version, but no target version was set. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
/test e2e-aws |
Test Resultse2e-aws
|
…rTest Extend SpotTerminationHandlerTest to verify the full spot remediation flow after NTH taints the node: Machine gets spot-interruption-signal annotation, Machine is marked for deletion, and exactly 1 replacement Machine is created with the interruptible-instance label.
01bd766 to
b1e5c9f
Compare
|
@enxebre do you think it is Ok for merging? |
|
/approve |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: enxebre, zhfeng The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
/lgtm |
|
Scheduling tests matching the |
|
@zhfeng: The following test failed, say
Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
|
Now I have comprehensive evidence. Let me produce the final report: Test Failure Analysis Complete (Multi-Job)Job 1: e2e-aws-4-21Job Information
Failed Steps
ErrorSummary
Evidence
Root CausePre-existing NTO flake: The Node Tuning Operator failed to reconcile the PerformanceProfile status ConfigMap within the allotted timeout. This is a known intermittent issue with the NTO operator in CI environments — it is unrelated to the changes in PR #8028. Job 2: e2e-awsJob Information
Failed Steps
ErrorSummaryAll individual tests passed (including Evidence
Timing analysis — did
Root CausePre-existing metrics-forwarder infrastructure issue: Aggregated Root CauseSummary
Neither failure is caused by PR #8028 (CNTRLPLANE-1401). Both are pre-existing, independent infrastructure/flake issues:
Recommendations
Artifacts
|
|
Now I have all the evidence. Let me compile the final report: Test Failure Analysis Complete (Multi-Step)Job Information — Job 1: e2e-aws-4-21
Job Information — Job 2: e2e-aws
PR Under Test
Failed Step AnalysesJob 1: e2e-aws-4-21 — Step: hypershift-aws-run-e2e-nestedError
Additionally, the SummaryThis failure is NOT caused by PR #8028. The The actual failures are:
Evidence
Job 2: e2e-aws — Step: hypershift-aws-run-e2e-nestedErrorSummaryThis failure IS directly caused by PR #8028. The
The new verification steps (Steps 7–9) added by the PR completed quickly (~1 minute), but the overall test time of ~15 minutes pushed the full test suite past the 2h CI timeout. When the timeout hit at All completed tests passed, including the spot test: The process was killed at 2h, then the 30m grace period expired. Evidence
Aggregated Root Cause
Root Cause HypothesisThe two failures have different root causes:
Recommendations
Artifacts
|
|
Now I have all the evidence. Let me produce the final structured report. Test Failure Analysis Complete (Multi-Step)Job Information
Failed Step AnalysesStep:
|
| Event | Timestamp | Elapsed |
|---|---|---|
| Test binary started | 12:33:46 | 0m |
| TestSpotTerminationHandler/SpotTerminationHandlerTest started | ~12:41:37 | ~8m |
| PR Step 7: spot-interruption-signal annotation verified | ~12:42:00 | instant (0s) |
| PR Step 8: Machine deletionTimestamp verified | ~12:42:00 | instant (0s) |
| PR Step 9: Replacement Machine created | ~12:42:50 | 50s |
| NodePool status stabilized after machine replacement | ~12:47:51 | 5m 51s |
| HostedCluster0/Teardown began | ~13:08 | ~34m |
| Teardown error: "hostedcluster wasn't finalized, aborting delete: context deadline exceeded" | ~13:15 | - |
| HostedCluster0/Teardown completed | ~13:33:26 | 25.3m teardown |
| 2h timeout hit — TestCreateCluster killed | 14:33:44 | 120m |
Test suite completion status (20 parallel tests):
| Test Suite | Status | Duration |
|---|---|---|
| TestNodePool/HostedCluster0 | ✅ PASS | 60.5m (teardown: 25.3m) |
| TestNodePool/HostedCluster2 | ✅ PASS | 56.1m (teardown: 7.1m) |
| TestAutoscaling | ✅ PASS | 53.3m (teardown: 7.7m) |
| TestUpgradeControlPlane | ✅ PASS | 51.0m (teardown: 7.8m) |
| TestCreateClusterRequestServingIsolation | ✅ PASS | 32.3m |
| TestNodePoolAutoscalingScaleFromZero | ✅ PASS | 31.0m |
| TestCreateClusterPrivateWithRouteKAS | ✅ PASS | 30.6m |
| TestCreateClusterCustomConfig | ✅ PASS | 28.2m |
| TestCreateClusterProxy | ✅ PASS | 25.8m |
| TestCreateClusterPrivate | ✅ PASS | 23.9m |
| TestHAEtcdChaos | ✅ PASS | 0.5m |
| TestCreateCluster | ❌ TIMEOUT | >120m (killed) |
Teardown comparison — HostedCluster0 vs. others:
HostedCluster0 (spot test cluster): 1517.60s = 25.3m ← ~18m penalty
HostedCluster2 (trust bundle test): 423.10s = 7.1m ← normal
TestAutoscaling: 463.57s = 7.7m ← normal
TestUpgradeControlPlane: 467.00s = 7.8m ← normal
The HostedCluster0 teardown was 3.6× slower than all other teardowns. The teardown logged:
hypershift_framework.go:559: Failed to destroy cluster, will retry: hostedcluster wasn't finalized, aborting delete: context deadline exceeded
journals.go:213: No machines associated with infra id node-pool-gskxl were found. Skipping journal dump.
This confirms that the machine replacement triggered by the PR's Step 8 (Machine deletion) left the cluster in a state where finalization was delayed.
PR's new test steps executed in the build log (lines 489-522):
nodepool_spot_termination_handler_test.go:316: Waiting for Machine ... to be annotated with hypershift.openshift.io/spot-interruption-signal
nodepool_spot_termination_handler_test.go:317: Successfully waited for ... spot-interruption-signal annotation in 0s
nodepool_spot_termination_handler_test.go:338: Waiting for Machine ... to be marked for deletion
nodepool_spot_termination_handler_test.go:339: Successfully waited for ... deletionTimestamp in 0s
nodepool_spot_termination_handler_test.go:360: Waiting for a replacement Machine (not ...) with label hypershift.openshift.io/interruptible-instance
nodepool_spot_termination_handler_test.go:361: Successfully waited for ... replacement Machine with spot label in 50s
nodepool_spot_termination_handler_test.go:402: Replacement Machine with interruptible-instance label is created
Contributing factor — EnsureMetricsForwarderWorking retry loop:
TestCreateCluster/Main/EnsureMetricsForwarderWorking was stuck retrying when the timeout hit:
util_metrics_proxy.go:117: kube-apiserver target via metrics-forwarder not found in Prometheus active targets (will retry)
This message appeared 7 times, indicating the test was blocked waiting for a Prometheus scrape target that never appeared. Even without the teardown penalty, this test may have consumed significant time.
Artifacts Examined
.work/prow-job-analyze-test-failure/2041848373203439616/logs/build-log.txt.work/prow-job-analyze-test-failure/2041848373203439616/logs/hypershift-aws-run-e2e-nested-build-log.txt.work/prow-job-analyze-test-failure/2041848373203439616/logs/junit_operator.xml
Aggregated Root Cause
Failed Steps Summary
| Step | One-line Failure |
|---|---|
hypershift-aws-run-e2e-nested |
2h timeout: PR's spot remediation steps caused 18m extra teardown, pushing TestCreateCluster past the time limit |
Root Cause Hypothesis
Primary cause: PR #8028 introduced spot remediation verification steps (Steps 7–9 in the test) that trigger a real Machine deletion and replacement cycle on the HostedCluster0 cluster. While the verification itself completes quickly (50s for replacement detection), the side effect — an in-flight machine replacement — causes the HostedCluster0 teardown to take 25.3 minutes instead of the normal ~7 minutes. This ~18-minute penalty is enough to push the overall parallel test suite past its 2-hour hard timeout, killing TestCreateCluster before it can complete.
Causal chain:
- PR Step 8 triggers Machine deletion via spot-interruption-signal annotation → CAPI deletes the Machine
- PR Step 9 verifies a replacement Machine is created (takes 50s) → replacement Machine is still provisioning
- The test's
Run()returns, but the replacement Machine is still being created - HostedCluster0 teardown begins with an in-flight machine replacement
- Teardown fails with "hostedcluster wasn't finalized, aborting delete: context deadline exceeded" and retries
- Teardown eventually succeeds after 25.3m (vs. ~7m normal)
- TestCreateCluster, running in parallel, is killed at the 2h mark while still executing
EnsureMetricsForwarderWorking
Contributing factor: TestCreateCluster/Main/EnsureMetricsForwarderWorking was stuck in a retry loop waiting for a Prometheus scrape target. This may have also consumed extra time, though the primary cause of the timeout is the teardown penalty.
Recommendations
-
Wait for replacement Machine to be fully ready before returning from the spot test: After Step 9 (replacement Machine creation), add a step that waits for the replacement Machine to become fully provisioned (e.g., wait for the Machine's
status.phaseto beRunningand the replacement Node to becomeReady). This ensures the machine replacement is fully settled before HostedCluster0 teardown begins, avoiding the extended teardown time. -
Alternative: Scale down the spot NodePool before returning: After verifying the replacement Machine is created (Step 9), explicitly scale the spot NodePool to 0 replicas and wait for all Machines to be deleted. This ensures no in-flight machines during teardown. Example:
// After verifying replacement, scale down the NodePool to avoid teardown delays err = e2eutil.UpdateObject(t, s.ctx, s.mgmtClient, &nodePool, func(obj *hyperv1.NodePool) { zero := int32(0) obj.Spec.Replicas = &zero })
-
Investigate
EnsureMetricsForwarderWorkingflakiness: This test was stuck retrying "kube-apiserver target via metrics-forwarder not found in Prometheus active targets" and may independently contribute to timeout pressure on the e2e suite.
Artifacts
- Test artifacts:
.work/prow-job-analyze-test-failure/2041848373203439616/logs/ - Build log:
.work/prow-job-analyze-test-failure/2041848373203439616/logs/build-log.txt - Step log:
.work/prow-job-analyze-test-failure/2041848373203439616/logs/hypershift-aws-run-e2e-nested-build-log.txt - JUnit:
.work/prow-job-analyze-test-failure/2041848373203439616/logs/junit_operator.xml
|
/retest |
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #8028 +/- ##
=======================================
Coverage 26.48% 26.48%
=======================================
Files 1087 1087
Lines 105002 105002
=======================================
Hits 27809 27809
Misses 74785 74785
Partials 2408 2408 🚀 New features to boost your workflow:
|
Summary
Extend
SpotTerminationHandlerTestto verify the full spot remediation flow after NTH taints the node:hypershift.openshift.io/spot-interruption-signalannotation (spot remediation controller acted)deletionTimestamp(Machine deleted)interruptible-instancelabel (CAPI replacement inherits spot marker)Also tightens the node count assertion from
len(nodes) == 0tolen(nodes) != 1to match the 1-replica NodePool.Test plan
go vet -tags e2e ./test/e2e/...)Summary by CodeRabbit