CNTRLPLANE-2204: improve AllNodesHealthy and AllMachinesReady conditions#8247
Conversation
|
Pipeline controller notification For optional jobs, comment This repository is configured in: LGTM mode |
5812e24 to
f6a66cc
Compare
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #8247 +/- ##
==========================================
+ Coverage 35.62% 35.66% +0.03%
==========================================
Files 767 767
Lines 93333 93387 +54
==========================================
+ Hits 33248 33302 +54
Misses 57396 57396
Partials 2689 2689
🚀 New features to boost your workflow:
|
f6a66cc to
4040c92
Compare
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughIntroduced Sequence Diagram(s)sequenceDiagram
participant Controller as NodePoolController
participant API as MachineAPI
participant Finder as findMachineStatusCondition
participant Aggregator as aggregateMachineReasonsAndMessages
participant NP as NodePoolStatus
Controller->>API: list Machines
API-->>Controller: Machines[]
loop for each Machine
Controller->>Finder: findMachineStatusCondition(machine, conditionType)
Finder-->>Controller: {Status, Reason, Message} or nil
end
Controller->>Aggregator: aggregateMachineReasonsAndMessages(per-machine results)
Aggregator-->>Controller: aggregatedReasons, aggregatedMessage (truncated)
Controller->>NP: set NodePool conditions (Healthy/Ready) with aggregated info
NP-->>Controller: status updated
🚥 Pre-merge checks | ✅ 9 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (9 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (1)
hypershift-operator/controllers/nodepool/nodepool_controller.go (1)
1123-1128: Sort each reason bucket before building the message.The reason keys are deterministic now, but
messageMap[reason]still reflects machine list order. Since Kubernetes list order is not guaranteed, the per-machine lines inside a single reason can still reshuffle between reconciles and trigger needless status patches.💡 Proposed fix
for _, reason := range reasons { + sort.Strings(messageMap[reason]) reasonBlock := aggregateMachineMessages(messageMap[reason]) if msgBuilder.Len()+len(reasonBlock) > maxGlobalMessageLength { msgBuilder.WriteString(endOfGlobalMessage) break }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@hypershift-operator/controllers/nodepool/nodepool_controller.go` around lines 1123 - 1128, The per-reason message bucket (messageMap[reason]) is using machine list order which can vary; before calling aggregateMachineMessages and appending to msgBuilder you should sort the items in that bucket deterministically (e.g. by machine name or another stable key) so aggregateMachineMessages receives a stable, ordered slice; update the code around reasonBlock creation (where reasonBlock := aggregateMachineMessages(messageMap[reason]) is called) to sort messageMap[reason] in-place or pass a sorted copy, keeping references to messageMap, reason, aggregateMachineMessages, msgBuilder, maxGlobalMessageLength, and endOfGlobalMessage to locate the change.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@hypershift-operator/controllers/nodepool/nodepool_controller.go`:
- Around line 1124-1126: The truncation suffix is appended after checking if
adding reasonBlock would overflow, which can still push the final message past
maxGlobalMessageLength; update the logic around msgBuilder, reasonBlock,
maxGlobalMessageLength and endOfGlobalMessage so the check considers the suffix
size too (e.g., test msgBuilder.Len()+len(reasonBlock)+len(endOfGlobalMessage)
against maxGlobalMessageLength), and if it would overflow either truncate
reasonBlock to leave room for endOfGlobalMessage or write only
endOfGlobalMessage and break so the global cap is never exceeded.
---
Nitpick comments:
In `@hypershift-operator/controllers/nodepool/nodepool_controller.go`:
- Around line 1123-1128: The per-reason message bucket (messageMap[reason]) is
using machine list order which can vary; before calling aggregateMachineMessages
and appending to msgBuilder you should sort the items in that bucket
deterministically (e.g. by machine name or another stable key) so
aggregateMachineMessages receives a stable, ordered slice; update the code
around reasonBlock creation (where reasonBlock :=
aggregateMachineMessages(messageMap[reason]) is called) to sort
messageMap[reason] in-place or pass a sorted copy, keeping references to
messageMap, reason, aggregateMachineMessages, msgBuilder,
maxGlobalMessageLength, and endOfGlobalMessage to locate the change.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository YAML (base), Central YAML (inherited)
Review profile: CHILL
Plan: Pro Plus
Run ID: a7e8efa0-4f57-4e56-b0d9-ab1911343902
📒 Files selected for processing (4)
hypershift-operator/controllers/nodepool/conditions.gohypershift-operator/controllers/nodepool/nodepool_controller.gohypershift-operator/controllers/nodepool/nodepool_controller_test.gohypershift-operator/controllers/nodepool/version.go
4040c92 to
4b86074
Compare
There was a problem hiding this comment.
🧹 Nitpick comments (1)
hypershift-operator/controllers/nodepool/conditions.go (1)
111-123: Consider adding a nil guard for defensive coding.The function doesn't check if
machineis nil before accessingmachine.Status.Conditions. While current callers iterate over slices that shouldn't contain nil pointers, a nil check would make this helper more robust for future use.🛡️ Optional defensive fix
func findMachineStatusCondition(machine *capiv1.Machine, conditionType string) *machineConditionResult { + if machine == nil { + return nil + } for i := range machine.Status.Conditions { if string(machine.Status.Conditions[i].Type) == conditionType {🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@hypershift-operator/controllers/nodepool/conditions.go` around lines 111 - 123, Add a nil guard at the start of findMachineStatusCondition so it returns nil immediately if the input machine is nil; this prevents dereferencing a nil pointer when accessing machine.Status.Conditions and keeps callers safe. Update the function findMachineStatusCondition to check "if machine == nil { return nil }" before iterating, while leaving the existing construction of machineConditionResult (Status, Reason, Message) unchanged.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@hypershift-operator/controllers/nodepool/conditions.go`:
- Around line 111-123: Add a nil guard at the start of
findMachineStatusCondition so it returns nil immediately if the input machine is
nil; this prevents dereferencing a nil pointer when accessing
machine.Status.Conditions and keeps callers safe. Update the function
findMachineStatusCondition to check "if machine == nil { return nil }" before
iterating, while leaving the existing construction of machineConditionResult
(Status, Reason, Message) unchanged.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository YAML (base), Central YAML (inherited)
Review profile: CHILL
Plan: Pro Plus
Run ID: f582d6b2-f86e-4725-b26e-211aec3904dd
📒 Files selected for processing (4)
hypershift-operator/controllers/nodepool/conditions.gohypershift-operator/controllers/nodepool/nodepool_controller.gohypershift-operator/controllers/nodepool/nodepool_controller_test.gohypershift-operator/controllers/nodepool/version.go
✅ Files skipped from review due to trivial changes (1)
- hypershift-operator/controllers/nodepool/version.go
🚧 Files skipped from review as they are similar to previous changes (1)
- hypershift-operator/controllers/nodepool/nodepool_controller.go
|
/approve |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: csrwng, sdminonne The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
| }, | ||
| expectedAllNodes: &testCondition{ | ||
| Status: corev1.ConditionFalse, | ||
| Reason: capiv1.NodeConditionsFailedReason + "," + capiv1.WaitingForNodeRefReason, |
There was a problem hiding this comment.
is this truncated anyhow so we don't break the MaxLength=1024 of the reason field?
same question for Message MaxLength
There was a problem hiding this comment.
Oh cool! Thanks @enxebre .
While we truncate the messages in this PR for the Reason there's no specific logic to keep under 1024 and even if I'm not sure there's not enough Reasons to reach 1024 it's a theoretical gap, I agree.
Reworking it
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@hypershift-operator/controllers/nodepool/nodepool_controller_test.go`:
- Around line 2054-2055: The test currently skips the exact-length assertion for
the "When empty reasons..." case because the check uses `if expectExactLen > 0`
which ignores `expectExactLen == 0`; update the assertion guard so zero-length
expectations are asserted (e.g., change the condition to `if expectExactLen >=
0` or adopt a sentinel like `-1` for "no-check" and adjust cases accordingly),
and ensure the test case named "When empty reasons..." (where `expectExactLen:
0`) triggers the `assert.Len(..., expectExactLen)` branch; reference the
`expectExactLen` field and the assertion block in the
nodepool_controller_test.go test function to implement this change.
In `@hypershift-operator/controllers/nodepool/nodepool_controller.go`:
- Around line 1156-1158: In truncateReasons the logic that writes endOfReasons
can produce a leading separator when nothing else fits; change the branch that
appends endOfReasons (using builder, separator, endOfReasons, maxReasonLength)
to detect if builder.Len() == 0 and, in that case, write endOfReasons without
the separator (or otherwise trim the leading separator) so you never produce a
comma-only prefix like ",ReasonsTruncated". Ensure the check still enforces
maxReasonLength and preserves existing behavior when builder already contains
entries.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository YAML (base), Central YAML (inherited)
Review profile: CHILL
Plan: Pro Plus
Run ID: 12a91ca8-7b68-4451-b490-dc61bd20cbd9
📒 Files selected for processing (2)
hypershift-operator/controllers/nodepool/nodepool_controller.gohypershift-operator/controllers/nodepool/nodepool_controller_test.go
| expectExactLen int // if > 0, expect exact length | ||
| }{ |
There was a problem hiding this comment.
Empty-reasons case is not actually asserted.
Line 2137 checks expectExactLen > 0, so the "When empty reasons..." case (expectExactLen: 0) skips its exact-length assertion.
💡 Proposed fix
- expectExactLen int // if > 0, expect exact length
+ expectExactLen *int // if non-nil, expect exact length
@@
- expectExactLen: 0,
+ expectExactLen: ptr.To(0),
@@
- if tc.expectExactLen > 0 {
- g.Expect(len(result)).To(Equal(tc.expectExactLen))
+ if tc.expectExactLen != nil {
+ g.Expect(len(result)).To(Equal(*tc.expectExactLen))
}Also applies to: 2137-2139
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@hypershift-operator/controllers/nodepool/nodepool_controller_test.go` around
lines 2054 - 2055, The test currently skips the exact-length assertion for the
"When empty reasons..." case because the check uses `if expectExactLen > 0`
which ignores `expectExactLen == 0`; update the assertion guard so zero-length
expectations are asserted (e.g., change the condition to `if expectExactLen >=
0` or adopt a sentinel like `-1` for "no-check" and adjust cases accordingly),
and ensure the test case named "When empty reasons..." (where `expectExactLen:
0`) triggers the `assert.Len(..., expectExactLen)` branch; reference the
`expectExactLen` field and the assertion block in the
nodepool_controller_test.go test function to implement this change.
| if builder.Len()+len(separator)+len(reason)+len(endOfReasons) > maxReasonLength { | ||
| builder.WriteString(endOfReasons) | ||
| break |
There was a problem hiding this comment.
Avoid leading comma when only truncation suffix fits.
If no reason can fit (e.g., an exceptionally long first reason), truncateReasons returns ,ReasonsTruncated. That produces a malformed comma-separated list.
💡 Proposed fix
if builder.Len()+len(separator)+len(reason)+len(endOfReasons) > maxReasonLength {
- builder.WriteString(endOfReasons)
+ if builder.Len() == 0 {
+ builder.WriteString(strings.TrimPrefix(endOfReasons, ","))
+ } else {
+ builder.WriteString(endOfReasons)
+ }
break
}📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| if builder.Len()+len(separator)+len(reason)+len(endOfReasons) > maxReasonLength { | |
| builder.WriteString(endOfReasons) | |
| break | |
| if builder.Len()+len(separator)+len(reason)+len(endOfReasons) > maxReasonLength { | |
| if builder.Len() == 0 { | |
| builder.WriteString(strings.TrimPrefix(endOfReasons, ",")) | |
| } else { | |
| builder.WriteString(endOfReasons) | |
| } | |
| break | |
| } |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@hypershift-operator/controllers/nodepool/nodepool_controller.go` around lines
1156 - 1158, In truncateReasons the logic that writes endOfReasons can produce a
leading separator when nothing else fits; change the branch that appends
endOfReasons (using builder, separator, endOfReasons, maxReasonLength) to detect
if builder.Len() == 0 and, in that case, write endOfReasons without the
separator (or otherwise trim the leading separator) so you never produce a
comma-only prefix like ",ReasonsTruncated". Ensure the check still enforces
maxReasonLength and preserves existing behavior when builder already contains
entries.
There was a problem hiding this comment.
@CodeRabbit I've hard time to imagine a 1024 long reason. I would say it is a very very unlikely
There was a problem hiding this comment.
@sdminonne, you're right — a single CAPI condition reason exceeding 1006 characters (i.e. maxReasonLength - len(",ReasonsTruncated")) is essentially impossible in practice given how CAPI reasons are structured (short identifiers like WaitingForInfrastructure, NodeNotFound, etc.). The concern is purely theoretical. Happy to leave the code as-is.
8167954 to
61ee334
Compare
|
@enxebre ptal |
|
/lgtm |
|
Scheduling tests matching the |
|
/retest |
Test Resultse2e-aws
e2e-aks
|
AI Test Failure AnalysisJob: Generated by hypershift-analyze-e2e-failure post-step using Claude claude-opus-4-6 |
AI Test Failure AnalysisJob: Generated by hypershift-analyze-e2e-failure post-step using Claude claude-opus-4-6 |
|
Scheduling tests matching the |
|
/verified by nodepool_test.go |
|
@mgencur: This PR has been marked as verified by DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
/test e2e-aws-4-22 |
|
/retest |
|
/test e2e-kubevirt-aws-ovn-reduced |
|
/retest |
|
/test e2e-kubevirt-aws-ovn-reduced |
|
@sdminonne: This pull request references CNTRLPLANE-2204 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target either version "5.0." or "openshift-5.0.", but it targets "openshift-4.21" instead. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
/jira refresh |
|
@sdminonne: This pull request references CNTRLPLANE-2204 which is a valid jira issue. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
/rerun |
|
/retest-required |
|
/hold Revision e977a39 was retested 3 times: holding |
|
/test e2e-kubevirt-aws-ovn-reduced |
|
/retest |
1 similar comment
|
/retest |
|
Now I have all the evidence needed for a comprehensive report. Here it is: Test Failure Analysis CompleteJob Information
Test Failure AnalysisErrorSummaryBoth TestCreateCluster and TestAutoscaling failed identically: HostedClusters rolled out successfully (nodes ready, CVO completed, all operators available) but the new Root CauseThe root cause is a platform-awareness bug in The bug: When This causes the Why both tests failed identically: Both TestCreateCluster and TestAutoscaling create KubeVirt-based HostedClusters, so both hit the same CNO rollout check bug. The CVO (old version tracking) completed successfully ( PR #8247 is NOT the cause: The PR modifies only NodePool condition logic ( Recommendations
Evidence
|
|
/test e2e-kubevirt-aws-ovn-reduced |
|
/hold cancel |
|
@sdminonne: all tests passed! Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
Summary
findMachineStatusConditionto look up conditions on a CAPI Machine and normalize them into a commonmachineConditionResultstruct, replacing the previousfindCAPIStatusConditionhelpersetAllNodesHealthyConditionto correctly treat machines with noNodeHealthycondition as not healthy (reasonWaitingForNodeRef) instead of silently ignoring themsetAllNodesHealthyConditionto use the sameaggregateMachineReasonsAndMessagespattern assetAllMachinesReadyCondition, reporting e.g. "2 of 3 machines are not healthy"setAllMachinesReadyConditionto treat machines with noReadycondition as not ready (reasonWaitingForInfrastructure) instead of silently ignoring themMessagein the Ready condition fallback path when present and meaningfulTest plan
findMachineStatusConditioncovering condition found, not found, and specific valuessetAllNodesHealthyConditionwhen machines have noNodeHealthyconditionsetAllMachinesReadyConditionwhen machines have noReadyconditionmake testto verify all unit tests passmake verifyto check formatting and generationRef: CNTRLPLANE-2204
🤖 Generated with Claude Code
Summary by CodeRabbit
Bug Fixes
Tests