OCPBUGS-86306: Stop controllers fighting over HCP status#8562
Conversation
|
Pipeline controller notification For optional jobs, comment This repository is configured in: LGTM mode |
|
Skipping CI for Draft Pull Request. |
|
@jkyros: This pull request references Jira Issue OCPBUGS-86306, which is valid. The bug has been moved to the POST state. 3 validation(s) were run on this bug
The bug has been updated to refer to the pull request using the external bug tracker. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Repository YAML (base), Central YAML (inherited) Review profile: CHILL Plan: Enterprise Run ID: 📒 Files selected for processing (7)
📝 WalkthroughWalkthroughThis pull request replaces direct HostedControlPlane status updates with optimistic-locking status patches across multiple controllers. The hcpstatus controller now generates RFC‑6902 JSON Patch ops (including a resourceVersion test) and issues Status().Patch. Other controllers (hostedcontrolplane, resources, etcdbackup, ignition-server) deep-copy original HCP objects and persist condition changes via Status().Patch with MergeFromWithOptimisticLock. New unit tests validate patch generation and backup-condition patch behavior. Suggested reviewers
🚥 Pre-merge checks | ✅ 10 | ❌ 2❌ Failed checks (2 warnings)
✅ Passed checks (10 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: jkyros The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
|
@jkyros: This pull request references Jira Issue OCPBUGS-86306, which is valid. 3 validation(s) were run on this bug
DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
There was a problem hiding this comment.
Actionable comments posted: 2
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
ignition-server/controllers/local_ignitionprovider.go (1)
768-788:⚠️ Potential issue | 🟠 Major | ⚡ Quick winWrap reconcileValidReleaseInfoCondition in retry logic for optimistic-lock conflicts.
This helper now uses
Status().Patch(..., client.MergeFromWithOptimisticLock{}), which returns 409 conflicts during concurrent HCP status writes. However, the caller at lines 451-452 only logs the error and continues, leavingIgnitionServerValidReleaseInfostale indefinitely—there's no guaranteed near-term retry.Wrap the helper logic in
retry.RetryOnConflict(retry.DefaultBackoff, func() { ... })and re-fetch the HostedControlPlane inside the retry block. This pattern is already used insupport/k8sutil/object.goand aligns with Kubernetes best practices.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@ignition-server/controllers/local_ignitionprovider.go` around lines 768 - 788, Wrap the existing status-patch logic in retry.RetryOnConflict(retry.DefaultBackoff, func() error { ... }) so optimistic-lock 409s are retried: inside the retry block re-fetch the latest HostedControlPlane using r.Client.Get into hostedControlPlane, create originalHCP := hostedControlPlane.DeepCopy(), compute/set the IgnitionServerValidReleaseInfo condition (use releaseImageProvider.GetMissingImages() and meta.SetStatusCondition as before), then call r.Client.Status().Patch(ctx, &hostedControlPlane, client.MergeFromWithOptions(originalHCP, client.MergeFromWithOptimisticLock{})) and return the patch error; let RetryOnConflict surface a final error if retries fail.
🧹 Nitpick comments (3)
control-plane-operator/hostedclusterconfigoperator/controllers/hcpstatus/hcpstatus_test.go (2)
19-160: ⚡ Quick winAdd parallelization to align with unit-test execution guidance.
This test suite does not use
t.Parallel(), so it misses the repository’s unit-test parallelism requirement.Suggested change
func TestMergeFromPatchDoesNotStompUnrelatedFields(t *testing.T) { + t.Parallel() t.Run("When hcpstatus controller patches conditions and versionStatus, it should not include autoNode in the patch", func(t *testing.T) { + t.Parallel() g := NewWithT(t) @@ t.Run("When hcpstatus controller makes no changes, it should produce an empty patch", func(t *testing.T) { + t.Parallel() g := NewWithT(t) @@ t.Run("When a condition-only controller patches, the conditions array should contain ALL conditions from the read", func(t *testing.T) { + t.Parallel() g := NewWithT(t)As per coding guidelines, "Use race detection and parallel execution for unit tests".
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@control-plane-operator/hostedclusterconfigoperator/controllers/hcpstatus/hcpstatus_test.go` around lines 19 - 160, The test suite TestMergeFromPatchDoesNotStompUnrelatedFields is missing parallelization; add t.Parallel() at the start of TestMergeFromPatchDoesNotStompUnrelatedFields and inside each subtest closure (the three t.Run anonymous funcs) so each subtest and the parent test run with testing parallelism and race detection enabled; look for the TestMergeFromPatchDoesNotStompUnrelatedFields function and the three t.Run blocks to insert the t.Parallel() calls.
151-152: ⚡ Quick winAvoid panic-prone type assertions in tests.
These direct assertions can panic and hide the real failure context. Prefer checked assertions so failures stay diagnostic.
Suggested change
- status := patchMap["status"].(map[string]interface{}) - conditions := status["conditions"].([]interface{}) + status, ok := patchMap["status"].(map[string]interface{}) + g.Expect(ok).To(BeTrue(), "patch should contain a status object") + conditions, ok := status["conditions"].([]interface{}) + g.Expect(ok).To(BeTrue(), "status.conditions should be a JSON array")🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@control-plane-operator/hostedclusterconfigoperator/controllers/hcpstatus/hcpstatus_test.go` around lines 151 - 152, Replace the unchecked type assertions on patchMap["status"] and status["conditions"] with guarded checks so the test fails with a clear message instead of panicking: retrieve the value using the comma-ok form (e.g., statusIface, ok := patchMap["status"]; if !ok { t.Fatalf(...) } then status, ok := statusIface.(map[string]interface{}); if !ok { t.Fatalf(...) }) and likewise check conditionsIface, ok := status["conditions"]; if !ok { t.Fatalf(...) } then conditions, ok := conditionsIface.([]interface{}); if !ok { t.Fatalf(...) }); alternatively use your test helper (e.g., require.IsType/require.NotNil) around the same symbols (patchMap, status, conditions) to make assertions explicit and diagnostic.control-plane-operator/hostedclusterconfigoperator/controllers/resources/resources.go (1)
570-576: ⚡ Quick winSkip no-op recovery status patches.
reconcileClusterRecoveryalways patches aftermeta.SetStatusCondition, even when the condition is unchanged. Because this path polls every 120 seconds, steady-state recovery checks become repeated write attempts and avoidable optimistic-lock conflicts on the same HCP status object. Guard on the boolean returned bymeta.SetStatusConditionto skip writing unchanged conditions.♻️ Suggested change
originalHCP := hcp.DeepCopy() - meta.SetStatusCondition(&hcp.Status.Conditions, *condition) + if !meta.SetStatusCondition(&hcp.Status.Conditions, *condition) { + if !finished { + return ctrl.Result{RequeueAfter: 120 * time.Second}, nil + } + return ctrl.Result{}, nil + } log.Info("setting condition", "type", condition.Type, "status", condition.Status, "message", condition.Message) if err := r.cpClient.Status().Patch(ctx, hcp, client.MergeFromWithOptions(originalHCP, client.MergeFromWithOptimisticLock{})); err != nil { return ctrl.Result{}, fmt.Errorf("failed to patch status on hcp for hosted cluster recovery: %w. Condition error message: %v", err, condition.Message) }🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@control-plane-operator/hostedclusterconfigoperator/controllers/resources/resources.go` around lines 570 - 576, reconcileClusterRecovery is always patching HCP status even when meta.SetStatusCondition didn't change the condition; capture the boolean return from meta.SetStatusCondition and only call r.cpClient.Status().Patch (using originalHCP and the same MergeFromWithOptions/OptimisticLock) if that boolean is true to skip no-op writes and avoid redundant optimistic-lock updates.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In
`@control-plane-operator/hostedclusterconfigoperator/controllers/hcpstatus/hcpstatus.go`:
- Around line 79-80: The patch call on h.mgtClusterClient.Status().Patch (using
crclient.MergeFromWithOptions and MergeFromWithOptimisticLock) must treat
optimistic-lock conflicts differently: after the Patch returns an error, detect
conflicts with apierrors.IsConflict(err) and, if true, return reconcile.Result{}
with a nil error so the controller runtime can requeue with backoff; for
non-conflict errors continue returning the wrapped fmt.Errorf("failed to patch
hcp status: %w", err). Locate the error handling directly after the Patch
invocation in the hcp status reconciliation code and insert the
apierrors.IsConflict check there.
In `@hypershift-operator/controllers/etcdbackup/reconciler.go`:
- Around line 302-305: The patch that sets meta.SetStatusCondition and calls
r.Status().Patch(hcp, client.MergeFromWithOptions(originalHCP,
client.MergeFromWithOptimisticLock{})) should be wrapped in
retry.RetryOnConflict so optimistic-lock 409s are retried: on each retry re-get
the latest HCP (fresh read into a new variable), reapply
condition.ObservedGeneration = hcp.Generation and
meta.SetStatusCondition(&hcp.Status.Conditions, condition) against the fresh
HCP, then call r.Status().Patch with client.MergeFromWithOptions(originalHCP,
client.MergeFromWithOptimisticLock{}) (or use a new merge-from base as
appropriate) until success or non-conflict error; ensure errors are returned if
retries exhaust.
---
Outside diff comments:
In `@ignition-server/controllers/local_ignitionprovider.go`:
- Around line 768-788: Wrap the existing status-patch logic in
retry.RetryOnConflict(retry.DefaultBackoff, func() error { ... }) so
optimistic-lock 409s are retried: inside the retry block re-fetch the latest
HostedControlPlane using r.Client.Get into hostedControlPlane, create
originalHCP := hostedControlPlane.DeepCopy(), compute/set the
IgnitionServerValidReleaseInfo condition (use
releaseImageProvider.GetMissingImages() and meta.SetStatusCondition as before),
then call r.Client.Status().Patch(ctx, &hostedControlPlane,
client.MergeFromWithOptions(originalHCP, client.MergeFromWithOptimisticLock{}))
and return the patch error; let RetryOnConflict surface a final error if retries
fail.
---
Nitpick comments:
In
`@control-plane-operator/hostedclusterconfigoperator/controllers/hcpstatus/hcpstatus_test.go`:
- Around line 19-160: The test suite
TestMergeFromPatchDoesNotStompUnrelatedFields is missing parallelization; add
t.Parallel() at the start of TestMergeFromPatchDoesNotStompUnrelatedFields and
inside each subtest closure (the three t.Run anonymous funcs) so each subtest
and the parent test run with testing parallelism and race detection enabled;
look for the TestMergeFromPatchDoesNotStompUnrelatedFields function and the
three t.Run blocks to insert the t.Parallel() calls.
- Around line 151-152: Replace the unchecked type assertions on
patchMap["status"] and status["conditions"] with guarded checks so the test
fails with a clear message instead of panicking: retrieve the value using the
comma-ok form (e.g., statusIface, ok := patchMap["status"]; if !ok {
t.Fatalf(...) } then status, ok := statusIface.(map[string]interface{}); if !ok
{ t.Fatalf(...) }) and likewise check conditionsIface, ok :=
status["conditions"]; if !ok { t.Fatalf(...) } then conditions, ok :=
conditionsIface.([]interface{}); if !ok { t.Fatalf(...) }); alternatively use
your test helper (e.g., require.IsType/require.NotNil) around the same symbols
(patchMap, status, conditions) to make assertions explicit and diagnostic.
In
`@control-plane-operator/hostedclusterconfigoperator/controllers/resources/resources.go`:
- Around line 570-576: reconcileClusterRecovery is always patching HCP status
even when meta.SetStatusCondition didn't change the condition; capture the
boolean return from meta.SetStatusCondition and only call
r.cpClient.Status().Patch (using originalHCP and the same
MergeFromWithOptions/OptimisticLock) if that boolean is true to skip no-op
writes and avoid redundant optimistic-lock updates.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository YAML (base), Central YAML (inherited)
Review profile: CHILL
Plan: Enterprise
Run ID: d0b2014e-1faa-4d10-a400-26d60e220b42
📒 Files selected for processing (6)
control-plane-operator/controllers/hostedcontrolplane/hostedcontrolplane_controller.gocontrol-plane-operator/hostedclusterconfigoperator/controllers/hcpstatus/hcpstatus.gocontrol-plane-operator/hostedclusterconfigoperator/controllers/hcpstatus/hcpstatus_test.gocontrol-plane-operator/hostedclusterconfigoperator/controllers/resources/resources.gohypershift-operator/controllers/etcdbackup/reconciler.goignition-server/controllers/local_ignitionprovider.go
| if err := h.mgtClusterClient.Status().Patch(ctx, hcp, crclient.MergeFromWithOptions(originalHCP, crclient.MergeFromWithOptimisticLock{})); err != nil { | ||
| return reconcile.Result{}, fmt.Errorf("failed to patch hcp status: %w", err) |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
# First, let's examine the file at the specified lines
cat -n control-plane-operator/hostedclusterconfigoperator/controllers/hcpstatus/hcpstatus.go | sed -n '70,90p'Repository: openshift/hypershift
Length of output: 1230
🏁 Script executed:
# Let's also see more context around the patch operation
cat -n control-plane-operator/hostedclusterconfigoperator/controllers/hcpstatus/hcpstatus.go | sed -n '60,100p'Repository: openshift/hypershift
Length of output: 2002
🏁 Script executed:
# Check if there are other patch operations in the file to understand the pattern
rg -n "Status\(\)\.Patch|MergeFromWithOptimisticLock" control-plane-operator/hostedclusterconfigoperator/controllers/hcpstatus/hcpstatus.go -A 2 -B 2Repository: openshift/hypershift
Length of output: 450
🏁 Script executed:
# Search for how conflicts are handled elsewhere in the codebase
rg -n "409|Conflict|optimistic" control-plane-operator/ --type go -i | head -20Repository: openshift/hypershift
Length of output: 4005
🏁 Script executed:
# Check the controller-runtime and kubernetes patterns for conflict handling
rg -n "MergeFromWithOptimisticLock" . --type go | head -10Repository: openshift/hypershift
Length of output: 1802
🏁 Script executed:
# Check the vendored controller-runtime to understand conflict handling patterns
cat vendor/sigs.k8s.io/controller-runtime/pkg/client/patch.go | sed -n '50,80p'Repository: openshift/hypershift
Length of output: 1445
🏁 Script executed:
# Search for error handling patterns around optimistic locks
rg -n "apiutil|IsConflict|StatusError" control-plane-operator/ --type go | head -15Repository: openshift/hypershift
Length of output: 1952
🏁 Script executed:
# Check if there are any examples of conflict detection in the codebase
rg -n "apierrors.IsConflict|StatusError" . --type go | head -20Repository: openshift/hypershift
Length of output: 3023
🌐 Web query:
Kubernetes controller-runtime conflict 409 error handling optimistic locking best practices
💡 Result:
Handling 409 Conflict errors in Kubernetes is an essential aspect of managing optimistic concurrency [1][2]. When multiple clients (e.g., controllers, users, HPA) modify the same resource, the Kubernetes API server uses the resourceVersion field to ensure updates are based on the latest state [1][3]. If a conflict occurs, the API server rejects the update with a 409 Conflict status [3][4]. Best practices for handling these conflicts include: 1. Let the Controller Reconcile (Recommended for most cases) In standard controller-runtime development, the most idiomatic approach is to return the error from your Reconcile function [5][3]. This allows the controller-runtime's built-in work queue to catch the error, log it, and automatically re-enqueue the request with exponential backoff [6][7][8]. This avoids manual retry loops inside your reconciliation logic and keeps the controller simple and idempotent [9][10]. 2. Use RetryOnConflict (For specific update operations) If you are performing a complex update operation within a reconciliation loop that must succeed before continuing, you can use the k8s.io/client-go/util/retry helper [5][11][12]. This helper manages the re-fetching of the object to get the latest resourceVersion before attempting the update again [1][12][13]. Example: err:= retry.RetryOnConflict(retry.DefaultRetry, func error { // 1. Re-fetch the latest version of the object if err:= r.Get(ctx, req.NamespacedName, latestObj); err!= nil { return err } // 2. Apply modifications to the fresh object latestObj.Status.Phase = "Running" // 3. Attempt update return r.Status.Update(ctx, latestObj) }) 3. Consider Server-Side Apply For complex objects with many fields modified by multiple controllers, Server-Side Apply (SSA) is the preferred alternative to traditional Update calls [5]. SSA allows for field-level ownership, significantly reducing the frequency of 409 conflicts compared to full object updates [5][12]. Key Principles: - Avoid manual, infinite retry loops. - Do not manually manipulate or suppress the resourceVersion field to bypass concurrency checks [2]. - Separate spec updates from status updates; status subresources help minimize conflicts [2]. - Leverage the framework: Return errors to the reconciler to benefit from built-in rate-limiting and backoff [6][8].
Citations:
- 1: https://medium.com/@pankajhasija2109/mastering-kubernetes-api-concurrency-7cc3083ffbb9
- 2: https://oneuptime.com/blog/post/2026-02-09-optimistic-concurrency-resourceversion/view
- 3: https://github.com/gardener/gardener/blob/master/docs/development/kubernetes-clients.md
- 4: https://medium.com/@timebertt/kubernetes-controllers-at-scale-clients-caches-conflicts-patches-explained-aa0f7a8b4332
- 5: How to elegantly solve the update conflict problem kubernetes-sigs/controller-runtime#1748
- 6: https://dev.to/jamesli/client-go-deep-dive-workqueue-the-reliable-task-queue-for-kubernetes-controllers-3pjc
- 7: https://oneuptime.com/blog/post/2026-02-09-reconciliation-loops-exponential-backoff/view
- 8: https://stuartleeks.com/posts/error-back-off-with-controller-runtime/
- 9: Docs suggestion: Optimistic Concurrency kubernetes-sigs/kubebuilder#617
- 10: https://book.kubebuilder.io/cronjob-tutorial/controller-implementation.html
- 11: https://oneuptime.com/blog/post/2026-02-09-retry-logic-api-conflicts-controllers/view
- 12: https://platformwale.blog/2026/02/24/kubernetes-operators-a-deep-dive-into-the-internals/
- 13: https://medium.com/@vamshitejanizam/building-resilient-kubernetes-controllers-a-practical-guide-to-retry-mechanisms-0d689160fa51
🏁 Script executed:
# Check for controller-runtime configuration or custom backoff settings
rg -n "DefaultRetry|ExponentialBackoff|SetupWithManager" control-plane-operator/hostedclusterconfigoperator/controllers/hcpstatus/hcpstatus.go -A 3Repository: openshift/hypershift
Length of output: 46
🏁 Script executed:
# Check if there's any custom error handling wrapper in the codebase
rg -n "apierrors.IsConflict" . --type go -B 3 -A 3 | head -40Repository: openshift/hypershift
Length of output: 3994
Handle optimistic-lock conflicts distinctly from other patch failures.
The current code treats 409 conflicts from optimistic locking the same as other patch errors, which can create misleading error metrics and logs. Conflicts are expected under contention and should be handled separately. Detect conflicts using apierrors.IsConflict(err) and return nil error to allow the framework's work queue to naturally requeue with exponential backoff, keeping other patch errors as genuine failures. See reference implementation in vendor/sigs.k8s.io/cluster-api/util/patch/patch.go for this pattern.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In
`@control-plane-operator/hostedclusterconfigoperator/controllers/hcpstatus/hcpstatus.go`
around lines 79 - 80, The patch call on h.mgtClusterClient.Status().Patch (using
crclient.MergeFromWithOptions and MergeFromWithOptimisticLock) must treat
optimistic-lock conflicts differently: after the Patch returns an error, detect
conflicts with apierrors.IsConflict(err) and, if true, return reconcile.Result{}
with a nil error so the controller runtime can requeue with backoff; for
non-conflict errors continue returning the wrapped fmt.Errorf("failed to patch
hcp status: %w", err). Locate the error handling directly after the Patch
invocation in the hcp status reconciliation code and insert the
apierrors.IsConflict check there.
| originalHCP := hcp.DeepCopy() | ||
| condition.ObservedGeneration = hcp.Generation | ||
| meta.SetStatusCondition(&hcp.Status.Conditions, condition) | ||
| return r.Status().Update(ctx, hcp) | ||
| return r.Status().Patch(ctx, hcp, client.MergeFromWithOptions(originalHCP, client.MergeFromWithOptimisticLock{})) |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
cd hypershift-operator/controllers/etcdbackup && wc -l reconciler.goRepository: openshift/hypershift
Length of output: 82
🏁 Script executed:
# First, let's look at the updateHCPBackupCondition function
cd hypershift-operator/controllers/etcdbackup && sed -n '295,310p' reconciler.goRepository: openshift/hypershift
Length of output: 968
🏁 Script executed:
# Now examine the callers at lines 193, 452, and 484
cd hypershift-operator/controllers/etcdbackup && sed -n '188,198p' reconciler.goRepository: openshift/hypershift
Length of output: 396
🏁 Script executed:
cd hypershift-operator/controllers/etcdbackup && sed -n '447,457p' reconciler.goRepository: openshift/hypershift
Length of output: 391
🏁 Script executed:
cd hypershift-operator/controllers/etcdbackup && sed -n '479,489p' reconciler.goRepository: openshift/hypershift
Length of output: 393
🏁 Script executed:
# Check imports to verify retry is available
cd hypershift-operator/controllers/etcdbackup && head -30 reconciler.goRepository: openshift/hypershift
Length of output: 950
Retry optimistic-lock conflicts instead of silently dropping the HCP condition update.
MergeFromWithOptimisticLock{} surfaces 409 Conflict errors when the HCP is concurrently modified, but all callers at lines 193, 452, and 484 only log the error and continue. This means conflicts can permanently stale EtcdBackupSucceeded conditions on all three terminal paths (in-progress, success, failure).
The adjacent function updateHostedClusterBackupURL already handles this correctly with retry.RetryOnConflict, reflecting the same concurrency concern. Wrap this function's patch in retry.RetryOnConflict with a fresh HCP read to safely retry on conflicts.
Proposed fix
func (r *HCPEtcdBackupReconciler) updateHCPBackupCondition(ctx context.Context, hcp *hyperv1.HostedControlPlane, condition metav1.Condition) error {
- originalHCP := hcp.DeepCopy()
- condition.ObservedGeneration = hcp.Generation
- meta.SetStatusCondition(&hcp.Status.Conditions, condition)
- return r.Status().Patch(ctx, hcp, client.MergeFromWithOptions(originalHCP, client.MergeFromWithOptimisticLock{}))
+ return retry.RetryOnConflict(retry.DefaultBackoff, func() error {
+ current := &hyperv1.HostedControlPlane{}
+ if err := r.Get(ctx, client.ObjectKeyFromObject(hcp), current); err != nil {
+ return err
+ }
+
+ originalHCP := current.DeepCopy()
+ condition.ObservedGeneration = current.Generation
+ meta.SetStatusCondition(¤t.Status.Conditions, condition)
+ return r.Status().Patch(ctx, current, client.MergeFromWithOptions(originalHCP, client.MergeFromWithOptimisticLock{}))
+ })
}🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@hypershift-operator/controllers/etcdbackup/reconciler.go` around lines 302 -
305, The patch that sets meta.SetStatusCondition and calls r.Status().Patch(hcp,
client.MergeFromWithOptions(originalHCP, client.MergeFromWithOptimisticLock{}))
should be wrapped in retry.RetryOnConflict so optimistic-lock 409s are retried:
on each retry re-get the latest HCP (fresh read into a new variable), reapply
condition.ObservedGeneration = hcp.Generation and
meta.SetStatusCondition(&hcp.Status.Conditions, condition) against the fresh
HCP, then call r.Status().Patch with client.MergeFromWithOptions(originalHCP,
client.MergeFromWithOptimisticLock{}) (or use a new merge-from base as
appropriate) until success or non-conflict error; ensure errors are returned if
retries exhaust.
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## main #8562 +/- ##
==========================================
+ Coverage 40.34% 40.43% +0.08%
==========================================
Files 755 755
Lines 93167 93298 +131
==========================================
+ Hits 37587 37723 +136
+ Misses 52877 52871 -6
- Partials 2703 2704 +1
... and 3 files with indirect coverage changes
Flags with carried forward coverage won't be shown. Click here to find out more. 🚀 New features to boost your workflow:
|
58ac9ee to
10d5332
Compare
There was a problem hiding this comment.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
ignition-server/controllers/local_ignitionprovider.go (1)
767-788:⚠️ Potential issue | 🟠 Major | ⚡ Quick winUse
retry.RetryOnConflictto prevent silent condition update loss on 409 conflicts.This function uses optimistic locking but doesn't retry on conflicts. Since
runMCOat lines 450-452 only logs the error, any 409 conflict will silently drop theIgnitionServerValidReleaseInfocondition update. Wrap the patch inretry.RetryOnConflict(retry.DefaultBackoff, func() error {...})with a fresh HCP read inside the closure, following the pattern used insupport/k8sutil.UpdateObjectand elsewhere in the codebase.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@ignition-server/controllers/local_ignitionprovider.go` around lines 767 - 788, The patch of hostedControlPlane’s status can silently fail on 409 conflicts; wrap the r.Client.Status().Patch call in retry.RetryOnConflict(retry.DefaultBackoff, func() error { ... }) and inside the closure re-read the current HostedControlPlane (hostedControlPlane) from the API to compute/merge the IgnitionServerValidReleaseInfo condition (using releaseImageProvider.GetMissingImages() as before), then call r.Client.Status().Patch (or reuse support/k8sutil.UpdateObject pattern) returning any error so conflicts are retried; keep originalHCP only for the MergeFrom base inside the closure.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In
`@control-plane-operator/hostedclusterconfigoperator/controllers/hcpstatus/hcpstatus_test.go`:
- Around line 20-161: Add parallelization by calling t.Parallel() at the start
of the top-level TestMergeFromPatchDoesNotStompUnrelatedFields function and
inside each subtest's closure (the anonymous func passed to t.Run).
Specifically, insert t.Parallel() as the first statement in
TestMergeFromPatchDoesNotStompUnrelatedFields and as the first statement in each
subtest function (the func(t *testing.T) for the three t.Run blocks) so the test
and its subtests run with Go's t.Parallel.
---
Outside diff comments:
In `@ignition-server/controllers/local_ignitionprovider.go`:
- Around line 767-788: The patch of hostedControlPlane’s status can silently
fail on 409 conflicts; wrap the r.Client.Status().Patch call in
retry.RetryOnConflict(retry.DefaultBackoff, func() error { ... }) and inside the
closure re-read the current HostedControlPlane (hostedControlPlane) from the API
to compute/merge the IgnitionServerValidReleaseInfo condition (using
releaseImageProvider.GetMissingImages() as before), then call
r.Client.Status().Patch (or reuse support/k8sutil.UpdateObject pattern)
returning any error so conflicts are retried; keep originalHCP only for the
MergeFrom base inside the closure.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository YAML (base), Central YAML (inherited)
Review profile: CHILL
Plan: Enterprise
Run ID: 6b55879f-4645-4cc2-9269-fc6811eee015
📒 Files selected for processing (6)
control-plane-operator/controllers/hostedcontrolplane/hostedcontrolplane_controller.gocontrol-plane-operator/hostedclusterconfigoperator/controllers/hcpstatus/hcpstatus.gocontrol-plane-operator/hostedclusterconfigoperator/controllers/hcpstatus/hcpstatus_test.gocontrol-plane-operator/hostedclusterconfigoperator/controllers/resources/resources.gohypershift-operator/controllers/etcdbackup/reconciler.goignition-server/controllers/local_ignitionprovider.go
|
/test e2e-aws-autonode |
So right now we have multiple controllers writing HCP status, and several of them are doing full Update() calls instead of patches, which means that the whole object gets overwritten. This results in race conditions such that things like e.g. vcpus get wiped from the status object when something else writes the object. This just swaps all those Status().Update() with Patch + MergeFrom + oplocks so the controllers should get 409'd and retry instead of stomping on fields updated by other controllers. Signed-off-by: John Kyros <jkyros@redhat.com>
|
Now I have the complete picture. Let me compile the final report. Test Failure Analysis CompleteJob Information
Test Failure AnalysisErrorSummaryTwo independent failures — neither caused by PR #8562's code changes. 1. 2. Root Causee2e-aws-autonode — Pre-existing flake, unrelated to PR #8562: The CVO in the hosted cluster failed to complete its initial rollout. The failure sequence:
The PR changes the hcpstatus controller from codecov/patch — Coverage gap in modified files: The PR modified 6 files but added unit tests only for Recommendations
Evidence
|
So our conversion from Update() to Patch() was clever, but our patch calls can't write a nil, the field just get omitted, and we have fields where that nil is normal, expected, and necessary, so this tries to work around it. This is gross but if it works it at least we know we understand the problem. Signed-off-by: John Kyros <jkyros@redhat.com>
10d5332 to
9995cf5
Compare
|
/test e2e-aws-autonode |
|
@jkyros: all tests passed! Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
|
Ugh that passed. Great. There's something in here with Patch and fields that are |
What this PR does / why we need it:
Status().Update()calls with calls toPatchcalls that useclient.MergeFromWithOptions()along withclient.MergeFromWithOptimisticLock{}so that the full object updates don't wipe other controllers' status updateWhich issue(s) this PR fixes:
Fixes OCPBUGS-86306
Special notes for your reviewer:
Checklist:
Summary by CodeRabbit
Refactor
Tests