Reconcile `Set` types until all patching/creation is completed by nagadeesh-nagaraja · Pull Request #676 · ironcore-dev/metal-operator

nagadeesh-nagaraja · 2026-02-13T13:18:05Z

Summary by CodeRabbit

New Features
- Configurable resynchronization interval for reconciliations.
- Explicit pending-patch tracking to drive requeue timing.
Bug Fixes
- Better handling of referenced resources to avoid unnecessary recreation.
- Requeue behavior refined to wait when resources are not fully available or patching is in progress.
Tests
- Added comprehensive end-to-end reconciliation scenarios covering update, pause/resume, deletion, and recovery.

Fixes #620

coderabbitai · 2026-02-13T13:22:35Z

Walkthrough

Adds a ResyncInterval field to BIOS/BMC Settings/Version reconcilers, updates patch*FromTemplate functions to return (bool, error) indicating pending patching, propagates pending state into reconcile to RequeueAfter using ResyncInterval, and adds/updates tests exercising wait-and-reconcile behavior.

Changes

Cohort / File(s)	Summary
Reconciler initialization `cmd/main.go`, `internal/controller/suite_test.go`	Set `ResyncInterval` on BIOS/BMC Settings/Version reconcilers (`maintenanceResyncInterval` in cmd, `10ms` in tests).
BIOS Settings controller `internal/controller/biossettingsset_controller.go`	Add `ResyncInterval time.Duration`; change `patchBIOSSettingsFromTemplate` → `(bool, error)` to indicate pending patching; propagate pending flag to requeue logic; refine create-missing logic to respect referenced BIOSSettings and handle fetch errors.
BIOS Settings tests `internal/controller/biossettingsset_controller_test.go`	Add comprehensive wait/reconcile test ("Should successfully wait and reconcile until all resources are updated") (appears duplicated).
BIOS Version controller `internal/controller/biosversionset_controller.go`	Add `ResyncInterval`; change `patchBIOSVersionFromTemplate` → `(bool, error)` and requeue when labeling incomplete or patching pending.
BMC Settings controller `internal/controller/bmcsettingsset_controller.go`	Add `ResyncInterval` and imports; change `patchBMCSettingsFromTemplate` → `(bool, error)` with pending-patching tracking; check referenced BMCSettings before creating new; propagate pending state to requeue; watcher predicates updated.
BMC Settings tests `internal/controller/bmcsettingsset_controller_test.go`	Add comprehensive wait/reconcile test (duplicate block present) exercising multi-set interactions, pause/resume, deletions, and final validation.
BMC Version controller `internal/controller/bmcversionset_controller.go`	Add `ResyncInterval`; change `patchBMCVersionfromTemplate` → `(bool, error)` with pending-patching detection; requeue when labeling incomplete or patching pending.
Helpers `internal/controller/helper.go`	Add `labelChangeOrAnyFieldChangeInObject` helper to detect label or field changes; adds `maps` import.

Sequence Diagram(s)

sequenceDiagram
    participant Reconciler
    participant K8sAPI as "Kubernetes API"
    participant ChildCR as "Child CRs (BIOS/BMC)"
    participant Timer as "Resync Timer"

    Reconciler->>K8sAPI: Get Set resource & templates
    Reconciler->>K8sAPI: List/Fetch Child CRs
    Reconciler->>ChildCR: Create or Patch per template
    ChildCR-->>Reconciler: Report state (Available | InProgress | Failed)
    alt any InProgress or labeling incomplete
        Reconciler->>Timer: RequeueAfter(ResyncInterval)
    else all Available & labeled
        Reconciler->>K8sAPI: Update Set status -> Completed
    end

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

Add BMCSettingsSet type and controller #471 — Overlaps changes to BMCSettingsSet reconciler, ResyncInterval and patch-signature updates.
Refactor BIOSVersionSetReconciler #572 — Modifies BIOS version reconciler logic and patch/patching flow related to this PR.
Kubebuilder update from v4.11.1 to v4.12.0 #686 — Addresses setting ResyncInterval on reconciler initializations similar to this change.

Suggested labels

bug

Suggested reviewers

afritzler
Nuckal777

🚥 Pre-merge checks | ✅ 2 | ❌ 3

❌ Failed checks (2 warnings, 1 inconclusive)

Check name	Status	Explanation	Resolution
Description check	⚠️ Warning	The PR description only contains 'Fixes `#620`' without any detail about the changes made, implementation approach, or what was modified.	Expand the description to explain the solution: added ResyncInterval field to reconcilers for periodic polling, updated patching logic to track pending state, and enhanced creation logic for referenced child CRDs.
Docstring Coverage	⚠️ Warning	Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.
Out of Scope Changes check	❓ Inconclusive	The PR adds a new helper function labelChangeOrAnyFieldChangeInObject in helper.go that appears related to watch predicates but lacks clear documentation of its intended use in the codebase.	Clarify the purpose of labelChangeOrAnyFieldChangeInObject and its integration with the Set reconciliation flow; confirm whether it's currently used or intended for future watch predicate enhancements.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title clearly and concisely describes the main change: implementing periodic reconciliation of Set resource types until patching/creation completes, which directly addresses the linked issue.
Linked Issues check	✅ Passed	The PR successfully addresses issue `#620` by implementing ResyncInterval polling as a safety mechanism: adds pending-patching tracking to ensure continued reconciliation when child CRDs are in-progress, updates patching function signatures to return pending state, and enhances creation logic to handle referenced child CRD edge cases where watches don't trigger.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings (stacked PR)
📝 Generate docstrings (commit on current branch)

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch setCRD

Tip

Try Coding Plans. Let us write the prompt for your AI agent so you can ship faster (with fewer bugs).
Share your feedback on Discord.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

internal/controller/suite_test.go (1)
250-297: ⚠️ Potential issue | 🟠 Major

Missing ResyncInterval for BMCSettingsSetReconciler, BMCVersionSetReconciler, and BIOSVersionSetReconciler in test setup.

ResyncInterval is set only for BIOSSettingsSetReconciler (Line 296) but not for the other three Set reconcilers (Lines 250–253, 283–286, 288–291). Since all four controllers now use RequeueAfter: r.ResyncInterval, a zero-value time.Duration will cause immediate requeue in tests—effectively a hot loop. This is especially relevant for BMCSettingsSetReconciler, which has a new test exercising the pending-patching requeue path.
Proposed fix
 		Expect((&BIOSVersionSetReconciler{
-			Client: k8sManager.GetClient(),
-			Scheme: k8sManager.GetScheme(),
+			Client:         k8sManager.GetClient(),
+			Scheme:         k8sManager.GetScheme(),
+			ResyncInterval: 10 * time.Millisecond,
 		}).SetupWithManager(k8sManager)).To(Succeed())

 		// ...

 		Expect((&BMCSettingsSetReconciler{
-			Client: k8sManager.GetClient(),
-			Scheme: k8sManager.GetScheme(),
+			Client:         k8sManager.GetClient(),
+			Scheme:         k8sManager.GetScheme(),
+			ResyncInterval: 10 * time.Millisecond,
 		}).SetupWithManager(k8sManager)).To(Succeed())

 		Expect((&BMCVersionSetReconciler{
-			Client: k8sManager.GetClient(),
-			Scheme: k8sManager.GetScheme(),
+			Client:         k8sManager.GetClient(),
+			Scheme:         k8sManager.GetScheme(),
+			ResyncInterval: 10 * time.Millisecond,
 		}).SetupWithManager(k8sManager)).To(Succeed())
Based on learnings: "Prevent hot loops caused by status-only changes triggering reconciliation."

🤖 Fix all issues with AI agents

In `@internal/controller/biossettingsset_controller.go`:
- Around line 222-231: The current r.Get call in the BIOSSettings reconcile
block (checking server.Spec.BIOSSettingsRef) conflates three cases; change the
logic to explicitly handle them: if err == nil then the referenced BIOSSettings
exists and you should log/continue (keep the existing log.V(1).Info and
continue), if apierrors.IsNotFound(err) then proceed to create a new
BIOSSettings for this Server (as intended), and for any other non-nil err (e.g.,
network/RBAC) log the error via log.Error including context (Server and
BIOSSettings name) and return/requeue the error so reconciliation doesn’t
silently skip creation; use the same symbols r.Get, server.Spec.BIOSSettingsRef,
apierrors.IsNotFound, and log.Error to locate and update the code.

In `@internal/controller/bmcsettingsset_controller.go`:
- Around line 252-259: The Get call for BMCSettings currently conflates err==nil
and non-NotFound errors; update the r.Get(...) handling in the reconcile loop to
check three cases explicitly: if apierrors.IsNotFound(err) then proceed to
create a new metalv1alpha1.BMCSettings; else if err != nil then log the error
with context (include bmc.Name and bmc.Spec.BMCSettingRef.Name) and
return/requeue the reconcile (do not silently continue); else (err == nil) log
that the referenced BMCSettings exists and continue. Also fix the comment that
mentions "BIOSSettings" to correctly reference "BMCSettings".

internal/controller/biossettingsset_controller.go

internal/controller/bmcsettingsset_controller.go

coderabbitai

Actionable comments posted: 1

🧹 Nitpick comments (3)

internal/controller/biossettingsset_controller.go (1)
300-306: Missing log message when skipping an InProgress BIOSSettings.

bmcsettingsset_controller.go logs "Skipping BMCSettings spec patching as it is in InProgress state" in the equivalent path; biossettingsset_controller.go is silent. This makes it harder to trace why certain BIOSSettings objects are not receiving spec updates during a reconcile.
💬 Suggested addition
 		if settings.Status.State == metalv1alpha1.BIOSSettingsStateInProgress {
+			log.V(1).Info("Skipping BIOSSettings spec patching as it is in InProgress state", "BIOSSettings", settings.Name)
 			pendingPatchingSettings = true
 			continue
 		}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@internal/controller/biossettingsset_controller.go` around lines 300 - 306,
The loop over settingsList.Items skips items whose Status.State ==
metalv1alpha1.BIOSSettingsStateInProgress but currently emits no log; add a log
call (matching the BMCSettings message) inside the conditional before continue
so the controller records skipping of that BIOSSettings object (include
identifying info like settings.Name/Namespace and a message such as "Skipping
BIOSSettings spec patching as it is in InProgress state"); update references in
this block that use pendingPatchingSettings and settingsList.Items to ensure the
logging is executed for each skipped item.
internal/controller/biosversionset_controller.go (1)
261-267: Missing log message when skipping an InProgress BIOSVersion.

Consistent with bmcversionset_controller.go and bmcsettingsset_controller.go, add a log.V(1).Info to make it visible in controller logs when a BIOSVersion is skipped due to its InProgress state.
💬 Suggested addition
 		if version.Status.State == metalv1alpha1.BIOSVersionStateInProgress {
+			log.V(1).Info("Skipping BIOSVersion spec patching as it is in InProgress state", "BIOSVersion", version.Name)
 			pendingPatchingVersion = true
 			continue
 		}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@internal/controller/biosversionset_controller.go` around lines 261 - 267, The
loop over versions.Items skips items whose status equals
metalv1alpha1.BIOSVersionStateInProgress but currently emits no log; add a
log.V(1).Info call just before setting pendingPatchingVersion=true and continue
(use the controller logger variable used in this file, e.g., log or r.Log) to
record a message like "skipping BIOSVersion in progress" including identifying
keys such as version.Name and version.Namespace and the state
(metalv1alpha1.BIOSVersionStateInProgress) so skipped BIOSVersion objects are
visible in controller logs.
internal/controller/bmcversionset_controller.go (1)
286-289: Document the intentional UpgradeTask != nil guard to prevent future confusion.

The skip condition here is stricter than in the other three controllers (BIOSVersionSet, BMCSettingsSet, BIOSSettingsSet), which skip patching for any InProgress state. The extra bmcVersion.Status.UpgradeTask != nil check means a BMCVersion that is InProgress but has no active task will be patched. If this is intentional (e.g., UpgradeTask being nil means the actual upgrade workload hasn't started yet), a brief comment would prevent future contributors from removing the guard as "dead code".
💬 Suggested documentation comment
-		if bmcVersion.Status.State == metalv1alpha1.BMCVersionStateInProgress && bmcVersion.Status.UpgradeTask != nil {
+		// Only skip patching when an actual upgrade task is running (UpgradeTask != nil).
+		// A BMCVersion that is InProgress but has no UpgradeTask yet is safe to patch.
+		if bmcVersion.Status.State == metalv1alpha1.BMCVersionStateInProgress && bmcVersion.Status.UpgradeTask != nil {
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@internal/controller/bmcversionset_controller.go` around lines 286 - 289, The
if that skips patching only when bmcVersion.Status.State ==
BMCVersionStateInProgress AND bmcVersion.Status.UpgradeTask != nil is
intentionally stricter than other controllers; add a concise comment above this
conditional explaining that an InProgress BMCVersion with a nil UpgradeTask
should still be patched because a nil UpgradeTask indicates the upgrade workload
hasn't started (so patching is safe), and note that the extra guard prevents
skipping when there is an active UpgradeTask; reference
bmcVersion.Status.UpgradeTask, BMCVersionStateInProgress, and
pendingPatchingVersion in the comment so future contributors understand why the
nil check must remain.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@internal/controller/bmcsettingsset_controller.go`:
- Around line 251-264: Change the NotFound logging and fix the stale comment in
the BMCSettings lookup block: when r.Get returns an error and
apierrors.IsNotFound(err) is true, replace the log.Error call with log.V(1).Info
to reflect the expected recovery path (include context keys "BMC", bmc.Name and
"BMCSettings", bmc.Spec.BMCSettingRef.Name); for the branch where err == nil
(the else after r.Get), update the comment to accurately state that the
referenced BMCSettings exists (remove "or unable to determining" language) and
keep the existing log.V(1).Info call that notes the BMC already has a
BMCSettings ref.

---

Duplicate comments:
In `@internal/controller/biossettingsset_controller.go`:
- Around line 231-234: The non-NotFound error handling in the reconciliation
loop (where log.Error(err, "error when trying to get BIOSSettings referenced by
Server", "Server", server.Name, "BIOSSettings",
server.Spec.BIOSSettingsRef.Name) is called) currently just logs and continues;
instead append that error to the errs slice (the same error-collection used by
this controller) so callers receive a retry signal—i.e., when the Get of the
BIOSSettings reference fails with a non-NotFound error, add errs = append(errs,
err) (or wrap with context) before continuing. Ensure you update the same errs
used elsewhere in this reconciler so rate-limited requeueing occurs.
- Around line 227-240: Change the NotFound path in the BIOSSettings lookup to
stop logging a real error: when r.Get(...) returns apierrors.IsNotFound(err)
replace the log.Error call with log.V(1).Info (preserving the message/context
and the fact that you'll create a new BIOSSettings), and update the comment on
the `else` branch (the branch where err == nil after `r.Get`) to remove the
stale phrase "or unable to determining" and instead state clearly that the
referenced BIOSSettings exists (e.g., "the referenced BIOSSettings exists, so
skip creating a new one"); use the existing symbols r.Get,
client.ObjectKey{Name: server.Spec.BIOSSettingsRef.Name},
metalv1alpha1.BIOSSettings, log.Error -> log.V(1).Info, and
server.Name/server.Spec.BIOSSettingsRef.Name to locate and change the code and
comment.

In `@internal/controller/biosversionset_controller.go`:
- Around line 184-187: The requeue uses r.ResyncInterval which can be zero;
update the conditional branch in the reconcile path that returns
ctrl.Result{RequeueAfter: r.ResyncInterval} (the block comparing
status.FullyLabeledServers, status.AvailableBIOSVersion and
pendingPatchingVersion) to guard against a 0 interval by substituting a safe
default (e.g., a small duration constant) when r.ResyncInterval == 0 before
returning the ctrl.Result so the controller actually requeues as intended.

In `@internal/controller/bmcsettingsset_controller.go`:
- Around line 255-258: The code currently logs non-NotFound Get errors
(log.Error(err, ...)) and then uses continue, which swallows transient API
errors instead of propagating them; update the block handling the Get of the
referenced BMCSettings (the section using log.Error, "Server", bmc.Name,
"BMCSettings", bmc.Spec.BMCSettingRef.Name and the continue) to append the error
to the errs slice (errs = append(errs, err)) before continuing so the caller
receives the error and controller-runtime can trigger a rate-limited retry;
ensure you only append for actual Get errors (not NotFound) and keep the
existing log statement.

---

Nitpick comments:
In `@internal/controller/biossettingsset_controller.go`:
- Around line 300-306: The loop over settingsList.Items skips items whose
Status.State == metalv1alpha1.BIOSSettingsStateInProgress but currently emits no
log; add a log call (matching the BMCSettings message) inside the conditional
before continue so the controller records skipping of that BIOSSettings object
(include identifying info like settings.Name/Namespace and a message such as
"Skipping BIOSSettings spec patching as it is in InProgress state"); update
references in this block that use pendingPatchingSettings and settingsList.Items
to ensure the logging is executed for each skipped item.

In `@internal/controller/biosversionset_controller.go`:
- Around line 261-267: The loop over versions.Items skips items whose status
equals metalv1alpha1.BIOSVersionStateInProgress but currently emits no log; add
a log.V(1).Info call just before setting pendingPatchingVersion=true and
continue (use the controller logger variable used in this file, e.g., log or
r.Log) to record a message like "skipping BIOSVersion in progress" including
identifying keys such as version.Name and version.Namespace and the state
(metalv1alpha1.BIOSVersionStateInProgress) so skipped BIOSVersion objects are
visible in controller logs.

In `@internal/controller/bmcversionset_controller.go`:
- Around line 286-289: The if that skips patching only when
bmcVersion.Status.State == BMCVersionStateInProgress AND
bmcVersion.Status.UpgradeTask != nil is intentionally stricter than other
controllers; add a concise comment above this conditional explaining that an
InProgress BMCVersion with a nil UpgradeTask should still be patched because a
nil UpgradeTask indicates the upgrade workload hasn't started (so patching is
safe), and note that the extra guard prevents skipping when there is an active
UpgradeTask; reference bmcVersion.Status.UpgradeTask, BMCVersionStateInProgress,
and pendingPatchingVersion in the comment so future contributors understand why
the nil check must remain.

internal/controller/bmcsettingsset_controller.go

internal/controller/biossettingsset_controller_test.go

afritzler · 2026-03-02T10:00:54Z

@coderabbitai instead of using a resync interval wouldn't it be better to have a proper watch setup on the Set resource to be re-triggered when a change occurs?

coderabbitai · 2026-03-02T10:02:45Z

Tip

For best results, initiate chat on the files or code changes.

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash

# Find the SetupWithManager methods for the Set reconcilers to understand current watch setup
echo "=== BIOSSettingsSetReconciler SetupWithManager ==="
rg -A 30 "func \(r \*BIOSSettingsSetReconciler\) SetupWithManager" --type go

echo -e "\n=== BIOSVersionSetReconciler SetupWithManager ==="
rg -A 30 "func \(r \*BIOSVersionSetReconciler\) SetupWithManager" --type go

echo -e "\n=== BMCSettingsSetReconciler SetupWithManager ==="
rg -A 30 "func \(r \*BMCSettingsSetReconciler\) SetupWithManager" --type go

echo -e "\n=== BMCVersionSetReconciler SetupWithManager ==="
rg -A 30 "func \(r \*BMCVersionSetReconciler\) SetupWithManager" --type go

Length of output: 5054

🏁 Script executed:

#!/bin/bash

# Find the enqueueFromChildObjUpdatesExceptAnnotation function
echo "=== enqueueFromChildObjUpdatesExceptAnnotation function ==="
rg -A 20 "func enqueueFromChildObjUpdatesExceptAnnotation" --type go

# Check if there are watches on child resource status changes
echo -e "\n=== Looking for status update handling in child watches ==="
rg -B 5 -A 15 "UpdateFunc.*func\(e event.UpdateEvent\)" internal/controller/biossettingsset_controller.go internal/controller/bmcsettingsset_controller.go

# Check the Owns() implementation to understand what it watches
echo -e "\n=== Looking at how child resources trigger reconciliation ==="
rg -B 3 -A 10 "Owns\(" --type go internal/controller/

Length of output: 11216

🏁 Script executed:

#!/bin/bash

# Look at the patching logic to understand when children are skipped
echo "=== BIOSSettings patching logic - InProgress handling ==="
rg -B 5 -A 10 "InProgress" internal/controller/biossettingsset_controller.go | head -60

echo -e "\n=== BMCVersion patching logic - InProgress handling ==="
rg -B 5 -A 10 "InProgress" internal/controller/bmcversionset_controller.go | head -60

# Check the requeue logic
echo -e "\n=== Requeue logic with ResyncInterval ==="
rg -B 3 -A 5 "RequeueAfter.*ResyncInterval" internal/controller/

Length of output: 14950

@afritzler Great question! You're absolutely right to question the polling approach. Let me explain what I found:

Current Architecture

The Set reconcilers already have watches in place:

For() watches the Set resource itself
Owns() watches child resources (BIOSSettings, BMCSettings, etc.) and triggers Set reconciliation on any child changes
Additional watches on Server/BMC resources for label changes

Why ResyncInterval is Currently Used

Looking at the patching logic, when a Set spec is updated while child CRDs are InProgress:

The controller intentionally skips patching those InProgress children (by design)
It sets a pendingPatching flag and returns RequeueAfter: ResyncInterval
The expectation is that it will come back later to patch them once they're done

The Key Question

Shouldn't the existing Owns() watch handle this?

Yes, theoretically it should! When a child resource's status changes from InProgress → Completed/Failed, the Owns() watch should trigger the Set reconciliation automatically. At that point, the controller would:

See the child is no longer InProgress
Patch it with the updated Set spec
No polling needed

Potential Issues

The ResyncInterval appears to be a defensive backup mechanism. This might be addressing:

Watch reliability concerns - ensuring reconciliation happens even if watch events are missed
Status-only updates - in some Kubernetes setups, status-only updates might not always trigger watches as expected
Race conditions - multiple children transitioning states simultaneously

Recommendation

I'd suggest investigating whether the Owns() watch is truly insufficient before relying on polling. You could:

Test watch behavior: Verify that child status changes properly trigger Set reconciliation via the existing watch
Improve watch predicates: If needed, add explicit predicates on the child watches (similar to BMCVersionSet's custom predicates) to ensure status changes trigger reconciliation
Remove ResyncInterval if watches work: If watches handle all cases reliably, the polling becomes unnecessary overhead

The watch-based approach you're suggesting would be cleaner, more event-driven, and more resource-efficient than polling. Would you like me to help investigate whether the existing watches are sufficient, or if specific watch predicates need to be added?

nagadeesh-nagaraja · 2026-03-02T15:25:37Z

@coderabbitai the polling to wait until all expected task is completed is not only about the template patching. It is also used for creation and deletion.
Assume the BIOSSettingsSet (ABC) is present and matches with 1 server and creation of BIOSSettings object failed because of the any error (for example: server already having (wrong) BIOSSettings object (not owned by this set) reference) the creation of BIOSSettings is skipped and reconcile stops.

later when the (wrong) BIOSSettings object is deleted or reference removed from server, the BIOSSettingsSet (ABC) does not reconcile as its not its own object, and the Server label did not change.

Which is better? reconcile for ever Server CRD changes ( which can be constant and unneccessary due to constant update of server object as it reconcile every 2 min. or reconcile until BIOSSettingsSet is completed?

coderabbitai · 2026-03-02T15:26:29Z

Tip

For best results, initiate chat on the files or code changes.

@nagadeesh-nagaraja Thank you for the excellent clarification! You've identified a real gap in the watch-based approach.

The Creation/Deletion Gap You Described

Your scenario is spot-on:

BIOSSettingsSet matches a Server
Creation fails because Server already has a BIOSSettingsRef to a "wrong" BIOSSettings (not owned by the Set)
When that wrong BIOSSettings is later deleted or the reference is removed:
- Owns() watch won't trigger (not owned by the Set)
- Server label watch won't trigger (labels unchanged)
- Set reconciliation never happens → the Set remains incomplete indefinitely

Trade-offs Analysis

Option A: Watch all Server changes

✅ Event-driven, catches the BIOSSettingsRef removal immediately
❌ Constant unnecessary reconciliations (every 2 min as Servers reconcile)
❌ High overhead for all Sets to react to mostly-irrelevant Server updates

Option B: ResyncInterval polling (your approach)

✅ Avoids constant reconciliation noise from irrelevant Server updates
✅ Eventually reaches consistency once the blocking condition is resolved
❌ Delayed response (up to ResyncInterval duration)
✅ Simpler to reason about - reconciles until complete

Option C: Smarter watch predicates
You could add a watch on Servers with predicates that only trigger when:

Server.Spec.BIOSSettingsRef changes (including deletion)
Server.Spec.BMCSettingsRef changes
Relevant labels change

This would give you event-driven behavior without the noise.

Recommendation

Given your constraints (Servers update every 2 min), I believe your ResyncInterval approach is reasonable because:

Watching all Server changes would be too noisy - every Set would reconcile every 2 min for every Server update, even when nothing relevant changed
Field-specific watch predicates could work but add complexity and might miss edge cases
Your approach is eventually consistent - the Set keeps reconciling until all resources are created/patched/ready
The delay is acceptable for these operations (BIOS/BMC settings aren't real-time critical)

One Enhancement Suggestion

You could consider combining both approaches:

Keep the ResyncInterval as a safety net
Add a watch with a predicate on Server's BIOSSettingsRef/BMCSettingsRef field changes

Watches(
    &metalv1alpha1.Server{},
    handler.EnqueueRequestsFromMapFunc(r.findSetsForServer),
    builder.WithPredicates(predicate.Funcs{
        UpdateFunc: func(e event.UpdateEvent) bool {
            oldServer := e.ObjectOld.(*metalv1alpha1.Server)
            newServer := e.ObjectNew.(*metalv1alpha1.Server)
            // Only trigger if BIOSSettingsRef changed
            return oldServer.Spec.BIOSSettingsRef != newServer.Spec.BIOSSettingsRef
        },
    }),
)

This would give you fast event-driven updates when refs change while avoiding the constant reconciliation noise. But if you prefer to keep it simple with just ResyncInterval, that's a valid choice given your requirements.

Does this align with your thinking?

🧠 Learnings used

Learnt from: CR
Repo: ironcore-dev/metal-operator PR: 0
File: AGENTS.md:0-0
Timestamp: 2026-02-17T13:12:50.124Z
Learning: Applies to internal/controller/**/*_controller.go : Watch secondary resources using `.Owns()` or `.Watches()` instead of relying solely on `RequeueAfter`

coderabbitai

Actionable comments posted: 1

♻️ Duplicate comments (2)

internal/controller/bmcsettingsset_controller.go (1)

252-260: ⚠️ Potential issue | 🟠 Major

Do not swallow non-NotFound lookup failures in createMissingBMCSettings.

The non-NotFound branch logs and continues, so reconciliation can return success while read failures are masked. This weakens retry/backoff semantics and observability. Also, Line 257 logs "Server" for a BMC object.

Suggested fix

 			if bmc.Spec.BMCSettingRef != nil {
 				if err := r.Get(ctx, client.ObjectKey{Name: bmc.Spec.BMCSettingRef.Name}, &metalv1alpha1.BMCSettings{}); err != nil {
 					if apierrors.IsNotFound(err) {
 						log.V(1).Info("BMCSettings referenced by BMC not found, will create a new one", "BMC", bmc.Name, "BMCSettings", bmc.Spec.BMCSettingRef.Name)
 						// proceed to create a new BMCSettings; the ref will be updated when it is created
 					} else {
-						log.Error(err, "error when trying to get BMCSettings referenced by Server", "Server", bmc.Name, "BMCSettings", bmc.Spec.BMCSettingRef.Name)
-						// we will try this again in next reconciliation loop
+						log.Error(err, "Failed to get BMCSettings referenced by BMC", "BMC", bmc.Name, "BMCSettings", bmc.Spec.BMCSettingRef.Name)
+						errs = append(errs, fmt.Errorf("failed to get BMCSettings %s referenced by BMC %s: %w", bmc.Spec.BMCSettingRef.Name, bmc.Name, err))
 						continue
 					}
 				} else {

As per coding guidelines: "Follow Kubernetes logging message style guidelines - start with capital letter, use active voice, use past tense for errors, and specify object types."

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@internal/controller/bmcsettingsset_controller.go` around lines 252 - 260, The
code in createMissingBMCSettings swallows non-NotFound errors from r.Get by
logging and doing continue which masks transient read failures and mislabels the
object as "Server"; change the non-NotFound branch to return the error (or
requeue) so the controller retries with backoff instead of silently proceeding,
and update the log message in the error branch to use the correct object type
"BMC" and follow Kubernetes logging style (capitalize, active voice, past tense)
when calling log.Error for the r.Get failure; locate the failing call via
r.Get(...) and the log site that currently logs "Server" and replace the
continue with a propagated error return.

internal/controller/biossettingsset_controller.go (1)

228-236: ⚠️ Potential issue | 🟠 Major

Propagate non-NotFound lookup failures instead of silently skipping creation.

Transient/API/RBAC errors are currently logged and skipped, which hides reconciliation failure and delays recovery behind polling.

✅ Proposed fix

 				if err := r.Get(ctx, client.ObjectKey{Name: server.Spec.BIOSSettingsRef.Name}, &metalv1alpha1.BIOSSettings{}); err != nil {
 					if apierrors.IsNotFound(err) {
 						log.Error(err, "failed to get BIOSSettings referenced by Server", "Server", server.Name, "BIOSSettings", server.Spec.BIOSSettingsRef.Name)
 						// we will go ahead and create a new BIOSSettings for this server. the ref will be updated when the new BIOSSettings is created
 					} else {
-						log.Error(err, "error when trying to get BIOSSettings referenced by Server", "Server", server.Name, "BIOSSettings", server.Spec.BIOSSettingsRef.Name)
-						// we will try this again in next reconciliation loop
+						errs = append(errs, fmt.Errorf("failed to get BIOSSettings %s referenced by Server %s: %w", server.Spec.BIOSSettingsRef.Name, server.Name, err))
 						continue
 					}
 				} else {

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@internal/controller/biossettingsset_controller.go` around lines 228 - 236,
The reconciliation currently swallows non-NotFound errors from r.Get when
looking up the BIOSSettingsRef (inside the loop around r.Get and handling Server
and metalv1alpha1.BIOSSettings), logging them and using continue; instead return
or requeue the error so controller machinery retries immediately. Change the
else branch that handles errors from r.Get(...) (where apierrors.IsNotFound(err)
is false) to return ctrl.Result{}, err (or wrap with apierrors.NewInternalError)
so the Reconcile function propagates the failure instead of silently skipping
creation for Server.Spec.BIOSSettingsRef.Name.

🧹 Nitpick comments (1)

internal/controller/biossettingsset_controller.go (1)
35-36: Guard the requeue fallback against an unset interval.

If ResyncInterval is 0, the fallback requeue path is effectively disabled. Adding a default/validation here would make the safety-net behavior robust.
🔁 Suggested safeguard
-	if status.FullyLabeledServers != status.AvailableBIOSSettings || pendingPatchingSettings {
+	if status.FullyLabeledServers != status.AvailableBIOSSettings || pendingPatchingSettings {
+		requeueAfter := r.ResyncInterval
+		if requeueAfter <= 0 {
+			requeueAfter = 30 * time.Second
+		}
 		log.V(1).Info("Waiting for all BIOSSettings to be created/Patched for the labeled Servers", "Status", status)
-		return ctrl.Result{RequeueAfter: r.ResyncInterval}, nil
+		return ctrl.Result{RequeueAfter: requeueAfter}, nil
 	}
Also applies to: 209-212

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@internal/controller/biossettingsset_controller_test.go`:
- Around line 524-530: The delete assertions for biosSettingsSet1,
biosSettings02, and biosSettings03 can flake if resources were already removed;
update the cleanup to treat NotFound as success by capturing the error from
k8sClient.Delete and asserting that err == nil || apierrors.IsNotFound(err)
(instead of Expect(...).To(Succeed())), and keep the subsequent
Eventually(Get(...)).Should(Satisfy(apierrors.IsNotFound)) checks as-is; apply
the same change for the similar block referencing the same resources around the
602-607 area.

---

Duplicate comments:
In `@internal/controller/biossettingsset_controller.go`:
- Around line 228-236: The reconciliation currently swallows non-NotFound errors
from r.Get when looking up the BIOSSettingsRef (inside the loop around r.Get and
handling Server and metalv1alpha1.BIOSSettings), logging them and using
continue; instead return or requeue the error so controller machinery retries
immediately. Change the else branch that handles errors from r.Get(...) (where
apierrors.IsNotFound(err) is false) to return ctrl.Result{}, err (or wrap with
apierrors.NewInternalError) so the Reconcile function propagates the failure
instead of silently skipping creation for Server.Spec.BIOSSettingsRef.Name.

In `@internal/controller/bmcsettingsset_controller.go`:
- Around line 252-260: The code in createMissingBMCSettings swallows
non-NotFound errors from r.Get by logging and doing continue which masks
transient read failures and mislabels the object as "Server"; change the
non-NotFound branch to return the error (or requeue) so the controller retries
with backoff instead of silently proceeding, and update the log message in the
error branch to use the correct object type "BMC" and follow Kubernetes logging
style (capitalize, active voice, past tense) when calling log.Error for the
r.Get failure; locate the failing call via r.Get(...) and the log site that
currently logs "Server" and replace the continue with a propagated error return.

ℹ️ Review info

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 73990a3 and 2f940f1.

📒 Files selected for processing (5)

internal/controller/biossettingsset_controller.go
internal/controller/biossettingsset_controller_test.go
internal/controller/bmcsettingsset_controller.go
internal/controller/bmcsettingsset_controller_test.go
internal/controller/helper.go

🚧 Files skipped from review as they are similar to previous changes (1)

internal/controller/bmcsettingsset_controller_test.go

coderabbitai · 2026-03-02T16:31:32Z

internal/controller/biossettingsset_controller_test.go

+		By("Deleting the 1st BIOSSettingsSet")
+		Expect(k8sClient.Delete(ctx, biosSettingsSet1)).To(Succeed())
+		Eventually(Get(biosSettingsSet1)).Should(Satisfy(apierrors.IsNotFound))
+		Expect(k8sClient.Delete(ctx, biosSettings02)).To(Succeed())
+		Eventually(Get(biosSettings02)).Should(Satisfy(apierrors.IsNotFound))
+		Expect(k8sClient.Delete(ctx, biosSettings03)).To(Succeed())
+		Eventually(Get(biosSettings03)).Should(Satisfy(apierrors.IsNotFound))


⚠️ Potential issue | 🟡 Minor

Make cleanup deletions tolerant to already-removed resources.

These Delete assertions can flake when dependents are removed before the explicit delete call. Consider ignoring NotFound in cleanup paths.

🔧 Suggested hardening

+import "sigs.k8s.io/controller-runtime/pkg/client"

- Expect(k8sClient.Delete(ctx, biosSettings02)).To(Succeed()) + Expect(client.IgnoreNotFound(k8sClient.Delete(ctx, biosSettings02))).To(Succeed()) Eventually(Get(biosSettings02)).Should(Satisfy(apierrors.IsNotFound)) - Expect(k8sClient.Delete(ctx, biosSettings03)).To(Succeed()) + Expect(client.IgnoreNotFound(k8sClient.Delete(ctx, biosSettings03))).To(Succeed()) Eventually(Get(biosSettings03)).Should(Satisfy(apierrors.IsNotFound)) ... - Expect(k8sClient.Delete(ctx, biosSettings02_02)).To(Succeed()) + Expect(client.IgnoreNotFound(k8sClient.Delete(ctx, biosSettings02_02))).To(Succeed()) Eventually(Get(biosSettings02_02)).Should(Satisfy(apierrors.IsNotFound)) - Expect(k8sClient.Delete(ctx, biosSettings03_02)).To(Succeed()) + Expect(client.IgnoreNotFound(k8sClient.Delete(ctx, biosSettings03_02))).To(Succeed()) Eventually(Get(biosSettings03_02)).Should(Satisfy(apierrors.IsNotFound))

Also applies to: 602-607

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@internal/controller/biossettingsset_controller_test.go` around lines 524 - 530, The delete assertions for biosSettingsSet1, biosSettings02, and biosSettings03 can flake if resources were already removed; update the cleanup to treat NotFound as success by capturing the error from k8sClient.Delete and asserting that err == nil || apierrors.IsNotFound(err) (instead of Expect(...).To(Succeed())), and keep the subsequent Eventually(Get(...)).Should(Satisfy(apierrors.IsNotFound)) checks as-is; apply the same change for the similar block referencing the same resources around the 602-607 area.

@coderabbitai the unit test setup does not automatically delete the child resources, hence these object cant not be "already removed" ?

🧩 Analysis chain

🌐 Web query:

envtest garbage collection owner references enabled

💡 Result:

In controller-runtime/pkg/envtest, garbage collection is effectively not enabled because envtest only starts etcd + kube-apiserver, not kube-controller-manager (garbage-collector-controller). So even if you set metadata.ownerReferences, dependents generally won’t be deleted in envtest the way they would be in a real cluster. [1]

How to deal with this:

Recommended for envtest: assert that the child objects have the expected ownerReferences (ownership), rather than expecting Kubernetes to actually delete them. [1]

If you need real GC behavior: run tests against a cluster that includes the controller manager (e.g., Kind/minikube/real cluster), or otherwise run kube-controller-manager yourself with the garbage-collector-controller enabled (it’s part of the --controllers set). [2]

Sources:
[1] Kubebuilder book – envtest “Testing considerations” (no built-in controllers; GC example)
[2] Kubernetes docs – kube-controller-manager controllers list (includes garbage-collector-controller)

@nagadeesh-nagaraja You're absolutely right! I've confirmed that envtest does not enable garbage collection by default because it only runs etcd and kube-apiserver, not kube-controller-manager. Without the garbage collector controller running, owner references won't trigger automatic deletion of child resources.

In your test setup, the child BIOSSettings objects will remain until explicitly deleted, so they cannot be "already removed" before your delete calls. The current implementation without client.IgnoreNotFound() is appropriate for this environment.

My suggestion would only apply in scenarios where:

Tests run against a full cluster (Kind/minikube) with garbage collection enabled

There's potential for race conditions between parent and child deletion

Tests need to be more defensive against future environment changes

For your current envtest-based setup, the explicit deletes as written are correct and won't flake. Thank you for the clarification!

🧠 Learnings used

Learnt from: CR Repo: ironcore-dev/metal-operator PR: 0 File: AGENTS.md:0-0 Timestamp: 2026-02-17T13:12:50.124Z Learning: Applies to internal/controller/**/*_controller.go : Implement finalizers to clean up external resources (buckets, VMs, DNS entries) before resource deletion

Reconcile SET CRD until all patching/creation is completed

ebdf9f6

nagadeesh-nagaraja self-assigned this Feb 13, 2026

nagadeesh-nagaraja requested a review from a team as a code owner February 13, 2026 13:18

github-project-automation bot added this to Roadmap Feb 13, 2026

github-actions bot added the size/XL label Feb 13, 2026

coderabbitai bot reviewed Feb 13, 2026

View reviewed changes

internal/controller/biossettingsset_controller.go Show resolved Hide resolved

internal/controller/bmcsettingsset_controller.go Outdated Show resolved Hide resolved

Review fix from AI

43b11d6

hardikdr added the area/metal-automation label Feb 14, 2026

Merge remote-tracking branch 'origin/main' into setCRD

73990a3

coderabbitai bot reviewed Feb 18, 2026

View reviewed changes

internal/controller/bmcsettingsset_controller.go Show resolved Hide resolved

afritzler changed the title ~~Reconcile SET CRD until all patching/creation is completed~~ Reconcile Set types until all patching/creation is completed Mar 2, 2026

afritzler requested changes Mar 2, 2026

View reviewed changes

Review comments

2f940f1

nagadeesh-nagaraja requested a review from afritzler March 2, 2026 16:25

coderabbitai bot reviewed Mar 2, 2026

View reviewed changes

afritzler added the enhancement New feature or request label Mar 3, 2026

afritzler approved these changes Mar 3, 2026

View reviewed changes

afritzler merged commit 56e4833 into main Mar 3, 2026
18 checks passed

afritzler deleted the setCRD branch March 3, 2026 10:12

github-project-automation bot moved this to Done in Roadmap Mar 3, 2026

coderabbitai bot mentioned this pull request Mar 4, 2026

Add ability to auto retry failed resources #695

Open

Conversation

nagadeesh-nagaraja commented Feb 13, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Feb 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Suggested labels

Suggested reviewers

❌ Failed checks (2 warnings, 1 inconclusive)

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

afritzler commented Mar 2, 2026

Uh oh!

coderabbitai bot commented Mar 2, 2026

Current Architecture

Why ResyncInterval is Currently Used

The Key Question

Potential Issues

Recommendation

Uh oh!

nagadeesh-nagaraja commented Mar 2, 2026

Uh oh!

coderabbitai bot commented Mar 2, 2026

The Creation/Deletion Gap You Described

Trade-offs Analysis

Recommendation

One Enhancement Suggestion

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

nagadeesh-nagaraja Mar 3, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 3, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

nagadeesh-nagaraja commented Feb 13, 2026 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Feb 13, 2026 •

edited

Loading

coderabbitai bot Mar 2, 2026 •

edited

Loading