Skip to content

OCPBUGS-63048: Kube object asserts do not handle case where object is replaced#6059

Open
dkhater-redhat wants to merge 4 commits into
openshift:mainfrom
dkhater-redhat:fix-assertion-uid-check
Open

OCPBUGS-63048: Kube object asserts do not handle case where object is replaced#6059
dkhater-redhat wants to merge 4 commits into
openshift:mainfrom
dkhater-redhat:fix-assertion-uid-check

Conversation

@dkhater-redhat
Copy link
Copy Markdown
Contributor

@dkhater-redhat dkhater-redhat commented May 18, 2026

The DoesNotExist assertion methods only checked if an object with a given name returns NotFound error. This failed to detect when an object was deleted and replaced with a new object of the same name but different UID.

This commonly occurs in tests like TestRebuildAnnotationRestartsBuild where:

  1. A MachineOSBuild/Job/Pod is deleted
  2. A rebuild creates new objects with same names but different UIDs
  3. DoesNotExist assertions see the new objects and keep polling

Fix by:

  • Capturing the original object's UID before polling
  • Checking both NotFound errors AND UID changes
  • Treating UID mismatch as deletion (object was replaced)

Updated methods:

  • PodDoesNotExist
  • JobDoesNotExist
  • SecretDoesNotExist
  • ConfigMapDoesNotExist
  • MachineOSBuildDoesNotExist
  • MachineOSConfigDoesNotExist
  • MachineConfigDoesNotExist
  • ImageDoesNotExist

This fixes the flaky failures in TestRebuildAnnotationRestartsBuild.

- What I did

- How to verify it

- Description for the changelog

Summary by CodeRabbit

  • New Features

    • Machine build deletion now uses a finalizer to ensure related image/artifact cleanup before removal and runs cleanup immediately when deletion starts.
  • Tests

    • Enhanced deletion verification by tracking resource identifiers to distinguish deleted vs. replaced objects.
    • Improved diagnostic logging and assertions around build lifecycle, deletion polling, and image/artifact pruning.
    • Deletion checks honor polling vs. non-polling modes for fail vs. wait behavior.

The DoesNotExist assertion methods only checked if an object with a given
name returns NotFound error. This failed to detect when an object was
deleted and replaced with a new object of the same name but different UID.

This commonly occurs in tests like TestRebuildAnnotationRestartsBuild where:
1. A MachineOSBuild/Job/Pod is deleted
2. A rebuild creates new objects with same names but different UIDs
3. DoesNotExist assertions see the new objects and keep polling

Fix by:
- Capturing the original object's UID before polling
- Checking both NotFound errors AND UID changes
- Treating UID mismatch as deletion (object was replaced)

Updated methods:
- PodDoesNotExist
- JobDoesNotExist
- SecretDoesNotExist
- ConfigMapDoesNotExist
- MachineOSBuildDoesNotExist
- MachineOSConfigDoesNotExist
- MachineConfigDoesNotExist
- ImageDoesNotExist

This fixes the flaky failures in TestRebuildAnnotationRestartsBuild.
@openshift-merge-bot
Copy link
Copy Markdown
Contributor

Pipeline controller notification
This repo is configured to use the pipeline controller. Second-stage tests will be triggered either automatically or after lgtm label is added, depending on the repository configuration. The pipeline controller will automatically detect which contexts are required and will utilize /test Prow commands to trigger the second stage.

For optional jobs, comment /test ? to see a list of all defined jobs. To trigger manually all jobs from second stage use /pipeline required command.

This repository is configured in: LGTM mode

@openshift-ci-robot openshift-ci-robot added jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. labels May 18, 2026
@openshift-ci-robot
Copy link
Copy Markdown
Contributor

@dkhater-redhat: This pull request references Jira Issue OCPBUGS-63048, which is invalid:

  • expected the bug to target the "5.0.0" version, but no target version was set

Comment /jira refresh to re-evaluate validity if changes to the Jira bug are made, or edit the title of this pull request to link to a different bug.

The bug has been updated to refer to the pull request using the external bug tracker.

Details

In response to this:

The DoesNotExist assertion methods only checked if an object with a given name returns NotFound error. This failed to detect when an object was deleted and replaced with a new object of the same name but different UID.

This commonly occurs in tests like TestRebuildAnnotationRestartsBuild where:

  1. A MachineOSBuild/Job/Pod is deleted
  2. A rebuild creates new objects with same names but different UIDs
  3. DoesNotExist assertions see the new objects and keep polling

Fix by:

  • Capturing the original object's UID before polling
  • Checking both NotFound errors AND UID changes
  • Treating UID mismatch as deletion (object was replaced)

Updated methods:

  • PodDoesNotExist
  • JobDoesNotExist
  • SecretDoesNotExist
  • ConfigMapDoesNotExist
  • MachineOSBuildDoesNotExist
  • MachineOSConfigDoesNotExist
  • MachineConfigDoesNotExist
  • ImageDoesNotExist

This fixes the flaky failures in TestRebuildAnnotationRestartsBuild.

- What I did

- How to verify it

- Description for the changelog

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 18, 2026

Walkthrough

Adds UID-aware deletion checks to various test assertions, introduces a MachineOSBuild deletion finalizer and reconciler handling, and strengthens E2E test logging and polling around MOSB deletion and ImageStreamTag pruning.

Changes

UID-aware deletion + MOSB finalizer + E2E updates

Layer / File(s) Summary
Core UID-aware deletion checker
test/helpers/assertions.go
New deletedWithUID(obj, err, originalUID) treats NotFound as deleted, treats UID changes as deletion (replacement), and returns waiting vs immediate error depending on polling mode.
Namespaced resource assertions with UID capture
test/helpers/assertions.go
SecretDoesNotExist, ConfigMapDoesNotExist, PodDoesNotExist, JobDoesNotExist, and ImageDoesNotExist perform a GET to capture original UID and use deletedWithUID for deletion/replacement detection.
Direct object assertions with UID extraction
test/helpers/assertions.go
MachineOSConfigDoesNotExist, MachineOSBuildDoesNotExist, and MachineConfigDoesNotExist extract original UID from the provided object and use deletedWithUID.
MachineOSBuild finalizer constant
pkg/controller/build/constants/constants.go
Adds exported MachineOSBuildDeletionFinalizer constant for MOSB cleanup ordering.
MOSB reconciler finalizer handling
pkg/controller/build/reconciler.go
Adds helpers to ensure the MOSB deletion finalizer is present, runs cleanup when DeletionTimestamp is set (delete associated ImageStreamTag/artifacts and digest ConfigMap), and removes the finalizer to allow deletion; attaches finalizer to reused and synthetic MOSBs.
E2E test diagnostics and deletion polling
test/e2e-ocl-1of2/onclusterlayering_test.go
Adds diagnostic logging for MOSB objects and replaces immediate deletion with a poll loop that waits for full MOSB resource removal; pre- and post-deletion ImageStreamTag existence checks and updated assertions.

Sequence Diagram(s)

sequenceDiagram
  participant Controller as MachineOSBuildController
  participant API as KubernetesAPI
  participant IST as ImageStreamTag
  participant CM as DigestConfigMap
  Controller->>API: Observe MachineOSBuild (syncMachineOSBuild)
  alt DeletionTimestamp set
    Controller->>IST: Delete ImageStreamTag / artifacts
    Controller->>CM: Delete digest ConfigMap
    Controller->>API: Update MachineOSBuild to remove MachineOSBuildDeletionFinalizer
  else Normal flow
    Controller->>API: Ensure MachineOSBuildDeletionFinalizer present on object
  end
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Suggested reviewers

  • bfournie
🚥 Pre-merge checks | ✅ 9 | ❌ 3

❌ Failed checks (3 warnings)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 33.33% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
Test Structure And Quality ⚠️ Warning TestMissingImageIsRebuilt violates timeout requirements: kubeassert created with context.WithCancel (no deadline), polling until timeout occurs but comment recommends UID-aware deletion poll instead. Add UID comparison check to the MOSB deletion poll (lines 271-289) before declaring success, matching UID-aware pattern used elsewhere in PR.
Microshift Test Compatibility ⚠️ Warning New e2e test uses MicroShift-incompatible APIs (MachineOSBuild, MachineOSConfig, MachineConfig, ImageStreamTag) without [Skipped:MicroShift], [apigroup:...], or IsMicroShiftCluster checks. Add [apigroup:machineconfiguration.openshift.io] tag to test function names, or add [Skipped:MicroShift] labels, or add IsMicroShiftCluster() runtime check with Skip().
✅ Passed checks (9 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title directly and specifically addresses the main issue: assertion helpers failing to detect when objects are deleted and replaced with new ones having different UIDs.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Stable And Deterministic Test Names ✅ Passed PR contains no Ginkgo tests. All test declarations are standard Go tests (func Test*) with stable, descriptive names containing no dynamic content.
Single Node Openshift (Sno) Test Compatibility ✅ Passed No new Ginkgo e2e tests are added in this PR. The test file modifications use standard Go testing (func Test...), not Ginkgo, so the SNO compatibility check does not apply.
Topology-Aware Scheduling Compatibility ✅ Passed PR modifies only Go source code (test helpers, constants, controller logic). No deployment manifests, scheduling constraints, affinity rules, or topology-dependent configurations.
Ote Binary Stdout Contract ✅ Passed All PR changes respect OTE stdout contract: test file uses only t.Logf() (intercepted by framework), no fmt.Print or uncontrolled stdout writes in main/init/suite code.
Ipv6 And Disconnected Network Test Compatibility ✅ Passed The custom check applies to "new Ginkgo e2e tests" but this PR only modifies existing Go test functions without adding new tests. The check is not applicable.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@openshift-ci openshift-ci Bot requested review from RishabhSaini and yuqi-zhang May 18, 2026 18:33
@openshift-ci openshift-ci Bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label May 18, 2026
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
test/helpers/assertions.go (1)

849-854: 💤 Low value

Remove dead conditional branch.

Both branches of the if a.poll check return the same value (false, err), making the conditional unnecessary.

Suggested simplification
 	// If there's another error, return it
 	if err != nil {
-		if a.poll {
-			return false, err
-		}
 		return false, err
 	}
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@test/helpers/assertions.go` around lines 849 - 854, The conditional checking
a.poll is dead because both branches return the same tuple; simplify by removing
the if block and directly returning (false, err) where that code appears (the
block referencing a.poll in assertions.go), i.e., replace the if err != nil { if
a.poll { return false, err } return false, err } with a single if err != nil {
return false, err } to eliminate redundant logic.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@test/helpers/assertions.go`:
- Around line 849-854: The conditional checking a.poll is dead because both
branches return the same tuple; simplify by removing the if block and directly
returning (false, err) where that code appears (the block referencing a.poll in
assertions.go), i.e., replace the if err != nil { if a.poll { return false, err
} return false, err } with a single if err != nil { return false, err } to
eliminate redundant logic.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 6a2b52cb-997f-422f-bcd4-8b6ffb21692f

📥 Commits

Reviewing files that changed from the base of the PR and between a3a68a8 and 4e13ea9.

📒 Files selected for processing (1)
  • test/helpers/assertions.go

@dkhater-redhat
Copy link
Copy Markdown
Contributor Author

/test e2e-gcp-op-ocl-part1
/test e2e-gcp-op-ocl-part2

1 similar comment
@dkhater-redhat
Copy link
Copy Markdown
Contributor Author

/test e2e-gcp-op-ocl-part1
/test e2e-gcp-op-ocl-part2

Copy link
Copy Markdown
Member

@isabella-janssen isabella-janssen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@isabella-janssen
Copy link
Copy Markdown
Member

/jira refresh

@openshift-ci-robot openshift-ci-robot added jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. and removed jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. labels May 19, 2026
@openshift-ci-robot
Copy link
Copy Markdown
Contributor

@isabella-janssen: This pull request references Jira Issue OCPBUGS-63048, which is valid. The bug has been moved to the POST state.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (5.0.0) matches configured target version for branch (5.0.0)
  • bug is in the state ASSIGNED, which is one of the valid states (NEW, ASSIGNED, POST)

Requesting review from QA contact:
/cc @sergiordlr

Details

In response to this:

/jira refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci openshift-ci Bot requested a review from sergiordlr May 19, 2026 16:24
@openshift-ci openshift-ci Bot added the lgtm Indicates that a PR is ready to be merged. label May 19, 2026
@openshift-merge-bot
Copy link
Copy Markdown
Contributor

Tests from second stage were triggered manually. Pipeline can be controlled only manually, until HEAD changes. Use command to trigger second stage.

@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented May 19, 2026

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: dkhater-redhat, isabella-janssen

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:
  • OWNERS [dkhater-redhat,isabella-janssen]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

TestMissingImageIsRebuilt was timing out waiting for ImageStreamTags
to be pruned after deleting a MachineOSBuild. The test expects that
when a MOSB is deleted via the API, its associated ImageStreamTag in
the internal registry is automatically cleaned up.

Problem:
ImageStreamTags are created by the build process (podman push) and
cannot have the MOSB as an owner reference (different API group).
When a MOSB is deleted externally (via kubectl or test), the
controller never sees or handles the deletion, leaving orphaned
ImageStreamTags in the registry.

The existing deleteMOSBImage logic only runs when the controller
itself initiates deletion (e.g., during rebuild), not when users
delete MOSBs directly.

Solution:
Add a custom finalizer (machine-os-build-deletion) to all MOSBs.
When a MOSB is deleted:
1. Controller detects DeletionTimestamp in syncMachineOSBuild
2. Calls handleMachineOSBuildDeletion which:
   - Deletes the ImageStreamTag via deleteMOSBImage
   - Deletes the digest ConfigMap
   - Removes the finalizer
3. Kubernetes proceeds with MOSB deletion

This is the standard Kubernetes pattern for cleanup and ensures
registry pruning works regardless of how the MOSB is deleted.

Fixes: TestMissingImageIsRebuilt timeout at image pruning assertion
@openshift-ci openshift-ci Bot removed the lgtm Indicates that a PR is ready to be merged. label May 19, 2026
@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented May 19, 2026

New changes are detected. LGTM label has been removed.

@dkhater-redhat
Copy link
Copy Markdown
Contributor Author

/test e2e-gcp-op-ocl-part1

1 similar comment
@dkhater-redhat
Copy link
Copy Markdown
Contributor Author

/test e2e-gcp-op-ocl-part1

@dkhater-redhat
Copy link
Copy Markdown
Contributor Author

/test unit

The previous fix only added the finalizer to newly created MOSBs,
but TestMissingImageIsRebuilt deletes MOSBs that were created earlier
in the test run before our controller code executed.

Add ensureMachineOSBuildFinalizer() which runs during syncMachineOSBuild
to migrate existing MOSBs. This ensures all MOSBs have the deletion
finalizer, whether they were created before or after the code change.

This handles the migration path and ensures the test will pass even
when deleting MOSBs created before the finalizer logic was added.
Added comprehensive logging to diagnose the image pruning issue:

1. Log finalizers on all three MOSBs when they're created and at key
   points (fetch, start, completion)

2. Log ImageStreamTag state before and after MOSB deletion to see if
   the finalizer is doing its job

3. Poll the MOSB during deletion to see DeletionTimestamp and finalizer
   removal progress

4. Log exact timing of when MOSB is fully deleted vs when IST should
   be pruned

This will help us understand:
- Whether the finalizer is being added to MOSBs
- Whether the finalizer handler is running
- Whether the ImageStreamTag is being deleted by the finalizer
- If there's a timing issue between MOSB deletion and IST pruning

The logs will show the full lifecycle and help identify where the
pruning is failing.
@dkhater-redhat
Copy link
Copy Markdown
Contributor Author

/test e2e-gcp-op-ocl-part1

@openshift-ci-robot
Copy link
Copy Markdown
Contributor

@dkhater-redhat: This pull request references Jira Issue OCPBUGS-63048, which is valid.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (5.0.0) matches configured target version for branch (5.0.0)
  • bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, POST)

Requesting review from QA contact:
/cc @sergiordlr

Details

In response to this:

The DoesNotExist assertion methods only checked if an object with a given name returns NotFound error. This failed to detect when an object was deleted and replaced with a new object of the same name but different UID.

This commonly occurs in tests like TestRebuildAnnotationRestartsBuild where:

  1. A MachineOSBuild/Job/Pod is deleted
  2. A rebuild creates new objects with same names but different UIDs
  3. DoesNotExist assertions see the new objects and keep polling

Fix by:

  • Capturing the original object's UID before polling
  • Checking both NotFound errors AND UID changes
  • Treating UID mismatch as deletion (object was replaced)

Updated methods:

  • PodDoesNotExist
  • JobDoesNotExist
  • SecretDoesNotExist
  • ConfigMapDoesNotExist
  • MachineOSBuildDoesNotExist
  • MachineOSConfigDoesNotExist
  • MachineConfigDoesNotExist
  • ImageDoesNotExist

This fixes the flaky failures in TestRebuildAnnotationRestartsBuild.

- What I did

- How to verify it

- Description for the changelog

Summary by CodeRabbit

  • New Features

  • Machine build deletion now uses a finalizer to ensure related image/artifact cleanup before removal and runs cleanup immediately when deletion starts.

  • Tests

  • Enhanced deletion verification by tracking resource identifiers to distinguish deleted vs. replaced objects.

  • Improved diagnostic logging and assertions around build lifecycle, deletion polling, and image/artifact pruning.

  • Deletion checks honor polling vs. non-polling modes for fail vs. wait behavior.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@pkg/controller/build/reconciler.go`:
- Around line 1024-1034: The finalizer is being removed even when cleanup fails;
change the delete flow in reconciler.go so that you only remove the
MachineOSBuild finalizer when all cleanup steps succeeded: check the return from
deleteMOSBImage(mosb) and from kubeclient.CoreV1().ConfigMaps(...).Delete(...)
(treat IsNotFound as success) and if any non-tolerated error occurs return that
error (or requeue) instead of proceeding to clear the finalizer; apply the same
guard for the other cleanup block around lines 1045-1054 so finalizer removal
happens only after successful image and configmap deletion.
- Around line 1195-1198: The current code only ensures the deletion finalizer
via ensureMachineOSBuildFinalizer during sync (migration), which leaves race
windows where newly-created MachineOSBuilds (created in normal/rebuild create
paths) can be deleted before their first reconcile and skip
handleMachineOSBuildDeletion; to fix, set
constants.MachineOSBuildDeletionFinalizer on every MachineOSBuild object
immediately before any client.Create call that creates a MachineOSBuild (i.e.,
in the normal create and rebuild create paths/functions that construct and call
Create for MachineOSBuild), removing reliance on the migration-only
ensureMachineOSBuildFinalizer for new objects and keeping that helper only for
existing-resource migration.
- Around line 981-986: The code mutates an informer-backed MachineOSBuild (mosb)
in-place when adding finalizers in
ensureMachineOSBuildFinalizer/handleMachineOSBuildDeletion; instead, obtain a
copy before mutating by calling getMachineOSBuildForUpdate(ctx, mosb.Name)
(which performs DeepCopy) or by using mosb = mosb.DeepCopy(), then call
mosb.SetFinalizers(append(..., constants.MachineOSBuildDeletionFinalizer)) and
pass that copy to
b.mcfgclient.MachineconfigurationV1().MachineOSBuilds().Update(...); update
syncMachineOSBuild/ensureMachineOSBuildFinalizer to operate on the copied object
to avoid mutating the informer cache.

In `@test/e2e-ocl-1of2/onclusterlayering_test.go`:
- Around line 271-289: The deletion poll treats NotFound as the only success
condition and can misinterpret a recreated MachineOSBuild with the same name as
the original; modify the wait.PollImmediate closure that calls
cs.GetMcfgclient().MachineconfigurationV1().MachineOSBuilds().Get to also check
the retrieved object's UID against the original thirdMOSB.UID (or reuse the
existing waitForMOSBToBeDeleted helper) so the poll returns success when the
original thirdMOSB's UID is absent/equivalently when a returned MOSB has a
different UID, logging the UID mismatch and exiting early instead of waiting the
full timeout.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 3aa149ea-170a-400f-b6f9-602c0756e514

📥 Commits

Reviewing files that changed from the base of the PR and between 4e13ea9 and 27c60e2.

📒 Files selected for processing (3)
  • pkg/controller/build/constants/constants.go
  • pkg/controller/build/reconciler.go
  • test/e2e-ocl-1of2/onclusterlayering_test.go

Comment on lines +981 to +986
// Add the finalizer
klog.V(4).Infof("Adding deletion finalizer to MachineOSBuild %q", mosb.Name)
mosb.SetFinalizers(append(mosb.GetFinalizers(), constants.MachineOSBuildDeletionFinalizer))

// Update the MOSB
_, err := b.mcfgclient.MachineconfigurationV1().MachineOSBuilds().Update(ctx, mosb, metav1.UpdateOptions{})
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

🧩 Analysis chain

🏁 Script executed:

# First, let me check the file and the specific lines mentioned
wc -l pkg/controller/build/reconciler.go

Repository: openshift/machine-config-operator

Length of output: 117


🏁 Script executed:

# Read the context around the lines mentioned (981-986 and 1043-1046)
sed -n '960,1000p' pkg/controller/build/reconciler.go | cat -n

Repository: openshift/machine-config-operator

Length of output: 1738


🏁 Script executed:

# Also check lines 1043-1046
sed -n '1030,1060p' pkg/controller/build/reconciler.go | cat -n

Repository: openshift/machine-config-operator

Length of output: 1489


🏁 Script executed:

# Find the function that contains these lines to understand the context
sed -n '900,990p' pkg/controller/build/reconciler.go | cat -n

Repository: openshift/machine-config-operator

Length of output: 4650


🏁 Script executed:

# Search for getMachineOSBuildForUpdate function
rg "getMachineOSBuildForUpdate" pkg/controller/build/

Repository: openshift/machine-config-operator

Length of output: 403


🏁 Script executed:

# Check if there's a lister pattern and how objects are obtained
rg -A 5 "syncMachineOSBuild" pkg/controller/build/reconciler.go | head -50

Repository: openshift/machine-config-operator

Length of output: 1371


🏁 Script executed:

# Get the full syncMachineOSBuild function
sed -n '820,1000p' pkg/controller/build/reconciler.go | cat -n

Repository: openshift/machine-config-operator

Length of output: 8899


🏁 Script executed:

# Get the getMachineOSBuildForUpdate function definition
rg -A 10 "func.*getMachineOSBuildForUpdate" pkg/controller/build/reconciler.go

Repository: openshift/machine-config-operator

Length of output: 385


🏁 Script executed:

# See where ensureMachineOSBuildFinalizer is called
rg -B 5 -A 2 "ensureMachineOSBuildFinalizer" pkg/controller/build/reconciler.go

Repository: openshift/machine-config-operator

Length of output: 799


Avoid mutating informer-backed MachineOSBuild objects in place.

syncMachineOSBuild receives mosb from the lister (b.machineOSBuildLister.List(labels.Everything())), then ensureMachineOSBuildFinalizer and handleMachineOSBuildDeletion mutate it with SetFinalizers(...) before the API update. This mutates the shared cache entry, creating stale-cache side effects and potential races. Use getMachineOSBuildForUpdate() (which already exists and does a DeepCopy) or call DeepCopy() explicitly before modifying finalizers.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@pkg/controller/build/reconciler.go` around lines 981 - 986, The code mutates
an informer-backed MachineOSBuild (mosb) in-place when adding finalizers in
ensureMachineOSBuildFinalizer/handleMachineOSBuildDeletion; instead, obtain a
copy before mutating by calling getMachineOSBuildForUpdate(ctx, mosb.Name)
(which performs DeepCopy) or by using mosb = mosb.DeepCopy(), then call
mosb.SetFinalizers(append(..., constants.MachineOSBuildDeletionFinalizer)) and
pass that copy to
b.mcfgclient.MachineconfigurationV1().MachineOSBuilds().Update(...); update
syncMachineOSBuild/ensureMachineOSBuildFinalizer to operate on the copied object
to avoid mutating the informer cache.

Comment on lines +1024 to +1034
// Delete the ImageStreamTag
if err := b.deleteMOSBImage(ctx, mosb, moscName); err != nil {
// Log the error but don't block deletion - the image might already be gone
klog.Warningf("Failed to delete image for MachineOSBuild %q during deletion: %v", mosb.Name, err)
}

// Delete the digest configmap if it exists
err = b.kubeclient.CoreV1().ConfigMaps(ctrlcommon.MCONamespace).Delete(ctx, utils.GetDigestConfigMapName(mosb), metav1.DeleteOptions{})
if err != nil && !k8serrors.IsNotFound(err) {
klog.Warningf("Failed to delete digest configmap for MachineOSBuild %q during deletion: %v", mosb.Name, err)
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Don’t remove the finalizer after a cleanup failure.

Any non-tolerated error from deleteMOSBImage or the digest ConfigMap delete means cleanup did not finish, but this code only logs and still clears the finalizer. That can permanently leak artifacts and defeats the ordering guarantee the finalizer is meant to provide.

💡 Suggested change
 	// Delete the ImageStreamTag
-	if err := b.deleteMOSBImage(ctx, mosb, moscName); err != nil {
-		// Log the error but don't block deletion - the image might already be gone
-		klog.Warningf("Failed to delete image for MachineOSBuild %q during deletion: %v", mosb.Name, err)
-	}
+	if err := b.deleteMOSBImage(ctx, mosb, moscName); err != nil {
+		return fmt.Errorf("failed to delete image for MachineOSBuild %q during deletion: %w", mosb.Name, err)
+	}

 	// Delete the digest configmap if it exists
 	err = b.kubeclient.CoreV1().ConfigMaps(ctrlcommon.MCONamespace).Delete(ctx, utils.GetDigestConfigMapName(mosb), metav1.DeleteOptions{})
 	if err != nil && !k8serrors.IsNotFound(err) {
-		klog.Warningf("Failed to delete digest configmap for MachineOSBuild %q during deletion: %v", mosb.Name, err)
+		return fmt.Errorf("failed to delete digest configmap for MachineOSBuild %q during deletion: %w", mosb.Name, err)
 	}

Also applies to: 1045-1054

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@pkg/controller/build/reconciler.go` around lines 1024 - 1034, The finalizer
is being removed even when cleanup fails; change the delete flow in
reconciler.go so that you only remove the MachineOSBuild finalizer when all
cleanup steps succeeded: check the return from deleteMOSBImage(mosb) and from
kubeclient.CoreV1().ConfigMaps(...).Delete(...) (treat IsNotFound as success)
and if any non-tolerated error occurs return that error (or requeue) instead of
proceeding to clear the finalizer; apply the same guard for the other cleanup
block around lines 1045-1054 so finalizer removal happens only after successful
image and configmap deletion.

Comment on lines +1195 to +1198
// Ensure finalizer is present on existing MOSBs (migration for pre-existing builds)
if err := b.ensureMachineOSBuildFinalizer(ctx, mosb); err != nil {
return fmt.Errorf("could not ensure finalizer on MachineOSBuild %q: %w", mosb.Name, err)
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Attach the deletion finalizer before every MachineOSBuild create.

Adding it here during sync still leaves the normal/rebuild create paths with a race: a freshly-created MachineOSBuild can be deleted before its first reconcile, handleMachineOSBuildDeletion sees no finalizer, and the image/configmap cleanup is skipped entirely. Set constants.MachineOSBuildDeletionFinalizer on the object before each Create, and keep this helper for migration only.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@pkg/controller/build/reconciler.go` around lines 1195 - 1198, The current
code only ensures the deletion finalizer via ensureMachineOSBuildFinalizer
during sync (migration), which leaves race windows where newly-created
MachineOSBuilds (created in normal/rebuild create paths) can be deleted before
their first reconcile and skip handleMachineOSBuildDeletion; to fix, set
constants.MachineOSBuildDeletionFinalizer on every MachineOSBuild object
immediately before any client.Create call that creates a MachineOSBuild (i.e.,
in the normal create and rebuild create paths/functions that construct and call
Create for MachineOSBuild), removing reliance on the migration-only
ensureMachineOSBuildFinalizer for new objects and keeping that helper only for
existing-resource migration.

Comment on lines +271 to +289
// Poll to see the MOSB state during deletion
t.Logf("Waiting for MachineOSBuild %q to be fully deleted...", thirdMOSB.Name)
deletionStart := time.Now()
pollErr := wait.PollImmediate(time.Second*2, time.Minute*5, func() (bool, error) {
mosb, err := cs.GetMcfgclient().MachineconfigurationV1().MachineOSBuilds().Get(ctx, thirdMOSB.Name, metav1.GetOptions{})
if k8serrors.IsNotFound(err) {
t.Logf("MachineOSBuild %q is now fully deleted (took %s)", thirdMOSB.Name, time.Since(deletionStart))
return true, nil
}
if err != nil {
return false, err
}
if mosb.DeletionTimestamp != nil {
t.Logf("MachineOSBuild %q has DeletionTimestamp=%s, finalizers=%v (waiting for finalizer cleanup...)",
thirdMOSB.Name, mosb.DeletionTimestamp, mosb.Finalizers)
}
return false, nil
})
require.NoError(t, pollErr)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Keep this deletion poll UID-aware as well.

This loop only treats NotFound as success. If the original thirdMOSB is deleted and another MachineOSBuild with the same name appears, the poll waits the full 5 minutes even though the target object is gone—the same failure mode this PR is fixing elsewhere. Compare against thirdMOSB.UID, or reuse waitForMOSBToBeDeleted.

💡 Suggested change
 	t.Logf("Waiting for MachineOSBuild %q to be fully deleted...", thirdMOSB.Name)
 	deletionStart := time.Now()
+	expectedUID := thirdMOSB.UID
 	pollErr := wait.PollImmediate(time.Second*2, time.Minute*5, func() (bool, error) {
 		mosb, err := cs.GetMcfgclient().MachineconfigurationV1().MachineOSBuilds().Get(ctx, thirdMOSB.Name, metav1.GetOptions{})
 		if k8serrors.IsNotFound(err) {
 			t.Logf("MachineOSBuild %q is now fully deleted (took %s)", thirdMOSB.Name, time.Since(deletionStart))
 			return true, nil
 		}
 		if err != nil {
 			return false, err
 		}
+		if mosb.UID != expectedUID {
+			t.Logf("MachineOSBuild %q was replaced with UID %s after %s", thirdMOSB.Name, mosb.UID, time.Since(deletionStart))
+			return true, nil
+		}
 		if mosb.DeletionTimestamp != nil {
 			t.Logf("MachineOSBuild %q has DeletionTimestamp=%s, finalizers=%v (waiting for finalizer cleanup...)",
 				thirdMOSB.Name, mosb.DeletionTimestamp, mosb.Finalizers)
 		}
 		return false, nil
 	})
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
// Poll to see the MOSB state during deletion
t.Logf("Waiting for MachineOSBuild %q to be fully deleted...", thirdMOSB.Name)
deletionStart := time.Now()
pollErr := wait.PollImmediate(time.Second*2, time.Minute*5, func() (bool, error) {
mosb, err := cs.GetMcfgclient().MachineconfigurationV1().MachineOSBuilds().Get(ctx, thirdMOSB.Name, metav1.GetOptions{})
if k8serrors.IsNotFound(err) {
t.Logf("MachineOSBuild %q is now fully deleted (took %s)", thirdMOSB.Name, time.Since(deletionStart))
return true, nil
}
if err != nil {
return false, err
}
if mosb.DeletionTimestamp != nil {
t.Logf("MachineOSBuild %q has DeletionTimestamp=%s, finalizers=%v (waiting for finalizer cleanup...)",
thirdMOSB.Name, mosb.DeletionTimestamp, mosb.Finalizers)
}
return false, nil
})
require.NoError(t, pollErr)
// Poll to see the MOSB state during deletion
t.Logf("Waiting for MachineOSBuild %q to be fully deleted...", thirdMOSB.Name)
deletionStart := time.Now()
expectedUID := thirdMOSB.UID
pollErr := wait.PollImmediate(time.Second*2, time.Minute*5, func() (bool, error) {
mosb, err := cs.GetMcfgclient().MachineconfigurationV1().MachineOSBuilds().Get(ctx, thirdMOSB.Name, metav1.GetOptions{})
if k8serrors.IsNotFound(err) {
t.Logf("MachineOSBuild %q is now fully deleted (took %s)", thirdMOSB.Name, time.Since(deletionStart))
return true, nil
}
if err != nil {
return false, err
}
if mosb.UID != expectedUID {
t.Logf("MachineOSBuild %q was replaced with UID %s after %s", thirdMOSB.Name, mosb.UID, time.Since(deletionStart))
return true, nil
}
if mosb.DeletionTimestamp != nil {
t.Logf("MachineOSBuild %q has DeletionTimestamp=%s, finalizers=%v (waiting for finalizer cleanup...)",
thirdMOSB.Name, mosb.DeletionTimestamp, mosb.Finalizers)
}
return false, nil
})
require.NoError(t, pollErr)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@test/e2e-ocl-1of2/onclusterlayering_test.go` around lines 271 - 289, The
deletion poll treats NotFound as the only success condition and can misinterpret
a recreated MachineOSBuild with the same name as the original; modify the
wait.PollImmediate closure that calls
cs.GetMcfgclient().MachineconfigurationV1().MachineOSBuilds().Get to also check
the retrieved object's UID against the original thirdMOSB.UID (or reuse the
existing waitForMOSBToBeDeleted helper) so the poll returns success when the
original thirdMOSB's UID is absent/equivalently when a returned MOSB has a
different UID, logging the UID mismatch and exiting early instead of waiting the
full timeout.

@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented May 20, 2026

@dkhater-redhat: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/e2e-gcp-op-ocl-part1 27c60e2 link true /test e2e-gcp-op-ocl-part1

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants