Skip to content

OCPBUGS-84218: Fix units rollback if update failure#5876

Merged
openshift-merge-bot[bot] merged 1 commit intoopenshift:mainfrom
pablintino:ocpbugs-84218
Apr 24, 2026
Merged

OCPBUGS-84218: Fix units rollback if update failure#5876
openshift-merge-bot[bot] merged 1 commit intoopenshift:mainfrom
pablintino:ocpbugs-84218

Conversation

@pablintino
Copy link
Copy Markdown
Contributor

@pablintino pablintino commented Apr 23, 2026

Fixes: #OCPBUGS-84218

- What I did

This changes fixes an issue that happened only during rollbacks that had updated existing units. The existing modified unit were never rolledback as the write logic was using the non-rollback arguments effectively writting the target units twice, instead of going back to the previous units.

- How to verify it

(credit goes to @sergiordlr)

  1. Deploy a cluster with this change on it
  2. Apply the following MC
kind: MachineConfig
metadata:
  labels:
    machineconfiguration.openshift.io/role: worker
  name: not-pullable-image-tc54054-ygo9u9tc
spec:
  config:
    ignition:
      version: 3.5.0
    passwd:
      users: []
    storage:
      files:
        - contents:
            source: data:,test-content-for-debugging%0A
          mode: 0644
          overwrite: true
          path: /etc/mco-test-file.conf
    systemd:
      units:
        - contents: |
            [Unit]
            Description=Ensure IKE SA established for existing IPsec connections.
            After=ipsec.service
            Before=kubelet-dependencies.target node-valid-hostname.service

            [Service]
            Type=oneshot
            ExecStart=/usr/local/bin/ipsec-connect-wait.sh
            Environment="MCO_TEST=true"
            StandardOutput=journal+console
            StandardError=journal+console

            [Install]
            WantedBy=ipsec.service
          enabled: true
          name: wait-for-ipsec-connect.service
  extensions: []
  kernelArguments: []
  osImageURL: quay.io/openshifttest/tc54054fakeimage:latest
  1. The MCP should degrade but the node should try again to pull the wrong image without facing pre-flight errors.

- Description for the changelog

Fix the rollback logic of systemd units in case of a mid-update error.

Summary by CodeRabbit

Bug Fixes

  • Improved rollback mechanism to more accurately identify and target configuration units that require reverting during system recovery operations.

This changes fixes an issue that happened only during rollbacks that had
updated existing units. The existing modified unit were never rolledback
as the write logic was using the non-rollback arguments effectively
writting the target units twice, instead of going back to the previous
units.

Signed-off-by: Pablo Rodriguez Nava <git@amail.pablintino.eu>
@openshift-merge-bot
Copy link
Copy Markdown
Contributor

Pipeline controller notification
This repo is configured to use the pipeline controller. Second-stage tests will be triggered either automatically or after lgtm label is added, depending on the repository configuration. The pipeline controller will automatically detect which contexts are required and will utilize /test Prow commands to trigger the second stage.

For optional jobs, comment /test ? to see a list of all defined jobs. To trigger manually all jobs from second stage use /pipeline required command.

This repository is configured in: LGTM mode

@openshift-ci-robot openshift-ci-robot added jira/severity-important Referenced Jira bug's severity is important for the branch this PR is targeting. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. labels Apr 23, 2026
@openshift-ci-robot
Copy link
Copy Markdown
Contributor

@pablintino: This pull request references Jira Issue OCPBUGS-84218, which is invalid:

  • expected the bug to target the "5.0.0" version, but no target version was set

Comment /jira refresh to re-evaluate validity if changes to the Jira bug are made, or edit the title of this pull request to link to a different bug.

The bug has been updated to refer to the pull request using the external bug tracker.

Details

In response to this:

Fixes: #OCPBUGS-84218

- What I did

This changes fixes an issue that happened only during rollbacks that had updated existing units. The existing modified unit were never rolledback as the write logic was using the non-rollback arguments effectively writting the target units twice, instead of going back to the previous units.

- How to verify it

  1. Deploy a cluster with this change on it
  2. Apply the following MC
kind: MachineConfig
metadata:
 labels:
   machineconfiguration.openshift.io/role: worker
 name: not-pullable-image-tc54054-ygo9u9tc
spec:
 config:
   ignition:
     version: 3.5.0
   passwd:
     users: []
   storage:
     files:
       - contents:
           source: data:,test-content-for-debugging%0A
         mode: 0644
         overwrite: true
         path: /etc/mco-test-file.conf
   systemd:
     units:
       - contents: |
           [Unit]
           Description=Ensure IKE SA established for existing IPsec connections.
           After=ipsec.service
           Before=kubelet-dependencies.target node-valid-hostname.service

           [Service]
           Type=oneshot
           ExecStart=/usr/local/bin/ipsec-connect-wait.sh
           Environment="MCO_TEST=true"
           StandardOutput=journal+console
           StandardError=journal+console

           [Install]
           WantedBy=ipsec.service
         enabled: true
         name: wait-for-ipsec-connect.service
 extensions: []
 kernelArguments: []
 osImageURL: quay.io/openshifttest/tc54054fakeimage:latest
  1. The MCP should degrade but the node should try again to pull the wrong image without facing pre-flight errors.

- Description for the changelog

Fix the rollback logic of systemd units in case of a mid-update error.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Apr 23, 2026

Walkthrough

The rollback logic in Daemon.update and Daemon.updateHypershift now determines the rollback unit set based on the reverse config diff rather than reusing the forward diff, ensuring rollback targets only units that differ when comparing new config against old config.

Changes

Cohort / File(s) Summary
Rollback Logic Refinement
pkg/daemon/update.go
Modified rollback unit determination to compute diff from reverse config direction (newIgnConfig vs oldIgnConfig) instead of reusing forward diff, ensuring rollback operations target only the units that actually differ in the reverse direction.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~8 minutes

🚥 Pre-merge checks | ✅ 12
✅ Passed checks (12 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The PR title clearly identifies the specific bug being fixed (OCPBUGS-84218: Fix units rollback if update failure) and accurately reflects the main change, which is correcting the rollback logic for systemd units when updates fail.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Stable And Deterministic Test Names ✅ Passed PR only modifies pkg/daemon/update.go (implementation file), not test files. No Ginkgo test names are present or changed in this PR.
Test Structure And Quality ✅ Passed The PR uses standard Go testing with testing.T, not Ginkgo, so the Ginkgo-specific quality checks are not applicable.
Microshift Test Compatibility ✅ Passed This PR does not add any new Ginkgo e2e tests. Only pkg/daemon/update.go was modified with +4/-2 lines for rollback logic, and no test files were added or modified.
Single Node Openshift (Sno) Test Compatibility ✅ Passed No new Ginkgo e2e tests were added. Changes are limited to rollback logic fixes in production code within pkg/daemon/update.go.
Topology-Aware Scheduling Compatibility ✅ Passed PR only modifies daemon rollback logic without introducing any deployment manifests, scheduling constraints, or topology-aware configurations.
Ote Binary Stdout Contract ✅ Passed The PR modifies pkg/daemon/update.go to fix rollback logic for systemd units. No stdout writes (fmt.Print*, klog, log calls, or os.Stdout) were found in the code. The machine-config-operator is a production daemon, not an OTE binary.
Ipv6 And Disconnected Network Test Compatibility ✅ Passed PR modifies only pkg/daemon/update.go with rollback logic changes; no new Ginkgo e2e tests added.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

@openshift-ci openshift-ci Bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Apr 23, 2026
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
pkg/daemon/update.go (1)

1206-1207: Consider centralizing rollback unit selection to reduce future regressions.

The same reverse-diff snippet is duplicated in two paths, and argument order is semantically non-obvious. A helper would make intent explicit.

♻️ Suggested refactor
+func unitsToRestoreOnRollback(failedIgnConfig, previousIgnConfig ign3types.Config) []ign3types.Unit {
+	rollbackUnitDiff := ctrlcommon.GetChangedConfigUnitsByType(&failedIgnConfig, &previousIgnConfig)
+	return slices.Concat(rollbackUnitDiff.Added, rollbackUnitDiff.Updated)
+}
...
-rollbackUnitDiff := ctrlcommon.GetChangedConfigUnitsByType(&newIgnConfig, &oldIgnConfig)
-if err := dn.updateFiles(newIgnConfig, oldIgnConfig, slices.Concat(rollbackUnitDiff.Added, rollbackUnitDiff.Updated), skipCertificateWrite, false); err != nil {
+if err := dn.updateFiles(newIgnConfig, oldIgnConfig, unitsToRestoreOnRollback(newIgnConfig, oldIgnConfig), skipCertificateWrite, false); err != nil {
 	...
 }
...
-rollbackUnitDiff := ctrlcommon.GetChangedConfigUnitsByType(&newIgnConfig, &oldIgnConfig)
-if err := dn.updateFiles(newIgnConfig, oldIgnConfig, slices.Concat(rollbackUnitDiff.Added, rollbackUnitDiff.Updated), false, false); err != nil {
+if err := dn.updateFiles(newIgnConfig, oldIgnConfig, unitsToRestoreOnRollback(newIgnConfig, oldIgnConfig), false, false); err != nil {
 	...
 }

Also applies to: 1427-1428

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/daemon/update.go` around lines 1206 - 1207, The duplicated reverse-diff
logic should be extracted into a single helper to make intent explicit and avoid
argument-order mistakes: create a function (e.g., buildRollbackUnits or
selectRollbackUnits) that accepts oldIgnConfig, newIgnConfig and returns the
concat of rollback unit names (using ctrlcommon.GetChangedConfigUnitsByType and
slices.Concat internally) and use it in both call sites instead of duplicating
the snippet; then call dn.updateFiles(newIgnConfig, oldIgnConfig, helper(...),
skipCertificateWrite, false) so the selection logic is centralized and the
meaning of the unit list is unambiguous.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@pkg/daemon/update.go`:
- Around line 1206-1207: The duplicated reverse-diff logic should be extracted
into a single helper to make intent explicit and avoid argument-order mistakes:
create a function (e.g., buildRollbackUnits or selectRollbackUnits) that accepts
oldIgnConfig, newIgnConfig and returns the concat of rollback unit names (using
ctrlcommon.GetChangedConfigUnitsByType and slices.Concat internally) and use it
in both call sites instead of duplicating the snippet; then call
dn.updateFiles(newIgnConfig, oldIgnConfig, helper(...), skipCertificateWrite,
false) so the selection logic is centralized and the meaning of the unit list is
unambiguous.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml

Review profile: CHILL

Plan: Pro Plus

Run ID: 1b0af6c2-0d56-409f-bd9b-5f18f649ef9d

📥 Commits

Reviewing files that changed from the base of the PR and between c3a9db7 and fd094b0.

📒 Files selected for processing (1)
  • pkg/daemon/update.go

@yuqi-zhang
Copy link
Copy Markdown
Contributor

/jira refresh

@openshift-ci-robot openshift-ci-robot added jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. and removed jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. labels Apr 23, 2026
@openshift-ci-robot
Copy link
Copy Markdown
Contributor

@yuqi-zhang: This pull request references Jira Issue OCPBUGS-84218, which is valid. The bug has been moved to the POST state.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (5.0.0) matches configured target version for branch (5.0.0)
  • bug is in the state ASSIGNED, which is one of the valid states (NEW, ASSIGNED, POST)
Details

In response to this:

/jira refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

Copy link
Copy Markdown
Contributor

@yuqi-zhang yuqi-zhang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@openshift-ci openshift-ci Bot added the lgtm Indicates that a PR is ready to be merged. label Apr 23, 2026
@openshift-merge-bot
Copy link
Copy Markdown
Contributor

Scheduling tests matching the pipeline_run_if_changed or not excluded by pipeline_skip_if_only_changed parameters:
/test e2e-aws-ovn
/test e2e-aws-ovn-upgrade
/test e2e-gcp-op-part1
/test e2e-gcp-op-part2
/test e2e-gcp-op-single-node
/test e2e-hypershift

@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented Apr 23, 2026

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: pablintino, yuqi-zhang

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:
  • OWNERS [pablintino,yuqi-zhang]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@sergiordlr
Copy link
Copy Markdown
Contributor

sergiordlr commented Apr 23, 2026

Verified usin IPI on AWS with proxy

  1. Verify that the MC mentioned in the description works fine. I.e. it reports a problem when MCD tries to pull the image, but it is not stuck in the config drift logic after the rollback, it executes the rollback properly.

  2. We verified it using passwords, sshkeys, units, dropins and files too. To do so we created 2 configs:

A first one to create the elements that we will modify later in the test

apiVersion: machineconfiguration.openshift.io/v1
kind: MachineConfig
metadata:
  labels:
    machineconfiguration.openshift.io/role: worker
  name: 99-test-full
spec:
  config:
    ignition:
      version: 3.5.0
    passwd:
      users:
        - name: core
          passwordHash: $6$rounds=5000$saltsalt$abcdefghijklmnopqrstuvwxyz0123456789ABCDEFGHIJKLMNOPQRS
          sshAuthorizedKeys:
            - ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABgQDtest1234567890 test@mco
    storage:
      files:
        - contents:
            source: data:,mco-test-content%0A
          mode: 0644
          overwrite: true
          path: /etc/mco-test-file.conf
    systemd:
      units:
        - contents: |
            [Unit]
            Description=MCO test unit
            [Service]
            Type=oneshot
            ExecStart=/bin/true
            [Install]
            WantedBy=multi-user.target
          enabled: true
          name: mco-test.service
        - dropins:
            - contents: |
                [Service]
                Environment="MCO_TEST=true"
              name: 99-mco-test.conf
          name: wait-for-ipsec-connect.service

A second one modifying those values and forcing a failure in the image (not pullable) to trigger the rollback

apiVersion: machineconfiguration.openshift.io/v1
kind: MachineConfig
metadata:
  labels:
    machineconfiguration.openshift.io/role: worker
  name: 99-test-full-v2
spec:
  config:
    ignition:
      version: 3.5.0
    passwd:
      users:
        - name: core
          passwordHash: $6$rounds=5000$saltsalt$zyxwvutsrqponmlkjihgfedcba9876543210ZYXWVUTSRQPONMLKJI
          sshAuthorizedKeys:
            - ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABgQDmodified9876543210 test@mco-v2
    storage:
      files:
        - contents:
            source: data:,mco-test-content-v2%0A
          mode: 0644
          overwrite: true
          path: /etc/mco-test-file.conf
    systemd:
      units:
        - contents: |
            [Unit]
            Description=MCO test unit v2
            [Service]
            Type=oneshot
            ExecStart=/bin/true
            [Install]
            WantedBy=multi-user.target
          enabled: true
          name: mco-test.service
        - dropins:
            - contents: |
                [Service]
                Environment="MCO_TEST=v2"
              name: 99-mco-test.conf
          name: wait-for-ipsec-connect.service
  extensions: []
  kernelArguments: []
  osImageURL: quay.io/openshifttest/tc54054fakeimage:latest


We could verify that the rollback was properly working and it was not failing and blocking the workflow in the confidrif checks.

 $  while oc debug -q node/ip-10-0-52-81.us-east-2.compute.internal -- chroot /host cat /etc/mco-test-file.conf; do :; done
 $  while oc debug -q node/ip-10-0-52-81.us-east-2.compute.internal -- chroot /host cat /etc/systemd/system/mco-test.service; do :; done
 $  while oc debug -q node/ip-10-0-52-81.us-east-2.compute.internal -- chroot /host cat /etc/systemd/system/wait-for-ipsec-connect.service.d/99-mco-test.conf; do :; done

We found that likely the password and sshkeys are not being restored. However, it doesn't make the config drift fail, and it doesn't block the workflow. This issue can be fixed in another PR.

Currently running the regression tests related to "units" and "config drift". When those test cases are passed we can add the verified label.

PS: all "units" and "config drift" test cases passed. We added the verified label. Thank you for the quick fix!!

@sergiordlr
Copy link
Copy Markdown
Contributor

/retest-required

/verified by @sergiordlr

@openshift-ci-robot openshift-ci-robot added the verified Signifies that the PR passed pre-merge verification criteria label Apr 24, 2026
@openshift-ci-robot
Copy link
Copy Markdown
Contributor

@sergiordlr: This PR has been marked as verified by @sergiordlr.

Details

In response to this:

/retest-required

/verified by @sergiordlr

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@pablintino
Copy link
Copy Markdown
Contributor Author

/retest-required

@pablintino
Copy link
Copy Markdown
Contributor Author

/override ci/prow/e2e-hypershift
Hypershift is in general red and the error seems consistent among changes

@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented Apr 24, 2026

@pablintino: Overrode contexts on behalf of pablintino: ci/prow/e2e-hypershift

Details

In response to this:

/override ci/prow/e2e-hypershift
Hypershift is in general red and the error seems consistent among changes

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@pablintino
Copy link
Copy Markdown
Contributor Author

/test unit

@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented Apr 24, 2026

@pablintino: all tests passed!

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@openshift-merge-bot openshift-merge-bot Bot merged commit b5d1685 into openshift:main Apr 24, 2026
18 checks passed
@openshift-ci-robot
Copy link
Copy Markdown
Contributor

@pablintino: Jira Issue Verification Checks: Jira Issue OCPBUGS-84218
✔️ This pull request was pre-merge verified.
✔️ All associated pull requests have merged.
✔️ All associated, merged pull requests were pre-merge verified.

Jira Issue OCPBUGS-84218 has been moved to the MODIFIED state and will move to the VERIFIED state when the change is available in an accepted nightly payload. 🕓

Details

In response to this:

Fixes: #OCPBUGS-84218

- What I did

This changes fixes an issue that happened only during rollbacks that had updated existing units. The existing modified unit were never rolledback as the write logic was using the non-rollback arguments effectively writting the target units twice, instead of going back to the previous units.

- How to verify it

(credit goes to @sergiordlr)

  1. Deploy a cluster with this change on it
  2. Apply the following MC
kind: MachineConfig
metadata:
 labels:
   machineconfiguration.openshift.io/role: worker
 name: not-pullable-image-tc54054-ygo9u9tc
spec:
 config:
   ignition:
     version: 3.5.0
   passwd:
     users: []
   storage:
     files:
       - contents:
           source: data:,test-content-for-debugging%0A
         mode: 0644
         overwrite: true
         path: /etc/mco-test-file.conf
   systemd:
     units:
       - contents: |
           [Unit]
           Description=Ensure IKE SA established for existing IPsec connections.
           After=ipsec.service
           Before=kubelet-dependencies.target node-valid-hostname.service

           [Service]
           Type=oneshot
           ExecStart=/usr/local/bin/ipsec-connect-wait.sh
           Environment="MCO_TEST=true"
           StandardOutput=journal+console
           StandardError=journal+console

           [Install]
           WantedBy=ipsec.service
         enabled: true
         name: wait-for-ipsec-connect.service
 extensions: []
 kernelArguments: []
 osImageURL: quay.io/openshifttest/tc54054fakeimage:latest
  1. The MCP should degrade but the node should try again to pull the wrong image without facing pre-flight errors.

- Description for the changelog

Fix the rollback logic of systemd units in case of a mid-update error.

Summary by CodeRabbit

Bug Fixes

  • Improved rollback mechanism to more accurately identify and target configuration units that require reverting during system recovery operations.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-merge-robot
Copy link
Copy Markdown
Contributor

Fix included in release 5.0.0-0.nightly-2026-04-25-015310

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. jira/severity-important Referenced Jira bug's severity is important for the branch this PR is targeting. jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. lgtm Indicates that a PR is ready to be merged. verified Signifies that the PR passed pre-merge verification criteria

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants