[release-4.22] OCPBUGS-83540: order corosync after OVS configuration on TNF clusters]#5852
Conversation
On fencing-induced reboots, configure-ovs.sh creates br-ex and migrates the corosync NIC (enp2s0) to an OVS port, stripping the corosync address. knet detects link-down within 84ms, the 3000ms TOTEM token timeout fires, and the surviving node fences the rebooted node again — creating a reboot loop (up to 8 cycles observed in CI). Add a systemd drop-in for corosync.service on TNF clusters that orders it after ovs-configuration.service. Both Wants= and After= are required: After= alone is insufficient because ovs-configuration is in a different activation chain (kubelet-dependencies.target vs multi-user.target), so systemd ignores the ordering unless Wants= pulls ovs-configuration into corosync's start transaction. Verified on a live two-node cluster: with the drop-in, corosync starts only after ovs-configuration completes. No knet link-down, no TOTEM timeout, no fencing loop.
The Wants= directive in the corosync drop-in pulls ovs-configuration into corosync's start transaction. On initial install, corosync starts ~22 minutes after ovs-configuration (after CEO handover), creating a new systemd transaction that re-triggers the oneshot service. The second run of configure-ovs.sh clears ovn-remote (line 13, unconditional), breaking OVN pod networking during installation. Add a ConditionPathExists guard on ovs-configuration.service for TNF clusters. configure-ovs.sh already creates /var/run/ovs-config-executed (line 22) on every successful run. Since /var/run is tmpfs, the marker resets on reboot, allowing normal execution on fencing-induced reboots while preventing re-execution within the same boot. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The drop-in YAML used the same filename (ovs-configuration.service.yaml)
as the base unit in templates/common/_base/units/. MCO's template renderer
keys by filename with last-write-wins semantics (render.go:223), so the
TNF drop-in-only file replaced the base unit entirely — losing ExecStart,
Type=oneshot, and all ordering directives. This prevented configure-ovs.sh
from running at all, causing bootstrap failures on TNF clusters.
Rename to ovs-configuration-skip-rerun.service.yaml to avoid the collision.
The YAML's internal name field ("ovs-configuration.service") still correctly
targets the unit for the drop-in.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
Pipeline controller notification For optional jobs, comment This repository is configured in: LGTM mode |
|
Important Review skippedAuto reviews are disabled on base/target branches other than the default branch. Please check the settings in the CodeRabbit UI or the ⚙️ Run configurationConfiguration used: Repository: openshift/coderabbit/.coderabbit.yaml Review profile: CHILL Plan: Pro Plus Run ID: You can disable this status message by setting the Use the checkbox below for a quick retry:
✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
|
/lgtm |
|
Scheduling tests matching the |
|
/approve |
|
/jira refresh |
|
@fonta-rh: This pull request references Jira Issue OCPBUGS-81572, which is invalid:
Comment The bug has been updated to refer to the pull request using the external bug tracker. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
/label backport-risk-assessed |
|
/retitle [[release-4.22] OCPBUGS-83540: order corosync after OVS configuration on TNF clusters] |
|
@openshift-cherrypick-robot: This pull request references Jira Issue OCPBUGS-83540, which is invalid:
Comment The bug has been updated to refer to the pull request using the external bug tracker. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
/jira refresh |
|
@fonta-rh: This pull request references Jira Issue OCPBUGS-83540, which is invalid:
Comment DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: eggfoobar, fonta-rh, openshift-cherrypick-robot, pablintino The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
/verified by @fonta-rh in original PR |
|
@fonta-rh: This PR has been marked as verified by DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
/jira refresh |
|
@fonta-rh: This pull request references Jira Issue OCPBUGS-83540, which is invalid:
Comment DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
/jira refresh |
|
@fonta-rh: This pull request references Jira Issue OCPBUGS-83540, which is valid. The bug has been moved to the POST state. 7 validation(s) were run on this bug
DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
/retitle [release-4.22] OCPBUGS-83540: order corosync after OVS configuration on TNF clusters] |
|
@openshift-cherrypick-robot: The following test failed, say
Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
efd28f7
into
openshift:release-4.22
|
@openshift-cherrypick-robot: Jira Issue Verification Checks: Jira Issue OCPBUGS-83540 Jira Issue OCPBUGS-83540 has been moved to the MODIFIED state and will move to the VERIFIED state when the change is available in an accepted nightly payload. 🕓 DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
Fix included in release 4.22.0-0.nightly-2026-04-16-185359 |
This is an automated cherry-pick of #5834
/assign eggfoobar