TELCODOCS-2039: Adding LACP status feature #99902

rohennes · 2025-10-01T10:21:40Z

TELCODOCS-2039: Adding LACP status feature

Version(s):
4.20

Issue:
https://issues.redhat.com/browse/TELCODOCS-2039

Link to docs preview:
https://99902--ocpdocs-pr.netlify.app/openshift-enterprise/latest/networking/hardware_networks/configure-lacp-for-sriov.html

QE review:

QE has approved this change.

Additional information:

openshift-ci-robot · 2025-10-01T10:21:45Z

@rohennes: This pull request references TELCODOCS-2039 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.21.0" version, but no target version was set.

In response to this:

TELCODOCS-2039: Adding LACP status feature

Version(s):
4.20

Issue:
https://issues.redhat.com/browse/TELCODOCS-2039

Link to docs preview:

QE review:

QE has approved this change.

Additional information:

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

openshift-ci-robot · 2025-10-01T10:21:50Z

@rohennes: This pull request references TELCODOCS-2039 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.21.0" version, but no target version was set.

In response to this:

TELCODOCS-2039: Adding LACP status feature

Version(s):
4.20

Issue:
https://issues.redhat.com/browse/TELCODOCS-2039

Link to docs preview:

QE review:

QE has approved this change.

Additional information:

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

rohennes · 2025-10-01T10:23:24Z

networking/hardware_networks/configure-lacp-for-sriov.adoc

@@ -0,0 +1,16 @@
+:_mod-docs-content-type: ASSEMBLY
+[id="sriov-lacp-sriov"]
+= Switch failure detection for bonded SR-IOV networks


Is this heading too specific? Are there other use cases for this?

If it is too specific, would something like "High-availability for bonded SR-IOV networks" work better?

modules/lacp-switch-monitoring.adoc

ocpdocs-previewbot · 2025-10-01T10:30:23Z

🤖 Wed Oct 15 18:59:38 - Prow CI generated the docs preview:

https://99902--ocpdocs-pr.netlify.app/openshift-enterprise/latest/networking/hardware_networks/configure-lacp-for-sriov.html

rohennes · 2025-10-01T10:39:29Z

modules/lacp-switch-monitoring.adoc

+  nodeName: worker-1
+  containers:
+    - name: client-pod
+      image: quay.io/openshifttest/hello-openshift:openshift


The QE flow used an nginix server I think. I just removed it as it had internal resource address. Would a nginix image from quay.io work? I wasn't sure if we would add a step to test traffic or not so I didn't proceed further. Thoughts?

In QE we use a custom client with testpmd for sending tcp traffic but I think a nginx will work fine.

Changed to: quay.io/nginx/nginx-unprivileged

openshift-ci-robot · 2025-10-01T10:41:02Z

@rohennes: This pull request references TELCODOCS-2039 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.21.0" version, but no target version was set.

In response to this:

TELCODOCS-2039: Adding LACP status feature

Version(s):
4.20

Issue:
https://issues.redhat.com/browse/TELCODOCS-2039

Link to docs preview:
https://99902--ocpdocs-pr.netlify.app/openshift-enterprise/latest/networking/hardware_networks/configure-lacp-for-sriov.html

QE review:

QE has approved this change.

Additional information:

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

modules/installing-pfsr-operator-cli.adoc

modules/installing-pfsr-operator-console.adoc

modules/lacp-switch-monitoring.adoc

gkopels · 2025-10-08T08:48:50Z

modules/lacp-switch-monitoring.adoc

+
+[IMPORTANT]
+====
+Use only one `PFLACPMonitor` custom resource to monitor each network interface on a node. If you create multiple resources that target the same interface, the the PF Status Relay Operator will not process the conflicting configurations and will mark them as `Degraded`.


NIT it does show as degraded but if you do a oc get you will see it as running
oc get pflacpmonitors.pfstatusrelay.openshift.io pflacpmonitor-duplicate-worker-0 -n openshift-pf-status-relay-operator -o yaml
Status: Degraded: true Error Message: interfaces [ens5f0 ens5f1] conflict with the ones from PFLACPMonitor pflacpmonitor-worker-0

$ oc get pod -n openshift-pf-status-relay-operator NAME READY STATUS RESTARTS AGE pf-status-relay-ds-pflacpmonitor-worker-0-t9fsd 1/1 Running 0 3m56s pf-status-relay-operator-controller-manager-9fb548bcf-rt5n4 2/2 Running 0 44h

I've removed the bit about being degraded and kept that it will not process the conflicting configurations.

gkopels · 2025-10-08T08:55:14Z

modules/lacp-switch-monitoring.adoc

+  nodeName: worker-1
+  containers:
+    - name: client-pod
+      image: quay.io/openshifttest/hello-openshift:openshift


In QE we use a custom client with testpmd for sending tcp traffic but I think a nginx will work fine.

gkopels · 2025-10-08T08:58:41Z

modules/lacp-switch-monitoring.adoc

+
+.. Exit the pod shell.
+
+.. Simulate an LACP failure on your upstream physical switch.


Here if you just bring the interface down then there is no need for LACP. You need to either block all traffic or just LACP on the switch interface. BUt the interface needs to be up.

@mlguerrero12 Is there any general direction we can give about how to simulate this "silent failure" where the carrier is up but the switch is unresponsive in some way. Or should we go into detail about blocking a port or something? (edited)

gkopels · 2025-10-08T09:01:38Z

modules/lacp-switch-monitoring.adoc

+Slave queue ID: 0
+----
+
+The client-bond pod detects the link state change and switches to the backup network path without packet loss.


There maybe one or two packets loss as the mac is refreshed on the switch.

Removed "without packet loss"

gkopels · 2025-10-08T09:06:50Z

networking/hardware_networks/configure-lacp-for-sriov.adoc

+
+For workloads using pod-level bonding with SR-IOV virtual functions (VFs), despite an upstream switch failure, an underlying physical function (PF) might still report an `up` state. This creates a silent failure, as attached VFs remain up and pods continue to send traffic to a dead endpoint, causing packet loss.
+
+The PF Status Relay Operator solves this issue by using Link Aggregation Control Protocol (LACP) as an active health check. In this configuration, each physical function (PF) is placed in its own single-member LACP bond with the upstream switch. When the Operator detects an LACP failure on a PF's bond, it propagates this status to the attached VFs. This action triggers the pod's `active-backup` bond to fail over to its backup network path, maintaining high availability.


Im not sure we should say it propagates the status to the VF. It makes it sounds as if there is some type of control plane process between the operator and the VF on the pod. Rather is simply changing the status of the VF on the node from auto to disabled. What do you think?

Updated to:

When the Operator detects an LACP failure on a PF's bond, it changes the link state of the attached VFs from auto to disabled.

gkopels · 2025-10-08T12:57:00Z

/lgtm

openshift-ci · 2025-10-08T12:57:12Z

@gkopels: changing LGTM is restricted to collaborators

In response to this:

/lgtm

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

mlguerrero12 · 2025-10-09T07:21:17Z

modules/installing-pfsr-operator-cli.adoc

+
+.Procedure
+
+. Create the `openshift-pfsr-operator` namespace by entering the following command:


openshift-pf-status-relay-operator

mlguerrero12 · 2025-10-09T07:23:23Z

modules/installing-pfsr-operator-console.adoc

+
+.. Select *PF Status Relay Operator* from the list of available Operators, and then click *Install*.
+
+.. On the *Install Operator* page, under *Installed Namespace*, select *Operator recommended Namespace*.


there is a problem here but I will fix it today and backport it to 4.20. I believe it is safe to leave this as it is

mlguerrero12 · 2025-10-09T07:26:59Z

modules/lacp-switch-monitoring.adoc

+          mode: 802.3ad
+          options:
+            miimon: '100'
+            lacp_rate: 'fast'


somewhere we need to mention, perhaps with a note that we need fast rate to be configured on both sides. I mean, it is very important to have it on the switch but let's just say both sides need to have fast rate

used callouts and added this as prerequisite

mlguerrero12 · 2025-10-09T07:31:47Z

modules/lacp-switch-monitoring.adoc

+.Example `sriovnetworkpolicy-client.yaml` file
+[source,yaml]
+----
+apiVersion: sriovnetwork.openshift.io/v1


you don't need this sriov network

removed obsolete resource

mlguerrero12 · 2025-10-09T07:33:02Z

modules/lacp-switch-monitoring.adoc

+
+.. Create a YAML file that defines the `SriovNetwork` resource for the VFs created on `ens5f0` on `worker-1`:
+
+.Example `sriovnetwork-client.yaml` file


you don't need this one

removed obsolete resource

mlguerrero12 · 2025-10-09T07:33:43Z

modules/lacp-switch-monitoring.adoc

+.Example `client-pod.yaml` file
+[source,yaml]
+----
+apiVersion: v1


removed obsolete resource

mlguerrero12 · 2025-10-09T07:39:37Z

@karampok, this is almost ready, but it would be nice if you have a look at it as well.

@rohennes, @karampok will be the person assigned to this operator from next week on

karampok · 2025-10-09T09:49:34Z

modules/installing-pfsr-operator-cli.adoc

+
+* You configured pod-level bonding for your SR-IOV networks.
+
+* You installeed the OpenShift CLI (`oc`).


karampok · 2025-10-09T09:54:42Z

modules/lacp-switch-monitoring.adoc

+apiVersion: nmstate.io/v1
+kind: NodeNetworkConfigurationPolicy
+metadata:
+  name: bond10


imo the name probably should be something like "dummySingleInterfaceBondForPFStatusRelayOperator".

Obviously a name cuter than that, that will indicate to the user/admin (admin that did not create it) why that bond with single interface exists

Updated to example-bond-f0 and example-bond-f1

openshift-ci-robot · 2025-10-09T10:27:43Z

@rohennes: This pull request references TELCODOCS-2039 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.21.0" version, but no target version was set.

In response to this:

TELCODOCS-2039: Adding LACP status feature

Version(s):
4.20

Issue:
https://issues.redhat.com/browse/TELCODOCS-2039

Link to docs preview:
https://99902--ocpdocs-pr.netlify.app/openshift-enterprise/latest/networking/hardware_networks/configure-lacp-for-sriov.html

QE review:

QE has approved this change.

Additional information:

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

rohennes · 2025-10-09T10:35:01Z

FYI @karampok - I updated all the resources with numbered callouts to explain the configuration a bit more. So let me know if any issues or anything else should be called out. Thanks!

mlguerrero12 · 2025-10-09T12:03:09Z

modules/lacp-switch-monitoring.adoc

+
+* The physical switch ports connected to the worker nodes are configured for LACP with a fast polling rate.
+
+* The `linkState` is set to `auto` for the SR-IOV VFs that you want to monitor. The default value for SR-IOV VFs is `linkState: auto`.


auto or disable. VFs with link state set to Enable are ignored.

Something like that needs to be mentioned here

mlguerrero12 · 2025-10-09T13:30:47Z

/lgtm

openshift-ci · 2025-10-13T14:30:09Z

New changes are detected. LGTM label has been removed.

slovern · 2025-10-15T15:38:24Z

modules/lacp-switch-monitoring.adoc

+
+.Procedure
+
+. Create the project namespace by creating a namespace.yaml file such as the following example:


Suggested change

. Create the project namespace by creating a namespace.yaml file such as the following example:

. Create the project namespace by creating a `namespace.yaml` file such as the following example:

nit

Updated, thanks

openshift-ci · 2025-10-15T19:04:08Z

@rohennes: all tests passed!

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

slovern · 2025-10-15T20:43:28Z

/cherrypick enterprise-4.20

openshift-cherrypick-robot · 2025-10-15T20:44:19Z

@slovern: new pull request created: #100592

In response to this:

/cherrypick enterprise-4.20

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Oct 1, 2025

openshift-ci bot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Oct 1, 2025

rohennes force-pushed the TELCODOCS-2039 branch from b3fd9ce to 9b4c0d4 Compare October 1, 2025 10:22

rohennes commented Oct 1, 2025

View reviewed changes

modules/lacp-switch-monitoring.adoc Show resolved Hide resolved

rohennes commented Oct 1, 2025

View reviewed changes

rohennes force-pushed the TELCODOCS-2039 branch 2 times, most recently from 0d1defd to 8847478 Compare October 7, 2025 16:52

openshift-ci bot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Oct 7, 2025

ocpdocs-vale-bot reviewed Oct 7, 2025

View reviewed changes

modules/installing-pfsr-operator-cli.adoc Outdated Show resolved Hide resolved

ocpdocs-vale-bot reviewed Oct 7, 2025

View reviewed changes

modules/installing-pfsr-operator-console.adoc Outdated Show resolved Hide resolved

rohennes force-pushed the TELCODOCS-2039 branch from 8847478 to 6277397 Compare October 7, 2025 18:04

gkopels reviewed Oct 8, 2025

View reviewed changes

rohennes force-pushed the TELCODOCS-2039 branch 2 times, most recently from 26c5c33 to ae31a91 Compare October 8, 2025 12:27

mlguerrero12 reviewed Oct 9, 2025

View reviewed changes

rohennes force-pushed the TELCODOCS-2039 branch from ae31a91 to a2ed7db Compare October 9, 2025 09:39

karampok reviewed Oct 9, 2025

View reviewed changes

rohennes force-pushed the TELCODOCS-2039 branch from a2ed7db to 8bc5a52 Compare October 9, 2025 10:33

mlguerrero12 reviewed Oct 9, 2025

View reviewed changes

rohennes force-pushed the TELCODOCS-2039 branch from 8bc5a52 to cce4ae1 Compare October 9, 2025 13:09

openshift-ci bot assigned mlguerrero12 Oct 9, 2025

openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Oct 9, 2025

rohennes force-pushed the TELCODOCS-2039 branch from cce4ae1 to 62178c5 Compare October 13, 2025 14:30

openshift-ci bot removed the lgtm Indicates that a PR is ready to be merged. label Oct 13, 2025

slovern added the branch/enterprise-4.20 label Oct 15, 2025

slovern added this to the Planned for 4.20 GA milestone Oct 15, 2025

slovern reviewed Oct 15, 2025

View reviewed changes

TELCODOCS-2039: Adding LACP status feature

d33e5e7

rohennes force-pushed the TELCODOCS-2039 branch from 62178c5 to d33e5e7 Compare October 15, 2025 18:45

slovern merged commit 60e03d9 into openshift:main Oct 15, 2025
2 checks passed

openshift-cherrypick-robot mentioned this pull request Oct 15, 2025

[enterprise-4.20] TELCODOCS-2039: Adding LACP status feature #100592

Merged


		.. Exit the pod shell.

		.. Simulate an LACP failure on your upstream physical switch.


		For workloads using pod-level bonding with SR-IOV virtual functions (VFs), despite an upstream switch failure, an underlying physical function (PF) might still report an `up` state. This creates a silent failure, as attached VFs remain up and pods continue to send traffic to a dead endpoint, causing packet loss.

		The PF Status Relay Operator solves this issue by using Link Aggregation Control Protocol (LACP) as an active health check. In this configuration, each physical function (PF) is placed in its own single-member LACP bond with the upstream switch. When the Operator detects an LACP failure on a PF's bond, it propagates this status to the attached VFs. This action triggers the pod's `active-backup` bond to fail over to its backup network path, maintaining high availability.


		.Procedure

		. Create the `openshift-pfsr-operator` namespace by entering the following command:


		.. Select PF Status Relay Operator from the list of available Operators, and then click Install.

		.. On the Install Operator page, under Installed Namespace, select Operator recommended Namespace.


		* You configured pod-level bonding for your SR-IOV networks.

		* You installeed the OpenShift CLI (`oc`).


		* The physical switch ports connected to the worker nodes are configured for LACP with a fast polling rate.

		* The `linkState` is set to `auto` for the SR-IOV VFs that you want to monitor. The default value for SR-IOV VFs is `linkState: auto`.


		.Procedure

		. Create the project namespace by creating a namespace.yaml file such as the following example:

TELCODOCS-2039: Adding LACP status feature #99902

TELCODOCS-2039: Adding LACP status feature #99902

Uh oh!

Conversation

rohennes commented Oct 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

openshift-ci-robot commented Oct 1, 2025 • edited by openshift-ci bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

openshift-ci-robot commented Oct 1, 2025 • edited by openshift-ci bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

ocpdocs-previewbot commented Oct 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rohennes Oct 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

openshift-ci-robot commented Oct 1, 2025 • edited by openshift-ci bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gkopels commented Oct 8, 2025

Uh oh!

openshift-ci bot commented Oct 8, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mlguerrero12 commented Oct 9, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

rohennes commented Oct 1, 2025 •

edited

Loading

openshift-ci-robot commented Oct 1, 2025 •

edited by openshift-ci bot

Loading

openshift-ci-robot commented Oct 1, 2025 •

edited by openshift-ci bot

Loading

ocpdocs-previewbot commented Oct 1, 2025 •

edited

Loading

rohennes Oct 1, 2025 •

edited

Loading

openshift-ci-robot commented Oct 1, 2025 •

edited by openshift-ci bot

Loading

openshift-ci-robot commented Oct 9, 2025 •

edited by openshift-ci bot

Loading