Skip to content

Conversation

@rohennes
Copy link
Contributor

@rohennes rohennes commented Oct 1, 2025

TELCODOCS-2039: Adding LACP status feature

Version(s):
4.20

Issue:
https://issues.redhat.com/browse/TELCODOCS-2039

Link to docs preview:
https://99902--ocpdocs-pr.netlify.app/openshift-enterprise/latest/networking/hardware_networks/configure-lacp-for-sriov.html

QE review:

  • QE has approved this change.

Additional information:

@openshift-ci-robot openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Oct 1, 2025
@openshift-ci-robot
Copy link

openshift-ci-robot commented Oct 1, 2025

@rohennes: This pull request references TELCODOCS-2039 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.21.0" version, but no target version was set.

In response to this:

TELCODOCS-2039: Adding LACP status feature

Version(s):
4.20

Issue:
https://issues.redhat.com/browse/TELCODOCS-2039

Link to docs preview:

QE review:

  • QE has approved this change.

Additional information:

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

1 similar comment
@openshift-ci-robot
Copy link

openshift-ci-robot commented Oct 1, 2025

@rohennes: This pull request references TELCODOCS-2039 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.21.0" version, but no target version was set.

In response to this:

TELCODOCS-2039: Adding LACP status feature

Version(s):
4.20

Issue:
https://issues.redhat.com/browse/TELCODOCS-2039

Link to docs preview:

QE review:

  • QE has approved this change.

Additional information:

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci openshift-ci bot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Oct 1, 2025
@@ -0,0 +1,16 @@
:_mod-docs-content-type: ASSEMBLY
[id="sriov-lacp-sriov"]
= Switch failure detection for bonded SR-IOV networks
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this heading too specific? Are there other use cases for this?

If it is too specific, would something like "High-availability for bonded SR-IOV networks" work better?

@ocpdocs-previewbot
Copy link

ocpdocs-previewbot commented Oct 1, 2025

nodeName: worker-1
containers:
- name: client-pod
image: quay.io/openshifttest/hello-openshift:openshift
Copy link
Contributor Author

@rohennes rohennes Oct 1, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The QE flow used an nginix server I think. I just removed it as it had internal resource address. Would a nginix image from quay.io work? I wasn't sure if we would add a step to test traffic or not so I didn't proceed further. Thoughts?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In QE we use a custom client with testpmd for sending tcp traffic but I think a nginx will work fine.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed to: quay.io/nginx/nginx-unprivileged

@openshift-ci-robot
Copy link

openshift-ci-robot commented Oct 1, 2025

@rohennes: This pull request references TELCODOCS-2039 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.21.0" version, but no target version was set.

In response to this:

TELCODOCS-2039: Adding LACP status feature

Version(s):
4.20

Issue:
https://issues.redhat.com/browse/TELCODOCS-2039

Link to docs preview:
https://99902--ocpdocs-pr.netlify.app/openshift-enterprise/latest/networking/hardware_networks/configure-lacp-for-sriov.html

QE review:

  • QE has approved this change.

Additional information:

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@rohennes rohennes force-pushed the TELCODOCS-2039 branch 2 times, most recently from 0d1defd to 8847478 Compare October 7, 2025 16:52
@openshift-ci openshift-ci bot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Oct 7, 2025
+
[IMPORTANT]
====
Use only one `PFLACPMonitor` custom resource to monitor each network interface on a node. If you create multiple resources that target the same interface, the the PF Status Relay Operator will not process the conflicting configurations and will mark them as `Degraded`.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NIT it does show as degraded but if you do a oc get you will see it as running
oc get pflacpmonitors.pfstatusrelay.openshift.io pflacpmonitor-duplicate-worker-0 -n openshift-pf-status-relay-operator -o yaml
Status:   Degraded:  true   Error Message:  interfaces [ens5f0 ens5f1] conflict with the ones from PFLACPMonitor pflacpmonitor-worker-0

$ oc get pod -n openshift-pf-status-relay-operator  NAME                                                          READY   STATUS    RESTARTS   AGE pf-status-relay-ds-pflacpmonitor-worker-0-t9fsd               1/1     Running   0          3m56s pf-status-relay-operator-controller-manager-9fb548bcf-rt5n4    2/2     Running   0          44h

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've removed the bit about being degraded and kept that it will not process the conflicting configurations.

nodeName: worker-1
containers:
- name: client-pod
image: quay.io/openshifttest/hello-openshift:openshift
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In QE we use a custom client with testpmd for sending tcp traffic but I think a nginx will work fine.


.. Exit the pod shell.

.. Simulate an LACP failure on your upstream physical switch.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here if you just bring the interface down then there is no need for LACP. You need to either block all traffic or just LACP on the switch interface. BUt the interface needs to be up.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mlguerrero12 Is there any general direction we can give about how to simulate this "silent failure" where the carrier is up but the switch is unresponsive in some way. Or should we go into detail about blocking a port or something? (edited)

Slave queue ID: 0
----
+
The client-bond pod detects the link state change and switches to the backup network path without packet loss.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There maybe one or two packets loss as the mac is refreshed on the switch.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed "without packet loss"


For workloads using pod-level bonding with SR-IOV virtual functions (VFs), despite an upstream switch failure, an underlying physical function (PF) might still report an `up` state. This creates a silent failure, as attached VFs remain up and pods continue to send traffic to a dead endpoint, causing packet loss.

The PF Status Relay Operator solves this issue by using Link Aggregation Control Protocol (LACP) as an active health check. In this configuration, each physical function (PF) is placed in its own single-member LACP bond with the upstream switch. When the Operator detects an LACP failure on a PF's bond, it propagates this status to the attached VFs. This action triggers the pod's `active-backup` bond to fail over to its backup network path, maintaining high availability.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Im not sure we should say it propagates the status to the VF. It makes it sounds as if there is some type of control plane process between the operator and the VF on the pod. Rather is simply changing the status of the VF on the node from auto to disabled. What do you think?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated to:

When the Operator detects an LACP failure on a PF's bond, it changes the link state of the attached VFs from auto to disabled.

@rohennes rohennes force-pushed the TELCODOCS-2039 branch 2 times, most recently from 26c5c33 to ae31a91 Compare October 8, 2025 12:27
@gkopels
Copy link

gkopels commented Oct 8, 2025

/lgtm

@openshift-ci
Copy link

openshift-ci bot commented Oct 8, 2025

@gkopels: changing LGTM is restricted to collaborators

In response to this:

/lgtm

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.


.Procedure

. Create the `openshift-pfsr-operator` namespace by entering the following command:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

openshift-pf-status-relay-operator


.. Select *PF Status Relay Operator* from the list of available Operators, and then click *Install*.

.. On the *Install Operator* page, under *Installed Namespace*, select *Operator recommended Namespace*.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there is a problem here but I will fix it today and backport it to 4.20. I believe it is safe to leave this as it is

mode: 802.3ad
options:
miimon: '100'
lacp_rate: 'fast'
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

somewhere we need to mention, perhaps with a note that we need fast rate to be configured on both sides. I mean, it is very important to have it on the switch but let's just say both sides need to have fast rate

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

used callouts and added this as prerequisite

.Example `sriovnetworkpolicy-client.yaml` file
[source,yaml]
----
apiVersion: sriovnetwork.openshift.io/v1
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you don't need this sriov network

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed obsolete resource


.. Create a YAML file that defines the `SriovNetwork` resource for the VFs created on `ens5f0` on `worker-1`:
+
.Example `sriovnetwork-client.yaml` file
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you don't need this one

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed obsolete resource

.Example `client-pod.yaml` file
[source,yaml]
----
apiVersion: v1
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no needed

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed obsolete resource

@mlguerrero12
Copy link
Member

@karampok, this is almost ready, but it would be nice if you have a look at it as well.

@rohennes, @karampok will be the person assigned to this operator from next week on


* You configured pod-level bonding for your SR-IOV networks.

* You installeed the OpenShift CLI (`oc`).
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo

apiVersion: nmstate.io/v1
kind: NodeNetworkConfigurationPolicy
metadata:
name: bond10
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

imo the name probably should be something like "dummySingleInterfaceBondForPFStatusRelayOperator".

Obviously a name cuter than that, that will indicate to the user/admin (admin that did not create it) why that bond with single interface exists

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated to example-bond-f0 and example-bond-f1

@openshift-ci-robot
Copy link

openshift-ci-robot commented Oct 9, 2025

@rohennes: This pull request references TELCODOCS-2039 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.21.0" version, but no target version was set.

In response to this:

TELCODOCS-2039: Adding LACP status feature

Version(s):
4.20

Issue:
https://issues.redhat.com/browse/TELCODOCS-2039

Link to docs preview:
https://99902--ocpdocs-pr.netlify.app/openshift-enterprise/latest/networking/hardware_networks/configure-lacp-for-sriov.html

QE review:

  • QE has approved this change.

Additional information:

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@rohennes
Copy link
Contributor Author

rohennes commented Oct 9, 2025

FYI @karampok - I updated all the resources with numbered callouts to explain the configuration a bit more. So let me know if any issues or anything else should be called out. Thanks!


* The physical switch ports connected to the worker nodes are configured for LACP with a fast polling rate.

* The `linkState` is set to `auto` for the SR-IOV VFs that you want to monitor. The default value for SR-IOV VFs is `linkState: auto`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

auto or disable. VFs with link state set to Enable are ignored.

Something like that needs to be mentioned here

@mlguerrero12
Copy link
Member

/lgtm

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Oct 9, 2025
@openshift-ci openshift-ci bot removed the lgtm Indicates that a PR is ready to be merged. label Oct 13, 2025
@openshift-ci
Copy link

openshift-ci bot commented Oct 13, 2025

New changes are detected. LGTM label has been removed.


.Procedure

. Create the project namespace by creating a namespace.yaml file such as the following example:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
. Create the project namespace by creating a namespace.yaml file such as the following example:
. Create the project namespace by creating a `namespace.yaml` file such as the following example:

nit

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated, thanks

@openshift-ci
Copy link

openshift-ci bot commented Oct 15, 2025

@rohennes: all tests passed!

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@slovern slovern merged commit 60e03d9 into openshift:main Oct 15, 2025
2 checks passed
@slovern
Copy link
Contributor

slovern commented Oct 15, 2025

/cherrypick enterprise-4.20

@openshift-cherrypick-robot

@slovern: new pull request created: #100592

In response to this:

/cherrypick enterprise-4.20

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

branch/enterprise-4.20 jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

9 participants