-
Notifications
You must be signed in to change notification settings - Fork 1.8k
TELCODOCS-2039: Adding LACP status feature #99902
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
@rohennes: This pull request references TELCODOCS-2039 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.21.0" version, but no target version was set. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
1 similar comment
|
@rohennes: This pull request references TELCODOCS-2039 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.21.0" version, but no target version was set. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
b3fd9ce to
9b4c0d4
Compare
| @@ -0,0 +1,16 @@ | |||
| :_mod-docs-content-type: ASSEMBLY | |||
| [id="sriov-lacp-sriov"] | |||
| = Switch failure detection for bonded SR-IOV networks | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this heading too specific? Are there other use cases for this?
If it is too specific, would something like "High-availability for bonded SR-IOV networks" work better?
|
🤖 Wed Oct 15 18:59:38 - Prow CI generated the docs preview: |
modules/lacp-switch-monitoring.adoc
Outdated
| nodeName: worker-1 | ||
| containers: | ||
| - name: client-pod | ||
| image: quay.io/openshifttest/hello-openshift:openshift |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The QE flow used an nginix server I think. I just removed it as it had internal resource address. Would a nginix image from quay.io work? I wasn't sure if we would add a step to test traffic or not so I didn't proceed further. Thoughts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In QE we use a custom client with testpmd for sending tcp traffic but I think a nginx will work fine.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed to: quay.io/nginx/nginx-unprivileged
|
@rohennes: This pull request references TELCODOCS-2039 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.21.0" version, but no target version was set. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
0d1defd to
8847478
Compare
8847478 to
6277397
Compare
modules/lacp-switch-monitoring.adoc
Outdated
| + | ||
| [IMPORTANT] | ||
| ==== | ||
| Use only one `PFLACPMonitor` custom resource to monitor each network interface on a node. If you create multiple resources that target the same interface, the the PF Status Relay Operator will not process the conflicting configurations and will mark them as `Degraded`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NIT it does show as degraded but if you do a oc get you will see it as running
oc get pflacpmonitors.pfstatusrelay.openshift.io pflacpmonitor-duplicate-worker-0 -n openshift-pf-status-relay-operator -o yaml
Status: Degraded: true Error Message: interfaces [ens5f0 ens5f1] conflict with the ones from PFLACPMonitor pflacpmonitor-worker-0
$ oc get pod -n openshift-pf-status-relay-operator NAME READY STATUS RESTARTS AGE pf-status-relay-ds-pflacpmonitor-worker-0-t9fsd 1/1 Running 0 3m56s pf-status-relay-operator-controller-manager-9fb548bcf-rt5n4 2/2 Running 0 44h
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've removed the bit about being degraded and kept that it will not process the conflicting configurations.
modules/lacp-switch-monitoring.adoc
Outdated
| nodeName: worker-1 | ||
| containers: | ||
| - name: client-pod | ||
| image: quay.io/openshifttest/hello-openshift:openshift |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In QE we use a custom client with testpmd for sending tcp traffic but I think a nginx will work fine.
modules/lacp-switch-monitoring.adoc
Outdated
|
|
||
| .. Exit the pod shell. | ||
|
|
||
| .. Simulate an LACP failure on your upstream physical switch. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here if you just bring the interface down then there is no need for LACP. You need to either block all traffic or just LACP on the switch interface. BUt the interface needs to be up.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mlguerrero12 Is there any general direction we can give about how to simulate this "silent failure" where the carrier is up but the switch is unresponsive in some way. Or should we go into detail about blocking a port or something? (edited)
modules/lacp-switch-monitoring.adoc
Outdated
| Slave queue ID: 0 | ||
| ---- | ||
| + | ||
| The client-bond pod detects the link state change and switches to the backup network path without packet loss. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There maybe one or two packets loss as the mac is refreshed on the switch.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed "without packet loss"
|
|
||
| For workloads using pod-level bonding with SR-IOV virtual functions (VFs), despite an upstream switch failure, an underlying physical function (PF) might still report an `up` state. This creates a silent failure, as attached VFs remain up and pods continue to send traffic to a dead endpoint, causing packet loss. | ||
|
|
||
| The PF Status Relay Operator solves this issue by using Link Aggregation Control Protocol (LACP) as an active health check. In this configuration, each physical function (PF) is placed in its own single-member LACP bond with the upstream switch. When the Operator detects an LACP failure on a PF's bond, it propagates this status to the attached VFs. This action triggers the pod's `active-backup` bond to fail over to its backup network path, maintaining high availability. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Im not sure we should say it propagates the status to the VF. It makes it sounds as if there is some type of control plane process between the operator and the VF on the pod. Rather is simply changing the status of the VF on the node from auto to disabled. What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated to:
When the Operator detects an LACP failure on a PF's bond, it changes the link state of the attached VFs from auto to disabled.
26c5c33 to
ae31a91
Compare
|
/lgtm |
|
@gkopels: changing LGTM is restricted to collaborators In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
|
|
||
| .Procedure | ||
|
|
||
| . Create the `openshift-pfsr-operator` namespace by entering the following command: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
openshift-pf-status-relay-operator
|
|
||
| .. Select *PF Status Relay Operator* from the list of available Operators, and then click *Install*. | ||
|
|
||
| .. On the *Install Operator* page, under *Installed Namespace*, select *Operator recommended Namespace*. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
there is a problem here but I will fix it today and backport it to 4.20. I believe it is safe to leave this as it is
modules/lacp-switch-monitoring.adoc
Outdated
| mode: 802.3ad | ||
| options: | ||
| miimon: '100' | ||
| lacp_rate: 'fast' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
somewhere we need to mention, perhaps with a note that we need fast rate to be configured on both sides. I mean, it is very important to have it on the switch but let's just say both sides need to have fast rate
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
used callouts and added this as prerequisite
modules/lacp-switch-monitoring.adoc
Outdated
| .Example `sriovnetworkpolicy-client.yaml` file | ||
| [source,yaml] | ||
| ---- | ||
| apiVersion: sriovnetwork.openshift.io/v1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you don't need this sriov network
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
removed obsolete resource
modules/lacp-switch-monitoring.adoc
Outdated
|
|
||
| .. Create a YAML file that defines the `SriovNetwork` resource for the VFs created on `ens5f0` on `worker-1`: | ||
| + | ||
| .Example `sriovnetwork-client.yaml` file |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you don't need this one
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
removed obsolete resource
modules/lacp-switch-monitoring.adoc
Outdated
| .Example `client-pod.yaml` file | ||
| [source,yaml] | ||
| ---- | ||
| apiVersion: v1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no needed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
removed obsolete resource
ae31a91 to
a2ed7db
Compare
|
|
||
| * You configured pod-level bonding for your SR-IOV networks. | ||
|
|
||
| * You installeed the OpenShift CLI (`oc`). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
typo
modules/lacp-switch-monitoring.adoc
Outdated
| apiVersion: nmstate.io/v1 | ||
| kind: NodeNetworkConfigurationPolicy | ||
| metadata: | ||
| name: bond10 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
imo the name probably should be something like "dummySingleInterfaceBondForPFStatusRelayOperator".
Obviously a name cuter than that, that will indicate to the user/admin (admin that did not create it) why that bond with single interface exists
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated to example-bond-f0 and example-bond-f1
|
@rohennes: This pull request references TELCODOCS-2039 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.21.0" version, but no target version was set. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
a2ed7db to
8bc5a52
Compare
|
FYI @karampok - I updated all the resources with numbered callouts to explain the configuration a bit more. So let me know if any issues or anything else should be called out. Thanks! |
modules/lacp-switch-monitoring.adoc
Outdated
|
|
||
| * The physical switch ports connected to the worker nodes are configured for LACP with a fast polling rate. | ||
|
|
||
| * The `linkState` is set to `auto` for the SR-IOV VFs that you want to monitor. The default value for SR-IOV VFs is `linkState: auto`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
auto or disable. VFs with link state set to Enable are ignored.
Something like that needs to be mentioned here
8bc5a52 to
cce4ae1
Compare
|
/lgtm |
cce4ae1 to
62178c5
Compare
|
New changes are detected. LGTM label has been removed. |
modules/lacp-switch-monitoring.adoc
Outdated
|
|
||
| .Procedure | ||
|
|
||
| . Create the project namespace by creating a namespace.yaml file such as the following example: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| . Create the project namespace by creating a namespace.yaml file such as the following example: | |
| . Create the project namespace by creating a `namespace.yaml` file such as the following example: |
nit
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated, thanks
62178c5 to
d33e5e7
Compare
|
@rohennes: all tests passed! Full PR test history. Your PR dashboard. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
|
/cherrypick enterprise-4.20 |
|
@slovern: new pull request created: #100592 In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
TELCODOCS-2039: Adding LACP status feature
Version(s):
4.20
Issue:
https://issues.redhat.com/browse/TELCODOCS-2039
Link to docs preview:
https://99902--ocpdocs-pr.netlify.app/openshift-enterprise/latest/networking/hardware_networks/configure-lacp-for-sriov.html
QE review:
Additional information: