Skip to content

Adding CFM-1.1: CFM over ETHoCWoMPLSoGRE#4859

Open
ASHNA-AGGARWAL-KEYSIGHT wants to merge 5 commits intoopenconfig:mainfrom
open-traffic-generator:fpt_otg_CFM_1_1
Open

Adding CFM-1.1: CFM over ETHoCWoMPLSoGRE#4859
ASHNA-AGGARWAL-KEYSIGHT wants to merge 5 commits intoopenconfig:mainfrom
open-traffic-generator:fpt_otg_CFM_1_1

Conversation

@ASHNA-AGGARWAL-KEYSIGHT
Copy link
Contributor

@ASHNA-AGGARWAL-KEYSIGHT ASHNA-AGGARWAL-KEYSIGHT commented Nov 26, 2025

@ASHNA-AGGARWAL-KEYSIGHT ASHNA-AGGARWAL-KEYSIGHT requested a review from a team as a code owner November 26, 2025 15:39
@OpenConfigBot
Copy link

OpenConfigBot commented Nov 26, 2025

Pull Request Functional Test Report for #4859 / 2c93e90

Virtual Devices

Device Test Test Documentation Job Raw Log
Arista cEOS status
CFM-1.1: CFM over ETHoCWoMPLSoGRE
Cisco 8000E status
CFM-1.1: CFM over ETHoCWoMPLSoGRE
Cisco XRd status
CFM-1.1: CFM over ETHoCWoMPLSoGRE
Juniper ncPTX status
CFM-1.1: CFM over ETHoCWoMPLSoGRE
Nokia SR Linux status
CFM-1.1: CFM over ETHoCWoMPLSoGRE
Openconfig Lemming status
CFM-1.1: CFM over ETHoCWoMPLSoGRE

Hardware Devices

Device Test Test Documentation Raw Log
Arista 7808 status
CFM-1.1: CFM over ETHoCWoMPLSoGRE
Cisco 8808 status
CFM-1.1: CFM over ETHoCWoMPLSoGRE
Juniper PTX10008 status
CFM-1.1: CFM over ETHoCWoMPLSoGRE
Nokia 7250 IXR-10e status
CFM-1.1: CFM over ETHoCWoMPLSoGRE

Help

@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello @ASHNA-AGGARWAL-KEYSIGHT, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request integrates a new, extensive feature test for Connectivity Fault Management (CFM) within a complex network topology utilizing Ethernet over Control Word over MPLS over GRE encapsulation. The changes enable robust testing of CFM functionalities, including session lifecycle, packet correctness, fault detection, and performance metrics, by enhancing existing configuration utilities and introducing new validation mechanisms to ensure proper operation across diverse network devices.

Highlights

  • New CFM Feature Test: Introduced a comprehensive feature test for Connectivity Fault Management (CFM) over Ethernet with Control Word over MPLS over GRE (ETHoCWoMPLSoGRE) encapsulation. This test suite covers CFM session establishment, packet integrity, alarm detection, loss threshold configuration, and performance measurements (delay and loss).
  • CFM Configuration and Validation Utilities: Added new Go functions and structs in internal/cfgplugins/cfm.go to streamline CFM configuration (domains, associations, measurement profiles) and validation (session status, dead timer, alarm detection, delay/loss measurements) on DUTs, including CLI-based fallbacks for platforms that do not fully support OpenConfig for CFM.
  • Enhanced Network Configuration Plugins: Updated existing configuration plugins to support the complex networking setup required for this CFM test, including flexible aggregate interface handling, VLAN client encapsulation, MPLS static pseudowires, and GRE next-hop groups with multiple IPs. Platform-specific CLI configurations are provided where OpenConfig models are not yet supported.
  • Packet Validation Improvements: Extended packet validation helpers to include VLAN header validation and specific checks for MPLS Control Word headers, crucial for verifying the ETHoCWoMPLSoGRE encapsulation.
  • Platform Deviations: Added new deviation flags for mpls_static_pseudowire_oc_unsupported and vlan_client_encapsulation_oc_unsupported to metadata.proto, reflecting platform-specific limitations and the need for CLI-based configurations on certain devices (e.g., Arista).
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a comprehensive new test suite for CFM over ETHoCWoMPLSoGRE (CFM-1.1), including a large test file, numerous helper functions, and necessary deviations, particularly for Arista devices which rely on CLI-based configurations. While the changes are extensive and cover a new feature area, the review identified several critical and high-severity issues. These include a critical bug in the OpenConfig implementation for GRE next-hop groups, incorrect packet validation logic for CCM intervals and control words, and inverted logic in test validation functions. Additionally, there are significant concerns regarding maintainability and test isolation due to the use of global variables and stateful package design. Several style guide violations, such as the use of time.Sleep and disallowed IP address ranges, were also noted. Addressing these issues is crucial for the correctness, reliability, and maintainability of this new test suite.

Comment on lines +378 to +385
ueh1 := params.NetworkInstance.GetOrCreateStatic().GetOrCreateNextHop(params.NexthopGroupName).GetOrCreateEncapHeader(1)
for _, addr := range params.DstAddr {
ueh1.GetOrCreateUdpV4().SetDstIp(addr)
}

for _, addr := range params.SrcAddr {
ueh1.GetOrCreateUdpV4().SetSrcIp(addr)
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

The OpenConfig implementation for configuring multiple next-hops is incorrect. The loops for DstAddr and SrcAddr repeatedly call SetDstIp and SetSrcIp on the same UdpV4 object (ueh1). This overwrites the values in each iteration, meaning only the last IP from each slice will be configured. To configure multiple next-hops for the group, you should create multiple NextHop entries, each with its own encapsulation header configuration.

Comment on lines 96 to 98
var (
sfBatch *gnmi.SetBatch
oam *oc.Oam
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The use of package-level variables sfBatch and oam introduces global state into the test file. This can lead to unpredictable behavior, race conditions if tests are run in parallel, and makes tests difficult to reason about individually. These variables should be scoped locally within the functions that use them (e.g., configureDut or the test functions themselves) to ensure proper test isolation.

Comment on lines 567 to 572
if uniqueCount < tunnelCount {
t.Log("flows are not ECMP'd across all available tunnels as expected")
return
}

t.Errorf("error: traffic was load-balanced across %d GRE sources.", uniqueCount)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The logic in this function appears to be inverted. It logs a message and returns successfully if uniqueCount < tunnelCount (a failure condition), but reports an error via t.Errorf if uniqueCount >= tunnelCount (the success condition). This will cause the test to fail when it should pass and pass when it should fail. The logic should be corrected to report an error when the number of unique source IPs does not match the expected tunnelCount.

Suggested change
if uniqueCount < tunnelCount {
t.Log("flows are not ECMP'd across all available tunnels as expected")
return
}
t.Errorf("error: traffic was load-balanced across %d GRE sources.", uniqueCount)
if uniqueCount != tunnelCount {
t.Errorf("traffic was not load-balanced across all %d GRE sources, found %d unique sources", tunnelCount, uniqueCount)
return
}
t.Logf("traffic was correctly load-balanced across all %d GRE sources.", uniqueCount)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if this is always one tunnel! and the CFM is not ECMP'd then do we need the error? under what condition is this expected to be an error?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We do not require an error notification when CFM is not ECMP'd. Instead, we need an error report when we receive PDUs on all 16 tunnels. The README specifies that we should verify that CCM PDUs are flowing consistently and not being ECMP'd across all available tunnels. I have added the necessary check accordingly.
Let me know if this clarifies your question.

Comment on lines 632 to 636
if cfmData[2]&0x0F != byte(expectedInterval) {
t.Errorf("ccm interval mismatch on packet; expected: %d, got: %d", expectedInterval, cfmData[2]&0x07)
} else {
t.Logf("packet has the correct CCM interval: %d", cfmData[2]&0x07)
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The CCM interval is encoded in the last 3 bits of the flags field (octet 3 of the CCM PDU). The check cfmData[2]&0x0F is incorrect as it uses a 4-bit mask. The correct mask to extract the 3-bit interval value is 0x07. The logging statement already uses the correct mask, but the validation logic is flawed.

Suggested change
if cfmData[2]&0x0F != byte(expectedInterval) {
t.Errorf("ccm interval mismatch on packet; expected: %d, got: %d", expectedInterval, cfmData[2]&0x07)
} else {
t.Logf("packet has the correct CCM interval: %d", cfmData[2]&0x07)
}
// Verify interval field in CCM packet.
if cfmData[2]&0x07 != byte(expectedInterval) {
t.Errorf("ccm interval mismatch on packet; expected: %d, got: %d", expectedInterval, cfmData[2]&0x07)
} else {
t.Logf("packet has the correct CCM interval: %d", cfmData[2]&0x07)
}

Comment on lines 640 to 647
rdiBitSet := (cfmData[2] & 0x80) == 1
t.Log("*******************************")
t.Log(rdiBitSet)
t.Log("*******************************")
if rdiBitSet {
t.Errorf("rdi bit verification failed on packet. Expected: %v, Got: %v", !verifyRDIBit, rdiBitSet)
}
t.Logf("packet RDI bit is correctly set to %v", verifyRDIBit)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The check for the RDI bit is incorrect. The expression (cfmData[2] & 0x80) == 1 will always be false when the RDI bit is set, as 0x80 (128) is not equal to 1. The correct way to check if the bit is set is (cfmData[2] & 0x80) != 0. Additionally, the debug logs using t.Log("*******************************") should be removed from the final code.

                rdiBitSet := (cfmData[2] & 0x80) != 0
                if rdiBitSet != verifyRDIBit {
                    t.Errorf("rdi bit verification failed on packet. Expected: %v, Got: %v", verifyRDIBit, rdiBitSet)
                }
                t.Logf("packet RDI bit is correctly set to %v", rdiBitSet)

return fmt.Errorf("could not write bytes to pcap file: %v", err)
}
defer os.Remove(f.Name()) // Clean up the temporary file
// defer os.Remove(f.Name()) // Clean up the temporary file
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

This line, which is responsible for cleaning up the temporary pcap file, is commented out. This will result in a resource leak, as temporary files will be left on the system after the test completes. This line should be uncommented to ensure proper cleanup.

Suggested change
// defer os.Remove(f.Name()) // Clean up the temporary file
defer os.Remove(f.Name()) // Clean up the temporary file

Comment on lines +406 to +415
if packetVal.MPLSLayer.ControlWordHeader {
if len(mpls.Payload) >= 4 {
controlWord := mpls.Payload[:4]
if uint16(controlWord[0])<<8|uint16(controlWord[1]) == uint16(packetVal.MPLSLayer.ControlWordSequence) {
t.Logf("%v (32-bit field ) control word is inserted between the MPLS label stack and the Layer 2 payload (the Ethernet frame).0", packetVal.MPLSLayer.ControlWordSequence)
}
} else {
t.Errorf("Control Word header not found")
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The validation logic for the MPLS control word is incorrect.

  1. It only checks the first 16 bits of the sequence number by casting a uint32 to uint16. The full 32-bit sequence number should be validated.
  2. According to RFC 4385, the first 4 bits of the control word must be zero. This check is missing.

Please update the logic to perform a complete and correct validation of the 32-bit control word.

            if packetVal.MPLSLayer.ControlWordHeader {
                if len(mpls.Payload) < 4 {
                    return fmt.Errorf("payload too short for Control Word header")
                }
                controlWord := mpls.Payload[:4]
                // Per RFC 4385, the first 4 bits of the control word MUST be 0.
                if (controlWord[0] & 0xF0) != 0 {
                    return fmt.Errorf("invalid control word format, first 4 bits are not zero: 0x%x", controlWord[0])
                }
                sequenceNumber := binary.BigEndian.Uint32(controlWord)
                if sequenceNumber != packetVal.MPLSLayer.ControlWordSequence {
                    return fmt.Errorf("control word sequence number mismatch. Got: %d, Want: %d", sequenceNumber, packetVal.MPLSLayer.ControlWordSequence)
                }
                t.Logf("Control Word with sequence number %d found as expected.", sequenceNumber)
            }

Comment on lines 546 to 548
time.Sleep(10 * time.Second)
ate.OTG().StopTraffic(t)
time.Sleep(10 * time.Second)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The use of time.Sleep is discouraged by the repository style guide (line 53), which recommends using gnmi.Watch with .Await for waiting on conditions. Using fixed sleeps can lead to flaky tests that are either too slow or fail intermittently. Please refactor this to use a more deterministic waiting mechanism, such as waiting for a specific OTG state.

}
helpers.GnmiCLIConfig(t, dut, cli)
} else {
// OC is not available
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The comment // OC is not available is confusing because it's inside an else block that is meant to handle the OpenConfig case. If there is no OC path for this configuration, the function should probably not have this else block. If there is a path, the implementation should be added. Please clarify or implement the intended OC configuration.


// GetGnmiCLIOutput sets config built with buildCliConfigRequest and returns the output.
func GetGnmiCLIOutput(t testing.TB, dut *ondatra.DUTDevice, config string) *gpb.GetResponse {
GnmiCLIConfig(t, dut, config)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The function GetGnmiCLIOutput first calls GnmiCLIConfig, which performs a gNMI Set operation. This is incorrect for show commands, which should only be performed via a Get request. Sending a show command as part of a Set request is unconventional and may lead to errors or unintended side effects. The call to GnmiCLIConfig should be removed from this helper.

func GetGnmiCLIOutput(t testing.TB, dut *ondatra.DUTDevice, config string) *gpb.GetResponse {

Copy link
Contributor

@balaji6 balaji6 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added few comments on SLM and error message. Please take a look, thanks.

@ASHNA-AGGARWAL-KEYSIGHT
Copy link
Contributor Author

Added few comments on SLM and error message. Please take a look, thanks.

Logs attached: https://partnerissuetracker.corp.google.com/issues/415458482#comment166

// Arista : https://partnerissuetracker.corp.google.com/issues/434922681
bool otn_to_eth_assignment = 317;

// Devices that do not support import export policies configured in network instance
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ASHNA-AGGARWAL-KEYSIGHT - Can you please add the bug ID for this deviation with the vendor. Without the bug ID we cannot approve this PR

Copy link
Contributor Author

@ASHNA-AGGARWAL-KEYSIGHT ASHNA-AGGARWAL-KEYSIGHT Mar 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@balaji6 Could you provide the bug IDs for the two deviations

bool reduced_ecmp_set_on_mixed_encap_decap_nh = 378;

// Devices that do not support mpls static pseudowire OC
bool mpls_static_pseudowire_oc_unsupported = 379;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ASHNA-AGGARWAL-KEYSIGHT - Please add the relevant bug ID as mentioned above.

@ram-mac
Copy link
Contributor

ram-mac commented Feb 26, 2026

@balaji6 - Can you please help validate this PR in google environment, so we know at least it is passing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants