Skip to content

CNTR: fix Arista client setup and add Arista deviations#5507

Open
pjacakArista wants to merge 2 commits into
openconfig:mainfrom
pjacakArista:arista-CNTR-2-fix-gnmi-cli-config
Open

CNTR: fix Arista client setup and add Arista deviations#5507
pjacakArista wants to merge 2 commits into
openconfig:mainfrom
pjacakArista:arista-CNTR-2-fix-gnmi-cli-config

Conversation

@pjacakArista
Copy link
Copy Markdown

Three CNTR tests (container lifecycle, container connectivity, supervisor failover) share containerztest.Client(), which fails on Arista for several independent reasons. This PR fixes the shared infrastructure and unblocks CNTR-1.1-1.7 and all of CNTR-2.

Root causes fixed:

  • containerztest.Client() called dut.Config().New().WithAristaText(...).Append(t) to push the containerz gNOI service config. The Arista binding rejects this (PushConfig is unimplemented), so the containerz gNOI service was never enabled and port 60061 was never opened in the management namespace firewall. Replaced with helpers.GnmiCLIConfig for a reliable CLI-origin gNMI config push. Added gnmiSaveWithRetry and waitForGNOI polling to handle Octa restarts triggered by enabling the containerz service.
  • The management namespace firewall (ns-mgmt) blocks port 60061 even after the containerz ACL config is applied, because the system control-plane ACL mechanism does not create kernel ip6tables rules on all EOS versions. Added a direct ip6tables ACCEPT rule into ns-mgmt as a reliable fallback.
  • Ondatra caches the gNOI connection; after the config push causes Octa to restart, the cached connection is stale. Added a re-dial step after config push to get a fresh gNOI connection.
  • Setup() did not call RemoveContainer on teardown, so stopped containers could contaminate subsequent tests via Docker's restart policy.
  • WaitForRunning used a single-deadline context and was failing on the first stream error. Improved with per-call timeouts and stream error retry.
    • DialGRPCWithPort in topologies/binding/binding.go combined per-RPC credentials with an insecure transport, producing a gRPC errCredentialsConflict. Fixed by stripping per-RPC credentials and defaulting to insecure transport when the device config specifies none.
    • cntr_test.go (container_connectivity): dialContainer used insecure transport; cntrsrv serves a self-signed TLS certificate. Added TLS skip-verify credentials for Arista. Guarded AddNH/AddNHG with a new GribiDecapInDefaultNiUnsupported deviation for platforms that cannot FIB-program a Decap NH in the DEFAULT network instance.
  • CNTR-2 metadata was missing the containerz_oc_unsupported deviation that CNTR-1 already had, so the gNOI service enablement block was never entered. Added. Also added gribi_decap_in_default_ni_unsupported.
  • Added AwaitSwitchoverReady to internal/components for reuse across CNTR tests (used by a follow up PR).

This is PR 1 of 2. The PR2 will add the cold-reboot / supervisor failover gNOI cache fix (ClientWithoutConfig, WaitForReboot) needed for CNTR-1.8 and all of CNTR-3.

Validated manually on Arista EOS:

  • CNTR-1 (container_lifecycle): subtests 1.1-1.7 pass (TestDeployAndStart, TestRetrieveLogs, TestListContainers, TestStopContainer, TestVolumes, TestUpgrade, TestPlugins). Subtest 1.8 (TestContainerPersistenceAfterColdReboot) requires PR2 changes and fails as expected on this branch alone.
  • CNTR-2 (container_connectivity): TestDial (PASS) and TestDialLocal (PASS, subtests: dial_gNMI, dial_gRIBI, dial_something_not_listening). TestDialLocal/dial_gRIBI passes with AddNH/AddNHG skipped via GribiDecapInDefaultNiUnsupported deviation. TestConnectRemote is a pre-existing hardcoded skip (TODO in test source).

The three CNTR tests share containerztest.Client(), which fails on Arista because the binding rejects dut.Config().New().WithAristaText(...).Append(t) (PushConfig is unimplemented). As a result the containerz gNOI service is never configured and the management namespace firewall blocks container port 60061. CNTR-2 additionally lacks the containerz_oc_unsupported deviation, so the gNOI service enablement block is never entered. A credential conflict in binding.DialGRPCWithPort (per-RPC credentials combined with a caller-supplied insecure transport) prevents CNTR-2 from reaching the container even when the port is open.

Replace the Append call with helpers.GnmiCLIConfig for reliable CLI-origin gNMI config push. Add gnmiSaveWithRetry and waitForGNOI polling to handle Octa restarts triggered by enabling the containerz gNOI service. Insert a direct ip6tables ACCEPT rule into EOS's ns-mgmt namespace as a reliable fallback for port 60061; the system control-plane ACL mechanism does not create kernel firewall rules on all EOS versions. Re-dial gNOI after config push to bypass Ondatra's stale connection cache. Fix Setup() teardown to call RemoveContainer so stopped containers do not contaminate subsequent tests via Docker's restart policy. Improve WaitForRunning with per-call timeouts and stream error retry instead of a fatal on the first error.

Fix DialGRPCWithPort to strip per-RPC credentials and default to insecure transport when the device config specifies none; this resolves the gRPC errCredentialsConflict when the caller supplies its own transport credentials.

In cntr_test.go, pass TLS skip-verify credentials in dialContainer for Arista (cntrsrv serves a self-signed certificate). Guard AddNH/AddNHG with the new GribiDecapInDefaultNiUnsupported deviation for platforms that cannot FIB-program a Decap NH in the DEFAULT network instance. Add AwaitSwitchoverReady to internal/components for reuse across CNTR tests. Add containerz_oc_unsupported and gribi_decap_in_default_ni_unsupported to CNTR-2 metadata. Add the GribiDecapInDefaultNiUnsupported deviation accessor and proto field 428.

With those changes CNTR-1.1-1.7 ( TestDeployAndStart, TestRetrieveLogs, TestListContainers, TestStopContainer, TestVolumes, TestUpgrade, TestPlugins ) and a whole CNTR-2 passes on Arista's EOS.
@pjacakArista pjacakArista requested a review from a team as a code owner May 26, 2026 07:45
@gemini-code-assist
Copy link
Copy Markdown
Contributor

Summary of Changes

Hello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request addresses several infrastructure failures preventing CNTR tests from running on Arista hardware. By refining the container service configuration, improving connection handling during device restarts, and adding necessary platform-specific deviations, this change stabilizes the container lifecycle and connectivity tests. These updates ensure that the test suite can reliably interact with Arista devices despite differences in gNOI service implementation and management namespace firewall policies.

Highlights

  • Arista Infrastructure Fixes: Implemented reliable CLI-origin gNMI config pushing, added polling for gNOI service availability, and introduced a direct ip6tables fallback rule to ensure container port 60061 is accessible in the management namespace.
  • Connection Stability: Added gNOI re-dial logic to handle stale connections following Octa restarts and improved container lifecycle management by ensuring containers are removed on teardown.
  • Test Robustness: Enhanced container polling with per-call timeouts and retry logic, and fixed gRPC credential conflicts in the binding layer by stripping per-RPC credentials for arbitrary services.
  • Arista Deviations: Added TLS skip-verify for Arista container connections and introduced the gribi_decap_in_default_ni_unsupported deviation to handle platform-specific FIB programming limitations.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize the Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counterproductive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

@OpenConfigBot
Copy link
Copy Markdown

OpenConfigBot commented May 26, 2026

Pull Request Functional Test Report for #5507 / 771915d

Virtual Devices

Device Test Test Documentation Job Raw Log
Arista cEOS status
CNTR-2: Container network connectivity tests
Cisco 8000E status
CNTR-2: Container network connectivity tests
Cisco XRd status
CNTR-2: Container network connectivity tests
Juniper ncPTX status
CNTR-2: Container network connectivity tests
Nokia SR Linux status
CNTR-2: Container network connectivity tests
Openconfig Lemming status
CNTR-2: Container network connectivity tests

Hardware Devices

Device Test Test Documentation Raw Log
Arista 7808 status
CNTR-2: Container network connectivity tests
Cisco 8808 status
CNTR-2: Container network connectivity tests
Juniper PTX10008 status
CNTR-2: Container network connectivity tests
Nokia 7250 IXR-10e status
CNTR-2: Container network connectivity tests

Help

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces several improvements to container connectivity testing and gNOI/gNMI operations. Key changes include adding a new deviation gribi_decap_in_default_ni_unsupported to skip Decap NH programming when unsupported, implementing robust gNOI re-dialing and retry mechanisms for "write memory" on Arista devices, and stripping per-RPC credentials in DialGRPCWithPort to avoid credentials conflicts. The review feedback highlights three important improvement opportunities: using a dedicated timeout context in the Setup cleanup function to ensure container removal succeeds during context cancellation, abstracting the vendor-specific TLS skip-verify check behind a new deviation to maintain test portability, and adding a timeout to the gNMI Set call in the retry loop to prevent potential hangs.

Comment thread internal/containerztest/containerztest.go
Comment thread feature/container/networking/tests/container_connectivity/cntr_test.go Outdated
Comment thread internal/containerztest/containerztest.go Outdated
Three issues were flagged in review of the initial CNTR infrastructure fixes:
- the `Setup()` cleanup closure captured the caller's test context, which may already be canceled when the cleanup runs (e.g. on test timeout), causing `Stop` and `RemoveContainer` to fail immediately and leave the container running on the DUT
- the gNMI Set call in `gnmiSaveWithRetry` used `context.Background()` with no timeout, which can hang indefinitely if gNMI is unresponsive during an Octa restart
- the `dialContainer` function in `cntr_test.go` used `dut.Vendor()` directly to select TLS credentials, which bypasses the deviation abstraction that featureprofiles tests are expected to use

Fixed all 3:
- give the `Setup` cleanup closure a fresh `context.WithTimeout` of one minute so `Stop` and `RemoveContainer` always execute regardless of the caller's context state.
- add a 10-second per-call timeout around the gNMI Set in `gnmiSaveWithRetry` so the retry loop can make progress even when the server is temporarily unresponsive.
- replace the `dut.Vendor()` switch in `dialContainer` with a new `ContainerzTLSInsecureSkipVerify` deviation and set it true for Arista in the CNTR-2 metadata.
@pjacakArista
Copy link
Copy Markdown
Author

/gemini review

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces several changes to container connectivity tests and related helpers, including support for insecure TLS skip-verify in containerz connections, handling unsupported gRIBI Decap NH in the default network instance, and robust container cleanup and status polling. Additionally, it updates the gRPC dialer in the topology binding to strip per-RPC credentials for generic ports. The review feedback highlights that the newly added deviations GribiDecapInDefaultNiUnsupported and ContainerzTLSInsecureSkipVerify are missing tracking issue URLs in their comments, violating the repository style guide. It also suggests adding a timeout to the gNOI re-dial context in Client() to prevent potential test hangs.

Comment on lines +1883 to +1889
// GribiDecapInDefaultNiUnsupported returns true if the device does not support
// programming a Decap NH in the DEFAULT network instance via gRIBI. When true,
// the test skips AddNH/AddNHG and accepts an empty gRIBI RIB (EOF) as proof of
// connectivity. Default value is false.
func GribiDecapInDefaultNiUnsupported(dut *ondatra.DUTDevice) bool {
return lookupDUTDeviations(dut).GetGribiDecapInDefaultNiUnsupported()
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The new deviation GribiDecapInDefaultNiUnsupported is missing a comment containing a URL link to an issue tracker tracking its removal, which violates the repository style guide (Section 2, Deviation Guidelines, Step 5). Please add a tracking URL in the format https://issuetracker.google.com/xxxxx.

Suggested change
// GribiDecapInDefaultNiUnsupported returns true if the device does not support
// programming a Decap NH in the DEFAULT network instance via gRIBI. When true,
// the test skips AddNH/AddNHG and accepts an empty gRIBI RIB (EOF) as proof of
// connectivity. Default value is false.
func GribiDecapInDefaultNiUnsupported(dut *ondatra.DUTDevice) bool {
return lookupDUTDeviations(dut).GetGribiDecapInDefaultNiUnsupported()
}
// GribiDecapInDefaultNiUnsupported returns true if the device does not support
// programming a Decap NH in the DEFAULT network instance via gRIBI. When true,
// the test skips AddNH/AddNHG and accepts an empty gRIBI RIB (EOF) as proof of
// connectivity. Default value is false.
//
// See: https://issuetracker.google.com/xxxxx
func GribiDecapInDefaultNiUnsupported(dut *ondatra.DUTDevice) bool {
return lookupDUTDeviations(dut).GetGribiDecapInDefaultNiUnsupported()
}
References
  1. Add a comment to the accessor function containing a URL link to an issue tracker which tracks removal of the deviation. The format should be https://issuetracker.google.com/xxxxx. (link)
  2. Issue tracker URLs in deviation comments are acceptable in the format 'https://issuetracker.google.com/xxxxx' (omitting the '/issues/' segment), provided they are validated.

Comment on lines +1891 to +1897
// ContainerzTLSInsecureSkipVerify returns true if the device's containerz
// service presents a self-signed TLS certificate that cannot be verified
// against a trusted CA. When true, dialContainer uses TLS with
// InsecureSkipVerify. Default value is false.
func ContainerzTLSInsecureSkipVerify(dut *ondatra.DUTDevice) bool {
return lookupDUTDeviations(dut).GetContainerzTlsInsecureSkipVerify()
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The new deviation ContainerzTLSInsecureSkipVerify is missing a comment containing a URL link to an issue tracker tracking its removal, which violates the repository style guide (Section 2, Deviation Guidelines, Step 5). Please add a tracking URL in the format https://issuetracker.google.com/xxxxx.

Suggested change
// ContainerzTLSInsecureSkipVerify returns true if the device's containerz
// service presents a self-signed TLS certificate that cannot be verified
// against a trusted CA. When true, dialContainer uses TLS with
// InsecureSkipVerify. Default value is false.
func ContainerzTLSInsecureSkipVerify(dut *ondatra.DUTDevice) bool {
return lookupDUTDeviations(dut).GetContainerzTlsInsecureSkipVerify()
}
// ContainerzTLSInsecureSkipVerify returns true if the device's containerz
// service presents a self-signed TLS certificate that cannot be verified
// against a trusted CA. When true, dialContainer uses TLS with
// InsecureSkipVerify. Default value is false.
//
// See: https://issuetracker.google.com/xxxxx
func ContainerzTLSInsecureSkipVerify(dut *ondatra.DUTDevice) bool {
return lookupDUTDeviations(dut).GetContainerzTlsInsecureSkipVerify()
}
References
  1. Add a comment to the accessor function containing a URL link to an issue tracker which tracks removal of the deviation. The format should be https://issuetracker.google.com/xxxxx. (link)
  2. Issue tracker URLs in deviation comments are acceptable in the format 'https://issuetracker.google.com/xxxxx' (omitting the '/issues/' segment), provided they are validated.

Comment on lines +95 to +99
freshClients, err := dut.RawAPIs().BindingDUT().DialGNOI(context.Background())
if err != nil {
t.Logf("gNOI re-dial in Client() failed (non-fatal): %v", err)
freshClients = dut.RawAPIs().GNOI(t)
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Using context.Background() without a timeout when calling DialGNOI can cause the test to hang indefinitely if the connection or the gNOI service is unresponsive. It is highly recommended to use a context with a reasonable timeout (e.g., 30 seconds) to ensure the test fails fast in case of network or service issues.

	dialCtx, dialCancel := context.WithTimeout(context.Background(), 30*time.Second)
	defer dialCancel()
	freshClients, err := dut.RawAPIs().BindingDUT().DialGNOI(dialCtx)
	if err != nil {
		t.Logf("gNOI re-dial in Client() failed (non-fatal): %v", err)
		freshClients = dut.RawAPIs().GNOI(t)
	}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants