CNTR: fix Arista client setup and add Arista deviations#5507
CNTR: fix Arista client setup and add Arista deviations#5507pjacakArista wants to merge 2 commits into
Conversation
The three CNTR tests share containerztest.Client(), which fails on Arista because the binding rejects dut.Config().New().WithAristaText(...).Append(t) (PushConfig is unimplemented). As a result the containerz gNOI service is never configured and the management namespace firewall blocks container port 60061. CNTR-2 additionally lacks the containerz_oc_unsupported deviation, so the gNOI service enablement block is never entered. A credential conflict in binding.DialGRPCWithPort (per-RPC credentials combined with a caller-supplied insecure transport) prevents CNTR-2 from reaching the container even when the port is open. Replace the Append call with helpers.GnmiCLIConfig for reliable CLI-origin gNMI config push. Add gnmiSaveWithRetry and waitForGNOI polling to handle Octa restarts triggered by enabling the containerz gNOI service. Insert a direct ip6tables ACCEPT rule into EOS's ns-mgmt namespace as a reliable fallback for port 60061; the system control-plane ACL mechanism does not create kernel firewall rules on all EOS versions. Re-dial gNOI after config push to bypass Ondatra's stale connection cache. Fix Setup() teardown to call RemoveContainer so stopped containers do not contaminate subsequent tests via Docker's restart policy. Improve WaitForRunning with per-call timeouts and stream error retry instead of a fatal on the first error. Fix DialGRPCWithPort to strip per-RPC credentials and default to insecure transport when the device config specifies none; this resolves the gRPC errCredentialsConflict when the caller supplies its own transport credentials. In cntr_test.go, pass TLS skip-verify credentials in dialContainer for Arista (cntrsrv serves a self-signed certificate). Guard AddNH/AddNHG with the new GribiDecapInDefaultNiUnsupported deviation for platforms that cannot FIB-program a Decap NH in the DEFAULT network instance. Add AwaitSwitchoverReady to internal/components for reuse across CNTR tests. Add containerz_oc_unsupported and gribi_decap_in_default_ni_unsupported to CNTR-2 metadata. Add the GribiDecapInDefaultNiUnsupported deviation accessor and proto field 428. With those changes CNTR-1.1-1.7 ( TestDeployAndStart, TestRetrieveLogs, TestListContainers, TestStopContainer, TestVolumes, TestUpgrade, TestPlugins ) and a whole CNTR-2 passes on Arista's EOS.
Summary of ChangesHello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request addresses several infrastructure failures preventing CNTR tests from running on Arista hardware. By refining the container service configuration, improving connection handling during device restarts, and adding necessary platform-specific deviations, this change stabilizes the container lifecycle and connectivity tests. These updates ensure that the test suite can reliably interact with Arista devices despite differences in gNOI service implementation and management namespace firewall policies. Highlights
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize the Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counterproductive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here. Footnotes
|
Pull Request Functional Test Report for #5507 / 771915dVirtual Devices
Hardware Devices
|
There was a problem hiding this comment.
Code Review
This pull request introduces several improvements to container connectivity testing and gNOI/gNMI operations. Key changes include adding a new deviation gribi_decap_in_default_ni_unsupported to skip Decap NH programming when unsupported, implementing robust gNOI re-dialing and retry mechanisms for "write memory" on Arista devices, and stripping per-RPC credentials in DialGRPCWithPort to avoid credentials conflicts. The review feedback highlights three important improvement opportunities: using a dedicated timeout context in the Setup cleanup function to ensure container removal succeeds during context cancellation, abstracting the vendor-specific TLS skip-verify check behind a new deviation to maintain test portability, and adding a timeout to the gNMI Set call in the retry loop to prevent potential hangs.
Three issues were flagged in review of the initial CNTR infrastructure fixes: - the `Setup()` cleanup closure captured the caller's test context, which may already be canceled when the cleanup runs (e.g. on test timeout), causing `Stop` and `RemoveContainer` to fail immediately and leave the container running on the DUT - the gNMI Set call in `gnmiSaveWithRetry` used `context.Background()` with no timeout, which can hang indefinitely if gNMI is unresponsive during an Octa restart - the `dialContainer` function in `cntr_test.go` used `dut.Vendor()` directly to select TLS credentials, which bypasses the deviation abstraction that featureprofiles tests are expected to use Fixed all 3: - give the `Setup` cleanup closure a fresh `context.WithTimeout` of one minute so `Stop` and `RemoveContainer` always execute regardless of the caller's context state. - add a 10-second per-call timeout around the gNMI Set in `gnmiSaveWithRetry` so the retry loop can make progress even when the server is temporarily unresponsive. - replace the `dut.Vendor()` switch in `dialContainer` with a new `ContainerzTLSInsecureSkipVerify` deviation and set it true for Arista in the CNTR-2 metadata.
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request introduces several changes to container connectivity tests and related helpers, including support for insecure TLS skip-verify in containerz connections, handling unsupported gRIBI Decap NH in the default network instance, and robust container cleanup and status polling. Additionally, it updates the gRPC dialer in the topology binding to strip per-RPC credentials for generic ports. The review feedback highlights that the newly added deviations GribiDecapInDefaultNiUnsupported and ContainerzTLSInsecureSkipVerify are missing tracking issue URLs in their comments, violating the repository style guide. It also suggests adding a timeout to the gNOI re-dial context in Client() to prevent potential test hangs.
| // GribiDecapInDefaultNiUnsupported returns true if the device does not support | ||
| // programming a Decap NH in the DEFAULT network instance via gRIBI. When true, | ||
| // the test skips AddNH/AddNHG and accepts an empty gRIBI RIB (EOF) as proof of | ||
| // connectivity. Default value is false. | ||
| func GribiDecapInDefaultNiUnsupported(dut *ondatra.DUTDevice) bool { | ||
| return lookupDUTDeviations(dut).GetGribiDecapInDefaultNiUnsupported() | ||
| } |
There was a problem hiding this comment.
The new deviation GribiDecapInDefaultNiUnsupported is missing a comment containing a URL link to an issue tracker tracking its removal, which violates the repository style guide (Section 2, Deviation Guidelines, Step 5). Please add a tracking URL in the format https://issuetracker.google.com/xxxxx.
| // GribiDecapInDefaultNiUnsupported returns true if the device does not support | |
| // programming a Decap NH in the DEFAULT network instance via gRIBI. When true, | |
| // the test skips AddNH/AddNHG and accepts an empty gRIBI RIB (EOF) as proof of | |
| // connectivity. Default value is false. | |
| func GribiDecapInDefaultNiUnsupported(dut *ondatra.DUTDevice) bool { | |
| return lookupDUTDeviations(dut).GetGribiDecapInDefaultNiUnsupported() | |
| } | |
| // GribiDecapInDefaultNiUnsupported returns true if the device does not support | |
| // programming a Decap NH in the DEFAULT network instance via gRIBI. When true, | |
| // the test skips AddNH/AddNHG and accepts an empty gRIBI RIB (EOF) as proof of | |
| // connectivity. Default value is false. | |
| // | |
| // See: https://issuetracker.google.com/xxxxx | |
| func GribiDecapInDefaultNiUnsupported(dut *ondatra.DUTDevice) bool { | |
| return lookupDUTDeviations(dut).GetGribiDecapInDefaultNiUnsupported() | |
| } |
References
- Add a comment to the accessor function containing a URL link to an issue tracker which tracks removal of the deviation. The format should be https://issuetracker.google.com/xxxxx. (link)
- Issue tracker URLs in deviation comments are acceptable in the format 'https://issuetracker.google.com/xxxxx' (omitting the '/issues/' segment), provided they are validated.
| // ContainerzTLSInsecureSkipVerify returns true if the device's containerz | ||
| // service presents a self-signed TLS certificate that cannot be verified | ||
| // against a trusted CA. When true, dialContainer uses TLS with | ||
| // InsecureSkipVerify. Default value is false. | ||
| func ContainerzTLSInsecureSkipVerify(dut *ondatra.DUTDevice) bool { | ||
| return lookupDUTDeviations(dut).GetContainerzTlsInsecureSkipVerify() | ||
| } |
There was a problem hiding this comment.
The new deviation ContainerzTLSInsecureSkipVerify is missing a comment containing a URL link to an issue tracker tracking its removal, which violates the repository style guide (Section 2, Deviation Guidelines, Step 5). Please add a tracking URL in the format https://issuetracker.google.com/xxxxx.
| // ContainerzTLSInsecureSkipVerify returns true if the device's containerz | |
| // service presents a self-signed TLS certificate that cannot be verified | |
| // against a trusted CA. When true, dialContainer uses TLS with | |
| // InsecureSkipVerify. Default value is false. | |
| func ContainerzTLSInsecureSkipVerify(dut *ondatra.DUTDevice) bool { | |
| return lookupDUTDeviations(dut).GetContainerzTlsInsecureSkipVerify() | |
| } | |
| // ContainerzTLSInsecureSkipVerify returns true if the device's containerz | |
| // service presents a self-signed TLS certificate that cannot be verified | |
| // against a trusted CA. When true, dialContainer uses TLS with | |
| // InsecureSkipVerify. Default value is false. | |
| // | |
| // See: https://issuetracker.google.com/xxxxx | |
| func ContainerzTLSInsecureSkipVerify(dut *ondatra.DUTDevice) bool { | |
| return lookupDUTDeviations(dut).GetContainerzTlsInsecureSkipVerify() | |
| } |
References
- Add a comment to the accessor function containing a URL link to an issue tracker which tracks removal of the deviation. The format should be https://issuetracker.google.com/xxxxx. (link)
- Issue tracker URLs in deviation comments are acceptable in the format 'https://issuetracker.google.com/xxxxx' (omitting the '/issues/' segment), provided they are validated.
| freshClients, err := dut.RawAPIs().BindingDUT().DialGNOI(context.Background()) | ||
| if err != nil { | ||
| t.Logf("gNOI re-dial in Client() failed (non-fatal): %v", err) | ||
| freshClients = dut.RawAPIs().GNOI(t) | ||
| } |
There was a problem hiding this comment.
Using context.Background() without a timeout when calling DialGNOI can cause the test to hang indefinitely if the connection or the gNOI service is unresponsive. It is highly recommended to use a context with a reasonable timeout (e.g., 30 seconds) to ensure the test fails fast in case of network or service issues.
dialCtx, dialCancel := context.WithTimeout(context.Background(), 30*time.Second)
defer dialCancel()
freshClients, err := dut.RawAPIs().BindingDUT().DialGNOI(dialCtx)
if err != nil {
t.Logf("gNOI re-dial in Client() failed (non-fatal): %v", err)
freshClients = dut.RawAPIs().GNOI(t)
}
Three CNTR tests (container lifecycle, container connectivity, supervisor failover) share containerztest.Client(), which fails on Arista for several independent reasons. This PR fixes the shared infrastructure and unblocks CNTR-1.1-1.7 and all of CNTR-2.
Root causes fixed:
containerztest.Client()calleddut.Config().New().WithAristaText(...).Append(t)to push the containerz gNOI service config. The Arista binding rejects this (PushConfigis unimplemented), so the containerz gNOI service was never enabled and port 60061 was never opened in the management namespace firewall. Replaced withhelpers.GnmiCLIConfigfor a reliable CLI-origin gNMI config push. AddedgnmiSaveWithRetryandwaitForGNOIpolling to handle Octa restarts triggered by enabling the containerz service.ns-mgmt) blocks port 60061 even after the containerz ACL config is applied, because the system control-plane ACL mechanism does not create kernelip6tablesrules on all EOS versions. Added a directip6tablesACCEPTrule intons-mgmtas a reliable fallback.Setup()did not callRemoveContaineron teardown, so stopped containers could contaminate subsequent tests via Docker's restart policy.WaitForRunningused a single-deadline context and was failing on the first stream error. Improved with per-call timeouts and stream error retry.DialGRPCWithPortintopologies/binding/binding.gocombined per-RPC credentials with an insecure transport, producing agRPCerrCredentialsConflict. Fixed by stripping per-RPC credentials and defaulting to insecure transport when the device config specifies none.cntr_test.go(container_connectivity):dialContainerused insecure transport;cntrsrvserves a self-signed TLS certificate. Added TLS skip-verify credentials for Arista. GuardedAddNH/AddNHGwith a newGribiDecapInDefaultNiUnsupporteddeviation for platforms that cannot FIB-program a Decap NH in the DEFAULT network instance.CNTR-2metadata was missing thecontainerz_oc_unsupporteddeviation thatCNTR-1already had, so the gNOI service enablement block was never entered. Added. Also addedgribi_decap_in_default_ni_unsupported.AwaitSwitchoverReadyto internal/components for reuse across CNTR tests (used by a follow up PR).This is PR 1 of 2. The PR2 will add the cold-reboot / supervisor failover gNOI cache fix (ClientWithoutConfig, WaitForReboot) needed for CNTR-1.8 and all of CNTR-3.
Validated manually on Arista EOS: