Problem
The internal/rawapis/rawapis.go caches gNOI (and gNMI/gNSI/gRIBI/P4RT) clients per DUT in a package-level map with no invalidation mechanism. After a DUT reboot or supervisor switchover, GNOI(t) returns the pre-reboot connection. RPCs on that connection hang with context deadline exceeded until the test's context expires - the underlying TCP socket is a zombie: the peer rebooted without sending FIN/RST and gRPC has no keepalive to detect this.
This is not vendor-specific. It affects any test that reboots a DUT and then uses GNOI(t).
I've observed it while working on CNTR-1 and CNTR-3 for Arista's EOS (dual-RP hardware DCS-7804-CH ):
grpcurl with a fresh TLS connection to the device returned the expected response immediately after reboot.
- The same RPC via the test's
GNOI(t) client hung with context deadline exceeded for the full test timeout.
- a
docker ps on the device confirmed the containerz service was healthy and responding - the cached connection was the sole problem.
Suggested fix
Two coordinated changes to internal/rawapis/rawapis.go:
- Add gRPC keepalive to
CommonDialOpts - something like:
grpc.WithKeepaliveParams(keepalive.ClientParameters{
Time: 30 * time.Second,
Timeout: 10 * time.Second,
PermitWithoutStream: true,
})
This causes gRPC to send HTTP/2 PING frames on idle connections. If the peer doesn't respond within Timeout, gRPC marks the connection failed and subsequent RPCs return codes.Unavailable quickly (~40s after peer death) instead of hanging indefinitely.
- Add a unary + stream interceptor that evicts on
codes.Unavailable:
When any RPC returns codes.Unavailable, the interceptor removes the cached client for that DUT. The next FetchGNOI / FetchGNMI / etc. call re-dials cleanly.
Why both are needed together:
- Keepalive alone: the dead connection eventually fails, but the dead entry stays in the cache - the next call still uses it, gets
Unavailable, and returns the error to the caller.
- Auto-evict alone: never triggers on a TCP-zombie connection, because gRPC never produces
Unavailable on a half-open socket - it just hangs - unless keepalive forces the failure first.
Together they make the cache self-healing: reboot causes keepalive failure -> Unavailable -> interceptor evicts -> next RPC re-dials transparently, with no test changes required.
codes.DeadlineExceeded probably should not be used as the eviction signal - it is ambiguous (slow server, slow RPC, or dead connection are indistinguishable), and evicting on it would cause spurious cache misses on legitimately slow RPCs.
Workaround
The openconfig/featureprofiles#5509 works around this by calling dut.RawAPIs().BindingDUT().DialGNOI(ctx) directly after each reboot/switchover and threading the resulting fresh clients through to the affected code paths. This avoids touching the stale cache entry, but requires every test author working with rebooting DUTs to know about and apply the pattern manually. This does not scale well and can manifest in many tests. The Ondatra-level fix would generalize the solution for all tests.
Problem
The
internal/rawapis/rawapis.gocaches gNOI (and gNMI/gNSI/gRIBI/P4RT) clients per DUT in a package-level map with no invalidation mechanism. After a DUT reboot or supervisor switchover,GNOI(t)returns the pre-reboot connection. RPCs on that connection hang with context deadline exceeded until the test's context expires - the underlying TCP socket is a zombie: the peer rebooted without sending FIN/RST and gRPC has no keepalive to detect this.This is not vendor-specific. It affects any test that reboots a DUT and then uses
GNOI(t).I've observed it while working on CNTR-1 and CNTR-3 for Arista's EOS (dual-RP hardware DCS-7804-CH ):
grpcurlwith a fresh TLS connection to the device returned the expected response immediately after reboot.GNOI(t)client hung withcontext deadline exceededfor the full test timeout.docker pson the device confirmed the containerz service was healthy and responding - the cached connection was the sole problem.Suggested fix
Two coordinated changes to
internal/rawapis/rawapis.go:CommonDialOpts- something like:This causes gRPC to send HTTP/2 PING frames on idle connections. If the peer doesn't respond within
Timeout, gRPC marks the connection failed and subsequent RPCs returncodes.Unavailablequickly (~40s after peer death) instead of hanging indefinitely.codes.Unavailable:When any RPC returns
codes.Unavailable, the interceptor removes the cached client for that DUT. The nextFetchGNOI/FetchGNMI/ etc. call re-dials cleanly.Why both are needed together:
Unavailable, and returns the error to the caller.Unavailableon a half-open socket - it just hangs - unless keepalive forces the failure first.Together they make the cache self-healing: reboot causes keepalive failure ->
Unavailable-> interceptor evicts -> next RPC re-dials transparently, with no test changes required.codes.DeadlineExceededprobably should not be used as the eviction signal - it is ambiguous (slow server, slow RPC, or dead connection are indistinguishable), and evicting on it would cause spurious cache misses on legitimately slow RPCs.Workaround
The openconfig/featureprofiles#5509 works around this by calling
dut.RawAPIs().BindingDUT().DialGNOI(ctx)directly after each reboot/switchover and threading the resulting fresh clients through to the affected code paths. This avoids touching the stale cache entry, but requires every test author working with rebooting DUTs to know about and apply the pattern manually. This does not scale well and can manifest in many tests. The Ondatra-level fix would generalize the solution for all tests.