Skip to content

rawapis: cached gNOI/gNMI clients become zombie connections after DUT reboot or supervisor switchover #145

@pjacakArista

Description

@pjacakArista

Problem

The internal/rawapis/rawapis.go caches gNOI (and gNMI/gNSI/gRIBI/P4RT) clients per DUT in a package-level map with no invalidation mechanism. After a DUT reboot or supervisor switchover, GNOI(t) returns the pre-reboot connection. RPCs on that connection hang with context deadline exceeded until the test's context expires - the underlying TCP socket is a zombie: the peer rebooted without sending FIN/RST and gRPC has no keepalive to detect this.

This is not vendor-specific. It affects any test that reboots a DUT and then uses GNOI(t).

I've observed it while working on CNTR-1 and CNTR-3 for Arista's EOS (dual-RP hardware DCS-7804-CH ):

  • grpcurl with a fresh TLS connection to the device returned the expected response immediately after reboot.
  • The same RPC via the test's GNOI(t) client hung with context deadline exceeded for the full test timeout.
  • a docker ps on the device confirmed the containerz service was healthy and responding - the cached connection was the sole problem.

Suggested fix

Two coordinated changes to internal/rawapis/rawapis.go:

  1. Add gRPC keepalive to CommonDialOpts - something like:
  grpc.WithKeepaliveParams(keepalive.ClientParameters{
      Time:                30 * time.Second,
      Timeout:             10 * time.Second,
      PermitWithoutStream: true,
  })

This causes gRPC to send HTTP/2 PING frames on idle connections. If the peer doesn't respond within Timeout, gRPC marks the connection failed and subsequent RPCs return codes.Unavailable quickly (~40s after peer death) instead of hanging indefinitely.

  1. Add a unary + stream interceptor that evicts on codes.Unavailable:
    When any RPC returns codes.Unavailable, the interceptor removes the cached client for that DUT. The next FetchGNOI / FetchGNMI / etc. call re-dials cleanly.

Why both are needed together:

  • Keepalive alone: the dead connection eventually fails, but the dead entry stays in the cache - the next call still uses it, gets Unavailable, and returns the error to the caller.
  • Auto-evict alone: never triggers on a TCP-zombie connection, because gRPC never produces Unavailable on a half-open socket - it just hangs - unless keepalive forces the failure first.

Together they make the cache self-healing: reboot causes keepalive failure -> Unavailable -> interceptor evicts -> next RPC re-dials transparently, with no test changes required.

codes.DeadlineExceeded probably should not be used as the eviction signal - it is ambiguous (slow server, slow RPC, or dead connection are indistinguishable), and evicting on it would cause spurious cache misses on legitimately slow RPCs.

Workaround

The openconfig/featureprofiles#5509 works around this by calling dut.RawAPIs().BindingDUT().DialGNOI(ctx) directly after each reboot/switchover and threading the resulting fresh clients through to the affected code paths. This avoids touching the stale cache entry, but requires every test author working with rebooting DUTs to know about and apply the pattern manually. This does not scale well and can manifest in many tests. The Ondatra-level fix would generalize the solution for all tests.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions