rawapis: cached gNOI/gNMI clients become zombie connections after DUT reboot or supervisor switchover

### Problem

The `internal/rawapis/rawapis.go` caches gNOI (and gNMI/gNSI/gRIBI/P4RT) clients per DUT in a package-level map with no invalidation mechanism. After a DUT reboot or supervisor switchover, `GNOI(t)` returns the pre-reboot connection. RPCs on that connection hang with context deadline exceeded until the test's context expires - the underlying TCP socket is a zombie: the peer rebooted without sending FIN/RST and gRPC has no keepalive to detect this.

This is not vendor-specific. It affects any test that reboots a DUT and then uses `GNOI(t)`.

I've observed it while working on CNTR-1 and CNTR-3 for Arista's EOS (dual-RP hardware DCS-7804-CH ):
- `grpcurl` with a fresh TLS connection to the device returned the expected response immediately after reboot.
- The same RPC via the test's `GNOI(t)` client hung with `context deadline exceeded` for the full test timeout.
- a `docker ps` on the device confirmed the containerz service was healthy and responding - the cached connection was the sole problem.

### Suggested fix

Two coordinated changes to `internal/rawapis/rawapis.go`:

1. Add gRPC keepalive to `CommonDialOpts` - something like:
```
  grpc.WithKeepaliveParams(keepalive.ClientParameters{
      Time:                30 * time.Second,
      Timeout:             10 * time.Second,
      PermitWithoutStream: true,
  })
```

This causes gRPC to send HTTP/2 PING frames on idle connections. If the peer doesn't respond within `Timeout`, gRPC marks the connection failed and subsequent RPCs return `codes.Unavailable` quickly (~40s after peer death) instead of hanging indefinitely.

2. Add a unary + stream interceptor that evicts on `codes.Unavailable`:
When any RPC returns `codes.Unavailable`, the interceptor removes the cached client for that DUT. The next `FetchGNOI` / `FetchGNMI` / etc. call re-dials cleanly.

Why both are needed together:
- Keepalive alone: the dead connection eventually fails, but the dead entry stays in the cache - the next call still uses it, gets `Unavailable`, and returns the error to the caller.
- Auto-evict alone: never triggers on a TCP-zombie connection, because gRPC never produces `Unavailable` on a half-open socket - it just hangs - unless keepalive forces the failure first.

Together they make the cache self-healing: reboot causes keepalive failure -> `Unavailable` -> interceptor evicts -> next RPC re-dials transparently, with no test changes required.

`codes.DeadlineExceeded` probably should not be used as the eviction signal - it is ambiguous (slow server, slow RPC, or dead connection are indistinguishable), and evicting on it would cause spurious cache misses on legitimately slow RPCs.

### Workaround

The [openconfig/featureprofiles#5509](https://github.com/openconfig/featureprofiles/pull/5509) works around this by calling `dut.RawAPIs().BindingDUT().DialGNOI(ctx)` directly after each reboot/switchover and threading the resulting fresh clients through to the affected code paths. This avoids touching the stale cache entry, but requires every test author working with rebooting DUTs to know about and apply the pattern manually. This does not scale well and can manifest in many tests. The Ondatra-level fix would generalize the solution for all tests.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

rawapis: cached gNOI/gNMI clients become zombie connections after DUT reboot or supervisor switchover #145

Problem

Suggested fix

Workaround

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

rawapis: cached gNOI/gNMI clients become zombie connections after DUT reboot or supervisor switchover #145

Description

Problem

Suggested fix

Workaround

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions