-
Notifications
You must be signed in to change notification settings - Fork 91
Description
Problem
When running gNMIC in a Kubernetes StatefulSet cluster (15–25 replicas), the leader-wait-timer creates an unavoidable trade-off between cold-start safety and rolling-restart speed.
Cold start scenario: All pods start simultaneously (podManagementPolicy: Parallel). If the leader dispatches targets before other pods have registered with Consul, a small number of early pods receive all targets and OOM. A long leader-wait-timer (e.g. 300s) prevents this.
Rolling restart scenario: Pods restart one at a time. When the leader pod restarts and a new leader is elected, 14–24 other pods are already running and registered. The long leader-wait-timer still fires, causing an unnecessary 5-minute metrics collection gap despite no OOM risk.
There is no way to configure gNMIC to distinguish between these two scenarios — the timer is a fixed delay regardless of cluster state.
Current behavior
In pkg/app/clustering.go, after a pod wins the leader lock, it unconditionally sleeps for LeaderWaitTimer before starting the loader and dispatching targets:
go func() {
go a.watchMembers(ctx)
a.Logger.Printf("leader waiting %s before dispatching targets",
a.Config.Clustering.LeaderWaitTimer)
time.Sleep(a.Config.Clustering.LeaderWaitTimer) // fixed delay
a.Logger.Printf("leader done waiting, starting loader and dispatching targets")
go a.startLoader(ctx)
go a.dispatchTargets(ctx)
}()Meanwhile, watchMembers() is already running concurrently and populating a.apiServices with healthy registered instances (via Consul TTL health checks). The leader already knows how many cluster members are ready — it just doesn't use that information.
Proposed solution
Add a new clustering config field, min-ready-instances, that allows the leader to dispatch targets as soon as a sufficient number of cluster members have registered — while keeping leader-wait-timer as a maximum timeout.
Config example
clustering:
leader-wait-timer: 300s # maximum wait (safety net / timeout)
min-ready-instances: 12 # dispatch as soon as 12 members registeredBehavior
- If
min-ready-instancesis set, the leader pollslen(a.apiServices)during the wait period - As soon as
len(a.apiServices) >= min-ready-instances, dispatch begins immediately - If the threshold isn't reached within
leader-wait-timer, dispatch proceeds anyway (current behavior, prevents infinite blocking) - If
min-ready-instancesis not set (default0), behavior is unchanged — pure timer-based wait
Implementation sketch
The change is localized to startCluster() in pkg/app/clustering.go and the config struct in pkg/config/clustering.go:
// In pkg/config/clustering.go — add to struct:
MinReadyInstances int `mapstructure:"min-ready-instances,omitempty" ...`
// In pkg/app/clustering.go — replace time.Sleep with:
deadline := time.After(a.Config.Clustering.LeaderWaitTimer)
ticker := time.NewTicker(2 * time.Second)
defer ticker.Stop()
for {
select {
case <-deadline:
a.Logger.Printf("leader-wait-timer expired, dispatching with %d instances",
len(a.apiServices))
goto DISPATCH
case <-ticker.C:
a.configLock.RLock()
n := len(a.apiServices)
a.configLock.RUnlock()
if a.Config.Clustering.MinReadyInstances > 0 && n >= a.Config.Clustering.MinReadyInstances {
a.Logger.Printf("min-ready-instances threshold met (%d/%d), dispatching",
n, a.Config.Clustering.MinReadyInstances)
goto DISPATCH
}
case <-ctx.Done():
return
}
}
DISPATCH:Impact
| Scenario | Current (300s timer) | With min-ready-instances |
|---|---|---|
| Cold start (15 pods) | 5 min delay | ~30-60s (pods register quickly with Parallel policy) |
| Rolling restart | 5 min delay | ~2-4s (14 pods already registered) |
| Partial failure | 5 min delay | Waits until threshold OR timeout |
Our deployment context
We run gNMIC v0.43.0 in production across multiple Kubernetes clusters:
- 15–25 replicas per cluster
- 200+ Arista/Junos targets per cluster
- Consul-based clustering with TTL health checks
podManagementPolicy: ParallelStatefulSets- The 5-minute gap during rolling restarts is our primary pain point
We're happy to contribute a PR if the maintainers are open to this approach.