-
Notifications
You must be signed in to change notification settings - Fork 1.1k
Description
ref the issue: tikv/client-go#1626
A few days ago, one TiKV store became unavailable in our juicefs tikv cluster, with Grafana showing a lot of connections to that node. Even after marking it offline, connections persisted until the tombstone store was removed.
The issue come from the TiKV client's health check mechanism, which, upon detecting an unreachable store (likely due to a network error), triggered a 1-second timeout loop.
With 3,000 JuiceFS clients (using tikv-client-go), this escalated to potentially 3,000 health checks per second, overwhelming the system.
Tcpdump analysis revealed 77% of connections lasted under 0.1 seconds—starting with a TLS Client Hello and closing within 3 ms—while 23% timed out after 1 second. The short-lived connections' origin remains unclear.
But increasing the livenessTimeout (e.g., to 5 seconds) might reduce pressure.
The code can be that:
in tkv_tikv.go func newTikvClient
tikv.SetStoreLivenessTimeout(5 * time.Second)
client, err := txnkv.NewClient(strings.Split(tUrl.Host, ","))
And, we can wait for the TiKV client community to respond to this ref issue tikv/client-go#1626
and see if this change should be adopted based on their feedback.