Skip to content

TiKV Client Health Checks May Cause TiKV Unavailability with a Large Number of JuiceFS Clients #5954

@SonglinLife

Description

@SonglinLife

ref the issue: tikv/client-go#1626

A few days ago, one TiKV store became unavailable in our juicefs tikv cluster, with Grafana showing a lot of connections to that node. Even after marking it offline, connections persisted until the tombstone store was removed.

The issue come from the TiKV client's health check mechanism, which, upon detecting an unreachable store (likely due to a network error), triggered a 1-second timeout loop.

With 3,000 JuiceFS clients (using tikv-client-go), this escalated to potentially 3,000 health checks per second, overwhelming the system.

Tcpdump analysis revealed 77% of connections lasted under 0.1 seconds—starting with a TLS Client Hello and closing within 3 ms—while 23% timed out after 1 second. The short-lived connections' origin remains unclear.

But increasing the livenessTimeout (e.g., to 5 seconds) might reduce pressure.

The code can be that:

in tkv_tikv.go func newTikvClient

tikv.SetStoreLivenessTimeout(5 * time.Second)
client, err := txnkv.NewClient(strings.Split(tUrl.Host, ","))

And, we can wait for the TiKV client community to respond to this ref issue tikv/client-go#1626
and see if this change should be adopted based on their feedback.

Metadata

Metadata

Assignees

No one assigned

    Labels

    area/third-partyIssues or PRs related to third party product or project

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions