-
Notifications
You must be signed in to change notification settings - Fork 487
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support failover for across Kubernetes clusters deployment #2935
Comments
Upgrading and scaling should work as expected, we just need to verify them. For the failover function, including PD, TiDB, TiFlash and TiKV, the current implementation is based on the failureMembers (PD/TiDB) or failureStores (TiKV/TiFlash), but since the multiple TidbClusters in the different Kubernetes clusters now are a whole one TiDB Cluster, if one Pod in one TidbCluster fails, the other TidbClusters may also know it and will record it in their status, which will cause failover happening for all of the TidbClusters for the same Pod, so we have to update the logic to record the failed Pods only in this cluster, maybe add the cluster domain to the match pattern here https://github.com/pingcap/tidb-operator/blob/master/pkg/manager/member/tikv_member_manager.go#L52 for TiKV and do similar filters for other components. |
Actually, the |
It works normally. The info about failureMembers is gathered from tc.Status.PD.Members. These are in-cluster operations. |
We need to make sure that the failover works expected for PD, TiDB, TiKV, TiFlash. |
Manually test result is here:
Then TiDB failover successfully before PD recovered, which verifies the design in startup scripts in TiDB. And then PD recovered, which verified the design in PDClient retry logic in PDControl. |
Feature Request
Is your feature request related to a problem? Please describe:
Need to make sure that the upgrading/scaling/failover functions work as expected for the across Kubernetes clusters deployment.
Describe the feature you'd like:
Describe alternatives you've considered:
Teachability, Documentation, Adoption, Migration Strategy:
Related Issues and PR:
#3661
#3556
#3530
The text was updated successfully, but these errors were encountered: