Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support failover for across Kubernetes clusters deployment #2935

Closed
DanielZhangQD opened this issue Jul 15, 2020 · 5 comments
Closed

Support failover for across Kubernetes clusters deployment #2935

DanielZhangQD opened this issue Jul 15, 2020 · 5 comments
Assignees
Labels
status/help-wanted Extra attention is needed

Comments

@DanielZhangQD
Copy link
Contributor

DanielZhangQD commented Jul 15, 2020

Feature Request

Is your feature request related to a problem? Please describe:

Need to make sure that the upgrading/scaling/failover functions work as expected for the across Kubernetes clusters deployment.
Describe the feature you'd like:

Describe alternatives you've considered:

Teachability, Documentation, Adoption, Migration Strategy:

Related Issues and PR:
#3661
#3556
#3530

@DanielZhangQD
Copy link
Contributor Author

Upgrading and scaling should work as expected, we just need to verify them.

For the failover function, including PD, TiDB, TiFlash and TiKV, the current implementation is based on the failureMembers (PD/TiDB) or failureStores (TiKV/TiFlash), but since the multiple TidbClusters in the different Kubernetes clusters now are a whole one TiDB Cluster, if one Pod in one TidbCluster fails, the other TidbClusters may also know it and will record it in their status, which will cause failover happening for all of the TidbClusters for the same Pod, so we have to update the logic to record the failed Pods only in this cluster, maybe add the cluster domain to the match pattern here https://github.com/pingcap/tidb-operator/blob/master/pkg/manager/member/tikv_member_manager.go#L52 for TiKV and do similar filters for other components.

@DanielZhangQD
Copy link
Contributor Author

Actually, the status of each TidbCluster should only include the components managed by this TidbCluster.

@handlerww
Copy link
Contributor

handlerww commented Nov 25, 2020

Upgrading and scaling should work as expected, we just need to verify them.

For the failover function, including PD, TiDB, TiFlash and TiKV, the current implementation is based on the failureMembers (PD/TiDB) or failureStores (TiKV/TiFlash), but since the multiple TidbClusters in the different Kubernetes clusters now are a whole one TiDB Cluster, if one Pod in one TidbCluster fails, the other TidbClusters may also know it and will record it in their status, which will cause failover happening for all of the TidbClusters for the same Pod, so we have to update the logic to record the failed Pods only in this cluster, maybe add the cluster domain to the match pattern here https://github.com/pingcap/tidb-operator/blob/master/pkg/manager/member/tikv_member_manager.go#L52 for TiKV and do similar filters for other components.

It works normally. The info about failureMembers is gathered from tc.Status.PD.Members. These are in-cluster operations.

@DanielZhangQD
Copy link
Contributor Author

We need to make sure that the failover works expected for PD, TiDB, TiKV, TiFlash.

@handlerww
Copy link
Contributor

handlerww commented Jan 14, 2021

Manually test result is here:
There are 1 pd in cluster1 and 2 pd in cluster2, disable cluster1-pd-0, cluster1-tidb-0.

cluster1-discovery-7d8dc88cbb-hl2p4        1/1     Running            0          153m
cluster1-pd-0                              0/1     ErrImagePull       0          35m
cluster1-tidb-0                            1/2     ImagePullBackOff   0          47m
cluster1-tikv-0                            1/1     Running            0          8m43s
cluster1-tikv-1                            1/1     Running            0          48m
cluster1-tikv-2                            1/1     Running            0          34m

Then TiDB failover successfully before PD recovered, which verifies the design in startup scripts in TiDB. And then PD recovered, which verified the design in PDClient retry logic in PDControl.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
status/help-wanted Extra attention is needed
Projects
None yet
Development

No branches or pull requests

2 participants