Support failover for across Kubernetes clusters deployment #2935

DanielZhangQD · 2020-07-15T02:52:20Z

Feature Request

Is your feature request related to a problem? Please describe:

Need to make sure that the upgrading/scaling/failover functions work as expected for the across Kubernetes clusters deployment.
Describe the feature you'd like:

Describe alternatives you've considered:

Teachability, Documentation, Adoption, Migration Strategy:

Related Issues and PR:
#3661
#3556
#3530

DanielZhangQD · 2020-07-15T06:50:17Z

Upgrading and scaling should work as expected, we just need to verify them.

For the failover function, including PD, TiDB, TiFlash and TiKV, the current implementation is based on the failureMembers (PD/TiDB) or failureStores (TiKV/TiFlash), but since the multiple TidbClusters in the different Kubernetes clusters now are a whole one TiDB Cluster, if one Pod in one TidbCluster fails, the other TidbClusters may also know it and will record it in their status, which will cause failover happening for all of the TidbClusters for the same Pod, so we have to update the logic to record the failed Pods only in this cluster, maybe add the cluster domain to the match pattern here https://github.com/pingcap/tidb-operator/blob/master/pkg/manager/member/tikv_member_manager.go#L52 for TiKV and do similar filters for other components.

DanielZhangQD · 2020-07-17T02:45:46Z

Actually, the status of each TidbCluster should only include the components managed by this TidbCluster.

handlerww · 2020-11-25T09:32:58Z

Upgrading and scaling should work as expected, we just need to verify them.

For the failover function, including PD, TiDB, TiFlash and TiKV, the current implementation is based on the failureMembers (PD/TiDB) or failureStores (TiKV/TiFlash), but since the multiple TidbClusters in the different Kubernetes clusters now are a whole one TiDB Cluster, if one Pod in one TidbCluster fails, the other TidbClusters may also know it and will record it in their status, which will cause failover happening for all of the TidbClusters for the same Pod, so we have to update the logic to record the failed Pods only in this cluster, maybe add the cluster domain to the match pattern here https://github.com/pingcap/tidb-operator/blob/master/pkg/manager/member/tikv_member_manager.go#L52 for TiKV and do similar filters for other components.

It works normally. The info about failureMembers is gathered from tc.Status.PD.Members. These are in-cluster operations.

DanielZhangQD · 2020-11-26T01:05:30Z

We need to make sure that the failover works expected for PD, TiDB, TiKV, TiFlash.

handlerww · 2021-01-14T07:11:07Z

Manually test result is here:
There are 1 pd in cluster1 and 2 pd in cluster2, disable cluster1-pd-0, cluster1-tidb-0.

cluster1-discovery-7d8dc88cbb-hl2p4        1/1     Running            0          153m
cluster1-pd-0                              0/1     ErrImagePull       0          35m
cluster1-tidb-0                            1/2     ImagePullBackOff   0          47m
cluster1-tikv-0                            1/1     Running            0          8m43s
cluster1-tikv-1                            1/1     Running            0          48m
cluster1-tikv-2                            1/1     Running            0          34m

Then TiDB failover successfully before PD recovered, which verifies the design in startup scripts in TiDB. And then PD recovered, which verified the design in PDClient retry logic in PDControl.

DanielZhangQD added the status/help-wanted Extra attention is needed label Jul 15, 2020

DanielZhangQD added this to the v1.2.0 milestone Jul 15, 2020

DanielZhangQD mentioned this issue Jul 15, 2020

Support deploying one TiDB Cluster across multiple regions or data centers #2816

Open

13 tasks

DanielZhangQD mentioned this issue Jul 17, 2020

Status sync update for heterogeneous deployment #2975

Closed

DanielZhangQD modified the milestones: v1.2.0, v1.2.0-alpha.1 Sep 10, 2020

DanielZhangQD assigned handlerww Oct 30, 2020

handlerww closed this as completed Jan 14, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support failover for across Kubernetes clusters deployment #2935

Support failover for across Kubernetes clusters deployment #2935

DanielZhangQD commented Jul 15, 2020 •

edited by handlerww

DanielZhangQD commented Jul 15, 2020

DanielZhangQD commented Jul 17, 2020

handlerww commented Nov 25, 2020 •

edited

DanielZhangQD commented Nov 26, 2020

handlerww commented Jan 14, 2021 •

edited

Support failover for across Kubernetes clusters deployment #2935

Support failover for across Kubernetes clusters deployment #2935

Comments

DanielZhangQD commented Jul 15, 2020 • edited by handlerww

Feature Request

DanielZhangQD commented Jul 15, 2020

DanielZhangQD commented Jul 17, 2020

handlerww commented Nov 25, 2020 • edited

DanielZhangQD commented Nov 26, 2020

handlerww commented Jan 14, 2021 • edited

DanielZhangQD commented Jul 15, 2020 •

edited by handlerww

handlerww commented Nov 25, 2020 •

edited

handlerww commented Jan 14, 2021 •

edited