New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PD failover #74
PD failover #74
Conversation
pkg/manager/member/failover.go
Outdated
// Failover implements the logic for pd/tikv/tidb's failover and recovery. | ||
type Failover interface { | ||
Failover(*v1alpha1.TidbCluster) error | ||
Recovery(*v1alpha1.TidbCluster) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use Recover
pkg/manager/member/failover.go
Outdated
// mark a peer as failure | ||
for podName, pdMember := range tc.Status.PD.Members { | ||
deadline := pdMember.LastTransitionTime.Add(pf.pdFailoverPeriod) | ||
_, exist := tc.Status.PD.FailureMembers[podName] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could FailureMembers
be nil
?
pkg/manager/member/failover.go
Outdated
deadline := pdMember.LastTransitionTime.Add(pf.pdFailoverPeriod) | ||
_, exist := tc.Status.PD.FailureMembers[podName] | ||
if !pdMember.Health && time.Now().After(deadline) && !exist && notDeletedCount == 0 { | ||
if !inQuorum { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This can be moved to the top level immediately after inQuorum
is computed.
pkg/manager/member/failover.go
Outdated
for podName, pdMember := range tc.Status.PD.Members { | ||
deadline := pdMember.LastTransitionTime.Add(pf.pdFailoverPeriod) | ||
_, exist := tc.Status.PD.FailureMembers[podName] | ||
if !pdMember.Health && time.Now().After(deadline) && !exist && notDeletedCount == 0 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
notDeletedCount
can be moved outside of the for
loop, it only needs to be checked once.
pkg/manager/member/failover.go
Outdated
} | ||
inQuorum := healthCount > int(tc.Spec.PD.Replicas/2) | ||
|
||
notDeletedCount := 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From the code, I guess this is used to indicate that we can only failover one at a time. So it's better to add a comment for this.
if tc.Status.PD.FailureMembers != nil { | ||
pmm.pdFailover.Recovery(tc) | ||
} | ||
} else if int(tc.Spec.PD.Replicas) == int(tc.Status.PD.StatefulSet.Replicas) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unnecessary to cast replicas to int
, they're both int32
.
pkg/manager/meta/meta_manager.go
Outdated
@@ -74,8 +74,8 @@ func (pmm *metaManager) Sync(tc *v1alpha1.TidbCluster) error { | |||
} | |||
// update meta info for pvc | |||
pvc, err := pmm.resolvePVCFromPod(pod) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
err
defined but not used. Replace it with _
if you don't want to check it.
@@ -104,6 +104,8 @@ func (pmm *metaManager) resolvePVCFromPod(pod *corev1.Pod) (*corev1.PersistentVo | |||
pvcName = vol.PersistentVolumeClaim.ClaimName | |||
break | |||
} | |||
case v1alpha1.TiDBMemberType.String(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should make a default
for TiDB
and others.
@@ -35,6 +35,12 @@ spec: | |||
- /usr/local/bin/tidb-controller-manager | |||
- -default-storage-class-name={{ .Values.defaultStorageClassName }} | |||
- -cluster-scoped={{ .Values.clusterScoped }} | |||
{{- if .Values.controllerManager.autoFailover }} | |||
- -auto-failover={{ .Values.controllerManager.autoFailover }} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use {{ .Values.controllerManager.autoFailover | default false }}
.
- -auto-failover={{ .Values.controllerManager.autoFailover }} | ||
{{- end }} | ||
{{- if .Values.controllerManager.pdFailoverPeriod }} | ||
- -pd-failover-period={{ .Values.controllerManager.pdFailoverPeriod }} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto.
then | ||
ARGS="${ARGS}--initial-cluster=${HOSTNAME}=http://${HOSTNAME}.${PEER_SERVICE_NAME}.${NAMESPACE}.svc:2380" | ||
else | ||
if [[ ${ORDINAL} -eq 0 ]] | ||
then | ||
TOP=$((replicas-1)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
replicas
is undefined, define it like following:
replicas=${replicas:-3}
/run-e2e-tests |
/run-e2e-tests |
/run-e2e-tests |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rest LGTM
if err != nil && !errors.IsNotFound(err) { | ||
return err | ||
} | ||
// start all the pd members together, if this is a old cluster |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/a/an/
/run-e2e-tests |
/run-e2e-tests |
* *: pd failover
This PR fixes #57, adds PD failover feature to TiDB Operator.
@tennix @onlymellb @xiaojingchen @gregwebs PTAL