Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PD failover #74

Merged
merged 16 commits into from Sep 7, 2018
Merged

PD failover #74

merged 16 commits into from Sep 7, 2018

Conversation

weekface
Copy link
Contributor

@weekface weekface commented Sep 3, 2018

This PR fixes #57, adds PD failover feature to TiDB Operator.

@tennix @onlymellb @xiaojingchen @gregwebs PTAL

@weekface weekface mentioned this pull request Sep 3, 2018
4 tasks
// Failover implements the logic for pd/tikv/tidb's failover and recovery.
type Failover interface {
Failover(*v1alpha1.TidbCluster) error
Recovery(*v1alpha1.TidbCluster)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use Recover

// mark a peer as failure
for podName, pdMember := range tc.Status.PD.Members {
deadline := pdMember.LastTransitionTime.Add(pf.pdFailoverPeriod)
_, exist := tc.Status.PD.FailureMembers[podName]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could FailureMembers be nil?

deadline := pdMember.LastTransitionTime.Add(pf.pdFailoverPeriod)
_, exist := tc.Status.PD.FailureMembers[podName]
if !pdMember.Health && time.Now().After(deadline) && !exist && notDeletedCount == 0 {
if !inQuorum {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can be moved to the top level immediately after inQuorum is computed.

for podName, pdMember := range tc.Status.PD.Members {
deadline := pdMember.LastTransitionTime.Add(pf.pdFailoverPeriod)
_, exist := tc.Status.PD.FailureMembers[podName]
if !pdMember.Health && time.Now().After(deadline) && !exist && notDeletedCount == 0 {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

notDeletedCount can be moved outside of the for loop, it only needs to be checked once.

}
inQuorum := healthCount > int(tc.Spec.PD.Replicas/2)

notDeletedCount := 0
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From the code, I guess this is used to indicate that we can only failover one at a time. So it's better to add a comment for this.

if tc.Status.PD.FailureMembers != nil {
pmm.pdFailover.Recovery(tc)
}
} else if int(tc.Spec.PD.Replicas) == int(tc.Status.PD.StatefulSet.Replicas) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unnecessary to cast replicas to int, they're both int32.

@@ -74,8 +74,8 @@ func (pmm *metaManager) Sync(tc *v1alpha1.TidbCluster) error {
}
// update meta info for pvc
pvc, err := pmm.resolvePVCFromPod(pod)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

err defined but not used. Replace it with _ if you don't want to check it.

@@ -104,6 +104,8 @@ func (pmm *metaManager) resolvePVCFromPod(pod *corev1.Pod) (*corev1.PersistentVo
pvcName = vol.PersistentVolumeClaim.ClaimName
break
}
case v1alpha1.TiDBMemberType.String():
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should make a default for TiDB and others.

@@ -35,6 +35,12 @@ spec:
- /usr/local/bin/tidb-controller-manager
- -default-storage-class-name={{ .Values.defaultStorageClassName }}
- -cluster-scoped={{ .Values.clusterScoped }}
{{- if .Values.controllerManager.autoFailover }}
- -auto-failover={{ .Values.controllerManager.autoFailover }}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use {{ .Values.controllerManager.autoFailover | default false }}.

- -auto-failover={{ .Values.controllerManager.autoFailover }}
{{- end }}
{{- if .Values.controllerManager.pdFailoverPeriod }}
- -pd-failover-period={{ .Values.controllerManager.pdFailoverPeriod }}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto.

then
ARGS="${ARGS}--initial-cluster=${HOSTNAME}=http://${HOSTNAME}.${PEER_SERVICE_NAME}.${NAMESPACE}.svc:2380"
else
if [[ ${ORDINAL} -eq 0 ]]
then
TOP=$((replicas-1))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

replicas is undefined, define it like following:

replicas=${replicas:-3}

@weekface
Copy link
Contributor Author

weekface commented Sep 5, 2018

/run-e2e-tests

@weekface
Copy link
Contributor Author

weekface commented Sep 5, 2018

/run-e2e-tests

@weekface
Copy link
Contributor Author

weekface commented Sep 6, 2018

/run-e2e-tests

Copy link
Member

@tennix tennix left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rest LGTM

if err != nil && !errors.IsNotFound(err) {
return err
}
// start all the pd members together, if this is a old cluster
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/a/an/

@weekface
Copy link
Contributor Author

weekface commented Sep 6, 2018

/run-e2e-tests

@weekface
Copy link
Contributor Author

weekface commented Sep 7, 2018

/run-e2e-tests

@weekface weekface merged commit 4885371 into pingcap:master Sep 7, 2018
@weekface weekface deleted the pd-failover branch September 7, 2018 03:13
queenliuxx pushed a commit to queenliuxx/tidb-operator that referenced this pull request Dec 19, 2018
* *: pd failover
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

PD automatic failover
3 participants