Pod failing to initialize due to no prior node #250

pedep · 2019-03-08T21:57:38Z

I have made a 3-node mysql cluster to have a play around with mysql-operator

When draining the node containing mysql-0, it seems to be unable to restore from a sibling/master in the cluster after the pod has been rescheduled on another node.
When inspecting, the sidecar errors with this message
https://github.com/presslabs/mysql-operator/blob/c26526ee7be6e22d0f2825e7ce33ae71781b87e3/pkg/sidecar/appclone/appclone.go#L73

Since i am using emptyDir, the clone-mysql sidecar should download from the current master, or a sibling, but due to the serverId being 100, it goes straight to the error-message above.
https://github.com/presslabs/mysql-operator/blob/c26526ee7be6e22d0f2825e7ce33ae71781b87e3/pkg/sidecar/appclone/appclone.go#L65

It seems some kind of recovery option for pod 0 is needed.
I would suggest something along the lines of this

if util.GetServerID() > 100 {
	sourceHost := util.GetHostFor(util.GetServerID() - 1)
	err := cloneFromSource(sourceHost)
	if err != nil {
		return fmt.Errorf("failed to clone from %s, err: %s", sourceHost, err)
	}
+} else if util.GetServerID() == 100 {
+	sourceHost := util.GetMasterHost()
+	err := cloneFromSource(sourceHost)
+	if err != nil {
+		return fmt.Errorf("failed to clone from %s, err: %s", sourceHost, err)
+	}
} else {
	return fmt.Errorf(
		"failed to initialize because no of no prior node exists, check orchestrator maybe",
	)
}

I dont think this will result in the pod trying to connect to itself for recovery, due to this check above
https://github.com/presslabs/mysql-operator/blob/c26526ee7be6e22d0f2825e7ce33ae71781b87e3/pkg/sidecar/appclone/appclone.go#L52

Easiest way to reproduce this behaviour is to create a new cluster with volumeSpec.emptyDir: {} and a few replicas, and delete the my-cluster-mysql-0 pod.

The text was updated successfully, but these errors were encountered:

AMecea · 2019-03-11T15:20:26Z

Nice catch, @pedep ! Indeed this is a bug, I didn't test too much with emptyDir.

I think your patch should fix this issue.

I will be happy to review and merge a PR with the fix.

pedep · 2019-03-12T12:07:30Z

@AMecea Thanks 😄

I will try my hand at a PR in a moment

AMecea added this to the 0.2.x milestone Mar 11, 2019

pedep mentioned this issue Mar 12, 2019

Fix being unable to clone when ServerID == 100 #252

Merged

AMecea closed this as completed in #252 Mar 13, 2019

AMecea mentioned this issue May 21, 2019

First pod didn't clone data from other nodes after rebuilding. #326

Open

dougfales mentioned this issue Oct 18, 2019

Cloning failure after PVC removal and pod deletion of master pod #412

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Pod failing to initialize due to no prior node #250

Pod failing to initialize due to no prior node #250

pedep commented Mar 8, 2019

AMecea commented Mar 11, 2019

pedep commented Mar 12, 2019

Pod failing to initialize due to no prior node #250

Pod failing to initialize due to no prior node #250

Comments

pedep commented Mar 8, 2019

AMecea commented Mar 11, 2019

pedep commented Mar 12, 2019