Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pod failing to initialize due to no prior node #250

Closed
pedep opened this issue Mar 8, 2019 · 2 comments · Fixed by #252
Closed

Pod failing to initialize due to no prior node #250

pedep opened this issue Mar 8, 2019 · 2 comments · Fixed by #252
Milestone

Comments

@pedep
Copy link
Contributor

pedep commented Mar 8, 2019

I have made a 3-node mysql cluster to have a play around with mysql-operator

When draining the node containing mysql-0, it seems to be unable to restore from a sibling/master in the cluster after the pod has been rescheduled on another node.
When inspecting, the sidecar errors with this message
https://github.com/presslabs/mysql-operator/blob/c26526ee7be6e22d0f2825e7ce33ae71781b87e3/pkg/sidecar/appclone/appclone.go#L73

Since i am using emptyDir, the clone-mysql sidecar should download from the current master, or a sibling, but due to the serverId being 100, it goes straight to the error-message above.
https://github.com/presslabs/mysql-operator/blob/c26526ee7be6e22d0f2825e7ce33ae71781b87e3/pkg/sidecar/appclone/appclone.go#L65


It seems some kind of recovery option for pod 0 is needed.
I would suggest something along the lines of this

if util.GetServerID() > 100 {
	sourceHost := util.GetHostFor(util.GetServerID() - 1)
	err := cloneFromSource(sourceHost)
	if err != nil {
		return fmt.Errorf("failed to clone from %s, err: %s", sourceHost, err)
	}
+} else if util.GetServerID() == 100 {
+	sourceHost := util.GetMasterHost()
+	err := cloneFromSource(sourceHost)
+	if err != nil {
+		return fmt.Errorf("failed to clone from %s, err: %s", sourceHost, err)
+	}
} else {
	return fmt.Errorf(
		"failed to initialize because no of no prior node exists, check orchestrator maybe",
	)
}

I dont think this will result in the pod trying to connect to itself for recovery, due to this check above
https://github.com/presslabs/mysql-operator/blob/c26526ee7be6e22d0f2825e7ce33ae71781b87e3/pkg/sidecar/appclone/appclone.go#L52


Easiest way to reproduce this behaviour is to create a new cluster with volumeSpec.emptyDir: {} and a few replicas, and delete the my-cluster-mysql-0 pod.

@AMecea AMecea added this to the 0.2.x milestone Mar 11, 2019
@AMecea
Copy link
Contributor

AMecea commented Mar 11, 2019

Nice catch, @pedep ! Indeed this is a bug, I didn't test too much with emptyDir.

I think your patch should fix this issue.

I will be happy to review and merge a PR with the fix.

@pedep
Copy link
Contributor Author

pedep commented Mar 12, 2019

@AMecea Thanks 😄

I will try my hand at a PR in a moment

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants