Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix/raft rejoin errors #170

Merged
merged 7 commits into from Oct 12, 2017

Conversation

ZenGround0
Copy link
Collaborator

@ZenGround0 ZenGround0 commented Oct 11, 2017

Addresses #112 by storing peers as bootstrappers when a node is shutdown. This makes rejoining the cluster pain-free (no chance for bad raft state to error during rejoin) for the departing node. This strengthens the implicit assumption that departing nodes will rejoin a cluster by default, as starting the departing node in its own cluster will now require modifying the config.

@hsanjuan I may be using Shadow in an unconventional way, feedback is welcome.

This PR exposed a flaw in the tests that is still a WIP. Specifically in k8s-ipfs restarting the cluster is currently handled by reinitializing ipfs-cluster-service before starting the daemon. This clears listeners bound to sockets that are left behind by the old daemon. Using the saved config between restarts means reinitializing is not an option, and the network state needs to be cleaned up elsewhere. I will comment when this is resolved and the full k8s test suite correctly preserves configs.

edit
Upon further inspection it looks like this is just a timing consideration (the re-init just gave more time for system to clean things up). Newest commit doubles restart timing buffer to help with this.

Copy link
Collaborator

@hsanjuan hsanjuan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See comments.

cluster.go Outdated
@@ -423,6 +423,10 @@ func (c *Cluster) Shutdown() error {
} else {
time.Sleep(2 * time.Second)
}
/* set c.Config.Bootstrap to current peers */
c.config.Bootstrap = c.peerManager.peersAddrs()
c.config.Shadow()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm yeah, unconventional :)

There is a behaviour change here. Before, if the user passed in a flag like --leave but leave_on_shutdown was false in the configuration, the configuration would stay as false when the cluster is shudown. With this, however, it will switch to true when this code runs.

I think the cleanest here is to do c.config.unshadow() before editing the Bootstrap key. In both cases you are saving right afterwards, and that's when unshadow is supposed to happen anyway. Since you've done it before, whatever you write on .Bootstrap will stay.

@hsanjuan
Copy link
Collaborator

You may have broken sharness too

@hsanjuan
Copy link
Collaborator

LGTM now, but tests should pass

@hsanjuan
Copy link
Collaborator

Maybe sharness will be fixed upon rebasing.

@ZenGround0 note that you can work directly on ipfs-cluster branches and don't need to do it on your own fork. This would allow me to push to them too more easily.

@ZenGround0
Copy link
Collaborator Author

@hsanjuan ack

Adding in sharness test fixes to pass tests
@ZenGround0 ZenGround0 merged commit 180807b into ipfs-cluster:master Oct 12, 2017
@ZenGround0 ZenGround0 deleted the fix/raft-rejoin-errors branch October 12, 2017 13:54
@hsanjuan hsanjuan mentioned this pull request Oct 17, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants