Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Initial Raft bootstrap using multiple arbitrarily-started nodes should be supported #853

Open
nathanleclaire opened this issue Jun 8, 2016 · 4 comments

Comments

@nathanleclaire
Copy link

nathanleclaire commented Jun 8, 2016

Consider:

You want to start a swarmkit cluster. The swarmkit cluster will consist of, say, 3 manager nodes and 7 worker nodes.

You boot the manager and worker nodes up at the same time, expecting the workers to join once the initial bootstrap and leader election is complete. You want the leader election to be arbitrary, i.e. handled by swarmkit's Raft implementation and not dictated ahead of time from on high.

In the 3 Nodes Cluster example, --join-addr specifies only one address, so it's not clear how an initial leader election is intended to be handled. (EDIT: If I understand correctly the node where docker swarm init is run just becomes the leader by default.)

How will this be handled in swarmkit? Bootstraps of this nature seem supported by etcd and Consul, so perhaps swarmkit should support it as well?

In Consul, when starting an agent for an initial bootstrap you pass the -bootstrap-expect N flag to delay election until the proper number of N peers have connected. It seems that these agents then discover each other by one of them running consul join with multiple addresses:

$ consul join <Node A Address> <Node B Address> <Node C Address>
Successfully joined cluster by contacting 3 nodes.

(I have no idea if that will run retries in the daemon if the IPs cannot be contacted. Seems doing so with a set number of retries might be the prudent approach).

Interestingly, they note:

Since a join operation is symmetric, it does not matter which node initiates it.

Then, for the workers:

... [when] the servers are all started and replicating to each other, all the remaining clients can be joined. Clients are much easier as they can join against any existing node.

Another example: in etcd this seems handled in the static case by the --initial-cluster flag:

As we know the cluster members, their addresses and the size of the cluster before starting, we can use an offline bootstrap configuration by setting the initial-cluster flag.

There is also support for a "discovery service" and for DNS SRV records. I'm sure each have their tradeoffs, so maybe swarmkit could implement a simple solution to this problem like static (or Consul's multi-node join) to start with?

In summary:

  • Without such a bootstrap functionality accepting multiple node addresses, will end-users be expected to bear the burden of this initial distributed consensus and discovery instead of swarmkit? What's the intended workflow/solution here? How do downstream nodes discovery the upstream?
  • Perhaps more importantly, what happens if the initial node that other nodes --join to goes away? How are downstream join nodes intended to get the address of a new node to --join to?

@aluzzardi @vieux @dongluochen @stevvooe Thanks, I hope I am understanding the internals correctly and have written a thorough examination of the issue at play.

@abronan
Copy link
Contributor

abronan commented Jun 8, 2016

Agreed. I think it's related to #342 :)

@nathanleclaire
Copy link
Author

Oh, nice!

@dongluochen
Copy link
Contributor

I think providing a pool of addresses is a good idea. N/2 +1 nodes have to show up before bootstrapping. This would avoid network partition starting 2 different clusters.

@jarrydfillmore
Copy link

100%!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants