Autodiscovery - autosetup of cluster members #24

hsanjuan · 2017-01-02T16:56:21Z

Given a single existing cluster member, a new cluster node should be able to set itself up, retrieve and connect to all members of the cluster.

Note the trickiness of this:

At least some components are not ready to work or should not attempt to work (consensus) before receiving the list of peers
It's not as easy as connecting to someone and retrieving current cluster members (i.e. would fail when starting a bunch of new members at the same time). Also involves broadcasting that a new node is available.
Need to give more thought. Should aim for something simple and straightfoward. Errors during automated setup are the worst.

meyerzinn · 2017-01-29T23:05:06Z

Is it necessary for every peer to be aware of another in ipfs-cluster? I was imagining something like a private DHT that assigns blocks to nodes.

hsanjuan · 2017-01-30T10:41:03Z

Cluster shares state (list of pinned things) using a consensus algorithm (Raft) so all nodes need to know each other.

meyerzinn · 2017-01-30T18:25:05Z

@hsanjuan When starting a bunch at once, why not seed with one node like so:

Here's a system I was thinking about for autodiscovery. Suppose we have three nodes: A, B, and C. Discovery takes place as follows:

Step 1: Seeding

In order for the system to function, all nodes must be "seeded" with at least one other node. In this diagram, node A is joining the cluster by contacting C.

                            +--------+
                            |        |
                            | Node B |
                            |        |
+--------+                  +--------+
|        |
| Node A |
|        |
+----+---+
     |
     |
     |                        +--------+
     |                        |        |
     +------------------------> Node C |
                              |        |
                              +--------+

Step 2: Pass-through

In this step, peers wishing to join the cluster contact a current member of the cluster. In this diagram, B wishes to join the cluster; thus, B contacts C.

                            +--------+
                            |        |
                            | Node B |
                            |        |
+--------+                  +----+---+
|        |                       |
| Node A |                       |
|        |                       |
+----+---+                       |
     |                           |
     |                           |
     |                        +--v-----+
     |                        |        |
     +------------------------> Node C |
                              |        |
                              +--------+

Step 3: Repeat

C returns a list of peers it is immediately connected to, and B goes to each of those peers individually and requests a list of peers. In this diagram, B contacts A, but since the only peer A will return is C, and B is already connected to C, this "trail" has gone cold.

                            +--------+
                            |        |
     +----------------------+ Node B |
     |                      |        |
+----v---+                  +----+---+
|        |                       |
| Node A |                       |
|        |                       |
+----+---+                       |
     |                           |
     |                           |
     |                        +--v-----+
     |                        |        |
     +------------------------> Node C |
                              |        |
                              +--------+

This mechanism works really well when considering a more complicated model, like so:

                       +--------+
                       |        |
                       | Node A |
                       |        |
                       +---+----+
                           |
+--------+             +---v----+           +--------+
|        |             |        |           |        |
| Node D +-------------> Node B |           | Node F |
|        |             |        |           |        |
+---^----+             +----^---+           +---+----+
    |                       |                   |
    |                  +----+---+               |
    |                  |        |               |
    |                  | Node C <---------------+
    |                  |        |
+---+----+             +--------+
|        |
| Node E |
|        |
+--------+

Let's assume that nodes are seeded with Node B. No matter what order nodes connect in, they should all end up in the same cluster.

Join() takes a peer's multiaddress. It then sends a "PeerAdd()" RPC request with that peer and the local multiaddress by which the remote peer is reached. The the remote peer uses the PeerAdd() functionality to a) Tell the rest of the cluster about the new member and b) send us the list of cluster members. Join() has no REST API support, but it is provided instead via "ipfs-cluster-service --join". A "--leave" flag has been included too, by which a peer triggers "PeerRemove()" on itself when shutting down. This allows a node to join an existing cluster and leave it by the end of its life. License: MIT Signed-off-by: Hector Sanjuan <hector@protocol.ai>

hsanjuan · 2017-02-02T23:34:49Z

@20zinnm thanks for the comments. For the moment I'm not supporting the case where everyone bootstraps from someone else at the same time.

C returns a list of peers it is immediately connected to, and B goes to each of those peers individually and requests a list of peers

This is a graph search, but in cluster C should know all its nodes and if it doesn't yet it should tell B and everyone else about any modifications (trails may warm up). Since it's not just about having a graph, but about having everyone connected to everyone else, this grows very complex.

An easier solution is to have nodes joining running clusters where they can receive the full list of peers from whoever they connect to directly, and where it is possibly to push the new update to all peers. After joining the full peer list is saved to the configuration for future use on start up. This also re-uses most of the PeerAdd() code path, as B is basically requesting the bootstrap node to PeerAdd(B).

meyerzinn · 2017-02-02T23:35:36Z

@hsanjuan Why would it need to announce modifications if the seeker needs to contact all nodes anyways? In order to establish connections, it needs to contact the nodes anyways, so it could announce its own "joining."

Also, the running cluster means that the ability of a node to join is dependant on (a) how up-to-date the cluster is, and (b) whether or not there is already a cluster. You also make it impossible to create a new cluster automatically without pre-filling peer lists, so my solution is the same as yours in the abstract except the peers are discovered at runtime.

hsanjuan · 2017-02-03T00:17:08Z

ok @20zinnm I'm gonna give a try to the way that you propose it. It is not way more complicated and the result is better.

meyerzinn · 2017-02-03T00:44:50Z

The only difference in practice is that mine only mandates the seeding of one node, which could be done by a Kubernetes service or whatnot. It requires no persistence and only one entry point. If you would like, I could put together a distributed demo cluster (obviously not IPFS, just a simple demo cluster).

Join() takes a peer's multiaddress. It then sends a "PeerAdd()" RPC request with that peer and the local multiaddress by which the remote peer is reached. The the remote peer uses the PeerAdd() functionality to a) Tell the rest of the cluster about the new member and b) send us the list of cluster members. Join() has no REST API support, but it is provided instead via "ipfs-cluster-service --join". A "--leave" flag has been included too, by which a peer triggers "PeerRemove()" on itself when shutting down. This allows a node to join an existing cluster and leave it by the end of its life. License: MIT Signed-off-by: Hector Sanjuan <hector@protocol.ai>

This reworks PeerAdd() and adds a Join() operation. Join(multiaddress) uses the given bootstrap multiaddress to register itself with it and fetch its list of cluster peers. Then it proceeds to register itself with every peer and fetch their lists of cluster peers. If a cluster peer is not known, then we register ourselves with it again and so on. PeerAdd() is now "simply" registering the new node and sending a Join() request to it. The new node then will fetch the list of peers and register itself with everyone like explained above. License: MIT Signed-off-by: Hector Sanjuan <hector@protocol.ai>

hsanjuan · 2017-02-03T03:56:34Z

@20zinnm I know, my original solution would also work with 1 seeding node which everyone uses. The new one would theoretically work even when using several seeding nodes and everyone connects at the same time. For this the "newcomer checks-in with every node" way that you propose is mentally simpler to me than adapting my original approach.

I have a more or less working solution but I wanna give it another round.

This reworks PeerAdd() and adds a Join() operation. Join(multiaddress) uses the given bootstrap multiaddress to register itself with it and fetch its list of cluster peers. Then it proceeds to register itself with every peer and fetch their lists of cluster peers. If a cluster peer is not known, then we register ourselves with it again and so on. PeerAdd() is now "simply" registering the new node and sending a Join() request to it. The new node then will fetch the list of peers and register itself with everyone like explained above. License: MIT Signed-off-by: Hector Sanjuan <hector@protocol.ai>

This is the third implementation attempt. This time, rather than broadcasting PeerAdd/Join requests to the whole cluster, we use the consensus log to broadcast new peers joining. This makes it easier to recover from errors and to know who exactly is member of a cluster and who is not. The consensus is, after all, meant to agree on things, and the list of cluster peers is something everyone has to agree on. Raft itself uses a special log operation to maintain the peer set. The tests are almost unchanged from the previous attempts so it should be the same, except it doesn't seem possible to bootstrap a bunch of nodes at the same time using different bootstrap nodes. It works when using the same. I'm not sure this worked before either, but the code is simpler than recursively contacting peers, and scales better for larger clusters. Nodes have to be careful about joining clusters while keeping the state from a different cluster (disjoint logs). This may cause problems with Raft. License: MIT Signed-off-by: Hector Sanjuan <hector@protocol.ai>

hsanjuan · 2017-02-07T17:47:27Z

@20zinnm went a third way in the end by commiting new members to the distributed log rather than broadcasting and recursively contacting peers. I have realized it is very easy to mess the consensus when someone does not cooperate perfectly or peers have disjoint logs, or different leadership terms etc. Despite locking sections of the code I wasn't very happy with the results and added complexity.

As of the latest approach, a cluster should be able to bootstrap from the same node (at the same time).

meyerzinn · 2017-02-07T22:39:41Z

@hsanjuan What do you mean by distributed log? What implementation is the log using that wouldn't suffer the same faults?

I think it's fine to have everyone join through the same node but it's not scalable in my opinion. My check-in method means you can bootstrap a thousand nodes by giving each a pointer to the one before it, without having to spam a single member for a list.

Fix #24: Auto-join and auto-leave operations for Cluster

jbenet mentioned this issue Jan 2, 2017

Notes from trying out ipfs-cluster for the first time #14

Closed

hsanjuan added this to the Member add/remove support milestone Jan 25, 2017

hsanjuan added the status/ready Ready to be worked label Jan 27, 2017

hsanjuan added status/in-progress In progress and removed status/ready Ready to be worked labels Feb 2, 2017

hsanjuan self-assigned this Feb 2, 2017

hsanjuan added need/review Needs a review and removed status/in-progress In progress labels Feb 2, 2017

hsanjuan mentioned this issue Feb 2, 2017

Fix #24: Add Cluster.Join() operation to join an existing cluster #38

Closed

hsanjuan added the status/in-progress In progress label Feb 3, 2017

hsanjuan added need/review Needs a review and removed need/review Needs a review labels Feb 3, 2017

hsanjuan mentioned this issue Feb 3, 2017

Fix #24: Add Cluster.Join() operation to join an existing cluster #39

Closed

hsanjuan removed the status/in-progress In progress label Feb 3, 2017

hsanjuan added status/in-progress In progress and removed need/review Needs a review labels Feb 6, 2017

hsanjuan mentioned this issue Feb 7, 2017

Fix #24: Auto-join and auto-leave operations for Cluster #40

Merged

hsanjuan added need/review Needs a review and removed status/in-progress In progress labels Feb 7, 2017

hsanjuan closed this as completed in 34fdc32 Feb 8, 2017

hsanjuan added a commit that referenced this issue Feb 8, 2017

Merge pull request #40 from ipfs/24-auto-join-bis-bis

33095e6

Fix #24: Auto-join and auto-leave operations for Cluster

hsanjuan removed the need/review Needs a review label Feb 8, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Autodiscovery - autosetup of cluster members #24

Autodiscovery - autosetup of cluster members #24

hsanjuan commented Jan 2, 2017 •

edited

meyerzinn commented Jan 29, 2017

hsanjuan commented Jan 30, 2017

meyerzinn commented Jan 30, 2017 •

edited

hsanjuan commented Feb 2, 2017

meyerzinn commented Feb 2, 2017 •

edited

hsanjuan commented Feb 3, 2017

meyerzinn commented Feb 3, 2017

hsanjuan commented Feb 3, 2017

hsanjuan commented Feb 7, 2017

meyerzinn commented Feb 7, 2017

Autodiscovery - autosetup of cluster members #24

Autodiscovery - autosetup of cluster members #24

Comments

hsanjuan commented Jan 2, 2017 • edited

meyerzinn commented Jan 29, 2017

hsanjuan commented Jan 30, 2017

meyerzinn commented Jan 30, 2017 • edited

Step 1: Seeding

Step 2: Pass-through

Step 3: Repeat

hsanjuan commented Feb 2, 2017

meyerzinn commented Feb 2, 2017 • edited

hsanjuan commented Feb 3, 2017

meyerzinn commented Feb 3, 2017

hsanjuan commented Feb 3, 2017

hsanjuan commented Feb 7, 2017

meyerzinn commented Feb 7, 2017

hsanjuan commented Jan 2, 2017 •

edited

meyerzinn commented Jan 30, 2017 •

edited

meyerzinn commented Feb 2, 2017 •

edited