Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ipfs-cluster - tool to coordinate between nodes #58

Open
jbenet opened this Issue Sep 27, 2015 · 25 comments

Comments

Projects
None yet
@jbenet
Copy link
Member

commented Sep 27, 2015

It is clear we will need a tool / protocol on top of IPFS to coordinate IPFS nodes together. This issue will track design goals, constraints, proposals, and the progress.

What to Coordinate

Things worth coordinating between IPFS nodes:

  • collaborative pin sets -- back up large pin sets together, to achieve redundancy and capacity constraints (including RAID-style modes).
  • authentication graphs -- trust models, like PKIs or hierarchical auth of control.
  • bitswap guilds -- the ability to band together into efficient data trade networks
  • application servers -- afford redundancy guarantees to hosted protocols / APIs

and more (expand this!)

Consensus

Many of these require consensus, and thus we'll likely bundle a simple (read: FAST!) consensus protocol with ipfs-cluster. this could be RAFT (etcd) or Paxos, and does not require byzantine consensus. Though having byzantine consensus would be useful for massive untrusted clusters-- though this approaches Filecoin and is a very different use case altogether.

cluster == a virtualized ipfs node

One goal is to represent a virtualized IPFS node sharded across other nodes. This makes for a very nice modular architecture where one can plug ipfs nodes into clusters, and clusters into larger clusters (hierarchies). This makes cluster a bit harder to design, but much, much more useful. Backing up of massive data (like all of archive.org or all of wikimedia, or all scientific data ever produced) would thus become orders of magnitude simpler to reason about.

The general idea here is to make ipfs-cluster provide an API that matches the standard ipfs node API, (i.e. with an identity, being able to be connected to, and providing the ipfs core methods).

@davidar

This comment has been minimized.

Copy link
Member

commented Sep 30, 2015

👍

CC: @kyledrake

@sroerick

This comment has been minimized.

Copy link

commented Oct 7, 2015

this is great stuff.
ping @DevonJames

@kyledrake

This comment has been minimized.

Copy link

commented Oct 8, 2015

As a sort of "hack" solution people can use right away, this gist allows you to make a quick service for pinning all data that's on a different IPFS node using a simple ncat one liner: https://gist.github.com/kyledrake/e8e2a583741b3bb8237e

This could be tossed into a cronjob and would basically replicate a single node as a mirror.

I see there being two different uses here:

The first is IPFS nodes replicating eachother, as in one IPFS node pins all the content provided by a different IPFS node. Ideally there is a mechanism in the protocol to inform the node of new data on the node it's replicating.

The second use case is helping to back up a chunk of something crazy large, like the Internet Archive. There's no way a hobbyist could back up the entire thing, the best they could do is agree to host some of that content. If you wanted to do this without pinning, you would be agreeing to share some amount of information that you didn't previously agree to pin. It would be more like "I'll help pin 2TB of data for you of whatever you want". This introduces all sorts of questions, like how do we make sure that the data is being evenly federated across multiple nodes.

There are some big questions and I'm not sure there are clean and obvious solutions to them.

In terms of priority, I would generally put IPNS at a much higher priority than this, because I feel that whatever solution we came up with would sit on top of IPNS in some capacity. It seems to make more sense for the design of things to agree to replicate an IPNS pubkeyhash rather than a bunch of random IPFS hashes. That pubkeyhash could point to all of the data that you want to archive, and then it would be easier to break up that to figure out what you want to replicate. Otherwise you're right back to the location addressing problem (as you are with my ncat solution).

So, maybe this flow:

  • A data archive publishes an IPNS pubkeyhash signing the IPFS hash you want to federate.
  • The archiving node "subscribes" to the IPNS pubkeyhash, and then all IPFS data there is pinned.
  • The archiving node is informed when the IPNS pubkeyhash changes, which tells the archive node to store the updated information. Maybe it's even possible to unpin data that's no longer referenced as an optional feature if you want it to.
@kyledrake

This comment has been minimized.

Copy link

commented Oct 8, 2015

That would be a good start for replication anyways. Then the case of federating a portion of that data rather than the whole thing could be considered. There's also some performance questions there (the Internet Archive IPNS pubkeyhash would be pointing to an enormous IPFS unixfs object that changes quite often!).

@whyrusleeping

This comment has been minimized.

Copy link
Member

commented Oct 8, 2015

I think we can reuse kademlias distance metrics to help distribute the content to some degree. Although, reassignments may make sense as the number of peers in a cluster grows. As long as peers in the cluster agree to abide by the rules, it should work out pretty well, and it synergizes (I actually get to use this word for real?) well with normal lookups. By the time you find the provider entries, you will likely have found the people actually providing.

@kyledrake

This comment has been minimized.

Copy link

commented Oct 8, 2015

@whyrusleeping I'm not super familiar with the tech behind this, but I think it's reasonable to assume abiding by the rules by nodes, since the usage in this case is intrinsically philanthropic and I'm not sure why someone would want to participate and also mess with the distribution at the same time. I also wouldn't expect any guarantees on the degree and evenness of federation from the originating nodes.

@davidar

This comment has been minimized.

Copy link
Member

commented Oct 8, 2015

@kyledrake In terms of how to break things between nodes, a strategy similar to what IA.BAK is doing could work:

  • master node publishes a list of hashes to replicate
  • each slave node reads this list, orders hashes by the number of peers serving them, and pins some number of (a random subset of) the least redundant hashes, up to whatever storage limit
@jbenet

This comment has been minimized.

Copy link
Member Author

commented Oct 8, 2015

@kyledrake

The first is IPFS nodes replicating eachother, as in one IPFS node pins all the content provided by a different IPFS node. Ideally there is a mechanism in the protocol to inform the node of new data on the node it's replicating.

yep. mirroring mode. (though can actually be the same as the next case, if local_disk > pinset_size)

The second use case is helping to back up a chunk of something crazy large, like the Internet Archive. There's no way a hobbyist could back up the entire thing, the best they could do is agree to host some of that content. If you wanted to do this without pinning, you would be agreeing to share some amount of information that you didn't previously agree to pin. It would be more like "I'll help pin 2TB of data for you of whatever you want".

yep! this is what i mean by "collaborative pin sets -- back up large pin sets together, to achieve redundancy and capacity constraints (including RAID-style modes)." above.

This introduces all sorts of questions, like how do we make sure that the data is being evenly federated across multiple nodes.

Accounting, and historical consensus. i mean it to be auditable.

In terms of priority, I would generally put IPNS at a much higher priority than this,

agreed. This issue is here now because people keep asking about this (i wanted something to point them to), and to maybe inspire someone to take a stab at it.

because I feel that whatever solution we came up with would sit on top of IPNS in some capacity. It seems to make more sense for the design of things to agree to replicate an IPNS pubkeyhash rather than a bunch of random IPFS hashes.

It wouldn't be "a bunch of random IPFS hashes", it would always be a single IPFS head, which would point to the rest. And, one IPNS name can only point to one IPFS hash at a time anyway. In practice will want to use IPNS for this, yes, but to point to the accounting/allocation index instead, not directly to the data. (the metadata will point to/include the pinset, which points to/includes the data)., Such an allocation index could be an object like this:

parent: <parent-hash>
pinset:  <pinset-hash>
members: <list-of-cluster-members-hash>
allocations: <allocation-log-hash>

and the IPNS name could point to it.

The allocation/accounting index mentioned here does not need to be exhaustive (i.e. include every hash) instead can work like the pinset, as it is possible to write a precise allocation of all objects to all cluster members as a compact expression (trivial example is sharding with mod, though we would want something more clever here).

So, maybe this flow:

  • A data archive publishes an IPNS pubkeyhash signing the IPFS hash you want to federate.
  • The archiving node "subscribes" to the IPNS pubkeyhash, and then all IPFS data there is pinned.
  • The archiving node is informed when the IPNS pubkeyhash changes, which tells the archive node to store the updated information.

Yep! something of this sort. +1 to the idea of signing the allocations. btw, the allocations could be done automatically by the cluster-leader (a program who manages the replication, and is likely elected by consensus), or by a cluster-administrator (a program or person who created the cluster and may want to express manual allocations -- instead of getting automatic balancing -- according to some external user policy.

Maybe it's even possible to unpin data that's no longer referenced as an optional feature if you want it to.

this will already happen in dev0.4.0. better gc.

@jbenet

This comment has been minimized.

Copy link
Member Author

commented Oct 8, 2015


@whyrusleeping

I think we can reuse kademlias distance metrics to help distribute the content to some degree. Although, reassignments may make sense as the number of peers in a cluster grows. As long as peers in the cluster agree to abide by the rules, it should work out pretty well, and it synergizes (I actually get to use this word for real?) well with normal lookups. By the time you find the provider entries, you will likely have found the people actually providing.

👎 For ipfs-cluster i want explicit tracking of every single copy. I want to keep exact allocation logs for which node is storing what (this doesn't mean a big log, can use precise AND short expressions), this helps to know if things fail, and how to rebalance. i'd like to represent a strong auditable contract (i.e. if someone loses a copy, you know who it was, and that is actionable data to an organization). ipfs-cluster is meant to also address the needs of orgs and groups of orgs to collaboratively back up critical data.

@kyledrake

@whyrusleeping I'm not super familiar with the tech behind this, but I think it's reasonable to assume abiding by the rules by nodes, since the usage in this case is intrinsically philanthropic and I'm not sure why someone would want to participate and also mess with the distribution at the same time. I also wouldn't expect any guarantees on the degree and evenness of federation from the originating nodes.

there could always be attackers, but yeah. lots of the use will be trusted.

but, regardless, for ipfs-cluster i would like to get some concrete non-byzantine scenarios working first (way easier), and only then attempt to reduce trust in the designs. (that said, i do want all the comm messages signed, so that node's signed "ACK" in a consensus round would mean binding agreement to replicate content according to the agreed-upon allocation. (i.e. you can see who failed to keep the promise, important in cross-org backing up of stuff).

@ion1

This comment has been minimized.

Copy link

commented Oct 8, 2015

@kyledrake

the Internet Archive IPNS pubkeyhash would be pointing to an enormous IPFS unixfs object that changes quite often!

There's an issue about that: ipfs/ipfs#96

If you, say, just took the huge directory object as data and applied chunking by a rolling hash to it, you could have rather efficient updates.

@xelra

This comment has been minimized.

Copy link

commented Nov 13, 2015

What would be nice, would be a smart replication model. For example new data and data that is frequently accessed is replicated to more nodes, while old and infrequently accessed data is less replicated.

@cinterloper

This comment has been minimized.

Copy link

commented Jan 2, 2016

would it be possible to have per-object replication policies?

@jbenet

This comment has been minimized.

Copy link
Member Author

commented Jan 13, 2016

It could-- though would get trickier. We could have things like the pins
that mark subdags with a RAID type.
On Sat, Jan 2, 2016 at 15:19 Grant Haywood notifications@github.com wrote:

would it be possible to have per-object replication policies?


Reply to this email directly or view it on GitHub
#58 (comment).

@Kubuxu

This comment has been minimized.

Copy link
Member

commented Feb 3, 2016

I can see a way to achieve consensus in the ipfs-cluster using just Conflict-free replicated data types if all nodes in the cluster can be trusted not to lie (although someone from the outside could subscribe to the list and pin things). This gives for example option of rebalancing cluster in case of prolonged split and guarantees that information about pin added by one node operator will sooner or later (at first possible time) propagate through the network.

Operation

  • Observe-Remove Set (in case of add-remove conflict add wins) of allowed cluster members
  • ORSet of Hashes to distribute
  • ORMap of Hash->PNCounter (PNCounter is counter that gives each node possibility to say how much data it has, result is the sum) of hashes and their availability on each node, (values from 0 to 100, 100 fully pinned).

Those two structures allow cluster to operate conflict free and to balance the data, request one that is not distributed and so on.

Big data management
When node decides to pin only fragment of requested pin it would publish that it has this part pinned and publish percentage of initial file that it pinned. Other nodes while wanting to also pin parts of the file will look if part is not already pinned in the network and decide for parts that have minimal coverage. This gives as possibility of storing bit data in the cluster.

@xelra

This comment has been minimized.

Copy link

commented Feb 5, 2016

I think that there should be traffic and diskspace constraints that are set by the node itself.

E.g., in a file on the node there should be 2 parameters saved that limit the traffic and the diskspace that are contributed to the cluster at max.

A cluster operator should be able to see those restraints and the underlying replication server (auto-pin process) should take that into account.

@Kubuxu Why would a node decide to pin parts of a file? Isn't the whole point of ipfs-cluster that the node gets instructed to pin a certain file or part of it?!

@hsanjuan

This comment has been minimized.

Copy link

commented May 12, 2016

Hi, last week I put some ideas together on this topic: https://github.com/hsanjuan/ipfsclusterspec/blob/master/README.md and how a pure on-top-of-IPFS implementation might look like. Hopefully it can serve as a start point for further iterations. It aligns a lot with @jbenet proposals although I left the vIPFS nodes aside for the moment.

I used IPNS to publish messages because it is the only way of passing messages around nowadays, but I have heard work is being done to provide a message-passing/subpub solution, which would allow to not abuse IPNS for this, so obviously this would have immediate application for implementing RAFT etc.

@hermanjunge

This comment has been minimized.

Copy link

commented May 13, 2016

Hola. So, I started to work on an IPFS cluster using kubernetes for scheduling and Nginx as reverse proxy. How can I help to the development of ipfs-cluster?

@jbenet

This comment has been minimized.

Copy link
Member Author

commented Jun 30, 2016

(Externalizing some notes)

We want:

  • (Certain) Composition of ipfs-nodes into an ipfs-cluster
  • (Certain) an ipfs-cluster itself makes a virtual ipfs-node (exposes the api).
    • submits requests to consensus
    • externalizes result only after consensus commits or fails
    • pin sets are sharded across nodes (configurable).
    • node configuration is itself committed to the consensus log
    • the consensus log will keep the operations, eg operations on the "ipfs repo" of the virtual node.
  • (Certain) ipfs-cluster is ipfs implementation agnostic (can use any ipfs node, given an api endpoint).
  • (Certain) needs an authentication model
    • (can get by at first with shared private cluster key).
  • (Certain) Pluggable consensus protocol (libp2p protocol category)
    • we can use Raft as a crutch, but,
    • goal is for a BA (Byzantine Agreement) protocol. (Raft is not)
  • (Mostly Certain) Use RAID configurations for "pin-replication" cluster modes.
    • RAID modes are well understood and well documented. so they're the obvious default choice.
    • but we should consider other well understood data replication configuration systems.

Likely construction (notes for discussion):

  • ipfs-clusterd is a "per-ipfs-node" service that talks to a given ipfs node process.
  • ipfs-clusterd speaks with other instances of ipfs-clusterd with its own wire protocol
    • (using libp2p transports of course. either on its own process, or through the ipfs-node's network stack. debatable.)
    • (ipfs-clusterd COULD be mounted onto the ipfs-node)
  • ipfs-clusterd exposes the IPFS API (over HTTP, like ipfs node)
    • this means any ipfs-clusterd process can respond to requests, but commits them to consensus protocol before externalizing the result.

Some important principles here:

  • Correctness is paramount (much more than perf or scalability).
    • Correctness of distributed state here is absolutely critical.
    • Correctness of externalizing requests is absolutely critical (cannot say something is pinned until it's committed to consensus).
  • Scalability is critical
    • would be great to scale up to 1000 nodes in v1. (then work our way up orders of magnitude)
    • perf should scale sub-linear (most fast consensus protocols are constant in time (O(1)) and quadratic (O(n^2)) in messages).
    • messages not that important, given batching, and relay.
@jbenet

This comment has been minimized.

Copy link
Member Author

commented Jun 30, 2016

Design Illustrations (source)

@bronger

This comment has been minimized.

Copy link

commented Jul 1, 2016

My use case: Guarantee that a given set of hashs is shared only among a given set of nodes. Would a cluster be the right tool for that?

And: Can a node be a direct member of different clusters at the same time, or is only a hierarchy/onion-like structure possible?

@jbenet

This comment has been minimized.

Copy link
Member Author

commented Jul 1, 2016

ipfs-cluster now has its own repo https://github.com/ipfs/ipfs-cluster

See the first ipfs-cluster design meeting notes ipfs/ipfs-cluster#1

@jbenet

This comment has been minimized.

Copy link
Member Author

commented Jul 1, 2016

Design notes/discussions on ipfs-cluster should probaly happen in that repo now. it's graduated out of notes into a thing. I'll keep the issue open though cause the open/close thing makes it annoying for search.

@petrsnm

This comment has been minimized.

Copy link

commented Jul 7, 2016

Just found this issue after having received a helpful pointer from someone on IRC. I'm certain that you'd answer more questions if you changed the title of this issue to: "Storing data permanently on IPFS". More likely to show up in a google search. How permanent is data stored on IPFS? does show up in a google search, but it takes a lot of attention to find a link to this issue.

@jbenet

This comment has been minimized.

Copy link
Member Author

commented Jul 9, 2016

@petrsnm this is not a "how to" issue, check the FAQ for that. (issues like ipfs/faq#47 or ipfs/faq#93 which are appropriately named, have extensive explanations, and show up in google results). This is a development issue, not meant as an entry point.

@daviddias

This comment has been minimized.

Copy link
Member

commented Jul 15, 2016

cc @nicola

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.