Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Probabilistic pinning for "replication everywhere" mode #1058

Open
hsanjuan opened this issue Apr 9, 2020 · 4 comments
Open

Probabilistic pinning for "replication everywhere" mode #1058

hsanjuan opened this issue Apr 9, 2020 · 4 comments
Labels
kind/enhancement A net-new feature or improvement to an existing feature need/review Needs a review

Comments

@hsanjuan
Copy link
Collaborator

hsanjuan commented Apr 9, 2020

This is an initial idea dump from things I've had in my head for a bit.

Background

With collaborative clusters, we have the possibility of creating very large clusters (with peers in the hundred/thousands).

However, the current allocation system does not work very well for collaborative clusters made of untrusted peers:

  • We can select a specific number of allocations, but they are based on a untrusted-peer provided metric. Peers can lie about their free space to be always at the top of the list and get the allocations.
  • Allocated peers may well lie about what they are storing and what not, simulating that things are pinned when they are not
  • Any peer can join, and also dissappear, and this will trigger repinnings of the content (when enabled) which is an expensive operation for the cluster.

For these reasons, we some recommendations:

  • Run fully public collaborative-clusters with replication factor -1: by replicating to everyone, we stop caring about whether we trust specific peers or not.
  • If peers are believe to generally behave, set allocation factor sufficiently high and provide manual allocation for a subset of trusted peers that are known to behave. However this is still vulnerable to peers misbehaving.

Cluster cannot ensure that the content is effectively pinned by peers. Filecoin addresses the problem of trust when storing something (see the level of complexity involved).

Proposed improvement

All in all, for public collaborative clusters we need to pin everywhere, and that means that a cluster with many peers and lots of content will not do an efficient distribution of the content.

In order to address this I propose turning things around and, rather than letting a trusted cluster peer pick specific allocations, letting every peer probabilistic-ally decide if they will be pinning something or not.

Example:

  1. Trusted peer adds pin with replication factor everywhere
  2. Every peer receiving the pin is configured to pin with probability 0.5. So it will flip a coin and decide themselves what to pin.
  3. Result: about 50% of the cluster peers will end up pinning the item. Malicious peers cannot influence the decision of other peers to pin or not. The content is sufficiently replicated.

Implementation notes

We can assume the current approach has a pinning-probability of 100%. So the question is how to allow reducing that number to something else.

This only applies to pins with replication factor -1. The current allocation system would still work (because it is useful for trusted-clusters and works well in that setup).

We can:

  • Let the probability be a configuration option, so that every incoming pin (with replication factor -1) will be decided by this. Users are then able to configure their peer however they want. Remote configurations allow to set this for all users.
  • Alternatively: we can add pinning probability as a Pin option carried with every pin (just like the replication factor). This allows setting some pins to pin everywhere with 100% probability, while others not.
  • We can combine both above and have a default plus a per-pin option (essentially like replication factor).

The handling and probability calculation will be implemented in the pin tracker.
Questions:

  • A peer that has already pinned something before, will keep it pinned or flip the coin again?
  • We may want to pin with 100% probability on a small cluster <5 peers, with 75% probability on a bigger cluster with 20 peers, and with 10% probability on a really large cluster. However, on a public cluster we cannot rely on how many peers we see because they may be malicious. This means that adjusting the pinning probability of an item cannot really happen dynamically.
  • We should find a way to configure trusted peers to always pin regardless of the probability option. In this case, the configuration value should not be an overridable default, but the other way around. Ideally something pinned gets stored for sure in trusted peers and the rest of the peers in the clusters run the lottery.
@hsanjuan hsanjuan added kind/enhancement A net-new feature or improvement to an existing feature need/review Needs a review labels Apr 9, 2020
@RubenKelevra
Copy link
Collaborator

RubenKelevra commented May 21, 2020

This is a really good idea.

Checksum based decisions

We don't have to flip a coin. We could run a checksum over the CID and the PeerID to have a deterministic decision for each CID if a is going to end up on a specific node or not. Every node in the cluster can calculate checksum and see if the value is lower or higher than the probability.

Algorithm and speed considerations

We need a cryptographically secure hash sum algorithm, to make sure we spread the data evenly on all nodes. I would recommend blake2b because of its speed.

So if the probability is .10, every value above 10% of a 256 bit INT MAX (e.g. on a 256-bit hash sum) will not end up on this node. So the decision could be a cheap sha256 sum plus bit shifting and XOR/AND - if we decide that 0.4% granularity is good enough (on 256-bit checksums). On larger datasets a 512-bit checksum would result in 0.2% granularity - without any slow and error-prone floating-point arithmetic.

Writing efficiently to the cluster

The writing server can send each CID to the node with the highest value inside the probability range and avoid this way to send data multiple times to the cluster. As a fallback, the node would have to store the CID locally for a configurable timeframe, like a day or so - to avoid writing a CID to a node which then goes offline or doesn't send any data back.

When the writing server determines itself as the node with the highest value, it will send the data to the node with the second highest value instead.

This distributes writes to all nodes equally, and all nodes will start to fetch from all other nodes the data to complete their cluster state again.

Slow nodes

Ideally, we can send all data directly CID by CID to each node on the cluster. But this is unlikely, some nodes might be behind the same connection or some nodes might have just a slower connection speed than we can deliver the blocks.

We could mitigate this by just skipping nodes if there's still a CID on the fly when we hit it again, this way we just write to a random other node on the cluster if we are hitting the bandwidth limit of individual nodes. This behavior can be extended until we hit the bottom of the list in which case we just start at the top again.

Slow nodes would just respond that they don't have the data, if other nodes try to fetch it from them. In this case the cluster would just try every node on the list in order to find the node holding the CID.

Detecting malfunctioning nodes

Since the writing server's target can be (somewhat) reliably determined by the rest of the cluster, bad nodes could be identified in a distributed fashion. If a Peer-ID fails to respond with the data of new pins over and over again, the individual nodes could report this back to the trusted peers.

With the information of the writing servers poor-performing nodes that don't send the data back to the network can be identified. This could, for example, be used for greylisting them for a while or notify the administrator of the cluster of the malfunction.

New nodes joining the cluster

A new node would receive the configuration from the cluster and all necessary metadata, and then calculate which part of the cluster the node should hold.

This allows a new joined node to start holding parts of the cluster without any new writes necessary - so clusters with long term data can be spread across multiple new nodes.

The new node would start to calculate the hashes until it found a match with a cluster peer which is online. So it's not necessary to calculate all hashes per CID for large clusters to find a node with the data. If none of the nodes on the cluster holds the data from the calculation, the new node can failback to the DHT search.

Space considerations

Database

Since we only need to store the CIDs and not a list of nodes for each CID we would shrink the size of the database considerably on large cluster sizes.

Node hard drive space

This distribution model will spread CIDs equally over all nodes. So the space requirements are somewhat equal if the CIDs are equally large.

Local node space modifier

To avoid that every node has to have a huge amount of storage to be part of the cluster we could allow the individual nodes to have a local modifier:

The node would need to report the local modifier (if it's different than 1) and a trusted peer will add this information to the cluster metadata. This way a node can report changing local modifiers and every node in the cluster can calculate if this node holds certain data or not.

The local modifier is by default 1. This means it will use the decision of the first round of the probability, like in the example 10%. If the checksum is below 10% it will end up on this node.

If the local modifier is 0.5, the 10% will be cut in half and everything with a checksum below 5% will be stored on this node.

If the local modifier is 2, the 10% will be doubled and the node will store 20% of the pinset.

If the local modifier is "inf" it will store all data of the pinset.

Algorithm quirks for lookup

Since the lookup in the first stage (with the checksum) will result in a small subset of nodes (which all have a local modifier <=1) we need to check the nodes with a local modifier >1 additionally, not after the first stage decision.

That's a bit added complexity, but this way we can define nodes that will hold the full cluster pin-set by a local definition.

Node Quota

Additionally, we need a way to have the nodes in the cluster report an EDQUOT. A node which has exceeded it's (locally set) quota of disk space can avoid this way being reported as "bad behaving" by other nodes.

Communication example

The communication could be handled like:

  • A trusted peer adds a pin, the pin cannot be stored fully, because of the quota.
  • The node would report this issue back, and the trusted peer would add this information about the node to the cluster information.
    -This node is then placed in a "quota exceeded list" on all of the other nodes of the cluster, to avoid that this node is being queried for anything (because it's probably failing), and considered as a cluster node which holds no data, by the other nodes.

The admin then could either increase the quota, the node would then fetch all the missing data, and report back, that it can provide all data now to the cluster. Alternatively, the admin could lower the local modifier to store fewer data of the cluster. The node would then recover by dropping data and fetching the missing data and report back with the lower local modifier and that the quota is now no longer exceeded.

Mathematical reliability of data storage

I guess both approaches (flipping a coin as well as my deterministic approach) have math issues:

The availability isn't guaranteed. There's just a higher and higher chance, that a block is stored on at least one node. But it's just like flipping a coin, after many many flips you'll get odd results, like 5 times head :)

This can be addressed in three ways:

  • We recommend that there's at least one node which have "inf" as a local modifier to store all data of the cluster when using this mode of operation
  • We have a kind of fallback like the writing trusted peer can detect when there's an odd number of peers selected which doesn't really match the cluster size and probability, in which case a normally 0 counter would be raised, to generate a different result for this pin. This would require an additional optional field for this purpose on the metadata.
  • Change the math: Cut the CID with the probability in different sectors which are distributed to specific nodes (Ceph uses this kind of approach). The issue here is, if the algorithm is flawed, you can forge the Node-IDs in a way that everything will end up on the bad nodes. So the guarantee of data integrity relies on a non-cryptographic algorithm, which is a bad idea IMHO.

Hope my brainstorming was useful. It's a bit late here, so it's a bit unorganized - hope everything is understandable.

@RubenKelevra
Copy link
Collaborator

RubenKelevra commented May 21, 2020

Guaranteed availability

We could periodically scan the pinset for pins which are below the minimum guaranteed availability. In this case a repin with a bumped counter would redistribute the pin to other peers with a new hash. Since this is expected to happen pretty much never, we could issue a notification to the admin of the cluster if this happens.

Grow/shrink a cluster

We can very cheaply grow a cluster, by updating the global default probability. The update would trigger a rescan of all items pinned for the cluster on each node. This way the nodes can either drop or fetch a portion of the cluster.

For shrinking (more data on each node) the nodes could use the old probability to calculate the priority list of nodes which should have the data they now need to hold, to avoid having to ask nodes which probably doesn't hold the data.

This operation is cheap since we only add one commit to the cluster and don't have to update each pin (when they all have the default probability). This operation is also cheap since it's fully distributed - each cluster member has all data available to calculate the necessary steps on it's own.

Scattering (sharding)

We can also do sharding this way... I would it more likely call scattering...

We can add an integer to each pin which defines the scattering pattern.

The scattering integer should be stored on each pin as individual value, and not like the probability modifyable with a global variable if not specified differently while pinning.

Changing the scattering is a major operation on a cluster, since all objects need to be reanalyzed. It's better to allow this only for repin operations.

scattering=0

Each pin of the cluster pinset will be considered one entity and stored completely on the node which calculates that

scattering=1

Each pin will be pinned direct, and the references will be scattered (recursively) to the cluster.

So if it's a folder (without subfolders and folder sharding), each file of it will be stored with a different hash instead of using the same hash for the folder and all references.

scattering=-1

Each block of each pin on the cluster will be stored with an individual hash. So each cluster member has only parts of each file.

Useful for storing very few but extremely large files on a cluster.

scattering=-2

Each CID of a pin which has a max reference of 1 references before blocks will be stored with it's own hash.

All items above will be stored with the hash of the pin in the pinset.

This will stripe large files with a bunch of consecutive chunks through the cluster, since we're using indirect references to store files with a lot of blocks.

With inline activated, folders with a lot of small files will be saved as a whole (or with directory sharding activated, as stripes with a bunch of files).

scattering=-3

Each CID which has a max reference of 2 references before blocks will be stored with it's own hash.

All items above will be stored with the hash of the pin in the pinset.

This will lead to much more data packed into a single hash and might lead to somewhat large files being stored completely on each node.

Directories with small files will be stored as one entity even without inline. If sharding is enabled, the directories with small files will be striped through the cluster.

@RubenKelevra
Copy link
Collaborator

RubenKelevra commented May 24, 2022

@hsanjuan looking again at this, I think it's a bit overcomplicated what I wrote up - too many ifs and options.

How about just adding the root of a pin direct and just run a deterministic probability by the Node-ID itself. So blockwise (or scattering=-1 previously).

So we still need a float on how much percentage a node should store, like 0.1 for 10%.

The direct pin of the root CID allows finding the nodes storing the data pretty efficiently, as all nodes holding parts of it are also holding the CID itself.

Bitswap could also be extended to share the "sharded pin" information, so ipfs nodes don't have to send (in this example) 90% of their requests to just get an "I don't have this block" answer.

If a sharded pin works similar to a DHT (blocks close to the Node-ID are stored there) then the same "distance" algorithm could be used to determine which node holds which data.

So ipfs itself would need to be extended to get a "sharded" pin type, with a percentage value of it.

This would allow people to also pin stuff they like as a percentage, to help keep it online, without having to leverage a cluster to organize themselves on that.

@RubenKelevra
Copy link
Collaborator

RubenKelevra commented May 24, 2022

TL;DR:

add a fraction switch to the pin operation in ipfs:

ipfs pin add --fraction=0.1 /ipfs/CID

And make it updatable:

ipfs pin update --fraction=0.2 /ipfs/CID

Make it also accept sizes and percent:

ipfs pin update --fraction-size-limit=1GB /ipfs/CID
$ ipfs pin update --fraction=10% /ipfs/CID

The fraction-size-limit will check the total size of the CID and determine internally the fraction – but show both to the user.

The pin list would then look like this:

$ ipfs pin ls --type fraction
QmCID1 direct+fraction (fraction=0.2)
QmCID2 direct+fraction (fraction-size-limit=1GB - fraction=0.23)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/enhancement A net-new feature or improvement to an existing feature need/review Needs a review
Projects
None yet
Development

No branches or pull requests

2 participants