New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Probabilistic pinning for "replication everywhere" mode #1058
Comments
|
This is a really good idea. Checksum based decisionsWe don't have to flip a coin. We could run a checksum over the CID and the PeerID to have a deterministic decision for each CID if a is going to end up on a specific node or not. Every node in the cluster can calculate checksum and see if the value is lower or higher than the probability. Algorithm and speed considerationsWe need a cryptographically secure hash sum algorithm, to make sure we spread the data evenly on all nodes. I would recommend blake2b because of its speed. So if the probability is Writing efficiently to the clusterThe writing server can send each CID to the node with the highest value inside the probability range and avoid this way to send data multiple times to the cluster. As a fallback, the node would have to store the CID locally for a configurable timeframe, like a day or so - to avoid writing a CID to a node which then goes offline or doesn't send any data back. When the writing server determines itself as the node with the highest value, it will send the data to the node with the second highest value instead. This distributes writes to all nodes equally, and all nodes will start to fetch from all other nodes the data to complete their cluster state again. Slow nodesIdeally, we can send all data directly CID by CID to each node on the cluster. But this is unlikely, some nodes might be behind the same connection or some nodes might have just a slower connection speed than we can deliver the blocks. We could mitigate this by just skipping nodes if there's still a CID on the fly when we hit it again, this way we just write to a random other node on the cluster if we are hitting the bandwidth limit of individual nodes. This behavior can be extended until we hit the bottom of the list in which case we just start at the top again. Slow nodes would just respond that they don't have the data, if other nodes try to fetch it from them. In this case the cluster would just try every node on the list in order to find the node holding the CID. Detecting malfunctioning nodesSince the writing server's target can be (somewhat) reliably determined by the rest of the cluster, bad nodes could be identified in a distributed fashion. If a Peer-ID fails to respond with the data of new pins over and over again, the individual nodes could report this back to the trusted peers. With the information of the writing servers poor-performing nodes that don't send the data back to the network can be identified. This could, for example, be used for greylisting them for a while or notify the administrator of the cluster of the malfunction. New nodes joining the clusterA new node would receive the configuration from the cluster and all necessary metadata, and then calculate which part of the cluster the node should hold. This allows a new joined node to start holding parts of the cluster without any new writes necessary - so clusters with long term data can be spread across multiple new nodes. The new node would start to calculate the hashes until it found a match with a cluster peer which is online. So it's not necessary to calculate all hashes per CID for large clusters to find a node with the data. If none of the nodes on the cluster holds the data from the calculation, the new node can failback to the DHT search. Space considerationsDatabaseSince we only need to store the CIDs and not a list of nodes for each CID we would shrink the size of the database considerably on large cluster sizes. Node hard drive spaceThis distribution model will spread CIDs equally over all nodes. So the space requirements are somewhat equal if the CIDs are equally large. Local node space modifierTo avoid that every node has to have a huge amount of storage to be part of the cluster we could allow the individual nodes to have a local modifier: The node would need to report the local modifier (if it's different than 1) and a trusted peer will add this information to the cluster metadata. This way a node can report changing local modifiers and every node in the cluster can calculate if this node holds certain data or not. The local modifier is by default 1. This means it will use the decision of the first round of the probability, like in the example 10%. If the checksum is below 10% it will end up on this node. If the local modifier is 0.5, the 10% will be cut in half and everything with a checksum below 5% will be stored on this node. If the local modifier is 2, the 10% will be doubled and the node will store 20% of the pinset. If the local modifier is "inf" it will store all data of the pinset. Algorithm quirks for lookupSince the lookup in the first stage (with the checksum) will result in a small subset of nodes (which all have a local modifier <=1) we need to check the nodes with a local modifier >1 additionally, not after the first stage decision. That's a bit added complexity, but this way we can define nodes that will hold the full cluster pin-set by a local definition. Node QuotaAdditionally, we need a way to have the nodes in the cluster report an Communication exampleThe communication could be handled like:
The admin then could either increase the quota, the node would then fetch all the missing data, and report back, that it can provide all data now to the cluster. Alternatively, the admin could lower the local modifier to store fewer data of the cluster. The node would then recover by dropping data and fetching the missing data and report back with the lower local modifier and that the quota is now no longer exceeded. Mathematical reliability of data storageI guess both approaches (flipping a coin as well as my deterministic approach) have math issues: The availability isn't guaranteed. There's just a higher and higher chance, that a block is stored on at least one node. But it's just like flipping a coin, after many many flips you'll get odd results, like 5 times head :) This can be addressed in three ways:
Hope my brainstorming was useful. It's a bit late here, so it's a bit unorganized - hope everything is understandable. |
Guaranteed availabilityWe could periodically scan the pinset for pins which are below the minimum guaranteed availability. In this case a repin with a bumped counter would redistribute the pin to other peers with a new hash. Since this is expected to happen pretty much never, we could issue a notification to the admin of the cluster if this happens. Grow/shrink a clusterWe can very cheaply grow a cluster, by updating the global default probability. The update would trigger a rescan of all items pinned for the cluster on each node. This way the nodes can either drop or fetch a portion of the cluster. For shrinking (more data on each node) the nodes could use the old probability to calculate the priority list of nodes which should have the data they now need to hold, to avoid having to ask nodes which probably doesn't hold the data. This operation is cheap since we only add one commit to the cluster and don't have to update each pin (when they all have the default probability). This operation is also cheap since it's fully distributed - each cluster member has all data available to calculate the necessary steps on it's own. Scattering (sharding)We can also do sharding this way... I would it more likely call scattering... We can add an integer to each pin which defines the scattering pattern. The scattering integer should be stored on each pin as individual value, and not like the probability modifyable with a global variable if not specified differently while pinning. Changing the scattering is a major operation on a cluster, since all objects need to be reanalyzed. It's better to allow this only for repin operations.
|
|
@hsanjuan looking again at this, I think it's a bit overcomplicated what I wrote up - too many ifs and options. How about just adding the root of a pin direct and just run a deterministic probability by the Node-ID itself. So blockwise (or scattering=-1 previously). So we still need a float on how much percentage a node should store, like The direct pin of the root CID allows finding the nodes storing the data pretty efficiently, as all nodes holding parts of it are also holding the CID itself. Bitswap could also be extended to share the "sharded pin" information, so ipfs nodes don't have to send (in this example) 90% of their requests to just get an "I don't have this block" answer. If a sharded pin works similar to a DHT (blocks close to the Node-ID are stored there) then the same "distance" algorithm could be used to determine which node holds which data. So ipfs itself would need to be extended to get a "sharded" pin type, with a percentage value of it. This would allow people to also pin stuff they like as a percentage, to help keep it online, without having to leverage a cluster to organize themselves on that. |
|
TL;DR: add a fraction switch to the pin operation in ipfs: ipfs pin add --fraction=0.1 /ipfs/CIDAnd make it updatable: ipfs pin update --fraction=0.2 /ipfs/CIDMake it also accept sizes and percent: ipfs pin update --fraction-size-limit=1GB /ipfs/CID
$ ipfs pin update --fraction=10% /ipfs/CIDThe The pin list would then look like this: $ ipfs pin ls --type fraction
QmCID1 direct+fraction (fraction=0.2)
QmCID2 direct+fraction (fraction-size-limit=1GB - fraction=0.23) |
hsanjuan commentedApr 9, 2020
This is an initial idea dump from things I've had in my head for a bit.
Background
With collaborative clusters, we have the possibility of creating very large clusters (with peers in the hundred/thousands).
However, the current allocation system does not work very well for collaborative clusters made of untrusted peers:
For these reasons, we some recommendations:
Cluster cannot ensure that the content is effectively pinned by peers. Filecoin addresses the problem of trust when storing something (see the level of complexity involved).
Proposed improvement
All in all, for public collaborative clusters we need to pin everywhere, and that means that a cluster with many peers and lots of content will not do an efficient distribution of the content.
In order to address this I propose turning things around and, rather than letting a trusted cluster peer pick specific allocations, letting every peer probabilistic-ally decide if they will be pinning something or not.
Example:
Implementation notes
We can assume the current approach has a pinning-probability of 100%. So the question is how to allow reducing that number to something else.
This only applies to pins with replication factor -1. The current allocation system would still work (because it is useful for trusted-clusters and works well in that setup).
We can:
The handling and probability calculation will be implemented in the pin tracker.
Questions:
The text was updated successfully, but these errors were encountered: