Skip to content

[2021 Theme Proposal] probabilistic tiering #80

@RubenKelevra

Description

@RubenKelevra

Note, this is part of the 2021 IPFS project planning process - feel free to add other potential 2021 themes for the IPFS project by opening a new issue or discuss this proposed theme in the comments, especially other example workstreams that could fit under this theme for 2021. Please also review others’ proposed themes and leave feedback here!

Theme description

Please describe the objective of your proposed theme, what problem it solves, and what executing on it would mean for the IPFS Project.

Hypothesis

Currently, we have a binary state of data availability - a node either got the data or not. If the node has the data, the node can provide it and each node searching for the data assumes that every node has exactly the same (free) network bandwidth, disk speed, processing power, etc.

While this makes sense to a degree, since we just issue more requests to nodes that deliver the data faster, this assumption has its limits.

Mobile devices might run on batteries, which makes requesting data from them, not the best option, as an example where requesting in a round-robin fashion doesn't make sense.

Putting a tiering number on the data a node provides (stored in the DHT), makes it possible to select a server instead of a tablet or RasPi as a source. To avoid that only the servers get the requests, it makes sense to have a probabilistic approach to the data request:

The higher the tiering number, the more probability that the request will start there. This avoids that small nodes get overloaded by requests because they get an equal amount of requests than a large server.

While the DHT saved tiering number makes sense for an average, like on each provide the node saves the 95% percentile mean load over the providing period, short overloads might make it necessary to create an informational message which can update the tiering number to each node which currently actively requests data from the node.

An example how local data requests could still be preferred is to use the tiering number times the latency measured (in ms) to prefer local nodes over nodes far away.

Vision statement

In the long run we probably want to created tiered data stores, which would make it possible to mark different data with different tiering, based on the storage media they are saved on.

This would enable us to use cheap cold storage, like blue-ray disks or magnet bands with an automated loading mechanism to store extremely large quantities of data over long periods where not many accesses are necessary since most stuff won't get accessed at all - when stored somewhere else in the network.

Why focus this year

I've heard there's work to get the internet archive on the network and mobile clients are also a topic to tackle soon. Both would benefit from such a way to differentiate between different tiers of nodes.

Example workstreams

  • Implement a measurement of the free memory, processing speed and bandwidth to estimate how fast a node can provide data after start.
  • Gather how fast we deliver data over a long period to estimate the free bandwidth
  • Update the value in the DHT on each reprovide
  • Implement informational updates in the protocol to inform nodes for a more recent estimate, like a rolling 5 minute mean of the 95 percentile.

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions