Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Provide fewer nodes #6155

Open
eingenito opened this issue Mar 28, 2019 · 4 comments
Open

Provide fewer nodes #6155

eingenito opened this issue Mar 28, 2019 · 4 comments
Labels
topic/meta Topic meta topic/provider Topic provider

Comments

@eingenito
Copy link
Contributor

eingenito commented Mar 28, 2019

Problem

Providing is an expensive operation and even with significant DHT speedups is likely to remain so. The growth of the IPFS network has made this problem more acute. It would be beneficial if IPFS instances could choose to provide only a subset of all nodes without dramatically impacting the availability and retrieval speed of the data.

Notes

The Go-IPFS team began discussion of this problem in our October 2018 London meetup.

  • @Kubuxu devised an algorithm and associated visualization that demonstrated the selection of good nodes for providing.
  • A requirement for all partial providing strategies is associated changes in bitswap (or whatever) that allow it to walk 'up' the dag from a node, searching for providers of ancestor nodes (since the current node may never have been provided).
@eingenito
Copy link
Contributor Author

eingenito commented Mar 28, 2019

Some conversation paraphrased from IRC:

@steb: I think the higher priority work will be the reverse link graph. We're going to run into problems with provider strategies unless we have some way to traverse abck up a dag and find providers for parents.
@eingenito: I wonder if the reverse link graph needs to be persisted or if it can be transient and owned by a bitswap session? Note: I think that there are other features enabled by a persistent database of per node metadata and this might a good place to start.

@eingenito: I think Kuba's algorithm was maximally efficient but required analysis of an entire dag to select optimal nodes for providing? Is that right?
@Kubuxu : no, it was nowhere near maximum efficiency but it required minimal information about the DAG. It just walked a DAG and made decisions depending on current node (and single number about the past state of the tree as we walked it).

@eingenito - Also, given how expensive providing is always likely to be I wonder if it isn't more useful to assign a priority to a given node wrt providing and then just provide in priority order knowing that many will simply never get provided.


Edit(kubux): s/@kuba/Kubuxu/ (someone else has handle kuba on GH.

@Kubuxu
Copy link
Member

Kubuxu commented Mar 28, 2019

The core idea about probabilistic providing bases on the fact that when someone has a given node there is a good chance he has its children. When we are missing a node we can try walking up the tree of nodes and try to resolve providers as we backtrack.

The essential part is having information about backlinks from the current node. This can be implemented in a number of fashions:

  • backlink/graph DB (hard/costly)
  • tracking backlinks as we recurse a tree
  • storing root and path to the current node as we recurse a tree

The probabilistic providing strategy has to optimize which nodes to provide depending on a number of factors:

  • A: target ratio of nodes we want to provide
  • B: maximum+average walk up backlinks to find a single provider
  • C: average walk up backlinks to find M of N providers

Why I propse probabilistic instread of deterministc providing strategy is that in the former the C is much smaller than B, this isn't true in the latter (while maintianing the same A).

I've also discovered a way to describe basic probabilistic providing strategies as markov chains, I used it in my simulations (as markov chains simplify to matrix multiplications). I will explain it if there is interest in it.

@eingenito
Copy link
Contributor Author

@Kubuxu thanks for the synopsis, it's very helpful. I have an observation, already mentioned above, but I want to restate it. Providing is so expensive, and the load on IPFS nodes so variable that I think it might always be important to model providing (and reproviding) as a best effort service. A probabilistic scheme for choosing good blocks to provide might be best used as a source of weights or priorities for all blocks so the re/provider system can appropriately order actual writing to the DHT. That way we can optimize providing under changing conditions and backfill lower priority provider records when conditions allow. Does that make sense?

@Kubuxu
Copy link
Member

Kubuxu commented Mar 30, 2019

It makes a lot of sense. Another thing is, we still have a lot of headway to improve our DHT.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
topic/meta Topic meta topic/provider Topic provider
Projects
No open projects
Development

No branches or pull requests

3 participants