Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Provider Strategy Discussion #6221

Open
michaelavila opened this issue Apr 15, 2019 · 23 comments
Open

Provider Strategy Discussion #6221

michaelavila opened this issue Apr 15, 2019 · 23 comments
Labels
need/community-input Needs input from the wider community topic/provider Topic provider

Comments

@michaelavila
Copy link
Contributor

michaelavila commented Apr 15, 2019

@Stebalien please add the people you think should be a part of this discussion.

We are introducing a provider system into go-ipfs (#6141) that replaces the mechanism in bitswap that provides all new blocks when it comes into contact with them. The new provider system is intended to give us more control over which blocks are provided during different operations.

The important questions for this group are:

1. What kinds of provide strategies will need to be supported?

e.g.

  • All
  • Roots
  • Whole/recursive (all of a dag when add or pin)
  • Pin roots
  • Pin recursive
  • Probabilistic (TBD)
  • Nothing
  • Some other strategy ...

@Stebalien mentioned an approach here #5870 (comment) which should be considered here as well.

2. What do we have to support initially?

Given the concerns I hear around providing, it seems like the following would help right away and could be merged on its own:

  • the ability to provide everything
  • the ability to disable providing without disabling content routing
  • the ability to provide only roots during bulk provide operations (e.g. add and pin)

But would that be enough?


Additionally, the reprovider strategies are being removed and instead a reprovide will work over all blocks that have been provided. Is that ok?

@michaelavila michaelavila added the need/community-input Needs input from the wider community label Apr 15, 2019
@michaelavila
Copy link
Contributor Author

@hannahhoward
Copy link
Contributor

currently various "get" like operations trigger a provide. (i.e. ipfs get, ipfs ls) provide everything would be fine, but this also presents a good time to use a more limited strategy like just roots.

@hannahhoward
Copy link
Contributor

^^ I think the above behavior is just a side affect of BS just providing every single block that goes through its system.

@obo20
Copy link

obo20 commented Apr 15, 2019

@Stebalien had mentioned to me there's an issue with providing roots where for example, if a node downloads certain parts of a tree and then gets interrupted, it won't automatically know how to "walk back up" the tree to get the root so it can download the rest of the hashes it needs to.

That might be something useful to consider when designing this.

@michaelavila
Copy link
Contributor Author

@obo20 thanks for joining the discussion! You're correct, it's a thing we have to consider and there has been some discussion about how best to approach that problem. I'm curious what some of the needs and/or pain points around providing are for Pinata, do you have any?

@michaelavila
Copy link
Contributor Author

@hannahhoward thanks! That's a great thing to point out.

@obo20
Copy link

obo20 commented Apr 16, 2019

The biggest pain points we've had involve content discovery times for both old content as well as recently added content. For lack of a better explanation, it seems like the provider just can't keep up with the amount of content it needs to announce.

@michaelavila
Copy link
Contributor Author

@obo20 thanks again for responding.

The biggest pain points we've had involve content discovery times for both old content ...

Understood. Here we are taking measures to reduce the number of things we provide in the first place. But like you mention, that's only part of the solution. There are efforts parallel to this for improving content discovery times.

... as well as recently added content.

Although it does not solve all of the problems you're running into, a small portion of the new provider system is being released with 0.4.20 (#6223) that will announce the root node from ipfs add and ipfs pin add commands. Hopefully this will make recently added content discoverable more quickly for you.


@obo20 I have two questions if you don't mind: I know Pinata is primarily dealing with pinned content, but how much non pinned content are you dealing with? Do you have a sense of the number of things you're trying to provide each day?


@Stebalien @eingenito do either of you know of an issue (or anything) tracking work being done in libp2p on content discovery performance?

@obo20
Copy link

obo20 commented Apr 16, 2019

@michaelavila No problem. Glad to help in any way I can. We're quite excited about a lot of the improvements coming to IPFS in 0.4.20. We'll be keeping an eye on things to see how performance improves.

I have two questions if you don't mind: I know Pinata is primarily dealing with pinned content, but how much non pinned content are you dealing with?

Currently our provider strategy is set to pinned. So we should only be announcing our content.

Do you have a sense of the number of things you're trying to provide each day?

Currently we have roughly 7000 root nodes that are being announced each day, but this number is increasing steadily.

@michaelavila
Copy link
Contributor Author

michaelavila commented Apr 16, 2019

@obo20 thank you. Out of curiosity, does Pinata have non pinned content in ipfs? If so, is it a lot?

@eingenito
Copy link
Contributor

eingenito commented Apr 17, 2019

@michaelavila and I just had a discussion and, following a conversation with @Stebalien it seems like there's a requirement that we preserve the current 0.4.20 'root block first' providing behavior no matter what. Would it be useful to merge an experimental provider system that offers the following control of re/providing:

  • Re/Provide none
  • Provide all blocks with root blocks always provided first; reprovide all blocks without prioritization

That's really it. Users who choose to can re/provide nothing without turning off content routing, and everyone else will re/provide everything, but root blocks will always take strict precedence on initial provide. If we can't provide at a rate that exhausts our list of root blocks we'll never even get to other blocks. Prioritizing (or restricting re/providing to) pinned roots or subtrees would be added later.

Presumably gateway could make use of the 'none' behavior. @hsanjuan could cluster make use of the 'all' strategy knowing that roots would always be provided first? Or more generally what does cluster do wrt re/providing?

Subsequent merges could add (potentially):

@Stebalien, @magik6k, @hsanjuan, @scout - any comments?

@eingenito
Copy link
Contributor

@obo20 I had a couple of questions/comments: you said you're using the 'pinned' reprovider strategy. If most (all?) of your content is pinned that's roughly equivalent to using 'all'. Would the 'roots' strategy be appropriate for your use case which just reprovides pin roots?

@obo20
Copy link

obo20 commented Apr 17, 2019

@obo20 thank you. Out of curiosity, does Pinata have non pinned content in ipfs? If so, is it a lot?

All of our content is pinned. We don't store any non-pinned content.

@obo20 I had a couple of questions/comments: you said you're using the 'pinned' reprovider strategy. If most (all?) of your content is pinned that's roughly equivalent to using 'all'. Would the 'roots' strategy be appropriate for your use case which just reprovides pin roots?

@eingenito We considered this, however the issue I described in my first comment could cause problems. For reference, this is the comment I'm referring to:

@Stebalien had mentioned to me there's an issue with providing roots where for example, if a node downloads certain parts of a tree and then gets interrupted, it won't automatically know how to "walk back up" the tree to get the root so it can download the rest of the hashes it needs to.

If this issue gets fixed, then we could absolutely use the "roots" strategy

@eingenito
Copy link
Contributor

eingenito commented Apr 17, 2019

@obo20 - thanks for your answers. The ultimate goal of these refactors is exactly what you're talking about; providing a subset of nodes (and @Kubuxu has done some work to characterize nodes as being particularly awesome to provide: #6155), and then enable bitswap (or derivative) to walk back up a DAG that is stalled in transfer looking for providers along the way until it can restart.

@michaelavila
Copy link
Contributor Author

@obo20 another question for you (thanks again): any idea how often Pinata is getting find provider requests for non root nodes (aka how often are you dealing with the interrupt situation you point out)?

@obo20
Copy link

obo20 commented Apr 18, 2019

As far as I’m aware, we’re not running into that interrupt situation right now as we’re running “pinned” as the provider strategy instead of “roots”.

@Stebalien had simply warned me that it may be an issue if we switched to “roots” as a provider strategy.

@hsanjuan
Copy link
Contributor

hsanjuan commented Apr 18, 2019

Presumably gateway could make use of the 'none' behavior. @hsanjuan could cluster make use of the 'all' strategy knowing that roots would always be provided first? Or more generally what does cluster do wrt re/providing?

For cluster, it's just fine that IPFS peers can pin (that is, find and retrieve) things that are pinned somewhere else. Therefore the "all", the "roots", the "pinned" strategies would work and we don't have special requirements that I can think of now.

The sharding feature (when it lands) might require the "all" strategy to be used in order to re-construct DAGs split across multple peers, but we can just make this a requirement for people wanting to shard.

@lanzafame ^^^ double-check this sounds right?

@Stebalien
Copy link
Member

@michaelavila

@Stebalien @eingenito do either of you know of an issue (or anything) tracking work being done in libp2p on content discovery performance?

There are a few PRs in flight to improve query perf. I also talked with @momack2 earlier today about having someone on the go-ipfs team work with the libp2p team on this kind of stuff. Unfortunately, much of this is still up in the air.


@obo20

@michaelavila No problem. Glad to help in any way I can. We're quite excited about a lot of the improvements coming to IPFS in 0.4.20. We'll be keeping an eye on things to see how performance improves.

So you don't get too excited, 0.4.20 brings some improvements but isn't likely to significantly improve content routing in your case. It may improve content routing for new content on initial add but that's about it.


@eingenito

Would it be useful to merge an experimental provider system that offers the following control of re/providing:

  • Re/Provide none
  • Provide all blocks with root blocks always provided first; reprovide all blocks without prioritization

Discussed out-of-band but, for the record, yes.

following a conversation with @Stebalien it seems like there's a requirement that we preserve the current 0.4.20 'root block first' providing behavior no matter what.

For context, the issue is that we just massively reduced the provider parallelism in bitswap. Unfortunately, that means it'll take longer to fully provide large files after adding them to go-ipfs. The current 'root block first' providing behavior isn't affected by this reduced provider parallelism.

Users who choose to can re/provide nothing without turning off content routing, and everyone else will re/provide everything, but root blocks will always take strict precedence on initial provide. If we can't provide at a rate that exhausts our list of root blocks we'll never even get to other blocks. Prioritizing (or restricting re/providing to) pinned roots or subtrees would be added later.

To make sure we're on the same page, if the user chooses a provider strategy that doesn't provide the first block, we shouldn't provide it even on initial provide.

Subsequent merges could add (potentially):

Those all sound like good ideas. Implementing them with our current datastore may be tricky but this could be good motivation to finally adopt a database.


@obo20

(@eingenito, wrt our conversation about the "pinned" provider strategy)

How volatile is pinned data. Specifically, could you approximate (don't spend any time on this) the ratio of content unpinned between GC cycles to the total number of pins you have?

I'm asking because it'll be somewhat tricky to re-implement the "pinned" provider strategy in the new provider subsystem exactly as is. Specifically, we'd start providing pinned content when pinned but we wouldn't stop until we've GCed the content (even if something unpins it in the meantime). We don't have to do it this way but it's simpler to implement.

So, I'm wondering how long you generally have "stale" data around.

@obo20
Copy link

obo20 commented Apr 19, 2019

How volatile is pinned data. Specifically, could you approximate (don't spend any time on this) the ratio of content unpinned between GC cycles to the total number of pins you have?

Our garbage collection runs every 24 hours. At quick glance I'd guess that roughly 10-20% of our repo consists of unpinned data when a typical collection starts.

@raulk
Copy link
Member

raulk commented Apr 19, 2019

I also talked with @momack2 earlier today about having someone on the go-ipfs team work with the libp2p team on this kind of stuff.

Hit the nail on the head with this proposal. I’ll connect with @momack2 to get the ball rolling.

@jdannenberg
Copy link

@Stebalien had mentioned to me there's an issue with providing roots where for example, if a node downloads certain parts of a tree and then gets interrupted, it won't automatically know how to "walk back up" the tree to get the root so it can download the rest of the hashes it needs to.

That might be something useful to consider when designing this.

Is this still true? For my use case, providing only root cids of very large trees of recursive pins (multi terabytes) would be enough in terms of the root cid is the only "entrypoint" (e.g. giving root cid to a pinning service). But I do not want to limit functionality / stability / resilience.

@Winterhuman
Copy link
Contributor

@bigCrash Setting Reprovider.Strategy to roots (https://github.com/ipfs/kubo/blob/master/docs/config.md#reproviderstrategy) would make your node only advertise root CIDs

@jdannenberg
Copy link

jdannenberg commented Sep 26, 2022

@Winterhuman I understand. But what are the drawbacks exactly. As someone without in-depth knowledge only advertising root CIDs seems to be enough for e.g. the use case of a pinning service pinning this root CID and all recursive descendants (so the root CID is always the entrypoint).

However, the comment I cited states that there might be problems fetching that tree, if ipfs gets interrupted. My question is: Is that still the case? Or am I fundamentally understanding something wrong here?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
need/community-input Needs input from the wider community topic/provider Topic provider
Projects
No open projects
Development

No branches or pull requests

9 participants