initial proposal for graphsync support in ipfs #78

willscott · 2021-03-12T04:09:17Z

Tagging @hannahhoward to tell me where i'm wrong: in particular if estimation of scope is off, and if i identified reasonable place where this plugs in & challenges.
Tagging @aschmahmann on if there are objections to graphsync being incorporated in this way we should know about

aschmahmann · 2021-03-12T06:36:43Z

proposals/ipfs-graphsync.md

+## Project definition
+#### Brief plan of attack
+
+[go-merkledag](https://github.com/ipfs/go-merkledag) currently takes a [blockservice](https://github.com/ipfs/go-merkledag/blob/bf51443272bb98cff071eb44ed9ce6c940e82f1f/merkledag.go#L32).  It should also be able to be created from a graphsync provider, or both. We can initially enable graphsync as an opt-in option, and need to design a transition (e.g. do we have merkledag make requests to both graphsync and to blockservice when present? do we make this code somewhat peer-aware, so we know based on provider which one to use?)


TLDR: In trying to make GraphSync usable in go-ipfs having an interface that handles both bitswap and graphsync backed DAGServices is the easy part. Content discovery, spreading requests across peers, prioritizing peers + protocols, etc. are the hard parts.

IIUC the main areas GraphSync is lacking in order to be a drop in replacement for Bitswap are:

We have no code (or AFAIK proposal yet) for distributing data across multiple peers using GraphSync

Graphsync is not on its own suitable for peer discovery in the way Bitswap is (e.g. it may be less efficient due to being unable to ask for HAVEs)

From what I see this limits our options, and some remaining ones are:

Make GraphSync able to be a drop-in replacement:

Extract sessions code from Bitswap/create a way for sessions to be managed across Bitswap and GraphSync

Make sessions smart enough to split queries across peers with GraphSync, or something simpler like trying to use GraphSync only if we fail at using Bitswap for long enough

Use Bitswap (or some other protocol, potentially including GraphSync) for discovery of content from peers we're already connected to

Have options like ipfs cat <cid> [--from=<multiaddr>] that will use GraphSync for direct point-to-point transfers

There are actually proposals that enable local swarm content discovery in graphsync similar to bitswap. See: ipld/specs#355 (and potentially more useful cause it would be full selector discovery). However note that actually implementing this in a way that is performant is a non-trivial task and may involve building the beginnings of a dag store to record what content we actually have locally not just at the block level but at the DAG level.

More importantly, this proposal allows an additional type of request that sits between bitswap's WANT-HAVE and WANT-BLOCK request -- since Graphsync deals with requests that span multiple CIDs, another potentially useful piece of information a graphsync peer can provide is the list of CIDs that will satisfy a selector request, without the blocks attached to those CIDs. This list of CIDs will have to be untrusted, but it makes for a potentially fruitful handoff to Bitswap:

graphsync tells you what you are fetching, and you don't really need to split graphsync requests given you're just getting a list of CIDs (which is small)

Bitswap handles the actual block fetching. (Note: This is a non-trivial implementation cause the CID list will be untrusted, meaning fetched data will need to go in a temporary store till a local selector traversal can verify it, but it's still feels doable to me.)

However: @aschmahmann is totally correct that all of this is difficult to make much use of if we don't provide access to bitswap's session peer list, and ideally pull out session peer tracking from bitswap and make it a more general service. That's a significant though IMHO necessary and timely refactor.

Moreover, getting bitswap to support putting blocks in a temporary store before they go into a permanent store on a per request basis is another significant refactor.

Basically, to do all this, there will be refactoring of bitswap.

I am generally not in favor of moving the protocol selector up to the user a.l.a. ipfs cat <cid> [--from=<multiaddr>] cause it strikes me as something the user should never have to think about.

hannahhoward

I'm going to suggest that we have a conversation on goals here, cause there seems to be a bit of a mixing of ideas about what this proposal is, and what it can unlock.

As I see it, there are two separate legible projects here:

IPFS nodes learn to efficiently request data by mixing graphsync and bitswap requests, to improve overall transfer speeds. This is something I have in my head a well defined roadmap and approach for. I've put some components in the comments, but can write it up seperately.
Filecoin miners can transfer data for storage deals by requesting it from IPFS. What this requires needs some thought, and is not as clear to me, but I don't think this proposal as described moves us any closer to that goal per se. I can see two approaches here:

The miner spins up a seperate full IPFS instance to request data. If they do this, there's no need to add graphsync -- just use what IPFS already uses, which is bitswap
The miner process itself attempts to talk to IPFS with its existing protocols. If this is the case, graphsync itself is not sufficient -- IPFS would also need to understand data transfer requests.

As I see it, we're mixing a well defined but perhaps lower impact project (IPFS request mixing & data-transfer speeds) with a high priority initiative (integrating IPFS + Filecoin) that it won't actually solve. I think it's good for the ecosystem generally to have good transfer speeds, and probably having software that can intelligently mix graphsync and bitswap will be useful in the long term to both IPFS and Filecoin. But there's no direct step here that makes Filecoin and IPFS nodes MORE able to talk to each other than they already do.

hannahhoward · 2021-03-15T20:02:18Z

proposals/ipfs-graphsync.md

+## Project definition
+#### Brief plan of attack
+
+[go-merkledag](https://github.com/ipfs/go-merkledag) currently takes a [blockservice](https://github.com/ipfs/go-merkledag/blob/bf51443272bb98cff071eb44ed9ce6c940e82f1f/merkledag.go#L32).  It should also be able to be created from a graphsync provider, or both. We can initially enable graphsync as an opt-in option, and need to design a transition (e.g. do we have merkledag make requests to both graphsync and to blockservice when present? do we make this code somewhat peer-aware, so we know based on provider which one to use?)


go-merkledag is IMHO a legacy service for retrieving data. go-fetcher is the replacement -- the top level interface is selectors specifically so we can do this kind of request mixing. I would prefer we do this work inside go-fetcher, and slowly deprecate go-merkledag for this sort of this.

Also, go-merkeldag's DAGService APIs are block level, so it's not even clear how it could talk to graphsync at all.

In short, we should replace the reference to go-merkledag here with go-fetcher.

hannahhoward · 2021-03-15T20:18:01Z

proposals/ipfs-graphsync.md

+## Project definition
+#### Brief plan of attack
+
+[go-merkledag](https://github.com/ipfs/go-merkledag) currently takes a [blockservice](https://github.com/ipfs/go-merkledag/blob/bf51443272bb98cff071eb44ed9ce6c940e82f1f/merkledag.go#L32).  It should also be able to be created from a graphsync provider, or both. We can initially enable graphsync as an opt-in option, and need to design a transition (e.g. do we have merkledag make requests to both graphsync and to blockservice when present? do we make this code somewhat peer-aware, so we know based on provider which one to use?)


There are actually proposals that enable local swarm content discovery in graphsync similar to bitswap. See: ipld/specs#355 (and potentially more useful cause it would be full selector discovery). However note that actually implementing this in a way that is performant is a non-trivial task and may involve building the beginnings of a dag store to record what content we actually have locally not just at the block level but at the DAG level.

More importantly, this proposal allows an additional type of request that sits between bitswap's WANT-HAVE and WANT-BLOCK request -- since Graphsync deals with requests that span multiple CIDs, another potentially useful piece of information a graphsync peer can provide is the list of CIDs that will satisfy a selector request, without the blocks attached to those CIDs. This list of CIDs will have to be untrusted, but it makes for a potentially fruitful handoff to Bitswap:

graphsync tells you what you are fetching, and you don't really need to split graphsync requests given you're just getting a list of CIDs (which is small)

Bitswap handles the actual block fetching. (Note: This is a non-trivial implementation cause the CID list will be untrusted, meaning fetched data will need to go in a temporary store till a local selector traversal can verify it, but it's still feels doable to me.)

However: @aschmahmann is totally correct that all of this is difficult to make much use of if we don't provide access to bitswap's session peer list, and ideally pull out session peer tracking from bitswap and make it a more general service. That's a significant though IMHO necessary and timely refactor.

Moreover, getting bitswap to support putting blocks in a temporary store before they go into a permanent store on a per request basis is another significant refactor.

Basically, to do all this, there will be refactoring of bitswap.

I am generally not in favor of moving the protocol selector up to the user a.l.a. ipfs cat <cid> [--from=<multiaddr>] cause it strikes me as something the user should never have to think about.

hannahhoward · 2021-03-15T20:22:22Z

proposals/ipfs-graphsync.md

+
+#### Alternatives
+
+* we live with bitswap


I hate to trashtalk my own library, but I'm not sure we can't just live with bitswap. I'm not clear ultimately that bitswap is a huge bottleneck. I really would like to explore what the heck is going on in the BeyondBitswap testbed that's resulting in such huge diffs between full-stack IPFS transfer and Bitswap only transfer before we embark on a huge optimization project with go-graphsync. While all the refactors I mentioned above to bitswap, along with request mixing with graphsync, represent my personal dream project in terms of my own interests and competencies, I'm not sure if it's actually our highest priority here. Is data transfer speed truly a big pain point for IPFS right now? I'm not sure. What's the product team say?

hannahhoward · 2021-03-15T20:23:50Z

proposals/ipfs-graphsync.md

+
+#### Future opportunities
+
+moves us a step closer to ipfs and filecoin nodes being able to exchange data


IPFS nodes already have graphsync as an experimental feature for responding to requests -- maybe we just turn it on?

hannahhoward · 2021-03-15T21:22:43Z

proposals/ipfs-graphsync.md

+
+#### Impact
+
+High. we invision cases where we want IPFS and filecoin nodes to be able to communicate


If this is the main goal, I'm not sure this proposal is right.

I'm trying to think through this use case: "a client builds a car of data to store, creates a filecoin deal for it, and the miner fetches the data from an IPFS client with graphsync"

Can you explain in more detail what this means? So I have all the data, I make a car file, and then I make a filecoin deal for it -- why don't I just send to the miner with data transfer and graphsync?

What's the use case for the miner fetching with graphsync via an IPFS client? Moreover, why would a miner run a full instance of IPFS when IPFS already supports retrieving via graphsync as an experimental feature -- it seems to me all of the software then is just built into the filecoin mining software-- they don't run a whole IPFS node.

Again, this is a fine project, but as specified the impact is to IPFS data transfer speeds. it really doesn't unlock much on its own for filecoin miners getting data from IPFS.

also, Filecoin doesn't speak Graphsync: It speaks data transfer on top of graphsync. Just supporting graphsync as a protocol is not enough on its own to unlock Filecoin transfer. IPFS needs to have an integrated data transfer instance to do what I think you want to do here. But again, I am not sure I understand the use case.

BigLep · 2021-03-24T23:55:02Z

Lots of great comments and points raised by @hannahhoward . When time permits, I assume you'll respond/incorporate @willscott .

willscott · 2021-03-25T00:09:45Z

Yep. This definitely needs a lot more clarity. I think the product asks are still evolving quite a bit as well:

do we want a go-IPFS node to initiate a connection to a lotus node, or vice versa?
are we okay speaking bitswap initially?
what sort of selector support do we need and where does that come into play (e.g. threaded up from an API command being made by a user to the go-ipfs node?)

Decisions on that path (cc @pooja ) will inform the next pass at this project scope.

mikeal · 2021-03-25T00:18:23Z

Can we set this PR to draft until you have time to refine it?

willscott · 2021-03-25T00:57:47Z

I don't see a button to do that, but no objection.

rvagg · 2021-03-25T05:41:44Z

@willscott over just under the reviewer list at the top right ⬈ Still in progress? Convert to Draft.

initial proposal for graphsync support in ipfs

3b4aa35

github-actions bot requested a review from jacobheun March 12, 2021 04:09

aschmahmann reviewed Mar 12, 2021

View reviewed changes

mikeal added this to Needs Review in Project Pipeline Mar 15, 2021

hannahhoward reviewed Mar 15, 2021

View reviewed changes

willscott marked this pull request as draft March 25, 2021 14:22

mikeal removed this from Needs Review in Project Pipeline Mar 31, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

initial proposal for graphsync support in ipfs #78

initial proposal for graphsync support in ipfs #78

willscott commented Mar 12, 2021

aschmahmann Mar 12, 2021

hannahhoward Mar 15, 2021

hannahhoward left a comment •

edited

hannahhoward Mar 15, 2021

hannahhoward Mar 15, 2021

hannahhoward Mar 15, 2021

hannahhoward Mar 15, 2021

hannahhoward Mar 15, 2021

hannahhoward Mar 15, 2021

BigLep commented Mar 24, 2021

willscott commented Mar 25, 2021

mikeal commented Mar 25, 2021 •

edited

willscott commented Mar 25, 2021 •

edited

rvagg commented Mar 25, 2021


		#### Future opportunities

		moves us a step closer to ipfs and filecoin nodes being able to exchange data


		#### Impact

		High. we invision cases where we want IPFS and filecoin nodes to be able to communicate

initial proposal for graphsync support in ipfs #78

Are you sure you want to change the base?

initial proposal for graphsync support in ipfs #78

Conversation

willscott commented Mar 12, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

hannahhoward left a comment • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

BigLep commented Mar 24, 2021

willscott commented Mar 25, 2021

mikeal commented Mar 25, 2021 • edited

willscott commented Mar 25, 2021 • edited

rvagg commented Mar 25, 2021

hannahhoward left a comment •

edited

mikeal commented Mar 25, 2021 •

edited

willscott commented Mar 25, 2021 •

edited