Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Kudo doesn't connect to peers with requested CID, can't finish transfer #9649

Open
3 tasks done
Rashkae2 opened this issue Feb 15, 2023 · 4 comments
Open
3 tasks done
Labels
effort/days Estimated to take multiple days, but less than a week kind/feature A new feature P2 Medium: Good to have, but can wait until someone steps up

Comments

@Rashkae2
Copy link

Checklist

  • My issue is specific & actionable.
  • I am not suggesting a protocol enhancement.
  • I have searched on the issue tracker for my issue.

Description

Assumptions (at my risk.). When a node has a large amount of data (over 30GB), reproviding all CIDs to DHT is not really feasible without using Accelerated DHT. Accelerated DHT works well, but brings it's own issues that might make it unsuitable in some scenario.

Alternatives to help make data available are changing Reprovider strategy to Roots, (which is what I tested with.), or using a script to publish your ROOT CIDs periodically.

It is easy to create a situation where a ipfs client (trying to download) is unable to find the peer with the data, even though the requested CID is on the DHT... (to replicate, I can start a tranfer, stop and restart both deamons so they are no longer peered.)

on the requesting node, use ipfs get x

The node did not need to connect to hosting node, because it was able to find x somewhere else, either its own cache, or another peer in the swarm. However, it is unable to find a subsequent CID and no data transfer happens

use ipfs bitswap stat, you can see a different CID in the wantlist.

Sometimes the transfer will eventually start. Very often, it will not. (If the get request includes a Path, CID/Somedirecotry it will almost always time out.)

I can fix this myself however, if I ipfs dht findpeers x It will list the peerID's where that CID is published on the DHT, including the node I'm hosting the data. I can then ipfs swarm connect /p2p/peerid and the data transfer will immediately start.

My proposal is that IPFS should be doing this by itself. If the client is unable to find the wanted CID in it's peers and has to query DHT, it should always search for and connect to peers with the original CID that was in the get command, (as well as the wantlist.)

@Rashkae2 Rashkae2 added the kind/feature A new feature label Feb 15, 2023
@Jorropo
Copy link
Contributor

Jorropo commented Feb 15, 2023

@Rashkae2 that an awesome analysis, you are completely correct,
I already know about this bug (github search sucks and I was not able to find the issue).
Sadly fixing this in bitswap is either a huge or really ugly refactor because bitswap doesn't think about the link between blocks, the simplest thing would be to add some "Root" field to session CIDs which we would fallback if the DHT search for the leaves CIDs yields no result. But that ugly because we are now tying sessions to some graph, instead of being a loose "thoses blocks are related" metric.

My new RAPIDE client have an internal representation that knows about the DAG and thus it will be trivial to implement backtracking of searches when I add support for routers (currently RAPIDE is in alpha and lacks many critical features like bitswap or DHT support).

@Rashkae2
Copy link
Author

Rashkae2 commented Feb 15, 2023

Does it need to be fixed in bitswap though? If that's too ugly, maybe that should just be automatic as soon a request is input, (ie, a thread unreleated to bitswap automatcially seaches for and connects to peers who are advertising the CID). In theory, I can just write my own wrapper for ipfs get opperation... but my real 'goal' would be for this to work from the http gateway.

@Jorropo
Copy link
Contributor

Jorropo commented Feb 15, 2023

@Rashkae2 duplicating the code that do content discovery is also bad.

Given that I'm working on a replacement to the go-bitswap client I don't really want to spend more time on go-bitswap.

If you want to send a pull request I'll review it, the way I would go about it (if you find a better way feel free to try):

// 1. https://pkg.go.dev/github.com/ipfs/go-libipfs/bitswap/client/internal/session#Session
// add a new private field root cid,Cid to the bitswap session

// 2. https://pkg.go.dev/github.com/ipfs/go-libipfs@v0.5.0/bitswap/client/internal/providerquerymanager
// find a way for the provider query manager to access the root object from the session and if the first search (the one searching the leaves) timeouts and if the root field is set (cid.Cid.Defined allows the check) restart a new search on the root

// 3. https://pkg.go.dev/github.com/ipfs/go-libipfs/bitswap/client#Client
// Add a new method which accepts a cid as a root (internally this return you the same Session object that NewSession returns you, but with some private root field set)
func (*Client) NewSessionWithRoot(ctx context.Context, root cid.Cid) exchange.Fetcher

// 4. https://pkg.go.dev/github.com/ipfs/go-ipfs-exchange-interface#Interface
// Add the same NewSessionWithRoot to the SessionExchange interface
type SessionExchange interface {
	Interface
	NewSession(context.Context) Fetcher
	NewSessionWithRoot(ctx context.Context, root cid.Cid) Fetcher
}

// 5. https://pkg.go.dev/github.com/ipfs/go-blockservice#NewSession
// Implements NewSessionWithRoot with the blockservice (it don't need to retain the root, just forward it the exchange (here bitswap)).
func NewSessionWithRoot(ctx context.Context, bs BlockService, root cid.Cid) Fetcher

// 6. Edit the gateway code to use NewSessionWithRoot instead of NewSession where it makes sense.

@Rashkae2
Copy link
Author

Rashkae2 commented Feb 15, 2023

I've afraid you've gone above my pay grade. All I can say is that I've been periodically testing IPFS since go-ipfs .10, and always end up at this same place before chucking the vm's. Drilling down to exactly where the process breaks down is the best I can offer.

@aschmahmann aschmahmann added P2 Medium: Good to have, but can wait until someone steps up effort/days Estimated to take multiple days, but less than a week labels May 8, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
effort/days Estimated to take multiple days, but less than a week kind/feature A new feature P2 Medium: Good to have, but can wait until someone steps up
Projects
None yet
Development

No branches or pull requests

3 participants