Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ipfs get doesn't automatically restart the download of a CID #7301

Open
vrde opened this issue May 10, 2020 · 15 comments
Open

ipfs get doesn't automatically restart the download of a CID #7301

vrde opened this issue May 10, 2020 · 15 comments
Labels
effort/days Estimated to take multiple days, but less than a week exp/expert Having worked on the specific codebase is important kind/bug A bug in existing code (including security flaws) need/analysis Needs further analysis before proceeding P1 High: Likely tackled by core team if no one steps up topic/bitswap Topic bitswap

Comments

@vrde
Copy link

vrde commented May 10, 2020

Version information:

go-ipfs version: 0.5.1
Repo version: 9
System version: amd64/linux
Golang version: go1.13.10

Description:

ipfs get gets stuck if the other peer reconnects to the network.

In this example node_0 is my laptop (behind NAT), node_1 is my VPS with a public ip.

  • node_0: add a file and get the CID.
  • node_1: run ipfs get <CID>.
  • ...wait for the download to start...
  • node_0: internet goes down.
  • node_0: internet goes up again.
  • node_1 download is now stuck.

Note that node_0 is the only one that can serve CID.

@vrde vrde added kind/bug A bug in existing code (including security flaws) need/triage Needs initial labeling and prioritization labels May 10, 2020
@rpodgorny
Copy link

maybe similar/dupe of #7211 ?

@Stebalien
Copy link
Member

How long is the download stuck? Is it possible to dial node_0 from a new node? If you restart node_1, does it work? If you explicitly re-connect node_1 to node_0 does it work?

I'm wondering if node_1 has stale addresses (ports, in this case) for node_0 and keeps trying the wrong ones.

@vrde
Copy link
Author

vrde commented May 22, 2020

How long is the download stuck?

I don't remember exactly how much, but I think it was for a long time, like 10/20 minutes maybe. I can give it another shot of course.

Is it possible to dial node_0 from a new node?

Which command should I use for that?

If you restart node_1, does it work?

I remember I was able to get it work again by stopping the command (ctrl+c) and running it again.

If you explicitly re-connect node_1 to node_0 does it work?

Which command should I use for that?

@Stebalien
Copy link
Member

Is it possible to dial node_0 from a new node?
Which command should I use for that?

ipfs swarm connect /p2p/$PEER_ID_OF_OTHER_NODE

I was trying to determine if your node was dialable. However, if stopping and re-starting the command fixes the problem, it probably is dialable and this isn't the problem.

I remember I was able to get it work again by stopping the command (ctrl+c) and running it again.

That points to an issue in bitswap itself. Could you run:

On node_1:

> ipfs bitswap wantlist
> ipfs swarm peers --streams | grep -A5 $node_0

And the following on node_0:

> ipfs bitswap wantlist -p $node_1
> ipfs swarm peers --streams | grep -A5 $node_1

That will tell me if:

  1. What node_1 is asking for.
  2. What node_0 thinks node_1 wants.
  3. If the peers are connected and what streams (requests) they have open.

@rpodgorny
Copy link

i am not the original poster but i'm also hit by this. a little bit different situation - for me node_1 is on flaky wifi connection but the outcome is the same...

node_0: QmYLhBZsWL2MQuNXFfn6Wr5snTTq2ys121nQ9FQoZnMLYa
node_1: Qmcc6ZBqk9bMs85ZRL1h1gcowW4mFXsjuYBoVuxaYw1HML

on node_1:

ipfs swarm peers --streams|grep -A5 QmYLhBZsWL2MQuNXFfn6Wr5snTTq2ys121nQ9FQoZnMLYa                                                              
/ip6/2a01:9420:9:301:e11e:2ca4:ea5c:5065/tcp/4001/p2p/QmYLhBZsWL2MQuNXFfn6Wr5snTTq2ys121nQ9FQoZnMLYa
  /ipfs/bitswap/1.2.0
  /ipfs/lan/kad/1.0.0
ipfs bitswap wantlist                                                                                                                           (+00:00:00) 2020-05-26 22:58:39
QmYhK5uZSrrxrVcpdZyaNTwNBUeXszy4DgWTDENypghMmm
QmcWQMg3MqRx1j5QXrq4JfyZ53NGMGt9M79meH7xUMdrke
QmdMojSFbivBaKVeGhdxvjAzhdWDAiDFq6fNKMH2XFQgWA
Qmf97jbWtmn2ZRAU6aXovYQZiMdCP5A9YFm9SH1igZM94E

on node_0:

ipfs swarm peers --streams|grep -A5 Qmcc6ZBqk9bMs85ZRL1h1gcowW4mFXsjuYBoVuxaYw1HML                                                    (+00:00:00) 2020-05-26 17:56:43
/ip6/2002:2e24:2741:250::2/tcp/4001/p2p/Qmcc6ZBqk9bMs85ZRL1h1gcowW4mFXsjuYBoVuxaYw1HML
  /ipfs/bitswap/1.2.0
ipfs bitswap wantlist -p Qmcc6ZBqk9bMs85ZRL1h1gcowW4mFXsjuYBoVuxaYw1HML
<NOTHING>

sometimes i see this on node_0:

ipfs swarm peers --streams|grep -A5 Qmcc6ZBqk9bMs85ZRL1h1gcowW4mFXsjuYBoVuxaYw1HML                                                    (+00:00:00) 2020-05-26 18:04:34
/ip6/2002:2e24:2741:250::2/tcp/4001/p2p/Qmcc6ZBqk9bMs85ZRL1h1gcowW4mFXsjuYBoVuxaYw1HML
  /ipfs/bitswap/1.2.0
  /ipfs/bitswap/1.2.0
  /ipfs/bitswap/1.2.0
  /ipfs/bitswap/1.2.0
  /ipfs/bitswap/1.2.0

...for me the situation is so bad that i have been unable to get anything from "bitswap wantlist -p" on node_0 even after restarting "ipfs get" and even restarting "ipfs daemon" on node_1 :-(

@rpodgorny
Copy link

also:

ipfs version --all                                                                                                                              (+00:00:00) 2020-05-26 23:11:00
go-ipfs version: 0.5.1-8431e2e87
Repo version: 9
System version: amd64/linux
Golang version: go1.14.2

@rpodgorny
Copy link

is there a way to force a "wantlist resend" on node_1 (to node_0)?

@Stebalien
Copy link
Member

is there a way to force a "wantlist resend" on node_1 (to node_0)?

We send a full wantlist every 30s. However, we don't send every wantlist to every peer, only to peers that we think have the content. But when the download stalls, we should broadcast to all connected peers so something fishy is going on here.

/ipfs/bitswap/1.2.0
/ipfs/bitswap/1.2.0
/ipfs/bitswap/1.2.0
/ipfs/bitswap/1.2.0
/ipfs/bitswap/1.2.0

That can happen when sending/receiving blocks.


Could you try the new RC (ipfs-update install v0.6.0-rc1)? We had a potential run-away timeout that could maybe cause this issue. I think this is a new bug, but it would be good to confirm that.

@dirkmc thoughts?

@rpodgorny
Copy link

same behaviour with 0.6.0-rc1 (compiled from git). -> i can see some streams on both nodes but the wantlist does not seem to be getting transferred. :-(

@rpodgorny
Copy link

also, this may be somehow connected to a "provide" feature/bug - currently on node_1's wantlist: QmVQ9tTPJeG2uWgjky4UWt4uubqRJeUpKGCG3qkYdM5BUU

node_0:

ipfs cat QmVQ9tTPJeG2uWgjky4UWt4uubqRJeUpKGCG3qkYdM5BUU | md5sum
bbfe0cc17f6a01caf8434c858e0caa34  -

...so the block is there. but:

on both node_0 and node_1:

ipfs dht findprovs QmVQ9tTPJeG2uWgjky4UWt4uubqRJeUpKGCG3qkYdM5BUU
<NOTHING>

after almost two minutes.

so it's almost like node_0 is not properly announcing its data and combined with "we don't send every wantlist to every peer, only to peers that we think have the content" this may lead to this download-being-stuck situation.

@rpodgorny
Copy link

rpodgorny commented May 27, 2020

after a forcefull:

ipfs dht provide QmVQ9tTPJeG2uWgjky4UWt4uubqRJeUpKGCG3qkYdM5BUU

on node_0 i now correctly see:

ipfs dht findprovs QmVQ9tTPJeG2uWgjky4UWt4uubqRJeUpKGCG3qkYdM5BUU
QmYLhBZsWL2MQuNXFfn6Wr5snTTq2ys121nQ9FQoZnMLYa

on both node_0 and node_1 but even-thou there's still connection between the nodes, QmVQ9tTPJeG2uWgjky4UWt4uubqRJeUpKGCG3qkYdM5BUU still stays in node_1's wantlist and not appearing on node_0's wantlist listing from node_1. :-(

...even after running the findprovs. i'd expect that manually finding the provider on node_1 (which returns node_0 and considering we have connection to that node) would immediatelly solve the situation.

@Stebalien
Copy link
Member

so it's almost like node_0 is not properly announcing its data and combined with "we don't send every wantlist to every peer, only to peers that we think have the content" this may lead to this download-being-stuck situation.

Sorry, we don't send the wantlist to all peers until we timeout. Then, we broadcast to all peers. That's why this is really strange.

@rpodgorny
Copy link

is there anything i can do to further investigate the issue?

@dirkmc
Copy link
Contributor

dirkmc commented May 28, 2020

@rpodgorny & @vrde thanks for taking the time to follow up on this issue, it really helps us a lot to receive these bug reports.

I've been trying to reproduce locally with some unit tests inside the go-bitswap project but I haven't been able to so far. It may help understand where the problem is if you output some debug logging while performing the ipfs get, on both the requesting node and the responding node.

To output debug level logging in the daemon window for a particular sub-system run the ipfs log level command from another window, eg:
ipfs log level bs:sess debug

The subsystems that may output useful information:

  • bs:sess
    Bitswap sessions: The session keeps track of all the outgoing requests for a particular file. The debug output should tell us when the session broadcasts a request (meaning it can't find a peer that has the blocks it wants) and when the session receives blocks that it had asked for, eg
    2020-05-28T13:00:53.812-0400 DEBUG bs:sess session/session.go:218 Bitswap <- block {"local": "13bZk1tR4", "from": "13a5CQiao", "cid": "QmRFfdpSr7hvPYUNzPcKgvtQok95XGGnUKopTBy7LPa5FN", "session": 1}
  • bs:sprmgr
    Bitswap session peer manager: Tells us when a peer is added to / removed from the session
  • engine
    The Bitswap "engine": The component of Bitswap that receives incoming requests for blocks. It should tell us if someone asks us for blocks, and output whether we have those blocks.
  • bitswap
    General bitswap log output: This is quite verbose so if you turn it on you may need to grep through it for things you are looking for. The most useful lines here are probably the ones from messagequeue.go. They tell us the want messages that get sent from the requesting node, eg
    2020-05-28T13:00:53.811-0400 DEBUG bitswap messagequeue/messagequeue.go:638 sent message {"type": "WANT_BLOCK", "cid": "QmNTPrxjHgastHFGJa3eJ8q6dqVaM4tJ3bFuVGhCz1E114", "local": "13bZk1tR4", "to": "13a5CQiao"}
  • bitswap_network
    Bitswap networking operations: Very verbose

@RubenKelevra
Copy link
Contributor

@rpodgorny wrote:

maybe similar/dupe of #7211 ?

I don't think so. 7211 is more about the loss of DHT functionality after a disconnect, where it isn't properly restored/detected.

I'm currently checking it for the master, and it seems to be gone. :)

That's why I haven't updated the ticket anymore for some time.

@Stebalien Stebalien added effort/days Estimated to take multiple days, but less than a week exp/expert Having worked on the specific codebase is important need/analysis Needs further analysis before proceeding P1 High: Likely tackled by core team if no one steps up topic/bitswap Topic bitswap and removed need/triage Needs initial labeling and prioritization labels Mar 31, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
effort/days Estimated to take multiple days, but less than a week exp/expert Having worked on the specific codebase is important kind/bug A bug in existing code (including security flaws) need/analysis Needs further analysis before proceeding P1 High: Likely tackled by core team if no one steps up topic/bitswap Topic bitswap
Projects
None yet
Development

No branches or pull requests

5 participants