Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

IPFS pin hangs despite available providers for content #4394

Closed
hsanjuan opened this issue Nov 15, 2017 · 8 comments · Fixed by #4407
Closed

IPFS pin hangs despite available providers for content #4394

hsanjuan opened this issue Nov 15, 2017 · 8 comments · Fixed by #4407
Labels
kind/bug A bug in existing code (including security flaws) topic/bitswap Topic bitswap

Comments

@hsanjuan
Copy link
Contributor

hsanjuan commented Nov 15, 2017

Version information: go-ipfs version: 0.4.13-rc1-5d23fd7 (docker:master)

Type: Bug

Severity: Medium

Description:

We have discovered that some of our Kubernetes-ipfs/ipfs-cluster tests fail because of a hanging ipfs-pin-add operations.

The problem affects at least 0.4.11, 0.4.12 and 5d23fd7 (master).

Whenever reproduced, the output of ipfs diag cmds -v is like:

bash-4.4# ipfs diag cmds -v
ID   Command    Arguments                                         Options                                       Active  StartTime        RunTime
459  pin/add    [QmZX5cM1TPz2sR8QsZbkVjEQ7DTk955rjdzpdk4eBxbaiW]  [enc=json,]                                   true    Nov 15 21:21:12  12m28.191261281s
573  pin/ls     [QmWwpqhTrDKb7QuSZo7ymeYEyX1qmqneu3VrUgkBvzwZhF]  [enc=json,type=recursive,]                    true    Nov 15 21:26:07  7m33.954139041s
672  repo/stat  []                                                [enc=json,]                                   false   Nov 15 21:33:23  19.66459ms
673  id         []                                                [enc=json,]                                   false   Nov 15 21:33:30  357.269µs
674  repo/stat  []                                                [enc=json,]                                   false   Nov 15 21:33:38  5.842197ms
675  diag/cmds  []                                                [encoding=json,stream-channels=true,v=true,]  true    Nov 15 21:33:41  427.623µs

Note that a pin ls --recursive request happening a few seconds after the hanging pin add one seems blocked too, but doing a manual pin ls request returns fine while in this situation. The given hash has providers as found by ipfs dht findprovs (2 at least).

Aborting go-ipfs produces the following stack trace: https://ipfs.io/ipfs/QmeAZo8H5QiRNWBxw89YE8WMzJnNc46cpYrzFkMVTSqVeS

The tests which usually show this behaviour do the following more or less:

  • add 100 bytes of data to one of the ipfs daemons in the cluster
  • tell cluster to pin it everywhere (this triggers a pin add ipfs request)
  • check that the item has been pinned everywhere

The result is that the CID it is not pinned, but "pinning" in some ipfs daemon, despite considerable delays. The problem shows up randomly and is not 100% reproducible. The cluster size is 5, and containers are running with minikube (locally). So network conditions are unlikely to be bad.

Bitswap issue? Pin locking issue? Any pointers are welcome...

@hsanjuan hsanjuan added the kind/bug A bug in existing code (including security flaws) label Nov 15, 2017
@Stebalien
Copy link
Member

Stebalien commented Nov 16, 2017

When in doubt, blame bitswap.

@Stebalien Stebalien added the topic/bitswap Topic bitswap label Nov 16, 2017
@leerspace
Copy link
Contributor

leerspace commented Nov 16, 2017

I thought this was just me (though I'm not using ipfs-cluster). Glad it isn't!

@leerspace
Copy link
Contributor

leerspace commented Nov 19, 2017

I'm not 100% sure if this is the same thing since in this case I encountered it during a get rather than a pin add, but here is the stack info I was able to get. If this shouldn't be part of this issue feel free to let me know.

ipfs.stacks

Command      Active  StartTime        RunTime
get          true    Nov 18 18:42:56  3m17.0082352s
swarm/peers  false   Nov 18 18:46:04  4.0078ms
diag/cmds    false   Nov 18 18:46:06  0s
swarm/peers  false   Nov 18 18:46:07  4.0042ms
swarm/peers  false   Nov 18 18:46:10  4.0053ms
diag/cmds    true    Nov 18 18:46:13  496.1µs

Stebalien added a commit that referenced this issue Nov 21, 2017
This deadlock would happen when calling SessionsForBlock (holding
bitswap.sessLk) while the session's main loop was trying to deregister the
session (taking bitswap.sessLk).

I've also defensively added selects on contexts for two other channel writes
just in case.

fixes #4394

...well, it fixes *a* deadlock showing up in that issue, there may be more.

License: MIT
Signed-off-by: Steven Allen <steven@stebalien.com>
@Stebalien
Copy link
Member

Stebalien commented Nov 21, 2017

I can confirm, this is (well, looks like) a bitswap deadlock on the sessLk lock.

Stebalien added a commit that referenced this issue Nov 21, 2017
This deadlock would happen when calling SessionsForBlock (holding
bitswap.sessLk) while the session's main loop was trying to deregister the
session (taking bitswap.sessLk).

I've also defensively added selects on contexts for two other channel writes
just in case.

fixes #4394

...well, it fixes *a* deadlock showing up in that issue, there may be more.

License: MIT
Signed-off-by: Steven Allen <steven@stebalien.com>
@hsanjuan
Copy link
Contributor Author

hsanjuan commented Nov 21, 2017

Thanks @Stebalien , happy to give it a go after it's merged or before, how ever you want it

@Stebalien
Copy link
Member

Stebalien commented Nov 22, 2017

@hsanjuan It's been merged.

@Stebalien Stebalien reopened this Nov 22, 2017
@Stebalien
Copy link
Member

Stebalien commented Nov 22, 2017

I'm going to leave this open until you confirm that it has been fixed.

@hsanjuan
Copy link
Contributor Author

hsanjuan commented Nov 22, 2017

hey @Stebalien , thanks! I have run the tests a couple of times, and things have improved significantly, the only failure is not related I think. I think the issues have been fixed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug A bug in existing code (including security flaws) topic/bitswap Topic bitswap
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants