Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Stall when fetching locally available content #6386

Open
Stebalien opened this issue May 29, 2019 · 11 comments
Open

Stall when fetching locally available content #6386

Stebalien opened this issue May 29, 2019 · 11 comments
Labels
kind/bug A bug in existing code (including security flaws)

Comments

@Stebalien
Copy link
Member

Stebalien commented May 29, 2019

Version information:

ipfs version 0.4.21-rc3

Description:

  • ipfs pin verify $HASH works.
  • curl $LOCAL_GATEWAY/ipfs/$HASH stalls.

Additionally, this dataset includes ~50e6 files, sharded up into multiple sub-directories.

See logs.zip for the stack traces and a CPU profile taken while reproducing the issue.

It looks like there's some kind of live-lock (and maybe a deadlock?) in go-bitswap.

Note: Unfortunately, the stack traces are truncated due to go's 64MiB pprof stack trace limit.


Originally reported by @obo20.

@Stebalien Stebalien added the kind/bug A bug in existing code (including security flaws) label May 29, 2019
@magik6k
Copy link
Member

magik6k commented May 30, 2019

Mutex profile might be helpful here:

  • curl -X POST -v 'localhost:5001/debug/pprof-mutex/?fraction=10
  • curl http://localhost:5001/debug/pprof/mutex > ipfs.mutex

@obo20
Copy link

obo20 commented May 31, 2019

Here's a mutex profile as requested, along with another cpuprof / stacks

FYI for these, the problem resolved itself a few minutes after I ran these, so I can't be 100% positive these will help. Hopefully they do.

ipfsLogs2.zip

@obo20
Copy link

obo20 commented May 31, 2019

Here's a new log set for a different, but similar dataset being retrieved that caused this error:

This hash was still bugging out at the time of posting this.

ipfsLogs3.zip

@obo20
Copy link

obo20 commented May 31, 2019

It should be noted that restarting the node in question fixes this issue instantly (for awhile)

@obo20
Copy link

obo20 commented Jun 4, 2019

Some more information, when this bug is encountered, the node can lag / stall. We've had to resort to resetting our nodes when encountering this.

@obo20
Copy link

obo20 commented Jun 17, 2019

@magik6k @hannahhoward
Here's another much longer stall that one of our nodes experienced with this bug:
ipfsLogs4.zip

@magik6k
Copy link
Member

magik6k commented Jun 17, 2019

This might be related to the issue in #6442, mind updating to master to see if it got fixed?

@obo20
Copy link

obo20 commented Jun 17, 2019

@magik6k this doesn't appear to have helped unfortunately

@obo20
Copy link

obo20 commented Jun 17, 2019

Further information that may or may not be of any help:

The root directory hash seems to load fine, but once I click in to one of the sub directories, the child directories inside of that won't load.

To better explain. The data looks like: Root Directory -> child directories -> grandchild directories -> files.

The gateway is getting stuck after we jump into one of the child directories and attempt to jump into one of the grandchild directories.

@Stebalien
Copy link
Member Author

Update: We believed this may have been caused by a useless task consuming all the CPU time in bitswap but that doesn't appear to be the case. This can be reproduced on master as of the beginning of July.

@Stebalien
Copy link
Member Author

@obo20 have you seen this since?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug A bug in existing code (including security flaws)
Projects
None yet
Development

No branches or pull requests

4 participants