Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Recursive add of large directory fails at 100% (with nocopy and fscache) #5815

Closed
dokterbob opened this issue Dec 3, 2018 · 22 comments
Closed
Labels
kind/bug A bug in existing code (including security flaws) kind/stale need/author-input Needs input from the original author

Comments

@dokterbob
Copy link
Contributor

Version information:

go-ipfs version: 0.4.18-
Repo version: 7
System version: amd64/linux
Golang version: go1.11.1

Type:

Bug

Description:

Adding a large resource (the ipfs-search.com index, specifically, 390GB) fails at 100% - it simply blocks and doesn't give the overall hash for the resource. Getting to 100% takes an acceptable amount of time, after which nothing happens for at least 12 hours.

Example output:

$ ipfs add -p -w --nocopy --fscache -r ipfs-search-backup
[...]
added QmfTjAf3keCtZkKVizGGPokkWAkm3GQoLbP7iLBRJK4Y2e ipfs-search-backup/indices/53neN9SkQWublctO6iu8AQ
 390.22 GiB / 390.22 GiB [=====================================================================================================================================================] 100.00%
^C
Error: context canceled
@Stebalien
Copy link
Member

This is probably due to provider records (i.e., the process of telling the network that you have the content), unfortunately. We currently need to make one DHT request per block which means ~1e6 DHT requests.

Tracked by: #5774


For now, you should be able to use the --local flag to not send out provider records. Alternatively, you can use ipfs add without starting the daemon (that'll do the same thing).

Unfortunately, that does mean you won't tell the network that you have the data.

@Stebalien Stebalien added the kind/bug A bug in existing code (including security flaws) label Dec 5, 2018
@Stebalien
Copy link
Member

Actually, it may not be that. Can you post a heap and goroutine profile when this gets stuck? That is, run:

wget http://localhost:5001/debug/pprof/heap
wget http://localhost:5001/debug/pprof/goroutine?debug=2

@dokterbob
Copy link
Contributor Author

@Stebalien Here you go. https://gateway.ipfs.io/ipfs/QmS79kLK2sxYVCtYNAJqwH1pZNePS8AQBNfr9AhRKkrq9a

Note that the adding progresses fine until exactly 100% is reached.

I will try again with --local as well.

@dokterbob
Copy link
Contributor Author

@Stebalien Most surprising result with --local:

 401.85 GiB / 401.85 GiB [================================================================================================================================================] 100.00%Error: merkledag: not found

@Stebalien
Copy link
Member

So, it does look like provider records are backing up however...

Most surprising result with --local:

That's not good. Can you run ipfs filestore verify (it may take a while)? If that doesn't show any errors, can you run ipfs repo verify (will take even longer).

Also, could you try ipfs add --local --pin=false?

It looks like you're missing a block that you should have.

@dokterbob
Copy link
Contributor Author

I've cleaned the filestore and all pins and run both verify commands, to no avail. :/

However, very much to my surprise, the resource does seem to be pinned!

(After another stall at 100% - note that this one was without --fscache as it was an empty repo. This is also why I'm certain that this is, in fact, the correct hash.)

$ ipfs pin ls -t recursive
QmV9b1jxgCaTVNcSHnz2Fv2C3TddC41BuQFNQezT74HsbU recursive
$ ipfs ls /ipfs/QmV9b1jxgCaTVNcSHnz2Fv2C3TddC41BuQFNQezT74HsbU
QmXvvgfYCePbh6bAWNLRrNoPMP2sFzYKR24Vn2RbbNznRj 18005369229 ipfs-search-backup/

Note lastly, that weirdly enough the reported filesizes by the gateway are only a fraction of the real size of the data (18 GB reported vs. 400 GB original).

I have not yet tried to download the resource as, with current IPFS performance, that would take several days. But you're very much invited to try.

@dokterbob
Copy link
Contributor Author

(I'm currently giving it another run with --local --pin=false.

@dokterbob
Copy link
Contributor Author

$ ipfs add -r -w --nocopy --local --pin=false --fscache
 411.12 GiB / 411.12 GiB [=============================================] 100.00%Error: merkledag: not found

:/

@Stebalien
Copy link
Member

Ok, this is definitely a bug in filestore.

What's the shape of the data? That is: small directory of large files or a large directory of small files?

@dokterbob
Copy link
Contributor Author

It's al elasticsearch snapshot: couple of levels of depth (~4), lot's of smaller (bytes) and larger files (megabytes).

Example data: Qmc3RxfyZTPf7omWN1XxDkaZhp93ukfLSY14CTC8n1v5Hv (created using ipfs-pack, which somehow does seem to work)

@Stebalien
Copy link
Member

I've just tested a large directory of small files with filestore so I'm pretty sure it's not that. I've also tested filestore on a 200MiB file so it's not that either.

@dokterbob have you tried running this without nocopy? I'm wondering if you have a filesystem corruption.

@dokterbob
Copy link
Contributor Author

dokterbob commented Dec 10, 2018 via email

@Stebalien
Copy link
Member

I’ve had this problem on two different machines, one of which runs ZFS - so very little chance of filesystem corruption (but running fsck on one of them anyways).

Also check the permissions, make sure the daemon can read all the files. Are you running the daemon as a different user?

(but it's probably a bug)

@dokterbob
Copy link
Contributor Author

dokterbob commented Dec 11, 2018 via email

@Stebalien
Copy link
Member

Those incorrect sizes are also pretty worrying.

@dokterbob
Copy link
Contributor Author

Yep.

Without --nocopy it's working fine, so it's definitely filestore.

@Stebalien
Copy link
Member

Do you have garbage collection enabled?

@dokterbob
Copy link
Contributor Author

Yep. It's kind of necessary as we pull about 1TB a day through our server.

@dokterbob
Copy link
Contributor Author

I could test it later this week on my home server with GC disabled (if it is enabled at all).

@Stebalien
Copy link
Member

Thanks!

@michaelavila michaelavila added the need/author-input Needs input from the original author label Jun 5, 2019
@Stebalien Stebalien mentioned this issue Apr 14, 2020
3 tasks
@github-actions
Copy link

Oops, seems like we needed more information for this issue, please comment with more details or this issue will be closed in 7 days.

@github-actions
Copy link

This issue was closed because it is missing author input.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug A bug in existing code (including security flaws) kind/stale need/author-input Needs input from the original author
Projects
None yet
Development

No branches or pull requests

3 participants