Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding file to repo with many files is slow #3545

Open
vertis opened this issue Dec 29, 2016 · 12 comments
Open

Adding file to repo with many files is slow #3545

vertis opened this issue Dec 29, 2016 · 12 comments
Labels
topic/commands:add Topic commands:add topic/perf Performance

Comments

@vertis
Copy link

vertis commented Dec 29, 2016

Version information:

go-ipfs version: 0.4.4-
Repo version: 4
System version: amd64/linux
Golang version: go1.7

Type: Bug

Priority: P1

Description:

Adding a file to an existing repo after many files have been imported is very slow.

  • This is local, without the daemon running (though that doesn't seem to matter).
# existing repo
root@localhost ~# ipfs repo stat
NumObjects 	 189555
RepoSize 	 43563229872
RepoPath 	 /data/ipfs
Version 	 fs-repo@4
root@localhost ~# dd if=/dev/random of=test.bin count=1024
0+1024 records in
113+1 records out
58024 bytes (58 kB) copied, 0.0343963 s, 1.7 MB/s
root@localhost ~# time -p ipfs add test.bin
added QmZikmEEAK923N6n1gpADRjWn7gSn3AyUGdSJXXqJmhk8W test.bin
real 11.49
user 0.16
sys 0.16

# new repo
root@localhost ~# ipfs repo stat
NumObjects 	 83
RepoSize 	 14358417
RepoPath 	 /data/ipfs2
Version 	 fs-repo@4
root@localhost ~# time -p ipfs add test.bin
added QmZikmEEAK923N6n1gpADRjWn7gSn3AyUGdSJXXqJmhk8W test.bin
real 0.26
user 0.01
sys 0.01

I had originally though it might just be the time writing the file out, however this doesn't appear to be the case. Even a tiny file incurs the ~10 second penalty.

Is this expected behaviour? because it sort of rules out using it for anything large like rubygems (which is what I was trying to import).

This isn't on a particularly beefy machine, however it doesn't seem to be CPU bound or IO bound (I'm not a massive expert on the later, however zpool iostat isn't showing high values).

@vertis
Copy link
Author

vertis commented Dec 29, 2016

Running this locally on my Macbook (to confirm it wasn't something weird with my ZFS drives), it does about 50k objects and then goes from roaring along at 0.05-0.06 seconds per file to 1-3 seconds per file

NumObjects 	 53533
RepoSize 	 3234337159
RepoPath 	 /Volumes/file/temp_ipfs
Version 	 fs-repo@4

Edit: correct number of objects downwards

@ghost
Copy link

ghost commented Dec 29, 2016

I think I've been seeing similar behaviour while adding the cdn.media.ccc.de mirror. I haven't measured it, but the repo was initially nearly empty, then the 3.8TB add got slower and slower over time, and now additional adds are just as slow.

@ghost
Copy link

ghost commented Dec 30, 2016

I just started another add, but this time with only-hash set (ipfs add -n -r cdn.media.ccc.de), to narrow this down to datastore. My hopeful expectation is that the add's duration stays constant.

@vertis
Copy link
Author

vertis commented Dec 30, 2016 via email

@ghost ghost added topic/commands:add Topic commands:add topic/perf Performance labels Dec 30, 2016
@jdgcs
Copy link

jdgcs commented Dec 30, 2016

similar issue:
Slowly handling large number of files

./ipfs add became very slow when handling about 45K files(~300GB), it took about 3+ seconds to wait after the process bar finished.

% ./ipfs repo stat
NumObjects 9996740
RepoSize 381559053765
RepoPath /home/liu/.ipfs
Version fs-repo@4

@vertis
Copy link
Author

vertis commented Dec 30, 2016

@jdgcs damn, I'd say it's the same thing...I even looked to see if there was already an issue and couldn't see anything

@mateon1
Copy link
Contributor

mateon1 commented Dec 30, 2016

I see that you are both using the 0.4.4 release. Could you try the 0.4.5 prerelease?
Performance has improved in 0.4.5, and I do not get this slowdown on a Windows machine, with a NTFS filesystem on HDD with a repo 41 GB in size, with 280318 objects.

$ time ipfs add random.bin
 1024 B / 1024 B [=========================================================================================] 100.00% 0s
added QmUXjD2mkCwCdqoN4cvcToBunKMvv7zG98GWUwer2xzvSH random.bin

real    0m0.350s
user    0m0.000s
sys     0m0.030s

repo stat:

$ ipfs repo stat
NumObjects       280318
RepoSize         41058762331
RepoPath         D:\ipfs
Version          fs-repo@4

@vertis
Copy link
Author

vertis commented Dec 30, 2016

@mateon1 Thanks for the reminder. I did actually try 0.4.5-dev yesterday after I reported this. It did still have the problem:

# ipfs version
ipfs version 0.4.5-dev
# dd if=/dev/random of=random.bin count=1024
1024+0 records in
1024+0 records out
524288 bytes transferred in 0.035843 secs (14627292 bytes/sec)

# du -sh random.bin 
512K	random.bin

# time -p ipfs add random.bin
added QmWhjMhVXWmEcUCWisJTEjgcQAXwxfmeUBAYxvNGKi1tUb random.bin
real         3.18
user         0.15
sys          0.12

# ipfs repo stat
NumObjects 	 59950
RepoSize 	 3312096091
RepoPath 	 /Volumes/WorldBuilder/temp_ipfs
Version 	 fs-repo@4

This is now locally on my MacbookPro on a USB3 external HD (HFS+).

Interestingly, it's after the progress bar, and even the debug messages just seem to sit there.

@ghost
Copy link

ghost commented Jan 2, 2017

I just started another add, but this time with only-hash set (ipfs add -n -r cdn.media.ccc.de), to narrow this down to datastore. My hopeful expectation is that the add's duration stays constant.

It took about 9h15m in the end, so it definitely is storage-related.

@rht
Copy link
Contributor

rht commented Feb 1, 2017

I test-benchmarked this on 0.4.5-rc1,

code to reproduce: https://github.com/rht/sfpi-benchmark/tree/master/add_growing_repo (./benchmark.sh)

The slowdown is >twice when the repo size hits ~5GB. I don't have enough storage space to quantify the ~TB scale though.
The spikes are possibly due to io flush (NoSync is used).

@FortisFortuna
Copy link

I experience this as well.

@Stebalien
Copy link
Member

@forstmeier try using the experimental badger datastore. You can create a node configured to use it by instantiating it with --profile=badgerds.

I'm guessing this is due to the default datastore creating tons of tiny files (which filesystems generally don't like). Badgerds doesn't create one file per block so it should perform better (also, it'll use a hell of a lot less space).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
topic/commands:add Topic commands:add topic/perf Performance
Projects
No open projects
Development

No branches or pull requests

6 participants