-
-
Notifications
You must be signed in to change notification settings - Fork 3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Datastore benchmarks #4870
Comments
I think the time to complete some common operations would be interesting for repos with varying sizes, pin counts, and varying amounts of unpinned content:
If memory utilization comparisons can also be made I think it would be useful. I've seen some really high memory usage from the IPFS daemon in certain cases when working with large 300+GB repos which I attributed to badger, but I haven't gone back to test with flatfs to see if it actually was badger or just having a large repo (or something else). |
@schomatis see also https://github.com/Kubuxu/go-ds-bench It is a bit old but should work after few fixes. |
See dgraph-io/badger#446 for a discussion of search key performance in the IPFS case. |
@leerspace : Btw, Badger's memory usage can be reduced via options. For e.g., by mmap-ing LSM tree instead of loading to RAM. Or, by keeping value log on disk, instead of mmap-ing it. |
Hello |
@schomatis I create PR with slowpoke test: https://github.com/schomatis/datastore_benchmarks/pull/1 Quick summary Badger: Slowpoke looks little slow then badger, but not dramatically |
@recoilme how does slowpoke scale in few TB to PB range? btw I'd put this discussion in a separate issue as it's kind of off-topic here. |
Ok, @magik6k please let me link to it please In general, slowpoke may be little slow then badger on synthetic benchmarks but it scales better on the big databases. Slowpoke is proxy to the filesystem, like flatfs + indexes + management memory |
@recoilme please also test "repo gc" when say 99% of the repo is not pinned. Badger seams an order of magnate (at least) slower than flatfs (i.e slowpoke), but this needs verification. |
@kevina Could you provide a simple example of a GC operation that would take an order of magnitude more than |
@schomatis on an empty repo do a
This could take a very long time, you may need to use #4979. Then do a "ipfs repo gc". Note: If you use I have not done former test yet. I am guessing you should see the same problem with any repo that contains lots of (over 100k) small objects with very few of the objects pinned. Also see #4908. |
@schomatis here is a script to reproduce the problem
When TMP pointed to "/tmp" which uses 'tmpfs' the 'repo gc' was fine. When TMP pointed to a non-memory file system that is when it was very slow. It try and let it complete and report the results. |
Okay. I gave up and killed "repo gc" when badgerds is used. Here are the results: So Badgerds is at least 30 times slower. |
Great work @kevina! Thanks for the test script. I'll try to reproduce it on my end and see if I can pinpoint the bottleneck on the Badger implementation. The GC is supposed to be slower in Badger due to the added complexity of checking for the deletion marks in the value log, but an order of magnitude slower (or more) would be too much to ask the user to bear during this transition. |
Let me see if I can run this. I'm doing a whole bunch of improvements related to GC, and versioning. I feel those should fix this up nicely. What version of Badger are you guys on? Update: When I run the script by @kevina above, it fails with
My |
This comment has been minimized.
This comment has been minimized.
Hi @manishrjain, thanks for stepping in here. The master branch of IPFS is using v1.3.0, I'm not sure what version @kevina used for the tests, but feel free to assume we're using the latest version in Badger's master branch, Badger is still not the default datastore so we have some latitude here. To use the latest version of Badger inside IPFS you can run the following commands:
Sorry for the convoluted commands, the current tool to work with development packages in |
@schomatis I am using master at |
@manishrjain The
|
Great, so that should indeed be Badger's last stable version |
I pulled in Badger master head, not sure if it makes any difference here. I set Another thing I noticed is that each "removed Q.." is run as a separate transaction, serially. If deletions can be batched up in one transaction, that'd speed up things a lot. Though, looking at the code, it might be easier to run On my laptop, with $TMP set to $HOME/test/tmp:
|
I can confirm that setting Thanks a lot @manishrjain!! This is a big win, please let me know if I can be of help with the ongoing GC improvements of dgraph-io/badger#454. |
@schomatis the idea behind adding a small file with very small chunks was to approximate how a very large shared directory will be stored. |
@kevina I see, my comment above is not about the file size (or chunk size) itself, but rather that it is necessary to surpass |
Sorry, just my 5 cents about nosync: Setting SyncWrites to false == corrupted database Async writes to file == may lead to corrupted data too Slowpoke don't have "nosync" option but has butch write (sets method) with fsync at the end. It's work like a transaction. And has DeleteFile and Close methods. It may store unpinned/pinned items separately and you may close (free keys from memory) not needed data and delete all files with not needed data fast and safely |
Thanks for the offer, @schomatis . I could definitely use your help on various issues. That PR you referenced is part of a v2.0 launch. Want to chat over email? Mine is my first name at dgraph.io. |
@recoilme That's a good point on syncing I/O (I'll have that in mind for the Badger tests), also thanks for submitting the PR with the slowpoke benchmarks. Please note that this issue concerns the Badger benchmarks as a datastore for IPFS, if you would like to add slowpoke as an experimental datastore for IPFS I'm all for it but please open a new issue to discuss that in detail. Regarding the benchmark results, I would like to point out that although performance is the main motivation to transitioning to Badger there are also other aspects to choosing a DB that in my opinion are also important to consider, such as how many developers are actively working on it, how many users in the community are using/testing it, who else has adopted it as its default DB, what documentation does the project have, what's the adaptive capacity of the system (to the IPFS use case). I think all of those (and I'm missing many more) are important aspects to consider besides the fact that a benchmark might suggest that one DB outperforms another by ~5% in some particular store/search scenario. |
@schomatis Thanks for the detailed answer. It seems to me that badger is an excellent choice. But I would be happy to add slowpoke as an experimental datastore for research. I just want to solve some problems specific to Ipfs for my storage because it's interesting. I implement the datastore interface after the vacation and open the issue for discussion if you like. |
@recoilme Great! |
Is anyone actively working on this? |
Hey @ajbouh, I am (sort of, I've been distracted with other issues), but any help is more than welcome :) If you were just asking to use Badger as your datastore you can enable it as an experimental feature. |
I'm trying to make IPFS performance benchmarks real partially in service of:
And partly to help support some of the awesome efforts already underway like: |
IPFS + TensorFlow 😍 So, these were just some informal benchmarks to get an idea of the already known benefits of using Badger over a flat architecture, and as they stand they are just an incomplete skeleton. As mentioned by @Stebalien in #4279 (comment) the priority right now is on making Badger as stable as possible, so benchmarking performance is not really at the top of the list at the moment. That being said, if you want to continue with this work feel free to write me and we could coordinate a plan forward so I could guide you through it. |
Covered by #6523. |
Considering that the default datastore will be transitioning from
flatfs
tobadger
soon (#4279) it would be useful to have an approximate idea of the performance gains (which are considerable) and also the losses (e.g., #4298).I'll be working in a separate repository (it can be integrated here later) to develop some profiling statistics about it, this will also help in having a better understanding of Badger's internals (at least that's my expectation).
I'm open to (and in need of) suggestions for use cases to test (random reads, rewriting/deleting the same value several times, GC'ing while reading/writing, etc).
The text was updated successfully, but these errors were encountered: