Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sharded directory fetching is unusably slow #4908

Closed
ajbouh opened this issue Apr 2, 2018 · 53 comments
Closed

Sharded directory fetching is unusably slow #4908

ajbouh opened this issue Apr 2, 2018 · 53 comments
Assignees
Labels
P1 High: Likely tackled by core team if no one steps up topic/perf Performance

Comments

@ajbouh
Copy link

ajbouh commented Apr 2, 2018

Version information:

0.4.15-dev

Type:

Bug/performance issue

Description:

More context is available over in tesserai/iptf#2

I'm trying to get reasonable performance for just listing the names of entries in a sharded directory that's not yet cached locally. This operation takes hours right now. With @Stebalien's help I've been able to determine that it's only requesting one hash at a time (as indicated by ipfs bitswap wantlist.

Seems like IPFS should be requesting more than one block at a time in this scenario. Creating a separate issue to track this specific performance issue separately from others.

child of #5487

@kevina kevina self-assigned this Apr 25, 2018
@kevina
Copy link
Contributor

kevina commented Apr 25, 2018

This seams like an easy enough fix, so I will look into it. If someone beats me to it please remove my assignment.

@Stebalien
Copy link
Member

@kevina have fun 😄. Unfortunately, it's actually a bit frustrating. Parallelizing fetching all the children of a single node is simple however, many of the nodes deep in sharded directory trees only have a few children so the speedup is a bit depressing.

At the end of the day, it becomes a memory/parallelism + throughput/latency tradeoff.

@ajbouh
Copy link
Author

ajbouh commented Apr 25, 2018 via email

@Stebalien
Copy link
Member

@ajbouh due to sharding, we have at most 256 children at each level. Fetching 256 at a time is great however, many of the deeper (partially filled) nodes in the tree end up with 5-10 children.

@ajbouh
Copy link
Author

ajbouh commented Apr 26, 2018 via email

@Stebalien
Copy link
Member

@ajbouh in practice, more like 4x. Definitely an improvement but we can do much better.

@kevina
Copy link
Contributor

kevina commented Apr 26, 2018

Yeah we need to be reading the blocks as they come in from the network, and then fetching any other needed blocks in parallel. This should be possible, but I have not looked into the code yet. However, we would need to limit the number of requests fetched in parallel somehow.

@Stebalien do you have some good test hashes?

@Stebalien
Copy link
Member

@kevina I just created a large directory with tiny files locally and tested with iptb. I find that's generally the best way to make a reproducible test.

@kevina
Copy link
Contributor

kevina commented Apr 26, 2018

@Stebalien where did the 4x number come from?

@Stebalien
Copy link
Member

@kevina most of the shards had few directories and we'd wait until we'd downloaded all of them before moving on. This gives us a sawtooth pattern where we were often only downloading a few stragglers.

@kevina
Copy link
Contributor

kevina commented Apr 29, 2018

@ajbouh #4979 should help significantly

@ajbouh
Copy link
Author

ajbouh commented Apr 29, 2018

@kevina excellent! Have you tried to ls the ImageNet CID with this change?

@kevina
Copy link
Contributor

kevina commented Apr 29, 2018 via email

@ajbouh
Copy link
Author

ajbouh commented Apr 29, 2018

Is someone tracking the optimization work needed to get this ls operation to work in a reasonable amount of time? I'm not talking about doing a get... Just an ls...?

@kevina
Copy link
Contributor

kevina commented Apr 29, 2018

@ajbouh it just finished, it completed in around 30 minutes, not great but better. There are around 1281167 entries consisting of around 112220 blocks. That a lot of blocks to retrieve so I am not sure how much better we can do. The p.r. retrieves the blocks in batch sizes up to 320 (see code for reason for this number) and it seamed to be taxing the resources on my machine so I am not sure how much larger I want to make this number.

@ajbouh
Copy link
Author

ajbouh commented Apr 30, 2018

@kevina I'm not sure we're talking about the same CID here. I'm talking about one with ~10^6 entries? Is it easy to determine how many bytes are required to represent the sharded directory?

It seems we should expect it to go as fast as an ipfs get of a file of the same size, no?

@kevina
Copy link
Contributor

kevina commented Apr 30, 2018

I am testing:

ipfs ls --resolve-type=false QmXNHWdf9qr7A67FZQTFVb6Nr1Vfp4Ct3HXLgthGG61qy1

My initial numbers where wrong so I updated the count.

It is not the size that is important but the number of blocks that need to be retrieved. With hamt sharding of a directory object the block size is likely to smaller than with normal sharding of a file which is broken up into equal size segments (of which I forgot the exact number but I think its around 43k).

@ajbouh
Copy link
Author

ajbouh commented Apr 30, 2018

I see, so perhaps sharded directories just aren't designed for this use case and we should be thinking about using something else?

We need to be able to quickly enumerate all entries so we can decide which to fetch next. Perhaps a single manifest file with a known name is the easiest way to accomplish this?

@kevina
Copy link
Contributor

kevina commented Apr 30, 2018

@ajbouh perhaps, however the number of blocks required is also really high.

@whyrusleeping @Stebalien thoughts?

@whyrusleeping
Copy link
Member

Investigating...

@whyrusleeping
Copy link
Member

whyrusleeping commented May 1, 2018

@kevina's code looks reasonable. Probably want to combine that with bitswap sessions and a higher bitswap activeWants count. Once concurrency of fetching is no longer the issue, there are other optimizations to look at, namely requester side batching of blocks that we receive. Right now every block we get through bitswap gets put to the datastore individually, batching those together could add some significant improvements.

In any case, @ajbouh do you need the entire list of names for your operation? Listing 10 million directory entries is going to be slow (order of tens of seconds) unless we work some fancy caching magic. Maybe theres a better way we can query this information?

@ajbouh
Copy link
Author

ajbouh commented May 1, 2018

Yes, I need to stream through all entries in a directory, batching, sampling and shuffling them in a consistent and user-specifiable manner.

@whyrusleeping what are you thinking the primary bottleneck is?

If we're talking about 1M entries that each need about 100 bytes, that's only a 100MB total download. This seems like something that we should be able to do in 10 seconds or less on a fast connection. If it's already on the local disk it should be even faster.

What am I missing here?

@Stebalien
Copy link
Member

@kevina were you using iptb on a separate network when you tested that (i.e., would bitswap sessions have affected it)?

@kevina
Copy link
Contributor

kevina commented May 1, 2018

@Stebalien I was not even using iptb, just testing it from my computer.

@Stebalien
Copy link
Member

@kevina could you run a quick test with iptb? That'll tell us how much bitswap sessions would help and how much, e.g., network latency/bandwidth affect it.

@kevina
Copy link
Contributor

kevina commented May 2, 2018

@Stebalien can you be a little more specific on what different combinations you want to test?

@kevina
Copy link
Contributor

kevina commented May 2, 2018

@Stebalien okay, I tested commit 3f79eab in pr #4979 as I think you wanted. I started a iptb testbad with just 2 nodes and connected them and then ran

time ./iptb run 0 ipfs ls --resolve-type=false QmXNHWdf9qr7A67FZQTFVb6Nr1Vfp4Ct3HXLgthGG61qy1 > /dev/null

It took 14m40s.

The second ipfs node in the cluster already has the hash and all the parent shards.

@whyrusleeping
Copy link
Member

@kevina how long does it take to run that when you already have all the blocks?

Also, are these nodes using badger or flatfs?

@whyrusleeping
Copy link
Member

@ajbouh Any other nicely measurable perf requirements you can think of are definitely appreciated, but I think this is enough to go on.

Things to note, it may be easiest to make a separate ipfs fast-ls command, or add a streaming option to ipfs ls. It currently blocks until it has all the entries, and then prints them out.

@ajbouh
Copy link
Author

ajbouh commented Jul 17, 2018

Yeah, I think I'm using the streaming API under the hood.

For context: this is part of a larger goal to train a state of the art machine learning model from your laptop with Google's TPUs.

Would much rather use IPFS for this as using cloud storage makes working with open source folks very difficult. Is also makes working from your laptop much harder.

TPUs are approximately $1 for 10 minutes of use. Getting the overhead of data loading/fetching to be just a few seconds is absolutely critical. For clarity, cloud storage has essentially zero up-front overhead for already-hosted datasets.

Looking forward to getting this figured out!

@whyrusleeping
Copy link
Member

@ajbouh thats really cool! Let's get this train moving then :)

Yeah, I think I'm using the streaming API under the hood.

Unless youre running custom ipfs code, I don't think youre getting what you think you are. In ls here: https://github.com/ipfs/go-ipfs/blob/master/core/commands/ls.go#L170 It collects all the results up, and the outputs them all at once. I threw together a quick PoC of a fully streaming ls command here: https://github.com/ipfs/go-ipfs/compare/hack/fastls?expand=1 We should think about how to integrate that properly.

@ajbouh
Copy link
Author

ajbouh commented Jul 17, 2018

Correction, not using the streaming API just yet, but we are using custom code.

tesserai/iptf#3 (comment)

That said, TensorFlow's own directory listing logic is not streaming, so some creativity will be required on my part for some operations: https://github.com/tensorflow/tensorflow/blob/e7f158858479400f17a1b6351e9827e3aa83e7ff/tensorflow/core/platform/file_system.h#L116

Agreed on getting the train moving!

Based on other threads, it seems like badger isn't a short term option. Who has the baton for this right now?

@eingenito
Copy link
Contributor

@ajbouh @hannahhoward has picked this back up. See the linked issue for more details.

@hannahhoward
Copy link
Contributor

@ajbouh I am not sure if we've cut a new release since ipfs/go-unixfs#19 was merged but I'd be curious to hear how this affects your performance

@ajbouh
Copy link
Author

ajbouh commented Oct 17, 2018

Thanks for the ping, @hannahhoward

I am also curious about the performance but have not tried a recent build myself. Have you tried the operations I referenced in tesserai/iptf#2

They were with the CID QmXNHWdf9qr7A67FZQTFVb6Nr1Vfp4Ct3HXLgthGG61qy1

@Stebalien
Copy link
Member

@hannahhoward we haven't.

@eingenito
Copy link
Contributor

eingenito commented Feb 11, 2019

I believe this has been addressed in 0.4.18. Please reopen if needed.

@ghost ghost removed the status/in-progress In progress label Feb 11, 2019
@ajbouh
Copy link
Author

ajbouh commented Feb 11, 2019

Hi @eingenito! Have you tried the operations I referenced in tesserai/iptf#2

They were with the CID QmXNHWdf9qr7A67FZQTFVb6Nr1Vfp4Ct3HXLgthGG61qy1

@Stebalien
Copy link
Member

The first part was addressed in 0.4.18 but the second part is the sessions improvements that'll land in 0.4.19. Hopefully that'll be sufficient.

@Stebalien Stebalien reopened this Feb 11, 2019
@ajbouh
Copy link
Author

ajbouh commented Feb 11, 2019

Will the session stuff address the duplicated blocks issue?

@Stebalien
Copy link
Member

Significantly but it's still far from perfect.

@kevina kevina removed their assignment Feb 11, 2019
@ajbouh
Copy link
Author

ajbouh commented Feb 11, 2019

@b5 ^

@Stebalien
Copy link
Member

This is probably as fast as it's going to get for the foreseeable future.

@ajbouh
Copy link
Author

ajbouh commented May 7, 2019

Hi @Stebalien, do you mind quantifying how fast things are now?

@momack2 momack2 added this to Done in ipfs/go-ipfs May 9, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
P1 High: Likely tackled by core team if no one steps up topic/perf Performance
Projects
No open projects
Development

No branches or pull requests

7 participants