Would it make sense to use hivemind for distributed training/generation? #77

0xdevalias · 2022-11-06T02:37:06Z

Splitting this out from the unrelated issue:

    Not sure if the implementation/etc would be compatible, but here's another distributed StableDiffusion training project a friend recently linked me to:
https://github.com/chavinlo/distributed-diffusion

Which seems to build on top of: https://github.com/learning-at-home/hivemind

Decentralized deep learning in PyTorch. Built to train models on thousands of volunteers across the world.

Originally posted by @0xdevalias in #12 (comment)

Basically, I stumbled across the hivemind lib and thought that it could be a useful addition to AI-Horde. I'm not 100% sure how the current distributed process is implemented, but from a quick skim it looked like perhaps you had rolled your own.

Not sure if it's something you already considered and decided against, but wanted to bring it to your attention in case you hadn't seen it before.

The text was updated successfully, but these errors were encountered:

db0 · 2022-11-06T12:39:14Z

That would take quite a bit of work to onboard to the horde, but it's a promising thing. The problem is that the horde is asynchronous so it could be that the latency would be prohibitive, but I would be willing to consider it, especially if someone sends a PR

ndahlquist · 2022-11-26T01:22:56Z

I think the horde is primarily used for inference, not training. Do any jobs actually do training, or is that planned for the future? If not, it seems like this may provide limited benefit.

0xdevalias · 2022-11-26T04:33:16Z

Speaking of using hivemind for distributed training/etc, just stumbled across the following on:

https://rentry.org/sdupdates3

SD Training Labs is going to conduct the first global public distributed training on November 27th

Distributed training information provided to me:

Attempted combination of the compute power of over 40+ peers worldwide to train a finetune of Stable Diffusion with Hivemind

This is an experimental test that is not guaranteed to work

This is a peer-to-peer network.

You can use a VPN to connect

Run inside an isolated container if possible

Developer will try to add code to prevent malicious scripting, but nothing is guaranteed

Current concerns with training like this:

Concern 1 - Poisoning: A node can connect and use a malicious dataset hence affecting the averaged gradients. Similar to a blockchain network, this will only have a small effect on the averaged weights. The larger the amount of malicious nodes connected, the more power they will have on the averaged weights. At the moment we are implementing super basic (and vague) discord account verification.

Concern 2 - RCE: Pickle exploits should not be possible but haven't been tested.

Concern 3 - IP leak & firewall issues: Due to the structure of hivemind, IPs will be seen by other peers. You can avoid this by seting client-only mode, but you will limit the network reach. IPFS should be possible to be used to avoid firewall and NAT issues but doesn't work at the moment

Doing some further googling/etc, it seems that the 'SD Training Labs' discord is:

https://discord.gg/JHyEskQQcZ

And things are being coordinated in the #distributed-training channel, which has a few pinned messages about the training, and links to the following repos:

https://github.com/learning-at-home/hivemind
- Decentralized deep learning in PyTorch. Built to train models on thousands of volunteers across the world.
https://github.com/chavinlo/distributed-diffusion
- Train a Stable Diffusion model over the internet with Hivemind
https://github.com/bigscience-workshop/petals
- Decentralized platform for running 100B+ language models. Work in progress

It looks like the chavinlo/distributed-diffusion repo is based on this one:

https://github.com/Mikubill/naifu-diffusion
- Train stable diffusion model with Diffusers, Hivemind and Pytorch Lightning

A couple of snippets from skimming that discord channel

Could you tell me what are the minimum hardware requirements to participate?

at the moment any GPU with 20.5 GB of VRAM. so a rtx 3090

yeah, you can connect and disconnect at anytime
It basically works like this:
When a training session starts there is one peer which is the one more peers are going to connect to, this one usually has two ports opened, one for TCP and other for UDP connections (TCP works most of the time while UDP doesn't)

Then the rest of the peers connect to the first peer, they can either chose wether to open their ports too, so more people can connect to them, extending the network reach and reducing global latency, or chose to just be a client, meaning that no other peers can connect to them.

Then all of the peers train individually (in a federated manner) on the provided dataset (provided by a dataset server). Once a certain number of itterations has been reached all peers stop training and start exchanging data to one another, this usually takes 3 minutes in very ideal conditions but it can take up to 15 or 20.

If a peer joins while this is happening, or has outdated weights it will have to wait and download the weights again.
If a peer exits while this is happening or before it shares it's locally-trained weights, the network losses some potential learning, and if the dataset that was assigned to that peer isnt reported (a timeout of 30 minutes) it will be reassigned to another peer later.

Once all the peers have syncronized they resume training and repeat the process until they reach the set number of itterations again.
Some potential concerns is the security of the network, since basically anyone can connect and send garbage data. I was thinking of adding basic discord account auth for now, I have read some PRs containing security network features but I am not sure
I am also testing right now the effects of compression during sync

I will prob just use the old unoptimized codebase and stick hivemind to it
its so complicated to port the diffusers thing into lightning

I will try to "port" it to lightning and see if it works because theres another repo (naifu) that is also doing training with diffusers, very similar to the current trainer, but does some weird things in the back
they got hivemind working I think
but im not sure because they dont even use the DHT (they have the modules though)

okay and is this just for the group project, or also to offer gpu to individual artists

I was also planning to do a distributed dreambooth like horde for everyone so yeah

…ra-Org#77) * Number of improvements to the main loop, improving performance: - job.exception() blocks until the job is done. Because of this, the main loop would always wait until all jobs were finished before executing the next iteration - Because of this, the job > 180s code path wasn't reachable This path still had some bugs that were fixed - Added a queue in front of the running jobs. This way, we can already retrieve the next job while the previous one is still running, hiding the job-pop latency - log timestamps with microsecond precision + added debug logging for performance tuning GPU's are actually not great at running multiple workloads at once; On a 3090 I see ~20% total throughput drop as soon as the second job starts. With these optimisations, it should be possible to run the worker with max_threads = 1 for optimal performance. Even for higher thread counts, the job.exception block prevented these threads from getting the highest utilisation of the GPU, so even in that case performance should be significantly better. * small bugfix * Don't use the queue if queue_size = 0, make 0 the default Use these defaults until we can prove that queue_size = 1 && max_threads = 1 is faster than queue_size = 0 and max_threads = 2 * stylefix * removed torch_gc * removed torch_gc Co-authored-by: Divided by Zer0 <mail@dbzer0.com>

0xdevalias changed the title ~~Would it make sense to use hivemind for distributed training?~~ Would it make sense to use hivemind for distributed training/generation? Nov 6, 2022

0xdevalias mentioned this issue Nov 26, 2022

[Feature Request]: Support for new 2.0 models | 768x768 resolution + new 512x512 + depth + inpainting AUTOMATIC1111/stable-diffusion-webui#5011

Closed

1 task

db0 added the enhancement New feature or request label Mar 7, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Would it make sense to use hivemind for distributed training/generation? #77

Would it make sense to use hivemind for distributed training/generation? #77

0xdevalias commented Nov 6, 2022

db0 commented Nov 6, 2022

ndahlquist commented Nov 26, 2022

0xdevalias commented Nov 26, 2022 •

edited

Would it make sense to use hivemind for distributed training/generation? #77

Would it make sense to use hivemind for distributed training/generation? #77

Comments

0xdevalias commented Nov 6, 2022

db0 commented Nov 6, 2022

ndahlquist commented Nov 26, 2022

0xdevalias commented Nov 26, 2022 • edited

0xdevalias commented Nov 26, 2022 •

edited