Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reproducing DALL-E using DeepSpeed #137

Open
mehdidc opened this issue Mar 29, 2021 · 32 comments
Open

Reproducing DALL-E using DeepSpeed #137

mehdidc opened this issue Mar 29, 2021 · 32 comments

Comments

@mehdidc
Copy link
Contributor

mehdidc commented Mar 29, 2021

Hi @lucidrains, Hi @robvanvolt,

@JeniaJitsev started initially a discussion in the discord channel of @robvanvolt.
Just a brief recap.
We (@JeniaJitsev, @janEbert, and myself) are in a research group in Germany, Helmholtz AI,
which is part of the Helmholtz Association. We are interested in reproducing DALL-E. We have the possibility to offer
you access to A100 GPUs (from https://apps.fz-juelich.de/jsc/hps/juwels/booster-overview.html) for reproducing the model, ideally using DeepSpeed for distributed training.
What are your thoughts ? Would you be interested ?

@lucidrains
Copy link
Owner

@mehdidc Hi Mehdi! I am actually busy with protein folding replication (Alphafold2), but I think @robvanvolt and @afiaka87 would definitely love to make use of the resources :) Thank you!

@afiaka87
Copy link
Contributor

@mehdidc Hi Mehdi! I am actually busy with protein folding replication (Alphafold2), but I think @robvanvolt and @afiaka87 would definitely love to make use of the resources :) Thank you!

@lucidrains Just for context, I deferred them to you due to my inability to answer questions regarding multi-GPU compute.

@mehdidc Seems we're all a bit busy at the moment. I will do my best to help you with this if you can file issues for us, but I've decided to be fairly hands off in the discord chat for personal reasons.

@afiaka87
Copy link
Contributor

afiaka87 commented Mar 29, 2021

@lucidrains I know you're busy but a quick yes or no will suffice-

Does the codebase in its current form make use of multiple GPUs?

@afiaka87
Copy link
Contributor

afiaka87 commented Mar 29, 2021

@mehdidc Just to be clear - we are quite interested.

I'll be making this a high priority but can only help so much due to my lack of machine learning knowledge. I'm assuming robvanvolt feels similarly, but they are also dealing with quite an in-surge in traffic on the newly created discord.

If you have a bit of patience though, we'll both be able to help you out along the process.

@lucidrains
Copy link
Owner

ohhh right, so the current script does not do multi-GPU, but it should be pretty easy to get multi-GPU working with the newest deepspeed (or pytorch lightning). I'll see what I can do tomorrow

@JeniaJitsev
Copy link

Hey folks! So there will be no trouble to arrange access to compute resources in size of 2 compute nodes with 4x GPUs each, given that we have together a look into multi GPU execution, preferable using DeepSpeed (as it seems to me the most straightforward way with transformers right now), but open for other suggestions. @lucidrains I can imagine that starting from that, we can also enter for working together on AlphaFold2 as well, at least with regard to its Transformer component. So it can be that we can turn it into a generic collaboration on distributed training of various useful architectures on multi node, multi GPU .

Please let us know what do you think.

@afiaka87
Copy link
Contributor

afiaka87 commented Mar 29, 2021

Hey folks! So there will be no trouble to arrange access to compute resources in size of 2 compute nodes with 4x GPUs each, given that we have together a look into multi GPU execution, preferable using DeepSpeed (as it seems to me the most straightforward way with transformers right now), but open for other suggestions. @lucidrains I can imagine that starting from that, we can also enter for working together on AlphaFold2 as well, at least with regard to its Transformer component. So it can be that we can turn it into a generic collaboration on distributed training of various useful architectures on multi node, multi GPU .

Please let us know what do you think.

My first concern is with regard to deepspeed. I've not yet been able to get it working (independently) with the sparse attention that we use. Is this something you've dealt with?

I believe lucidrains has gotten it working because there's an install script in the repo and code for it. But as it stands we don't have a pleasant docker deploy type scenario (and those scripts don't seem to work on my configs even if I switch to the correct cudatoolkit, etc.).

Furthermore - I'm not certain that microsoft actually supports the A100 GPU yet. For now it seems your best bet is to deploy using a V100 or a Titan RTX. I've filed an issue about this here. Give it a thumbs up and maybe they'll have a look? Not likely though. That's not to say that it won't work - but it may require severe tinkering.

@JeniaJitsev
Copy link

Good question - so first we have to clarify whether we get the sparse attention transformer and deepspeed go along together. We ourselves haven't tried it - in fact, we only run deepspeed in a very simple multi-gpu scenario, data parallel mode, for standard CIFAR-10 supervised training on ResNet-50, so quite boring.

How about that: I will provide you links with instructions how to register at our supercomputing facilities and will grant you access to some compute resources. We then can try together to make this one particular test, deepspeed with sparse attention transformer.

Timewise there is no hurry. In fact we are also unfortunately quite busy and until end of May will have kind of sparse way )) to do hands on with your together. From June on, it looks then better. But first steps we can manage, so that you have your environment on compute nodes and the libraries in place etc. One note - on supercomputing nodes, it is not really possible to switch flexibly low level things like nvidia drivers or switch between a lot of different CUDA versions etc, if that becomes necessary.

@JeniaJitsev
Copy link

Furthermore - I'm not certain that microsoft actually supports the A100 GPU yet. For now it seems your best bet is to deploy using a V100 or a Titan RTX. I've filed an issue about this here. Give it a thumbs up and maybe they'll have a look? Not likely though. That's not to say that it won't work - but it may require severe tinkering.

It is not a problem to start with V100, we have nodes with those as well.

@afiaka87
Copy link
Contributor

afiaka87 commented Mar 29, 2021

Hm - well if you're not in desperate need of the actual sparse attention then as far as I'm concerned - turn it off the moment it gives you problems ha.

And yeah, I believe the V100s would be a better starting point to just get the code running at least. Do any of you have local dev environments with GPUS you can use as well without needing to explicitly include them in your budget?

@JeniaJitsev
Copy link

We do have a machine with 4x V100 without budget limitation, with a drawback that is not accessible from outside. I think it would be better to get on a machine where we all can work indeed together. Let's try to have a model training running on a compute node where we all have access. Once we have it tuned, we can commit longer training runs on the local machine for further testing

@JeniaJitsev
Copy link

JeniaJitsev commented Mar 29, 2021

@lucidrains @afiaka87 Let's do a step like this: please drop me a short email to j.jitsev@fz-juelich.de, and I send your instructions so that you can already register for the access, so that I can also already add you both to the compute project. We do this step and see from then on how to organize ourselves.

@JeniaJitsev
Copy link

With regard to sparse attention: here another colleague of us, Alex Strube, @surak, was opening issue at deepspeed - so it should be fine to go with V100 and cuda 11, it seems from that discussion: microsoft/DeepSpeed#790

@janEbert
Copy link
Contributor

@lucidrains If it's fine with you, I'd take the learning experience with DeepSpeed and try to get it running on some V100s tomorrow. Please tell me if you'd rather do it yourself, otherwise I'm definitely up to relieve you from that.

@lucidrains
Copy link
Owner

@janEbert would be pleased for you to take the helm!

@janEbert
Copy link
Contributor

Thanks for the trust. ;)

@robvanvolt
Copy link
Contributor

robvanvolt commented Mar 29, 2021

Awesome! this got a little traction fast! :D I'm currently trying to get deepspeed with sparse attention running on a 3090rtx (should work on the A100 then if it succeeds).

@afiaka87 is right, i'm rather new to ML, just a programmer for a little more than a decade, so i wouldn't be of much help in the deep darks of ML outside of a little code optimization / preprocessing and translating of captions / organizing stuff (that was the reason for the discord a more organized crew and less "chat" here in the github issues)

@afiaka87
Copy link
Contributor

afiaka87 commented Mar 29, 2021

@lucidrains If it's fine with you, I'd take the learning experience with DeepSpeed and try to get it running on some V100s tomorrow. Please tell me if you'd rather do it yourself, otherwise I'm definitely up to relieve you from that.

@janEbert Thanks a ton for taking this up! Your prior experience means you're likely to figure that out a bit faster than I could have.
I'm still happy to help - don't intend to get in your way of course.
@JeniaJitsev I agree that a team environment is going to work well here! Thank you very much. At most I'll be seeing how janEbert's progress is going, but I am also interested in access and may be able to log in and fix things occasionally going forward. I'll send you an email now and we can discuss it there.

@janEbert
Copy link
Contributor

Ah right, you also mentioned you wanted to do it, sorry! I'll see how far I can get tomorrow and stay in touch with you on the Discord, is that okay?

@afiaka87
Copy link
Contributor

afiaka87 commented Mar 29, 2021

Ah right, you also mentioned you wanted to do it, sorry! I'll see how far I can get tomorrow and stay in touch with you on the Discord, is that okay?

Please do! I'll be highly available to help if you need anything.

@janEbert
Copy link
Contributor

That's great to know, thank you!

@afiaka87
Copy link
Contributor

@janEbert @JeniaJitsev This system is indeed complex. Could I borrow one of you for a quick tutorial on deploying dalle-pytorch with a proper dataset? I believe that would speed up things a bit for me.

@JeniaJitsev
Copy link

@janEbert @JeniaJitsev This system is indeed complex. Could I borrow one of you for a quick tutorial on deploying dalle-pytorch with a proper dataset? I believe that would speed up things a bit for me.

You can also borrow @mehdidc who originated this issue, he will be also eager to help I guess ))

@lucidrains
Copy link
Owner

Great work so far! I just want to throw it out there that Pytorch Lightning has deepspeed (and wandb) integration https://pytorch-lightning.readthedocs.io/en/stable/advanced/multi_gpu.html?highlight=deepspeed

Perhaps by using it we can get the best of both worlds and have it be significantly less complex than it needs to be?

@janEbert
Copy link
Contributor

Thanks for the suggestion, I didn't know about that! To the software engineer in me it's valuable to have direct access to the API I'm using. For what it's worth, if I understood the documentation correctly, training is now set up so we can do anything PyTorch Lightning can (using ZeRO or ZeRO-Offload).
However, we may even use DeepSpeed's pipeline parallelism if we wrap our models which Lightning does not support yet.

I definitely see the value in clean (and, even more importantly, battle-tested) research code.
For now, we could clean up the training code by wrapping even the non-distributed models so we don't need to handle distributed and non-distributed update step code in different ways. This may cause other issues with users, though. It's a hard take. :/

@janEbert
Copy link
Contributor

Well, as expected, I was completely wrong... :D
I didn't manage to get ZeRO to work with the current setup, for example. Seems more changes have to be made.

@lucidrains
Copy link
Owner

@janEbert no worries, give it another try :) I believe in you

@JeniaJitsev
Copy link

JeniaJitsev commented Apr 2, 2021

@janEbert , @mehdidc & everyone:

It seems Eleuther AI folks work on training and then releasing a publicly available large GPT version (175B one), and they as well use a code base that employs DeepSpeed. Looks to me it can be helpful in deepspeed experiments we conduct. They also have their own fork of DeepSpeed adapted to this end.

  • GPT‑NeoX is an implementation of 3D-parallel GPT‑3-like models on distributed GPUs, based upon DeepSpeed and Megatron-LM. It is designed to be able to train models in the hundreds of billions of parameters or larger (https://www.eleuther.ai/projects/gpt-neox/)
  • As of 2021-03-31, the codebase is fairly stable. DeepSpeed, 3D-parallelism and ZeRO are all working properly. [seems ZeRO stage 1 only is working, ZeRO 3 is in progress]
  • [still no surprise on sparse attention situation here] Deepspeed's sparse attention kernels are supported, but don't work with cuda 11.0+, and require a specific hardware setup (V100s/RTX2080s). add "sparsity": "all" to your config to use sparse attention on all layers, or "sparsity": "interspersed" to use it every other layer.
  • https://github.com/EleutherAI/gpt-neox
  • https://github.com/EleutherAI/DeeperSpeed

@janEbert
Copy link
Contributor

janEbert commented Apr 2, 2021

Well, we are lucky enough to use the DeepSpeed library itself so we have stage 2 working already! I can't test stage 3 as I don't have access to a recent enough version of DeepSpeed but from my assumptions this really should work out of the box with the current code.

@JeniaJitsev
Copy link

Well, we are lucky enough to use the DeepSpeed library itself so we have stage 2 working already! I can't test stage 3 as I don't have access to a recent enough version of DeepSpeed but from my assumptions this really should work out of the box with the current code.

Okay, if it will work out of the box with DeepSpeed, the better. Less libraries, less trouble ))

@lucidrains
Copy link
Owner

Well, we are lucky enough to use the DeepSpeed library itself so we have stage 2 working already! I can't test stage 3 as I don't have access to a recent enough version of DeepSpeed but from my assumptions this really should work out of the box with the current code.

@janEbert amazing job! 💯 💯 🙏

@janEbert
Copy link
Contributor

janEbert commented Apr 3, 2021

Just to confirm: Stage 3 works, our sysadmin sacrificed valuable holiday time to quickly upgrade DeepSpeed!
And thanks! 😘

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants