Training text_to_image on multi-GPU with Megatron-LM or DeepSpeed does not work

I have an EC2 instance with 4 GPUs and each GPU has 16GB memory.

By default, hugging face uses Data Parallelism and tries to fit one copy of the model in each GPU. Of course, 16GB is not enough and I ran into CUDA Out of Memory error.

I tried to use Tensor Parallelism and Pipeline Parallelism through **_Megatron-LM_** . For example, with Pipeline Parallel Degree of 4 using Megatron-LM, it is supposed to split the model over 4 GPUs and totally 64 GB GPU memory. However, I still ran into CUDA Out of Memory error.

I also tried to use ZeRO Stage 2 or 3 through _**DeepSpeed**_. However, I still ran into CUDA Out of Memory error.

In both cases, I used **_hugging face accelerate_** to configure Megatron-LM or DeepSpeed. 

I wonder if the diffuser repo works out of box with Megatron-LM and DeepSpeed. Or, do I need to make code change to make it work with Tensor/Pipeline parallelism? 

Thank you very much, 



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Training text_to_image on multi-GPU with Megatron-LM or DeepSpeed does not work #3484

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Training text_to_image on multi-GPU with Megatron-LM or DeepSpeed does not work #3484

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions