Skip to content

Training text_to_image on multi-GPU with Megatron-LM or DeepSpeed does not work #3484

@zipliucc

Description

@zipliucc

I have an EC2 instance with 4 GPUs and each GPU has 16GB memory.

By default, hugging face uses Data Parallelism and tries to fit one copy of the model in each GPU. Of course, 16GB is not enough and I ran into CUDA Out of Memory error.

I tried to use Tensor Parallelism and Pipeline Parallelism through Megatron-LM . For example, with Pipeline Parallel Degree of 4 using Megatron-LM, it is supposed to split the model over 4 GPUs and totally 64 GB GPU memory. However, I still ran into CUDA Out of Memory error.

I also tried to use ZeRO Stage 2 or 3 through DeepSpeed. However, I still ran into CUDA Out of Memory error.

In both cases, I used hugging face accelerate to configure Megatron-LM or DeepSpeed.

I wonder if the diffuser repo works out of box with Megatron-LM and DeepSpeed. Or, do I need to make code change to make it work with Tensor/Pipeline parallelism?

Thank you very much,

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions