Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support MoE for pipeline models #5338

Merged
merged 11 commits into from Apr 8, 2024
Merged

Conversation

mosheisland
Copy link
Contributor

This PR enhances DeepSpeed to support MoE for pipeline models (e.g. GPTModelPipe from Megatron-DeepSpeed).
Main changes:

  • Enhance expert groups creation for pipeline (enhance both flavors: DP/PP/EP and DP/TP/PP/EP)
  • Fix MoE save/load checkpoint for PipelineModule based models.
  • Display MoE loss for PipelineModule based models.
  • Support gradients reduce for BF16_Optimizer for PipelineModule.
    Note that same commit also fixes gradients reduction error when using Megatron-DeepSpeed GPTModelPipe with BF16_Optimizer also for a dense (no MOE) model.
  • When using no-drop tokens, all-reduce the capacity (op=max) using expert parallel group instead of world group

@mosheisland
Copy link
Contributor Author

Please note that Megatron-DeepSpeed PR#373 https://github.com/microsoft/Megatron-DeepSpeed/pull/373 is dependent on this PR.
Also, Megatron-DeepSpeed PR#373 includes multiple verification runs to test MoE functionality for pipeline models and to test no regressions to other configurations (Dense models, MoE for non pipeline models).

@tohtana tohtana self-requested a review April 3, 2024 16:25
Signed-off-by: Moshe Island <misland@habana.ai>
Currently MoE uses Megatron-DeepSpeed APIs to get tensor-parallel info (rank,
world_size, group).

In order to enable MoE for PipelineModule, modify to use backward-compatible
methods that can access either Megatron, DeepSpeed Topology or Old Megatron
APIs.

Since MoE is not part of deepspeed runtime, move backward compatible methods
to deepseed.utils and modify imports as required.

Signed-off-by: Moshe Island <misland@habana.ai>
Signed-off-by: Moshe Island <misland@habana.ai>
Currently, only "total_loss" is displayed.
If model has additional losses (e.g. MoE loss), display them as well.
Similar to "total loss", additional losses are displayed for the full batch
after mean reduced across DP ranks.

Signed-off-by: Moshe Island <misland@habana.ai>
Signed-off-by: Moshe Island <misland@habana.ai>
Currently, when using no-drop tokens, we calculate locally the capacity and
then all-reduce(op=max) on world group.

This fails when using pipeline parallel (with micro batches), since different
stage workers are handling different model layers (or at warmup, where first
stage workers are processing while last stage workers are idle).

Fix it by running the all-reduce on the expert group.

Signed-off-by: Moshe Island <misland@habana.ai>
This commit enhances expert group creation for both modes:
- DP + PP + EP
- DP + TP + PP + EP

Signed-off-by: Moshe Island <misland@habana.ai>
When using MoE with MoE-TP disabled, use pipeline parallel group to max or sum
MoE gradients.

This also fixes the behavior for following configuration:
No pipeline, TP enabled, MoE TP disabled.

Signed-off-by: Moshe Island <misland@habana.ai>
@mosheisland mosheisland force-pushed the moe/pipe branch 2 times, most recently from 7a5e888 to 0f9d2b5 Compare April 4, 2024 12:27
Copy link
Contributor

@tohtana tohtana left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is incredibly great work. Thank you for the amazing contribution!

deepspeed/runtime/pipe/engine.py Outdated Show resolved Hide resolved
@tohtana
Copy link
Contributor

tohtana commented Apr 4, 2024

I left a comment about a very small part but have already approved this PR.

Signed-off-by: Moshe Island <misland@habana.ai>
@mosheisland
Copy link
Contributor Author

I left a comment about a very small part but have already approved this PR.

Done

@tohtana tohtana enabled auto-merge April 7, 2024 06:13
@tohtana tohtana added this pull request to the merge queue Apr 7, 2024
@github-merge-queue github-merge-queue bot removed this pull request from the merge queue due to failed status checks Apr 7, 2024
@tohtana tohtana added this pull request to the merge queue Apr 7, 2024
@github-merge-queue github-merge-queue bot removed this pull request from the merge queue due to failed status checks Apr 7, 2024
@loadams loadams added this pull request to the merge queue Apr 8, 2024
Merged via the queue into microsoft:master with commit 08e0733 Apr 8, 2024
12 checks passed
rraminen pushed a commit to ROCm/DeepSpeed that referenced this pull request May 9, 2024
This PR enhances DeepSpeed to support MoE for pipeline models (e.g.
GPTModelPipe from Megatron-DeepSpeed).
Main changes:

- Enhance expert groups creation for pipeline (enhance both flavors:
DP/PP/EP and DP/TP/PP/EP)
- Fix MoE save/load checkpoint for PipelineModule based models.
- Display MoE loss for PipelineModule based models.
- Support gradients reduce for BF16_Optimizer for
PipelineModule.<br>Note that same commit also fixes gradients reduction
error when using Megatron-DeepSpeed GPTModelPipe with BF16_Optimizer
also for a dense (no MOE) model.
- When using no-drop tokens, all-reduce the capacity (op=max) using
expert parallel group instead of world group

---------

Signed-off-by: Moshe Island <misland@habana.ai>
Co-authored-by: Moshe Island <misland@habana.ai>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants