Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] ZeRO optimizer with MoE Expert Parallelism #5618

Open
Jack47 opened this issue Jun 5, 2024 · 1 comment
Open

[BUG] ZeRO optimizer with MoE Expert Parallelism #5618

Jack47 opened this issue Jun 5, 2024 · 1 comment
Assignees
Labels
bug Something isn't working training

Comments

@Jack47
Copy link

Jack47 commented Jun 5, 2024

Describe the bug
Just like this PR: #5259 , ZeRO optimizer also needs to be fixed:

  1. partition logic of expert params.
image
  1. average_tensor used in gradient reduce in zero2
image

To Reproduce
Steps to reproduce the behavior:

use ep=4 and adamw optimizer to train llm

Expected behavior
expert gradients should be equal under ep=4 and ep=1, but currently it's 4 times bigger than ep=1

@Jack47 Jack47 added bug Something isn't working training labels Jun 5, 2024
@jomayeri
Copy link
Contributor

jomayeri commented Jun 7, 2024

@Jack47 Can you make a PR for this? Thanks!

@jomayeri jomayeri self-assigned this Jun 7, 2024
@Jack47 Jack47 changed the title [BUG] [BUG] ZeRO optimizer with MoE Expert Parallelism Jun 7, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working training
Projects
None yet
Development

No branches or pull requests

2 participants