New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

[BUG] ZeRO optimizer with MoE Expert Parallelism #5618

Open

Jack47 opened this issue Jun 5, 2024 · 1 comment

Assignees

Labels

Jack47 commented Jun 5, 2024 •

edited

Loading

Describe the bug
Just like this PR: #5259 , ZeRO optimizer also needs to be fixed：

partition logic of expert params.

average_tensor used in gradient reduce in zero2

To Reproduce
Steps to reproduce the behavior:

use ep=4 and adamw optimizer to train llm

Expected behavior
expert gradients should be equal under ep=4 and ep=1, but currently it's 4 times bigger than ep=1

Jack47 added bug training labels

Contributor

jomayeri commented Jun 7, 2024

@Jack47 Can you make a PR for this? Thanks!

jomayeri self-assigned this

Jack47 changed the title ~~[BUG]~~ [BUG] ZeRO optimizer with MoE Expert Parallelism

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment