Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG]: Colossalai-OpenMoE example failed to converge #5163

Open
marsggbo opened this issue Dec 5, 2023 · 2 comments
Open

[BUG]: Colossalai-OpenMoE example failed to converge #5163

marsggbo opened this issue Dec 5, 2023 · 2 comments
Labels
bug Something isn't working

Comments

@marsggbo
Copy link

marsggbo commented Dec 5, 2023

馃悰 Describe the bug

Setup

I am currently running the Colossalai/examples/language/openmoe project with the following experimental setup:

  • datasets: load_dataset("yizhongw/self_instruct", "super_natural_instructions"), I also tried "wikitext-2"
  • model: openmoe-base
  • batch size: 2
  • 2 GPUs
  • parallel strategy:
    • pp_size=1
    • dp_size=1
    • ep_size=2
    • extra_dp_size=2
    • zero_stage=2

Issue 1: loss value is very large and cannot converge

I've encountered challenges during the training process where the convergence seems unachievable. The training loss value persists at an exceptionally high level (exceeding $1e10$) even after running 10 epochs. The logged information provided below showcases three final loss values: aux_loss, z_loss, and ZCrossEntropy(ce) loss. While the first two loss items appear normal, the ce loss is significantly larger.

image

Furthermore, in the default setup of the openmoe project, the data format is bf16. Considering this, I suspect the issue might stem from an overflow problem. Consequently, I attempted using fp16 (fp32 seems not supported in zero mode) but still encountered the same problem.

issue 2: when disable loading checkpoint, the loss value will be nan

Additionally, upon commenting out these two lines, the loss value tends to become nan. I wonder what datasets the provided pretrained weights (huggingface hpcaitech/openmoe-base) are based on?

if not test_mode:
load_ckpt(repo_name, model, booster)

image

Are there specific aspects I should pay attention to or common pitfalls that might cause model training failures when working with openmoe? Your insights or suggestions on resolving this issue or optimizing the project would be immensely helpful. Thank you for your assistance!

Environment

  • torch 2.0.1+cu118
  • python 3.8.12
@marsggbo marsggbo added the bug Something isn't working label Dec 5, 2023
@flybird11111
Copy link
Contributor

Thank you for your feedback; we will address this issue as soon as possible.

@noob-ctrl
Copy link

I also encountered this problem

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants