[BUG]: Colossalai-OpenMoE example failed to converge #5163

marsggbo · 2023-12-05T13:00:44Z

🐛 Describe the bug

Setup

I am currently running the Colossalai/examples/language/openmoe project with the following experimental setup:

datasets: load_dataset("yizhongw/self_instruct", "super_natural_instructions"), I also tried "wikitext-2"
model: openmoe-base
batch size: 2
2 GPUs
parallel strategy:
- pp_size=1
- dp_size=1
- ep_size=2
- extra_dp_size=2
- zero_stage=2

Issue 1: loss value is very large and cannot converge

I've encountered challenges during the training process where the convergence seems unachievable. The training loss value persists at an exceptionally high level (exceeding $1e10$) even after running 10 epochs. The logged information provided below showcases three final loss values: aux_loss, z_loss, and ZCrossEntropy(ce) loss. While the first two loss items appear normal, the ce loss is significantly larger.

Furthermore, in the default setup of the openmoe project, the data format is bf16. Considering this, I suspect the issue might stem from an overflow problem. Consequently, I attempted using fp16 (fp32 seems not supported in zero mode) but still encountered the same problem.

issue 2: when disable loading checkpoint, the loss value will be nan

Additionally, upon commenting out these two lines, the loss value tends to become nan. I wonder what datasets the provided pretrained weights (huggingface hpcaitech/openmoe-base) are based on?

ColossalAI/examples/language/openmoe/train.py

Lines 314 to 315 in 3dbbf83

    
           if not test_mode: 
        
               load_ckpt(repo_name, model, booster)

Are there specific aspects I should pay attention to or common pitfalls that might cause model training failures when working with openmoe? Your insights or suggestions on resolving this issue or optimizing the project would be immensely helpful. Thank you for your assistance!

Environment

torch 2.0.1+cu118
python 3.8.12

The text was updated successfully, but these errors were encountered:

flybird11111 · 2023-12-11T06:16:49Z

Thank you for your feedback; we will address this issue as soon as possible.

noob-ctrl · 2024-01-05T10:48:19Z

I also encountered this problem

marsggbo added the bug Something isn't working label Dec 5, 2023

oahzxl mentioned this issue Dec 14, 2023

[moe] fix converge problem of openmoe #5183

Closed

10 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG]: Colossalai-OpenMoE example failed to converge #5163

[BUG]: Colossalai-OpenMoE example failed to converge #5163

marsggbo commented Dec 5, 2023

flybird11111 commented Dec 11, 2023

noob-ctrl commented Jan 5, 2024

[BUG]: Colossalai-OpenMoE example failed to converge #5163

[BUG]: Colossalai-OpenMoE example failed to converge #5163

Comments

marsggbo commented Dec 5, 2023

🐛 Describe the bug

Setup

Issue 1: loss value is very large and cannot converge

issue 2: when disable loading checkpoint, the loss value will be nan

Environment

flybird11111 commented Dec 11, 2023

noob-ctrl commented Jan 5, 2024