[BUG]: Colossalai-OpenMoE-8b : loss value is very large and cannot converge #5212

hangchen426926 · 2023-12-28T08:54:10Z

🐛 Describe the bug

I am currently running the Colossalai/examples/language/openmoe project with the following experimental setup:

datasets: load_dataset("yizhongw/self_instruct/data/finetuning/self_instruct_221203", "super_natural_instructions"),
model: openmoe-8b
A100_num_GPU：8
Epoch：3
BatchSize：4
LR：0.00001
zero_stage：1
precision：bf16
Boost plugin：ep_zero
extra_dp_size：2
max_length：2048

### Issue : loss value is very large and cannot converge
I've encountered challenges during the training process where the convergence seems unachievable. The training loss value persists at an exceptionally high level (exceeding 2.3E+10) even after running 3 epochs. The logged information provided below showcases loss values:

Furthermore, in the default setup of the openmoe project, the lr is 0.00001. Considering this, I suspect the issue might stem from an overfitting problem. Consequently, I attempted to adjustment the lr parameter(from 0.00001 to 0.0000001) ,but still encountered the same problem.

Environment

torch 1.13.1
python 3.8.17
cuda:11.7.17

Orion-Zheng · 2023-12-30T19:21:33Z

Thank you for your valuable feedback! 😃We are working on this bug and will get back to you later.

noob-ctrl · 2024-01-05T10:47:30Z

I also encountered this problem

noob-ctrl · 2024-01-08T03:07:52Z

@Orion-Zheng Has this bug been solved?

hangchen426926 added the bug Something isn't working label Dec 28, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG]: Colossalai-OpenMoE-8b : loss value is very large and cannot converge #5212

[BUG]: Colossalai-OpenMoE-8b : loss value is very large and cannot converge #5212

hangchen426926 commented Dec 28, 2023

Orion-Zheng commented Dec 30, 2023

noob-ctrl commented Jan 5, 2024

noob-ctrl commented Jan 8, 2024

[BUG]: Colossalai-OpenMoE-8b : loss value is very large and cannot converge #5212

[BUG]: Colossalai-OpenMoE-8b : loss value is very large and cannot converge #5212

Comments

hangchen426926 commented Dec 28, 2023

🐛 Describe the bug

Environment

Orion-Zheng commented Dec 30, 2023

noob-ctrl commented Jan 5, 2024

noob-ctrl commented Jan 8, 2024