Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG]: Colossalai-OpenMoE-8b : loss value is very large and cannot converge #5212

Open
hangchen426926 opened this issue Dec 28, 2023 · 3 comments
Labels
bug Something isn't working

Comments

@hangchen426926
Copy link

🐛 Describe the bug

I am currently running the Colossalai/examples/language/openmoe project with the following experimental setup:

datasets: load_dataset("yizhongw/self_instruct/data/finetuning/self_instruct_221203", "super_natural_instructions"),
model: openmoe-8b
A100_num_GPU:8
Epoch:3
BatchSize:4
LR:0.00001
zero_stage:1
precision:bf16
Boost plugin:ep_zero
extra_dp_size:2
max_length:2048

### Issue : loss value is very large and cannot converge
I've encountered challenges during the training process where the convergence seems unachievable. The training loss value persists at an exceptionally high level (exceeding 2.3E+10) even after running 3 epochs. The logged information provided below showcases loss values:
loss

Furthermore, in the default setup of the openmoe project, the lr is 0.00001. Considering this, I suspect the issue might stem from an overfitting problem. Consequently, I attempted to adjustment the lr parameter(from 0.00001 to 0.0000001) ,but still encountered the same problem.
tempsnip
tempsnip2

Environment

torch 1.13.1
python 3.8.17
cuda:11.7.17

@hangchen426926 hangchen426926 added the bug Something isn't working label Dec 28, 2023
@Orion-Zheng
Copy link
Contributor

Thank you for your valuable feedback! 😃We are working on this bug and will get back to you later.

@noob-ctrl
Copy link

I also encountered this problem

@noob-ctrl
Copy link

@Orion-Zheng Has this bug been solved?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants