You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
### Issue : loss value is very large and cannot converge
I've encountered challenges during the training process where the convergence seems unachievable. The training loss value persists at an exceptionally high level (exceeding 2.3E+10) even after running 3 epochs. The logged information provided below showcases loss values:
Furthermore, in the default setup of the openmoe project, the lr is 0.00001. Considering this, I suspect the issue might stem from an overfitting problem. Consequently, I attempted to adjustment the lr parameter(from 0.00001 to 0.0000001) ,but still encountered the same problem.
Environment
torch 1.13.1
python 3.8.17
cuda:11.7.17
The text was updated successfully, but these errors were encountered:
🐛 Describe the bug
I am currently running the Colossalai/examples/language/openmoe project with the following experimental setup:
datasets: load_dataset("yizhongw/self_instruct/data/finetuning/self_instruct_221203", "super_natural_instructions"),
model: openmoe-8b
A100_num_GPU:8
Epoch:3
BatchSize:4
LR:0.00001
zero_stage:1
precision:bf16
Boost plugin:ep_zero
extra_dp_size:2
max_length:2048
### Issue : loss value is very large and cannot converge
I've encountered challenges during the training process where the convergence seems unachievable. The training loss value persists at an exceptionally high level (exceeding 2.3E+10) even after running 3 epochs. The logged information provided below showcases loss values:
Furthermore, in the default setup of the openmoe project, the lr is 0.00001. Considering this, I suspect the issue might stem from an overfitting problem. Consequently, I attempted to adjustment the lr parameter(from 0.00001 to 0.0000001) ,but still encountered the same problem.
Environment
torch 1.13.1
python 3.8.17
cuda:11.7.17
The text was updated successfully, but these errors were encountered: