Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Loss did not decrease in Albert example after 125000 max step. #447

Closed
elricwan opened this issue Jan 18, 2022 · 5 comments
Closed
Assignees
Labels
bug Something isn't working

Comments

@elricwan
Copy link
Contributor

elricwan commented Jan 18, 2022

Describe the bug
I run the albert example with wikitext data. I use one peer, default settings (target_batch_size=4096, train_batch_size=4, max_step=125000, lr=0.00176), but the loss did not decrease after training, it start as 11 and finish as 11.

Jan 15 10:30:14.734 [INFO] Step #1 loss = 11.04938
Jan 15 10:32:14.842 [INFO] Step #2 loss = 11.05589
Jan 15 10:34:14.975 [INFO] Step #3 loss = 11.06803
Jan 15 10:36:15.093 [INFO] Step #4 loss = 11.06271
Jan 15 10:38:15.228 [INFO] Step #5 loss = 11.06433
Jan 15 10:40:15.337 [INFO] Step #6 loss = 11.05447
Jan 15 10:41:45.401 [INFO] Step #7 loss = 11.06115
Jan 15 10:43:45.541 [INFO] Step #8 loss = 11.06025
..........
Jan 15 18:09:13.117 [INFO] Step #238 loss = 11.05597
Jan 15 18:11:13.233 [INFO] Step #239 loss = 11.06724
Jan 15 18:13:13.369 [INFO] Step #240 loss = 11.06289
Jan 15 18:15:13.494 [INFO] Step #241 loss = 11.05922
Jan 15 18:16:43.577 [INFO] Step #242 loss = 11.05226
Jan 15 18:18:43.691 [INFO] Step #243 loss = 11.05418
Jan 15 18:20:43.843 [INFO] Step #244 loss = 11.05638

To Reproduce
Run the script in albert example.
For monitor, I run:

python run_training_monitor.py
--experiment_prefix albert_experiment
--wandb_project albert_wandb

For trainer, I run:

IP=/ip4/192.168.0.188/tcp/45731/p2p/QmSRerwCPUSreHhwMuTLHoVHqTfWuT8J57w3sXFZtU8ECo

WANDB_DISABLED=true CUDA_VISIBLE_DEVICES=0 python run_trainer.py
--experiment_prefix albert_experiment
--initial_peers $IP
--logging_first_step
--output_dir ./outputs
--overwrite_output_dir
--logging_dir ./logs
--dataset_path="/home/protago/Xiangpeng/hivemind/examples/albert/data/albert_tokenized_wikitext"
--per_device_train_batch_size 4
--learning_rate 0.00176
--num_train_epochs=5
--save_steps=60000

Environment
Please list:

If the script doesn't work, please report pytorch and numpy versions manually. We also encourage you to include any additional information that you believe can help us solve the issue.

@elricwan elricwan added the bug Something isn't working label Jan 18, 2022
@justheuristic
Copy link
Member

justheuristic commented Jan 19, 2022

Hi, (and thanks for reporting the issue!)

I reproduced the error on my side -- the loss does indeed not_decrease in your setup.
I then restarted with two training peers and the loss was decreasing normally:
image

Q:Can you please check if it works on your side as well?

To avoid waiting for long time, please reduce the batch size and warmup: --target_batch_size 256 --warmup_steps 500
Otherwise the learning rate warmup would take 3125 steps at batch size 4096, which only makes sense when you have 10-50 peers.

If you only have one GPU, just reduce the batch size until you can fit two trainers on one GPU (please tell us if you encounter problems in that, we'll figure out something).

@justheuristic
Copy link
Member

justheuristic commented Jan 19, 2022

I investigated what went wrong in when training with only one trainer. Currently, hivemind.Optimizer is hard-wired to use the averaged gradients -- as in "averaged with peers".

If you are the only peer, gradients are not averaged, so optimizer runs with zero gradients all the time.
This change should fix the problem in your specific case: 4ffd9ca
I have seemingly introduced that bug myself in #440 . It only affects the github version of hivemind (i.e. not the pypi version)

Thank you again for the report! We'll get the fix to master as soon as possible.

If possible, I'd appreciate if you can verify if it works with 2 peers (and/or with the fix above) whenever you have time.

@finger92
Copy link
Contributor

finger92 commented Jan 19, 2022

I tried it with two peers work and one peer fail (to reduce loss)

@elricwan
Copy link
Contributor Author

Yes, it works with two peers. Thank you for the quick response!

@justheuristic
Copy link
Member

[closing the issue. Feel free to add more if you encounter further problems]

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants