Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Did you increase the decoupled weight decay simultaneously when decreasing the learning rate? #2

Closed
xiangning-chen opened this issue Feb 15, 2023 · 4 comments

Comments

@xiangning-chen
Copy link
Contributor

Thanks for implementing and testing our lion optimizer!
Just wondering did you also enlarge the decoupled weight decay to maintain the regularization strength?

best,
--xiangning

@lucidrains
Copy link
Owner

@xiangning-chen Hi Xiangning! Thank you for this interesting paper

So far I have been only testing with weight decay turned off. There are a lot of networks that are still trained with just plain Adam, and I wanted to see how Lion fares against Adam alone

@lucidrains
Copy link
Owner

@xiangning-chen but yes, I have noted the section in the paper where you said the weight decay needs to be higher

Let me add that to the readme to increase the chances people train it correctly

@xiangning-chen
Copy link
Contributor Author

xiangning-chen commented Feb 15, 2023

Thanks for the update!
Yeah disabling weight decay for both optimizers is pretty meaningful and fair, thank you!

@lucidrains
Copy link
Owner

@xiangning-chen ok good luck! hope this technique holds up to scrutiny!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants