-
Notifications
You must be signed in to change notification settings - Fork 25.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add AdaFactor optimizer from fairseq #6722
Add AdaFactor optimizer from fairseq #6722
Conversation
… MLM -- reduced memory consumption compared to ADAM.
Hey @sshleifer -- here is belated PR for AdaFactor. Please let me know how to edit this properly, and what tests or examples we should add. Thanks! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is gunna be awesome!
Want to add a test similar to test_adamw here ?
Also I can take over whenever!
We will integrate into examples/ in a separate PR I think. |
Thanks @sshleifer -- let me try to make those changes. Agree that I should be able to add a single test -- appreciate the link -- and you can add examples in separate PR. If I don't get this figure out soon, yes happy for you to make the changes yourself :-) |
…ransformers into add_fairseq_adafactor
Hey @sshleifer -- think I got a test working finally. We can squash the commits. Still not sure what I need to clean up for the code standards/linter. Please advise, thanks! |
For local style checking, you need: sty () {
make style
flake8 examples templates tests src utils
} and then run |
Also squashing happens automatically at merge time, don't worry about that. |
Hmm. Is there a way for |
if you also run the flake8 command it should just fix it. |
I think I fixed the formatting, as requested. Took a sec to figure that all out... |
src/transformers/optimization.py
Outdated
# | ||
# Alternatively, relative_step with warmup_init can also be used. | ||
# Training without LR warmup or clip threshold, is not recommended. Additional optimizer operations | ||
# like gradient clipping, should not be used. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(nit)
This "second docstring" breaks style convention, I am OK to leave it here, because it is very useful, but would prefer to consolidate with the class docstring below.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Gotcha. It's up to you. Happy to move it, or if you want to consolidate the docstring in a future PR.
Let me try to make the change and see if you like it.
Co-authored-by: Sam Shleifer <sshleifer@gmail.com>
@sshleifer -- any idea what happened with the |
Yes they did, sorry about that. I did some cleanup on this branch. |
Codecov Report
@@ Coverage Diff @@
## master #6722 +/- ##
==========================================
- Coverage 78.96% 78.94% -0.03%
==========================================
Files 157 157
Lines 28486 28571 +85
==========================================
+ Hits 22495 22555 +60
- Misses 5991 6016 +25
Continue to review full report at Codecov.
|
Awesome. Thanks @sshleifer. I'll start working more on the other less mature PRs we discussed. And please ping me if/when you write tests or examples for this. Happy to contribute to that as well if you need. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great, thanks a lot! Cool test as well.
I've added Adafactor to the docs and slightly changed the style of the docstrings in #6765 |
Thanks! I'll add a |
* AdaFactor optimizer ported from fairseq. Tested for T5 finetuning and MLM -- reduced memory consumption compared to ADAM. * update PR fixes, add basic test * bug -- incorrect params in test * bugfix -- import Adafactor into test * bugfix -- removed accidental T5 include * resetting T5 to master * bugfix -- include Adafactor in __init__ * longer loop for adafactor test * remove double error class declare * lint * black * isort * Update src/transformers/optimization.py Co-authored-by: Sam Shleifer <sshleifer@gmail.com> * single docstring * Cleanup docstring Co-authored-by: Nikolai Y <nikolai.yakovenko@point72.com> Co-authored-by: Sam Shleifer <sshleifer@gmail.com>
* AdaFactor optimizer ported from fairseq. Tested for T5 finetuning and MLM -- reduced memory consumption compared to ADAM. * update PR fixes, add basic test * bug -- incorrect params in test * bugfix -- import Adafactor into test * bugfix -- removed accidental T5 include * resetting T5 to master * bugfix -- include Adafactor in __init__ * longer loop for adafactor test * remove double error class declare * lint * black * isort * Update src/transformers/optimization.py Co-authored-by: Sam Shleifer <sshleifer@gmail.com> * single docstring * Cleanup docstring Co-authored-by: Nikolai Y <nikolai.yakovenko@point72.com> Co-authored-by: Sam Shleifer <sshleifer@gmail.com>
This reverts commit 006deb9.
Tested for T5 finetuning and MLM -- reduced memory consumption compared to ADAM.
Fixes #1256