-
Notifications
You must be signed in to change notification settings - Fork 42
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Is this specific to transformers? #11
Comments
I've tested it on a two-layer diagonal MLP for classification that exhibits grokking and grokfast mitigates grokking. |
Thank you very much for trying out our code. I would like to gently note that the code provided here is basically for the proof-of-concept demonstration of acceleration of grokking in some previously known scenarios, and therefore the filter design can be suboptimal in other types of scenarios. As mentioned in our paper, Transformers, MLPs and LSTMs under grokking phenomenon can be benefitted by the use of grokfast if a well-designed low-pass filter is applied. However, as you may have already noticed, in MLPs and LSTMs, there should be high weight norm regularization in order to have such a benefit. So, basically, weight norms (as well-investigated in Omnigrok paper) and low-pass filtered gradients seem to provide synergistic effect if used together. Such effect should be further investigated. This is a very early stage of designing a good optimizer for the models under grokking, and so I believe that there should be better filter designs to work with different types of tasks/models. In other words, MA/EMA filters shown here are only for the proof-of-concept and there should be further research of finding good filter designs (except for the simplest MA/EMA filters) for our benefit. Again, thanks for the valuable report. |
I think the original article first discovered the grokking effect in transformers.
I have been experimenting with a seq2seq model, for language translation, and not seeing any behavior that would indicate any state transition on validation data.
The text was updated successfully, but these errors were encountered: