Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Is this specific to transformers? #11

Closed
phalexo opened this issue Jul 7, 2024 · 2 comments
Closed

Is this specific to transformers? #11

phalexo opened this issue Jul 7, 2024 · 2 comments

Comments

@phalexo
Copy link

phalexo commented Jul 7, 2024

I think the original article first discovered the grokking effect in transformers.

I have been experimenting with a seq2seq model, for language translation, and not seeing any behavior that would indicate any state transition on validation data.

@Zhi0467
Copy link

Zhi0467 commented Jul 9, 2024

I've tested it on a two-layer diagonal MLP for classification that exhibits grokking and grokfast mitigates grokking.

@ironjr
Copy link
Owner

ironjr commented Jul 10, 2024

Thank you very much for trying out our code. I would like to gently note that the code provided here is basically for the proof-of-concept demonstration of acceleration of grokking in some previously known scenarios, and therefore the filter design can be suboptimal in other types of scenarios.

As mentioned in our paper, Transformers, MLPs and LSTMs under grokking phenomenon can be benefitted by the use of grokfast if a well-designed low-pass filter is applied. However, as you may have already noticed, in MLPs and LSTMs, there should be high weight norm regularization in order to have such a benefit. So, basically, weight norms (as well-investigated in Omnigrok paper) and low-pass filtered gradients seem to provide synergistic effect if used together. Such effect should be further investigated.

This is a very early stage of designing a good optimizer for the models under grokking, and so I believe that there should be better filter designs to work with different types of tasks/models. In other words, MA/EMA filters shown here are only for the proof-of-concept and there should be further research of finding good filter designs (except for the simplest MA/EMA filters) for our benefit. Again, thanks for the valuable report.

@ironjr ironjr closed this as completed Jul 10, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants