New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Finetuning a Pretrained Model Using MuP #31
Comments
If it's already pretrained, you can replace torch layers with muP layers to
allow you to use muP optimizers (that can scale per layer lr with shape
info), as long as you make sure to keep the model forward pass invariant
when you switch out layers.
…On Wed, Dec 7, 2022, 6:13 AM Zach Nussbaum ***@***.***> wrote:
Somewhat of a naive question, but say we have pretrained a model and now
want to finetune it on a downstream task. Is there any reason we shouldn't
replace the MuP layers with the equivalent torch layers? I have to
imagine that we don't need to use MuP here, but want to make sure that
this doesn't break anything if we replace them
—
Reply to this email directly, view it on GitHub
<#31>, or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AMWHHMY2DARLGILJMLQBUQDWMCZRLANCNFSM6AAAAAASXAFWV4>
.
You are receiving this because you are subscribed to this thread.Message
ID: ***@***.***>
|
Sorry I was not clear! If we pretrained using MuP, we should replace the
Readout layers with normal torch layers when fine tuning correct?
…On Wed, Dec 7, 2022 at 11:42 AM Greg Yang ***@***.***> wrote:
If it's already pretrained, you can replace torch layers with muP layers to
allow you to use muP optimizers (that can scale per layer lr with shape
info), as long as you make sure to keep the model forward pass invariant
when you switch out layers.
On Wed, Dec 7, 2022, 6:13 AM Zach Nussbaum ***@***.***> wrote:
> Somewhat of a naive question, but say we have pretrained a model and now
> want to finetune it on a downstream task. Is there any reason we
shouldn't
> replace the MuP layers with the equivalent torch layers? I have to
> imagine that we don't need to use MuP here, but want to make sure that
> this doesn't break anything if we replace them
>
> —
> Reply to this email directly, view it on GitHub
> <#31>, or unsubscribe
> <
https://github.com/notifications/unsubscribe-auth/AMWHHMY2DARLGILJMLQBUQDWMCZRLANCNFSM6AAAAAASXAFWV4
>
> .
> You are receiving this because you are subscribed to this thread.Message
> ID: ***@***.***>
>
—
Reply to this email directly, view it on GitHub
<#31 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/AIBFIPKDDKW4CZHTYJC4FADWMC45RANCNFSM6AAAAAASXAFWV4>
.
You are receiving this because you authored the thread.Message ID:
***@***.***>
|
I think this is up to you. It's possible to replacing muP layer with torch
layers can make it easier to apply established hyperparameters for
fine-tuning. On the other hand, the muP layers themselves can open up
better hyperparameter choices for fine-tuning as well.
…On Wed, Dec 7, 2022, 6:47 AM Zach Nussbaum ***@***.***> wrote:
Sorry I was not clear! If we pretrained using MuP, we should replace the
Readout layers with normal torch layers when fine tuning correct?
On Wed, Dec 7, 2022 at 11:42 AM Greg Yang ***@***.***> wrote:
> If it's already pretrained, you can replace torch layers with muP layers
to
> allow you to use muP optimizers (that can scale per layer lr with shape
> info), as long as you make sure to keep the model forward pass invariant
> when you switch out layers.
>
> On Wed, Dec 7, 2022, 6:13 AM Zach Nussbaum ***@***.***> wrote:
>
> > Somewhat of a naive question, but say we have pretrained a model and
now
> > want to finetune it on a downstream task. Is there any reason we
> shouldn't
> > replace the MuP layers with the equivalent torch layers? I have to
> > imagine that we don't need to use MuP here, but want to make sure that
> > this doesn't break anything if we replace them
> >
> > —
> > Reply to this email directly, view it on GitHub
> > <#31>, or unsubscribe
> > <
>
https://github.com/notifications/unsubscribe-auth/AMWHHMY2DARLGILJMLQBUQDWMCZRLANCNFSM6AAAAAASXAFWV4
> >
> > .
> > You are receiving this because you are subscribed to this
thread.Message
> > ID: ***@***.***>
> >
>
> —
> Reply to this email directly, view it on GitHub
> <#31 (comment)>, or
> unsubscribe
> <
https://github.com/notifications/unsubscribe-auth/AIBFIPKDDKW4CZHTYJC4FADWMC45RANCNFSM6AAAAAASXAFWV4
>
> .
> You are receiving this because you authored the thread.Message ID:
> ***@***.***>
>
—
Reply to this email directly, view it on GitHub
<#31 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/AMWHHM6MONS54U5QXTIQVC3WMC5TJANCNFSM6AAAAAASXAFWV4>
.
You are receiving this because you commented.Message ID:
***@***.***>
|
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Somewhat of a naive question, but say we have pretrained a model and now want to finetune it on a downstream task. Is there any reason we shouldn't replace the MuP layers with the equivalent
torch
layers? I have to imagine that we don't need to useMuP
here, but want to make sure that this doesn't break anything if we replace themThe text was updated successfully, but these errors were encountered: