Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Should base=None be used in set_base_shapes for model used for tuning? #25

Open
callumm-graphcore opened this issue Nov 3, 2022 · 2 comments

Comments

@callumm-graphcore
Copy link

Hello! First of all, thank you for doing such great work and making it so accessible. I'm looking at using mup for a project but I'm a bit confused about how to set the base shapes for the smaller model used for hyperparameter tuning.

Let's say I want to train an MLP with hidden dimension 1024, and I want to muTransfer the best learning rate from an MLP with hidden dimension 128. My top-level code might look like this:

best_loss = float('inf')
best_lr = 0.

# Hyperparameter sweep with hidden dimension 128
for lr in learning_rates:

    small_mlp = MLP(hidden_dim=128)

    # use `base=None` in `set_base_shapes`
    small_mlp = mup.set_base_shapes(small_mlp, base=None)

    final_loss = full_training_loop(small_mlp, lr=lr)

    if final_loss < best_loss:
        best_loss = final_loss
        best_lr = lr

# Transfer optimal LR to large model

base_mlp = MLP(hidden_dim=128)
big_mlp = MLP(hidden_dim=1024)

big_mlp = mup.set_base_shapes(big_mlp, base=base_mlp)

ultimate_loss = full_training_loop(big_mlp, lr=best_lr)

or like this:

best_loss = float('inf')
best_lr = 0.

for lr in learning_rates:

    small_mlp = MLP(hidden_dim=128)

    # use a base model in `set_base_shapes`
    smaller_mlp = MLP(hidden_dim=32)
    small_mlp = mup.set_base_shapes(small_mlp, base=smaller_mlp)

    final_loss = full_training_loop(small_mlp, lr=lr)

    if final_loss < best_loss:
        best_loss = final_loss
        best_lr = lr

# Transfer optimal LR to large model

base_mlp = MLP(hidden_dim=128)
big_mlp = MLP(hidden_dim=1024)

big_mlp = mup.set_base_shapes(big_mlp, base=base_mlp)

ultimate_loss = full_training_loop(big_mlp, lr=best_lr)

Could you please clarify which of these would be correct? Thank you very much for your time!

@thegregyang
Copy link
Contributor

Thanks for the kind words!

You should do the 2nd thing. base=None essentially means not using muP.

@callumm-graphcore
Copy link
Author

Great, thanks Greg!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants