Should `base=None` be used in `set_base_shapes` for model used for tuning? #25

callumm-graphcore · 2022-11-03T16:06:50Z

Hello! First of all, thank you for doing such great work and making it so accessible. I'm looking at using mup for a project but I'm a bit confused about how to set the base shapes for the smaller model used for hyperparameter tuning.

Let's say I want to train an MLP with hidden dimension 1024, and I want to muTransfer the best learning rate from an MLP with hidden dimension 128. My top-level code might look like this:

best_loss = float('inf')
best_lr = 0.

# Hyperparameter sweep with hidden dimension 128
for lr in learning_rates:

    small_mlp = MLP(hidden_dim=128)

    # use `base=None` in `set_base_shapes`
    small_mlp = mup.set_base_shapes(small_mlp, base=None)

    final_loss = full_training_loop(small_mlp, lr=lr)

    if final_loss < best_loss:
        best_loss = final_loss
        best_lr = lr

# Transfer optimal LR to large model

base_mlp = MLP(hidden_dim=128)
big_mlp = MLP(hidden_dim=1024)

big_mlp = mup.set_base_shapes(big_mlp, base=base_mlp)

ultimate_loss = full_training_loop(big_mlp, lr=best_lr)

or like this:

best_loss = float('inf')
best_lr = 0.

for lr in learning_rates:

    small_mlp = MLP(hidden_dim=128)

    # use a base model in `set_base_shapes`
    smaller_mlp = MLP(hidden_dim=32)
    small_mlp = mup.set_base_shapes(small_mlp, base=smaller_mlp)

    final_loss = full_training_loop(small_mlp, lr=lr)

    if final_loss < best_loss:
        best_loss = final_loss
        best_lr = lr

# Transfer optimal LR to large model

base_mlp = MLP(hidden_dim=128)
big_mlp = MLP(hidden_dim=1024)

big_mlp = mup.set_base_shapes(big_mlp, base=base_mlp)

ultimate_loss = full_training_loop(big_mlp, lr=best_lr)

Could you please clarify which of these would be correct? Thank you very much for your time!

The text was updated successfully, but these errors were encountered:

thegregyang · 2022-11-03T17:46:12Z

Thanks for the kind words!

You should do the 2nd thing. base=None essentially means not using muP.

callumm-graphcore · 2022-11-04T07:58:07Z

Great, thanks Greg!

callumm-graphcore closed this as completed Nov 4, 2022

callumm-graphcore reopened this Jan 10, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Should `base=None` be used in `set_base_shapes` for model used for tuning? #25

Should `base=None` be used in `set_base_shapes` for model used for tuning? #25

callumm-graphcore commented Nov 3, 2022

thegregyang commented Nov 3, 2022

callumm-graphcore commented Nov 4, 2022

Should base=None be used in set_base_shapes for model used for tuning? #25

Should base=None be used in set_base_shapes for model used for tuning? #25

Comments

callumm-graphcore commented Nov 3, 2022

thegregyang commented Nov 3, 2022

callumm-graphcore commented Nov 4, 2022

Should `base=None` be used in `set_base_shapes` for model used for tuning? #25

Should `base=None` be used in `set_base_shapes` for model used for tuning? #25