Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Learning rate scaling for distributed training? #8

Open
RahulBhalley opened this issue Feb 18, 2023 · 4 comments
Open

Learning rate scaling for distributed training? #8

RahulBhalley opened this issue Feb 18, 2023 · 4 comments

Comments

@RahulBhalley
Copy link

Hi @lucidrains, thanks for this implementation.

I wonder if you're using distributed training for your experiments. If so, as noted in Accelerate's docs, do you scale your learning rate (on top of downscaling for LION optimizer, even if you're not using Accelerate) based on number of processes (GPUs).

If you don't scale learning rate, do you recommend doing so?

@RahulBhalley
Copy link
Author

Btw LION seems to update parameters per train step faster than Adam or AdamW.

@simasima121
Copy link

I have run a few experiments and got unexpected results.

It seems as if Lion doesn't follow the traditional scaling law so far.

With a batch size of 64 across multiple GPUs, it doesn't matter how much we scale the LR; the training is only 10-20% faster.

For example, iteration 100's training loss with 1 GPU is 0.001 with bs of 64; if running on 4 GPUs (effective bs of 256), iteration 100 loss would be 0.0009 if LR is 4x bigger.

I have tried experiments with making LR the same, 2x bigger, 4x bigger and much larger, but it doesn't help.

Using Adam, the scaling laws would apply.

I would appreciate any ideas here.

Thanks

@RahulBhalley
Copy link
Author

I have run a few experiments and got unexpected results.

It seems as if Lion doesn't follow the traditional scaling law so far.

With a batch size of 64 across multiple GPUs, it doesn't matter how much we scale the LR; the training is only 10-20% faster.

For example, iteration 100's training loss with 1 GPU is 0.001 with bs of 64; if running on 4 GPUs (effective bs of 256), iteration 100 loss would be 0.0009 if LR is 4x bigger.

I have tried experiments with making LR the same, 2x bigger, 4x bigger and much larger, but it doesn't help.

Using Adam, the scaling laws would apply.

I would appreciate any ideas here.

Thanks

@simasima121 That's interesting. Thanks! I wonder if you get any better results with LION.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants
@simasima121 @RahulBhalley and others