-
Notifications
You must be signed in to change notification settings - Fork 49
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Instability when resuming trains #13
Comments
Hi, thanks for testing the Lion optimizer. |
No worries! It's a really cool idea, so it will be nice if it can consistently improve on Adam! It is too early to say in my own experiments. For Lion I am using default (0.9, 0.99). For AdamW I have had successful runs with the defaults (0.9, 0.999). and also using (0.9, 0.99). Still tuning, so not sure what the optimum values are. |
Sounds good. Another betas setting (0.95, 0.98) can help with the training stability. |
@xiangning-chen have you run into this stability issue yourself? |
Not on diffusion model, when I encountered instability I just lower the learning rate. On language modeling, I found that lowering the beta2 improves stability for both AdamW and Lion. |
@angusturner have you tried the suggestions? |
Betas (0.95, 0.98) considerably lowered my instabilities, thanks for the tip @xiangning-chen ! |
@clementpoiret nice! 🙏 |
@clementpoiret Thanks for the update! For fine-tuning, are you referring to using Lion to fine-tune an AdamW trained model? |
@xiangning-chen, Yup, it's a pretrained EfficientNet from timm. I replaced the classifier with my own MLP. |
Did you load the AdamW optimizer status from pre-training? |
Not at all, I just loaded the weights from timm. (Please note that the strange double descent at the start is normal, I also have it using other optimizers.) |
@xiangning-chen do you have any experiments showing that loading adam momentum into lion for fine tuning is better? happy to add that feature, provided it isn't just a hunch you have |
oh, actually, loading adam optimizer state works fine as is, ok no worries |
Oh I meant when using both Adam for pre-training and fine-tuning, loading the 1st and 2nd moments are helpful. Never tried loading the Adam momentum into Lion as their EMA parameters are different. |
I got the same problem that loss explodes immediately when resuming checkpoints when |
I think this is because of #24 (comment) |
To your point @mitchellnw , from triton:
Looks like it is what happens - multiple kernel launches update model weights wrongly a few times in a row. UPD: removing autotune and setting fixed |
@ipoletaev ah yes, Mitchell filled me in on this issue through email this morning do you want to see if 6ab873a resolves it? the more permanent solution would be to submit a PR to Triton to clone all tensors prior to auto-tuning |
IMO it makes sense to just remove autotune and keep it simple: where the user specifies the block size they need. |
@ipoletaev yea true, i'll do that if this hack doesn't do the trick |
@ipoletaev actually, you are right, this battle isn't worth fighting |
Hi, I have been testing this out on some diffusion models I am training.
Convergence seems decent (somewhat faster than AdamW, using 1/10th the learning rate, 10x weight decay).
However, I recently paused a few experiments and tried to resume, and the loss explodes immediately. I do not
face this issue when resuming AdamW trains.
I have also found it necessary to use an LR warm-up period in my trains (even with the 1/10th loss), which again,
is not required in AdamW. I'll try to do a bit more digging to see if I can track down the source of instability - however for experiment resuming, surely if I load the optimizer state correctly things should resume as expected?
My only thought is whether something could be going wrong with saving of EMA / moving average statistics? If I get a chance to dig into this more I'll let you know what I find. (Possibly I am doing something wrong).
The text was updated successfully, but these errors were encountered: