Instability when resuming trains #13

angusturner · 2023-02-26T05:22:31Z

Hi, I have been testing this out on some diffusion models I am training.

Convergence seems decent (somewhat faster than AdamW, using 1/10th the learning rate, 10x weight decay).
However, I recently paused a few experiments and tried to resume, and the loss explodes immediately. I do not
face this issue when resuming AdamW trains.

I have also found it necessary to use an LR warm-up period in my trains (even with the 1/10th loss), which again,
is not required in AdamW. I'll try to do a bit more digging to see if I can track down the source of instability - however for experiment resuming, surely if I load the optimizer state correctly things should resume as expected?

My only thought is whether something could be going wrong with saving of EMA / moving average statistics? If I get a chance to dig into this more I'll let you know what I find. (Possibly I am doing something wrong).

xiangning-chen · 2023-02-26T05:32:57Z

Hi, thanks for testing the Lion optimizer.
Just wondering what's the betas for AdamW and Lion, are they both as default, i.e., (0.9, 0.999) for AdamW, and (0.9, 0.99) for Lion?

angusturner · 2023-02-26T05:45:13Z

No worries! It's a really cool idea, so it will be nice if it can consistently improve on Adam! It is too early to say in my own experiments. For Lion I am using default (0.9, 0.99). For AdamW I have had successful runs with the defaults (0.9, 0.999). and also using (0.9, 0.99). Still tuning, so not sure what the optimum values are.

xiangning-chen · 2023-02-26T05:48:13Z

Sounds good. Another betas setting (0.95, 0.98) can help with the training stability.

lucidrains · 2023-02-26T20:36:54Z

@xiangning-chen have you run into this stability issue yourself?

xiangning-chen · 2023-02-26T23:31:00Z

Not on diffusion model, when I encountered instability I just lower the learning rate. On language modeling, I found that lowering the beta2 improves stability for both AdamW and Lion.

lucidrains · 2023-03-02T18:20:23Z

@angusturner have you tried the suggestions?

clementpoiret · 2023-03-05T15:29:13Z

Betas (0.95, 0.98) considerably lowered my instabilities, thanks for the tip @xiangning-chen !
I still have heavy instabilities as soon as I unfreeze some layers during fine-tuning sessions, so I just try to lower the LR.

lucidrains · 2023-03-06T01:39:39Z

@clementpoiret nice! 🙏

xiangning-chen · 2023-03-06T01:46:40Z

@clementpoiret Thanks for the update! For fine-tuning, are you referring to using Lion to fine-tune an AdamW trained model?

clementpoiret · 2023-03-06T08:27:35Z

@xiangning-chen, Yup, it's a pretrained EfficientNet from timm. I replaced the classifier with my own MLP.
For 1/4th of the epochs I train only my MLP, then I gradually unfreeze the last two blocks based on an EarlyStopping strategy. You can see that everything explodes, and it does not happen using AdamW, or when I only train the MLP without fine-tuning layers. I also use gradient clipping.

xiangning-chen · 2023-03-06T18:30:33Z

Did you load the AdamW optimizer status from pre-training?

clementpoiret · 2023-03-07T08:13:23Z

Not at all, I just loaded the weights from timm. (Please note that the strange double descent at the start is normal, I also have it using other optimizers.)

lucidrains · 2023-03-07T17:09:26Z

@xiangning-chen do you have any experiments showing that loading adam momentum into lion for fine tuning is better? happy to add that feature, provided it isn't just a hunch you have

lucidrains · 2023-03-07T17:22:54Z

oh, actually, loading adam optimizer state works fine as is, ok no worries

xiangning-chen · 2023-03-07T17:32:05Z

do you have any experiments showing that loading adam momentum into lion for fine tuning is better? happy to add that feature, provided it isn't just a hunch you have

Oh I meant when using both Adam for pre-training and fine-tuning, loading the 1st and 2nd moments are helpful. Never tried loading the Adam momentum into Lion as their EMA parameters are different.

flymark2010 · 2023-03-20T08:51:31Z

Hi, I have been testing this out on some diffusion models I am training.

Convergence seems decent (somewhat faster than AdamW, using 1/10th the learning rate, 10x weight decay). However, I recently paused a few experiments and tried to resume, and the loss explodes immediately. I do not face this issue when resuming AdamW trains.

I have also found it necessary to use an LR warm-up period in my trains (even with the 1/10th loss), which again, is not required in AdamW. I'll try to do a bit more digging to see if I can track down the source of instability - however for experiment resuming, surely if I load the optimizer state correctly things should resume as expected?

My only thought is whether something could be going wrong with saving of EMA / moving average statistics? If I get a chance to dig into this more I'll let you know what I find. (Possibly I am doing something wrong).

I got the same problem that loss explodes immediately when resuming checkpoints when use_triton=True in NLP task. Then setting use_triton=False, the loss becomes normal.

mitchellnw · 2023-05-09T02:13:05Z

I think this is because of #24 (comment)

ipoletaev · 2023-05-09T18:49:30Z

To your point @mitchellnw , from triton:

When all the configurations are evaluated, the kernel will run multiple time.
This means that whatever value the kernel updates will be updated multiple times.
To avoid this undesired behavior, you can use the reset_to_zero argument, which
reset the value of the provided tensor to zero before running any configuration.

Looks like it is what happens - multiple kernel launches update model weights wrongly a few times in a row.
Also, shall we close this one as a duplicate of #20 ?

UPD: removing autotune and setting fixed BLOCK_SIZE=1024 worked and resolved the issue.

lucidrains · 2023-05-09T21:30:05Z

@ipoletaev ah yes, Mitchell filled me in on this issue through email this morning

do you want to see if 6ab873a resolves it? the more permanent solution would be to submit a PR to Triton to clone all tensors prior to auto-tuning

ipoletaev · 2023-05-09T21:34:20Z

IMO it makes sense to just remove autotune and keep it simple: where the user specifies the block size they need.

lucidrains · 2023-05-09T21:47:05Z

@ipoletaev yea true, i'll do that if this hack doesn't do the trick

lucidrains · 2023-05-09T22:02:59Z

@ipoletaev actually, you are right, this battle isn't worth fighting

2226ec8

lucidrains closed this as completed May 9, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Instability when resuming trains #13

Instability when resuming trains #13

angusturner commented Feb 26, 2023

xiangning-chen commented Feb 26, 2023

angusturner commented Feb 26, 2023

xiangning-chen commented Feb 26, 2023

lucidrains commented Feb 26, 2023

xiangning-chen commented Feb 26, 2023

lucidrains commented Mar 2, 2023

clementpoiret commented Mar 5, 2023

lucidrains commented Mar 6, 2023

xiangning-chen commented Mar 6, 2023

clementpoiret commented Mar 6, 2023

xiangning-chen commented Mar 6, 2023

clementpoiret commented Mar 7, 2023

lucidrains commented Mar 7, 2023

lucidrains commented Mar 7, 2023

xiangning-chen commented Mar 7, 2023

flymark2010 commented Mar 20, 2023 •

edited

Loading

mitchellnw commented May 9, 2023

ipoletaev commented May 9, 2023 •

edited

Loading

lucidrains commented May 9, 2023

ipoletaev commented May 9, 2023

lucidrains commented May 9, 2023

lucidrains commented May 9, 2023

Instability when resuming trains #13

Instability when resuming trains #13

Comments

angusturner commented Feb 26, 2023

xiangning-chen commented Feb 26, 2023

angusturner commented Feb 26, 2023

xiangning-chen commented Feb 26, 2023

lucidrains commented Feb 26, 2023

xiangning-chen commented Feb 26, 2023

lucidrains commented Mar 2, 2023

clementpoiret commented Mar 5, 2023

lucidrains commented Mar 6, 2023

xiangning-chen commented Mar 6, 2023

clementpoiret commented Mar 6, 2023

xiangning-chen commented Mar 6, 2023

clementpoiret commented Mar 7, 2023

lucidrains commented Mar 7, 2023

lucidrains commented Mar 7, 2023

xiangning-chen commented Mar 7, 2023

flymark2010 commented Mar 20, 2023 • edited Loading

mitchellnw commented May 9, 2023

ipoletaev commented May 9, 2023 • edited Loading

lucidrains commented May 9, 2023

ipoletaev commented May 9, 2023

lucidrains commented May 9, 2023

lucidrains commented May 9, 2023

flymark2010 commented Mar 20, 2023 •

edited

Loading

ipoletaev commented May 9, 2023 •

edited

Loading