"adamw" optimizer + weight decay = poor generations #170

afiaka87 · 2021-04-06T20:49:44Z

It appears as though adamw does work better but the weight decay is creating strange generations.

Getting the same strange "brown" generations even though the loss continues to go down. It does so at a pretty slow rate - and if you're working with --fp16 it's tough to know the generations are poor until after training due to the inability to submit images through wandb.

afiaka87 · 2021-04-07T05:16:51Z

Is a good temporary solution to this to just set the weight_decay parameter to zero? @kobiso said as much but I assumed that effectively just turns it into a plain ole adam optimizer? Out of my depth.

afiaka87 · 2021-04-07T06:03:54Z

@lucidrains Noticed the adamw removal. Should I keep this open since it's from the paper?

kobiso · 2021-04-07T12:16:20Z

yeap, let's keep it open since it's from the paper :)

robvanvolt · 2021-04-07T19:52:53Z

The default weight_decay is .0 anyway, isn't it?

kobiso · 2021-04-08T01:15:03Z

@robvanvolt the default weight_decay is 0, but DALLE paper used 4.5*10-2.

afiaka87 · 2021-04-27T05:24:41Z

@kobiso @lucidrains @robvanvolt

So - I'm not having this problem anymore. I'm not sure exactly when we fixed it, but I can no longer reproduce this issue.

These are two of a bunch of good samples I'm getting training on a t-shirts dataset.

I tried to follow the paper (with regard to the optimizer).

opt = AdamW(dalle.parameters(), lr=LEARNING_RATE, betas=(0.9,0.96), weight_decay=4.5e-2, amsgrad=True)

I also have found a decent learning rate to be 3.7e-4. That's what I used here.

Due to experimentation and sunk cost fallacy, this network has the attention types:

attn_types=('full', 'axial_row', 'axial_col', 'full')

afiaka87 · 2021-04-29T06:28:54Z

#220

@lucidrains this has been steadily improving my results. I say we put it back in.

afiaka87 · 2021-04-29T19:09:13Z

Okay AdamW with the OpenAI defaults is merged back in:

#220

afiaka87 · 2021-05-01T13:56:07Z

Hm - so I realize now that the problem is actually that the state of the optimizer and scheduler is not stored on the model for resuming. If you have both AdamW and LR_Decay turned on, and try to resume - the scheduler will start with a learning rate tuned for the beginning of training, causing the bad generations.

@janEbert is that in your deepspeed fix branch?

janEbert · 2021-05-03T08:40:59Z

Yeah, DeepSpeed by default loads (and saves) the optimizer and LR scheduler states. So the DeepSpeed checkpoints do not have this problem with the default settings.

The default non-DeepSpeed checkpoints are not suited for resuming, only for inference!

afiaka87 · 2021-05-03T14:31:37Z

I'm advising we get rid of AdamW from the main codebase again - I was flat wrong about it working again unfortunately.

Here's a PR which does so. #227

Okay! I have no clue what's causing this actually because it didn't happen with a resume involved.

This is such a subtle thing to catch because requires you to run a few epochs to see it happening. Here's a full run where I did not use resume and the problem still occurs.

https://wandb.ai/afiaka87/starting_over/reports/Weight-Decay-Bug--Vmlldzo2NTgxMjA?accessToken=rigmz991xq7blj8fuesbwtrnmyi86nsjranwmphzj79unjx8ilu4akjow2pqd86i

shizhediao · 2022-02-23T00:29:28Z

Any updates on this?
Could I use Adamw with weight decay?
I got a similar result in brown

afiaka87 closed this as completed Apr 6, 2021

afiaka87 reopened this Apr 7, 2021

afiaka87 closed this as completed Apr 7, 2021

afiaka87 reopened this Apr 7, 2021

afiaka87 closed this as completed Apr 29, 2021

afiaka87 reopened this May 1, 2021

shizhediao mentioned this issue Feb 23, 2022

Bad result with vqgan #418

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

"adamw" optimizer + weight decay = poor generations #170

"adamw" optimizer + weight decay = poor generations #170

afiaka87 commented Apr 6, 2021 •

edited

afiaka87 commented Apr 7, 2021

afiaka87 commented Apr 7, 2021

kobiso commented Apr 7, 2021

robvanvolt commented Apr 7, 2021

kobiso commented Apr 8, 2021

afiaka87 commented Apr 27, 2021

afiaka87 commented Apr 29, 2021

afiaka87 commented Apr 29, 2021 •

edited

afiaka87 commented May 1, 2021 •

edited

janEbert commented May 3, 2021

afiaka87 commented May 3, 2021 •

edited

shizhediao commented Feb 23, 2022

"adamw" optimizer + weight decay = poor generations #170

"adamw" optimizer + weight decay = poor generations #170

Comments

afiaka87 commented Apr 6, 2021 • edited

afiaka87 commented Apr 7, 2021

afiaka87 commented Apr 7, 2021

kobiso commented Apr 7, 2021

robvanvolt commented Apr 7, 2021

kobiso commented Apr 8, 2021

afiaka87 commented Apr 27, 2021

afiaka87 commented Apr 29, 2021

afiaka87 commented Apr 29, 2021 • edited

afiaka87 commented May 1, 2021 • edited

janEbert commented May 3, 2021

afiaka87 commented May 3, 2021 • edited

shizhediao commented Feb 23, 2022

afiaka87 commented Apr 6, 2021 •

edited

afiaka87 commented Apr 29, 2021 •

edited

afiaka87 commented May 1, 2021 •

edited

afiaka87 commented May 3, 2021 •

edited