Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

"adamw" optimizer + weight decay = poor generations #170

Open
afiaka87 opened this issue Apr 6, 2021 · 12 comments
Open

"adamw" optimizer + weight decay = poor generations #170

afiaka87 opened this issue Apr 6, 2021 · 12 comments

Comments

@afiaka87
Copy link
Contributor

afiaka87 commented Apr 6, 2021

#139 (comment)

It appears as though adamw does work better but the weight decay is creating strange generations.

Getting the same strange "brown" generations even though the loss continues to go down. It does so at a pretty slow rate - and if you're working with --fp16 it's tough to know the generations are poor until after training due to the inability to submit images through wandb.

@afiaka87 afiaka87 closed this as completed Apr 6, 2021
@afiaka87 afiaka87 reopened this Apr 7, 2021
@afiaka87
Copy link
Contributor Author

afiaka87 commented Apr 7, 2021

Is a good temporary solution to this to just set the weight_decay parameter to zero? @kobiso said as much but I assumed that effectively just turns it into a plain ole adam optimizer? Out of my depth.

@afiaka87 afiaka87 closed this as completed Apr 7, 2021
@afiaka87 afiaka87 reopened this Apr 7, 2021
@afiaka87
Copy link
Contributor Author

afiaka87 commented Apr 7, 2021

@lucidrains Noticed the adamw removal. Should I keep this open since it's from the paper?

@kobiso
Copy link
Contributor

kobiso commented Apr 7, 2021

yeap, let's keep it open since it's from the paper :)

@robvanvolt
Copy link
Contributor

The default weight_decay is .0 anyway, isn't it?

@kobiso
Copy link
Contributor

kobiso commented Apr 8, 2021

@robvanvolt the default weight_decay is 0, but DALLE paper used 4.5*10-2.

@afiaka87
Copy link
Contributor Author

@kobiso @lucidrains @robvanvolt

So - I'm not having this problem anymore. I'm not sure exactly when we fixed it, but I can no longer reproduce this issue.

These are two of a bunch of good samples I'm getting training on a t-shirts dataset.

I tried to follow the paper (with regard to the optimizer).

opt = AdamW(dalle.parameters(), lr=LEARNING_RATE, betas=(0.9,0.96), weight_decay=4.5e-2, amsgrad=True)

I also have found a decent learning rate to be 3.7e-4. That's what I used here.

Due to experimentation and sunk cost fallacy, this network has the attention types:

attn_types=('full', 'axial_row', 'axial_col', 'full')

media_images_image_701_b2f731fa
media_images_image_1301_9e6df8f2

@afiaka87
Copy link
Contributor Author

#220

@lucidrains this has been steadily improving my results. I say we put it back in.

@afiaka87
Copy link
Contributor Author

afiaka87 commented Apr 29, 2021

Okay AdamW with the OpenAI defaults is merged back in:

#220

@afiaka87 afiaka87 reopened this May 1, 2021
@afiaka87
Copy link
Contributor Author

afiaka87 commented May 1, 2021

Hm - so I realize now that the problem is actually that the state of the optimizer and scheduler is not stored on the model for resuming. If you have both AdamW and LR_Decay turned on, and try to resume - the scheduler will start with a learning rate tuned for the beginning of training, causing the bad generations.

@janEbert is that in your deepspeed fix branch?

@janEbert
Copy link
Contributor

janEbert commented May 3, 2021

Yeah, DeepSpeed by default loads (and saves) the optimizer and LR scheduler states. So the DeepSpeed checkpoints do not have this problem with the default settings.

The default non-DeepSpeed checkpoints are not suited for resuming, only for inference!

@afiaka87
Copy link
Contributor Author

afiaka87 commented May 3, 2021

I'm advising we get rid of AdamW from the main codebase again - I was flat wrong about it working again unfortunately.

Here's a PR which does so. #227

Screenshot from 2021-05-03 09-32-33
Okay! I have no clue what's causing this actually because it didn't happen with a resume involved.

This is such a subtle thing to catch because requires you to run a few epochs to see it happening. Here's a full run where I did not use resume and the problem still occurs.

https://wandb.ai/afiaka87/starting_over/reports/Weight-Decay-Bug--Vmlldzo2NTgxMjA?accessToken=rigmz991xq7blj8fuesbwtrnmyi86nsjranwmphzj79unjx8ilu4akjow2pqd86i

@shizhediao
Copy link

Any updates on this?
Could I use Adamw with weight decay?
I got a similar result in brown

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants