-
Notifications
You must be signed in to change notification settings - Fork 642
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
"adamw" optimizer + weight decay = poor generations #170
Comments
Is a good temporary solution to this to just set the |
@lucidrains Noticed the adamw removal. Should I keep this open since it's from the paper? |
yeap, let's keep it open since it's from the paper :) |
The default weight_decay is .0 anyway, isn't it? |
@robvanvolt the default weight_decay is 0, but DALLE paper used 4.5*10-2. |
@kobiso @lucidrains @robvanvolt So - I'm not having this problem anymore. I'm not sure exactly when we fixed it, but I can no longer reproduce this issue. These are two of a bunch of good samples I'm getting training on a t-shirts dataset. I tried to follow the paper (with regard to the optimizer). opt = AdamW(dalle.parameters(), lr=LEARNING_RATE, betas=(0.9,0.96), weight_decay=4.5e-2, amsgrad=True) I also have found a decent learning rate to be 3.7e-4. That's what I used here. Due to experimentation and sunk cost fallacy, this network has the attention types: attn_types=('full', 'axial_row', 'axial_col', 'full') |
@lucidrains this has been steadily improving my results. I say we put it back in. |
Okay AdamW with the OpenAI defaults is merged back in: |
Hm - so I realize now that the problem is actually that the state of the optimizer and scheduler is not stored on the model for resuming. If you have both AdamW and LR_Decay turned on, and try to resume - the scheduler will start with a learning rate tuned for the beginning of training, causing the bad generations. @janEbert is that in your deepspeed fix branch? |
Yeah, DeepSpeed by default loads (and saves) the optimizer and LR scheduler states. So the DeepSpeed checkpoints do not have this problem with the default settings. The default non-DeepSpeed checkpoints are not suited for resuming, only for inference! |
I'm advising we get rid of AdamW from the main codebase again - I was flat wrong about it working again unfortunately. Here's a PR which does so. #227
This is such a subtle thing to catch because requires you to run a few epochs to see it happening. Here's a full run where I did not use resume and the problem still occurs. |
Any updates on this? |
#139 (comment)
It appears as though adamw does work better but the weight decay is creating strange generations.
Getting the same strange "brown" generations even though the loss continues to go down. It does so at a pretty slow rate - and if you're working with --fp16 it's tough to know the generations are poor until after training due to the inability to submit images through wandb.
The text was updated successfully, but these errors were encountered: