-
-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
EMA Bug #87
Comments
btw, how do you usually debug/test when adding some new functions or starting a new repo? I found my efficiency is quite low (Either run in command and wait for ERROR, or copy codes into jupyter-notebook and test again and again...) |
@CiaoHe i've come full circle and just use a simple |
@lucidrains lol. But when moving to cluster do train, things gonna be out of control sometimes (I hate bugs) |
🪰 🪱 🐞 |
Hi Phil,
This morning I tried to run the decoder training part. I decided to use
DecoderTrainer
but found one issue when ema update.When after using decoder_trainer do sampling, the next train forward run will throw RunError:
DALLE2-pytorch/dalle2_pytorch/train.py
Lines 108 to 118 in 6f76652
And I checked the
up_weight.device
(online model) andold_weight.device
(ema model), found online model is oncuda:0
but ema model is oncpu
. It's really weird, I debugged for a long time and I think it might be caused by theDecoderTrainer.sample()
process.When swapping across ema and online model, there exists some problem related to the device.
DALLE2-pytorch/dalle2_pytorch/train.py
Lines 298 to 308 in 6021945
Hope to hear your solution
Enjoy!
The text was updated successfully, but these errors were encountered: