-
Notifications
You must be signed in to change notification settings - Fork 30
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ValueError at 79 epoch #82
Comments
Hi @someitalian123. Could you try decreasing the learning rate? I have recently decreased it as well on default template configs due to some other users reporting instability in some instances. You can see the recommended values here. |
I just copy and pasted the values that you linked and I have made it to epoch 80. I will have to run it a bit longer though to see if it happens again. As it's running though the console still says that the learning rate is 1.60e-03 instead of the the 1-e3 that I changed it to.
I'm wondering if the change actually took effect or if something else is going on. Do I need to delete the current contents of the experiments folder and start the training over? Or is it fine to continue after changing the learning rate in the config file? As far as increasing the batch size is concerned, I wasn't sure if it would be better to have a larger batch size or a larger patch size. As of right now while the training is running 7.1/8.0 GB of my dedicated VRAM is in use. I would likely need to decrease the patch size to 32 if I wanted to increase the batch size right? |
Yes, changing learning rate doesn't work right now if you resume from the state dict (such as when using
Yes. Normally you should start with a larger batch and a patch_size of 32. Then later on training (>40-60k) you can start to decrease batch size and increase patch_size. This is a strategy called "curriculum learning" in ML literature. |
I'm receiving this error preventing me from continuing. It occurs during the 79th epoch.
I am new to AI upscaling so I'm not sure how to troubleshoot it. Bfloat16 is already enabled. The GPU I'm using is an RTX 3070. Below is my config for the model.
The text was updated successfully, but these errors were encountered: