Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ValueError at 79 epoch #82

Closed
someitalian123 opened this issue Jul 28, 2024 · 3 comments
Closed

ValueError at 79 epoch #82

someitalian123 opened this issue Jul 28, 2024 · 3 comments
Labels
question Further information is requested

Comments

@someitalian123
Copy link

I'm receiving this error preventing me from continuing. It occurs during the 79th epoch.

Traceback (most recent call last):
  File "D:\neosr\train.py", line 350, in <module>
    train_pipeline(str(root_path))
  File "D:\neosr\train.py", line 256, in train_pipeline
    model.optimize_parameters(current_iter)  # type: ignore[reportFunctionMemberAccess,attr-defined]
    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "D:\neosr\neosr\models\image.py", line 649, in optimize_parameters
    self.closure(current_iter)
  File "D:\neosr\neosr\models\image.py", line 633, in closure
    raise ValueError(msg)
ValueError:

                  NaN found, aborting training. Make sure you're using a proper learning rate.
                  If you have AMP enabled, try using bfloat16. For more information:
                  https://github.com/muslll/neosr/wiki/Configuration-Walkthrough

I am new to AI upscaling so I'm not sure how to troubleshoot it. Bfloat16 is already enabled. The GPU I'm using is an RTX 3070. Below is my config for the model.

name = "An4x"
model_type = "image"
scale = 4
use_amp = true
bfloat16 = true
fast_matmul = true
#compile = true
#manual_seed = 1024

[datasets.train]
type = "paired"
dataroot_gt = 'D:\neosr\datasets\gt\'
dataroot_lq = 'D:\neosr\datasets\lq\'
patch_size = 64
batch_size = 8
#accumulate = 1
#augmentation = [ "none", "mixup", "cutmix", "resizemix" ] # [ "cutblur" ]
#aug_prob = [ 0.5, 0.1, 0.1, 0.1 ] # [ 0.7 ]

#[datasets.val]
#name = "val"
#type = "paired"
#dataroot_gt = 'D:\datasets\val\gt\'
#dataroot_lq = 'D:\datasets\val\lq\'
#[val]
#val_freq = 1000
#tile = 200
#[val.metrics.psnr]
#type = "calculate_psnr"
#[val.metrics.ssim]
#type = "calculate_ssim"
#[val.metrics.dists]
#type = "calculate_dists"
#better = "lower"

[path]
#pretrain_network_g = 'experiments\pretrain_g.pth'
#pretrain_network_d = 'experiments\pretrain_d.pth'

[network_g]
type = "esrgan"

[network_d]
type = "ea2fpn"

[train]
ema = 0.999
#sam = "fsam"
#sam_init = 1000
#eco = true
#eco_init = 15000
#wavelet_guided = true
#wavelet_init = 80000
#match_lq_colors = true

[train.optim_g]
type = "adan_sf"
lr = 1.6e-3
betas = [ 0.98, 0.92, 0.987 ]
weight_decay = 0.02
schedule_free = true
warmup_steps = 1600

[train.optim_d]
type = "adan_sf"
lr = 8e-4
betas = [ 0.98, 0.92, 0.995 ]
weight_decay = 0.02
schedule_free = true
#warmup_steps = 600

#  losses
[train.mssim_opt]
type = "mssim_loss"
loss_weight = 1.0

[train.perceptual_opt]
type = "vgg_perceptual_loss"
loss_weight = 1.0
criterion = "huber"
#patchloss = true
#ipk = true
#patch_weight = 1.0

[train.gan_opt]
type = "gan_loss"
gan_type = "bce"
loss_weight = 0.3

[train.color_opt]
type = "color_loss"
loss_weight = 1.0
criterion = "huber"

[train.luma_opt]
type = "luma_loss"
loss_weight = 1.0
criterion = "huber"

#[train.dists_opt]
#type = "dists_loss"
#loss_weight = 0.5

#[train.ldl_opt]
#type = "ldl_loss"
#loss_weight = 1.0
#criterion = "huber"

#[train.ff_opt]
#type = "ff_loss"
#loss_weight = 1.0

#[train.gw_opt]
#type = "gw_loss"
#loss_weight = 1.0
#criterion = "chc"

[logger]
total_iter = 1000000
save_checkpoint_freq = 1000
use_tb_logger = true
#save_tb_img = true
#print_freq = 100

@neosr-project neosr-project added the question Further information is requested label Jul 28, 2024
@neosr-project
Copy link
Owner

Hi @someitalian123. Could you try decreasing the learning rate? I have recently decreased it as well on default template configs due to some other users reporting instability in some instances. You can see the recommended values here.
Another suggestion: at the beginning of training, use a larger batch size, particularly if you're training from scratch (no pretrained model). Training without a pretrain is highly discouraged btw.

@someitalian123
Copy link
Author

Hi @someitalian123. Could you try decreasing the learning rate? I have recently decreased it as well on default template configs due to some other users reporting instability in some instances. You can see the recommended values here. Another suggestion: at the beginning of training, use a larger batch size, particularly if you're training from scratch (no pretrained model). Training without a pretrain is highly discouraged btw.

I just copy and pasted the values that you linked and I have made it to epoch 80. I will have to run it a bit longer though to see if it happens again. As it's running though the console still says that the learning rate is 1.60e-03 instead of the the 1-e3 that I changed it to.

28-07-2024 02:00 PM | INFO: [ epoch: 80 ] [ iter: 73,900 ] [ performance: 1.065 it/s ] [ lr: 1.60e-03 ] [ eta: 7 days, 3:11:22 ] [ l_g_mssim: 9.9570e-02 ] [ l_g_percep: 5.0608e+00 ] [ l_g_color: 3.1832e-04 ] [ l_g_luma: 4.0132e-03 ] [ l_g_gan: 1.8426e+00 ] [ l_g_total: 7.0073e+00 ] [ l_d_real: 1.3882e-05 ] [ out_d_real: 1.2296e+01 ] [ l_d_fake: 2.2195e-03 ] [ out_d_fake: -6.1791e+00 ] [ l_d_total: 1.1167e-03 ]

I'm wondering if the change actually took effect or if something else is going on. Do I need to delete the current contents of the experiments folder and start the training over? Or is it fine to continue after changing the learning rate in the config file?

As far as increasing the batch size is concerned, I wasn't sure if it would be better to have a larger batch size or a larger patch size. As of right now while the training is running 7.1/8.0 GB of my dedicated VRAM is in use. I would likely need to decrease the patch size to 32 if I wanted to increase the batch size right?

@neosr-project
Copy link
Owner

As it's running though the console still says that the learning rate is 1.60e-03 instead of the the 1-e3

Yes, changing learning rate doesn't work right now if you resume from the state dict (such as when using --auto_resume). If you want to change it, you must to use the current model as a pretrain and start training again.
However, if it's not giving you NaN, you should just proceed. The learning rate of 1.6e-3 should work perfectly well since the schedule-free optimizer blends the parameters (adapts). If you do want a more stable training though, decreasing it to 1.2e-3 or 1e-3 should do the trick.

I would likely need to decrease the patch size to 32 if I wanted to increase the batch size right?

Yes. Normally you should start with a larger batch and a patch_size of 32. Then later on training (>40-60k) you can start to decrease batch size and increase patch_size. This is a strategy called "curriculum learning" in ML literature.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants