ValueError at 79 epoch #82

someitalian123 · 2024-07-28T03:04:14Z

I'm receiving this error preventing me from continuing. It occurs during the 79th epoch.

Traceback (most recent call last):
  File "D:\neosr\train.py", line 350, in <module>
    train_pipeline(str(root_path))
  File "D:\neosr\train.py", line 256, in train_pipeline
    model.optimize_parameters(current_iter)  # type: ignore[reportFunctionMemberAccess,attr-defined]
    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "D:\neosr\neosr\models\image.py", line 649, in optimize_parameters
    self.closure(current_iter)
  File "D:\neosr\neosr\models\image.py", line 633, in closure
    raise ValueError(msg)
ValueError:

                  NaN found, aborting training. Make sure you're using a proper learning rate.
                  If you have AMP enabled, try using bfloat16. For more information:
                  https://github.com/muslll/neosr/wiki/Configuration-Walkthrough

I am new to AI upscaling so I'm not sure how to troubleshoot it. Bfloat16 is already enabled. The GPU I'm using is an RTX 3070. Below is my config for the model.

name = "An4x"
model_type = "image"
scale = 4
use_amp = true
bfloat16 = true
fast_matmul = true
#compile = true
#manual_seed = 1024

[datasets.train]
type = "paired"
dataroot_gt = 'D:\neosr\datasets\gt\'
dataroot_lq = 'D:\neosr\datasets\lq\'
patch_size = 64
batch_size = 8
#accumulate = 1
#augmentation = [ "none", "mixup", "cutmix", "resizemix" ] # [ "cutblur" ]
#aug_prob = [ 0.5, 0.1, 0.1, 0.1 ] # [ 0.7 ]

#[datasets.val]
#name = "val"
#type = "paired"
#dataroot_gt = 'D:\datasets\val\gt\'
#dataroot_lq = 'D:\datasets\val\lq\'
#[val]
#val_freq = 1000
#tile = 200
#[val.metrics.psnr]
#type = "calculate_psnr"
#[val.metrics.ssim]
#type = "calculate_ssim"
#[val.metrics.dists]
#type = "calculate_dists"
#better = "lower"

[path]
#pretrain_network_g = 'experiments\pretrain_g.pth'
#pretrain_network_d = 'experiments\pretrain_d.pth'

[network_g]
type = "esrgan"

[network_d]
type = "ea2fpn"

[train]
ema = 0.999
#sam = "fsam"
#sam_init = 1000
#eco = true
#eco_init = 15000
#wavelet_guided = true
#wavelet_init = 80000
#match_lq_colors = true

[train.optim_g]
type = "adan_sf"
lr = 1.6e-3
betas = [ 0.98, 0.92, 0.987 ]
weight_decay = 0.02
schedule_free = true
warmup_steps = 1600

[train.optim_d]
type = "adan_sf"
lr = 8e-4
betas = [ 0.98, 0.92, 0.995 ]
weight_decay = 0.02
schedule_free = true
#warmup_steps = 600

#  losses
[train.mssim_opt]
type = "mssim_loss"
loss_weight = 1.0

[train.perceptual_opt]
type = "vgg_perceptual_loss"
loss_weight = 1.0
criterion = "huber"
#patchloss = true
#ipk = true
#patch_weight = 1.0

[train.gan_opt]
type = "gan_loss"
gan_type = "bce"
loss_weight = 0.3

[train.color_opt]
type = "color_loss"
loss_weight = 1.0
criterion = "huber"

[train.luma_opt]
type = "luma_loss"
loss_weight = 1.0
criterion = "huber"

#[train.dists_opt]
#type = "dists_loss"
#loss_weight = 0.5

#[train.ldl_opt]
#type = "ldl_loss"
#loss_weight = 1.0
#criterion = "huber"

#[train.ff_opt]
#type = "ff_loss"
#loss_weight = 1.0

#[train.gw_opt]
#type = "gw_loss"
#loss_weight = 1.0
#criterion = "chc"

[logger]
total_iter = 1000000
save_checkpoint_freq = 1000
use_tb_logger = true
#save_tb_img = true
#print_freq = 100

The text was updated successfully, but these errors were encountered:

neosr-project · 2024-07-28T14:41:44Z

Hi @someitalian123. Could you try decreasing the learning rate? I have recently decreased it as well on default template configs due to some other users reporting instability in some instances. You can see the recommended values here.
Another suggestion: at the beginning of training, use a larger batch size, particularly if you're training from scratch (no pretrained model). Training without a pretrain is highly discouraged btw.

someitalian123 · 2024-07-28T18:13:11Z

Hi @someitalian123. Could you try decreasing the learning rate? I have recently decreased it as well on default template configs due to some other users reporting instability in some instances. You can see the recommended values here. Another suggestion: at the beginning of training, use a larger batch size, particularly if you're training from scratch (no pretrained model). Training without a pretrain is highly discouraged btw.

I just copy and pasted the values that you linked and I have made it to epoch 80. I will have to run it a bit longer though to see if it happens again. As it's running though the console still says that the learning rate is 1.60e-03 instead of the the 1-e3 that I changed it to.

28-07-2024 02:00 PM | INFO: [ epoch: 80 ] [ iter: 73,900 ] [ performance: 1.065 it/s ] [ lr: 1.60e-03 ] [ eta: 7 days, 3:11:22 ] [ l_g_mssim: 9.9570e-02 ] [ l_g_percep: 5.0608e+00 ] [ l_g_color: 3.1832e-04 ] [ l_g_luma: 4.0132e-03 ] [ l_g_gan: 1.8426e+00 ] [ l_g_total: 7.0073e+00 ] [ l_d_real: 1.3882e-05 ] [ out_d_real: 1.2296e+01 ] [ l_d_fake: 2.2195e-03 ] [ out_d_fake: -6.1791e+00 ] [ l_d_total: 1.1167e-03 ]

I'm wondering if the change actually took effect or if something else is going on. Do I need to delete the current contents of the experiments folder and start the training over? Or is it fine to continue after changing the learning rate in the config file?

As far as increasing the batch size is concerned, I wasn't sure if it would be better to have a larger batch size or a larger patch size. As of right now while the training is running 7.1/8.0 GB of my dedicated VRAM is in use. I would likely need to decrease the patch size to 32 if I wanted to increase the batch size right?

neosr-project · 2024-07-28T20:00:59Z

As it's running though the console still says that the learning rate is 1.60e-03 instead of the the 1-e3

Yes, changing learning rate doesn't work right now if you resume from the state dict (such as when using --auto_resume). If you want to change it, you must to use the current model as a pretrain and start training again.
However, if it's not giving you NaN, you should just proceed. The learning rate of 1.6e-3 should work perfectly well since the schedule-free optimizer blends the parameters (adapts). If you do want a more stable training though, decreasing it to 1.2e-3 or 1e-3 should do the trick.

I would likely need to decrease the patch size to 32 if I wanted to increase the batch size right?

Yes. Normally you should start with a larger batch and a patch_size of 32. Then later on training (>40-60k) you can start to decrease batch size and increase patch_size. This is a strategy called "curriculum learning" in ML literature.

neosr-project added the question Further information is requested label Jul 28, 2024

neosr-project closed this as completed Jul 30, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ValueError at 79 epoch #82

ValueError at 79 epoch #82

someitalian123 commented Jul 28, 2024

neosr-project commented Jul 28, 2024

someitalian123 commented Jul 28, 2024

neosr-project commented Jul 28, 2024

ValueError at 79 epoch #82

ValueError at 79 epoch #82

Comments

someitalian123 commented Jul 28, 2024

neosr-project commented Jul 28, 2024

someitalian123 commented Jul 28, 2024

neosr-project commented Jul 28, 2024