Exception: found a loss that is not finite #573

XiaodongGuan · 2022-09-24T14:40:46Z

Hello, thanks for the excellent work! Your answer would be deeply appreciated.

I'm training a shufflenetv2k30 for human pose estimation using customised dataset and customised data-augmentation. I have visualised and validated the training samples, however during the training, I ran into this problem:
File "/opt/conda/lib/python3.8/runpy.py", line 194, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/opt/conda/lib/python3.8/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/opt/conda/lib/python3.8/site-packages/openpifpaf-0.13.4+2.gc7b05a2-py3.8-linux-x86_64.egg/openpifpaf/train.py", line 202, in
main()
File "/opt/conda/lib/python3.8/site-packages/openpifpaf-0.13.4+2.gc7b05a2-py3.8-linux-x86_64.egg/openpifpaf/train.py", line 198, in main
trainer.loop(train_loader, val_loader, start_epoch=start_epoch)
File "/opt/conda/lib/python3.8/site-packages/openpifpaf-0.13.4+2.gc7b05a2-py3.8-linux-x86_64.egg/openpifpaf/network/trainer.py", line 158, in loop
self.train(train_scenes, epoch)
File "/opt/conda/lib/python3.8/site-packages/openpifpaf-0.13.4+2.gc7b05a2-py3.8-linux-x86_64.egg/openpifpaf/network/trainer.py", line 294, in train
loss, head_losses = self.train_batch(data, target, apply_gradients)
File "/opt/conda/lib/python3.8/site-packages/openpifpaf-0.13.4+2.gc7b05a2-py3.8-linux-x86_64.egg/openpifpaf/network/trainer.py", line 183, in train_batch
loss, head_losses = self.loss(outputs, targets)
File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
return forward_call(*input, **kwargs)
File "/opt/conda/lib/python3.8/site-packages/openpifpaf-0.13.4+2.gc7b05a2-py3.8-linux-x86_64.egg/openpifpaf/network/losses/multi_head.py", line 29, in forward
flat_head_losses = [ll
File "/opt/conda/lib/python3.8/site-packages/openpifpaf-0.13.4+2.gc7b05a2-py3.8-linux-x86_64.egg/openpifpaf/network/losses/multi_head.py", line 31, in
for ll in l(f, t)]
File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
return forward_call(*input, **kwargs)
File "/opt/conda/lib/python3.8/site-packages/openpifpaf-0.13.4+2.gc7b05a2-py3.8-linux-x86_64.egg/openpifpaf/network/losses/composite.py", line 339, in forward
raise Exception('found a loss that is not finite: {}, prev: {}'
Exception: found a loss that is not finite: [tensor(-187.4948, device='cuda:0', grad_fn=), tensor(inf, device='cuda:0', grad_fn=), tensor(0.2073, device='cuda:0', grad_fn=)], prev: [-169.70677185058594, 1248.675048828125, 0.15428416430950165]

again:
File "/opt/conda/lib/python3.8/site-packages/openpifpaf-0.13.4+2.gc7b05a2-py3.8-linux-x86_64.egg/openpifpaf/network/losses/composite.py", line 339, in forward
raise Exception('found a loss that is not finite: {}, prev: {}'
Exception: found a loss that is not finite: [tensor(-181.7841, device='cuda:0', grad_fn=), tensor(inf, device='cuda:0', grad_fn=), tensor(0.1630, device='cuda:0', grad_fn=)], prev: [-187.84259033203125, 1307.96533203125, 0.12597596645355225]

It happens in a low frequency: ~once per 250k images. And I printed out the related values when it happened again.
torch.sum(l_confidence_bg): tensor(499.5447, device='cuda:0', grad_fn=)
torch.sum(l_confidence)): tensor(-16766.3672, device='cuda:0', grad_fn=)
torch.sum(l_reg): tensor(inf, device='cuda:0', grad_fn=)
torch.sum(l_scale): tensor(7.2607, device='cuda:0', grad_fn=)
batch_size: 64
x_regs: tensor(-68.8318, device='cuda:0', grad_fn=)
t_regs: tensor(139.7582, device='cuda:0')
t_sigma_min: tensor(190.9250, device='cuda:0')
t_scales_reg: tensor(17457.9258, device='cuda:0')

In composite.py and components.py under /networks/losses, I noticed that: x[above_max] = self.max_value + torch.log(1 - self.max_value + x[above_max]), this formula does not actually clamp the value when the value is inf. Also there's no corresponding remedy for inf values' occurrance in class RegressionLoss().
I have confirmed that the input images don't contain inf values or nan values.
Please could you suggest what the possible reasons might be?

bstandaert · 2023-01-06T14:30:10Z

I have exactly the same issue.

It happens when I am trying to fine-tune a model (using --checkpoint) that has not been trained on my custom dataset.
Surprisingly, I don't have this issue when I am training from scratch (using --basenet) with my custom dataset.

bstandaert · 2023-01-11T10:43:17Z

I managed to finetune a checkoint by decreasing the learning rate.
I use now --lr=0.0001 with SGD optimizer and everything seems to work.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Exception: found a loss that is not finite #573

Exception: found a loss that is not finite #573

XiaodongGuan commented Sep 24, 2022

bstandaert commented Jan 6, 2023

bstandaert commented Jan 11, 2023

Exception: found a loss that is not finite #573

Exception: found a loss that is not finite #573

Comments

XiaodongGuan commented Sep 24, 2022

bstandaert commented Jan 6, 2023

bstandaert commented Jan 11, 2023