Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Exception: found a loss that is not finite #573

Open
XiaodongGuan opened this issue Sep 24, 2022 · 2 comments
Open

Exception: found a loss that is not finite #573

XiaodongGuan opened this issue Sep 24, 2022 · 2 comments

Comments

@XiaodongGuan
Copy link

Hello, thanks for the excellent work! Your answer would be deeply appreciated.

I'm training a shufflenetv2k30 for human pose estimation using customised dataset and customised data-augmentation. I have visualised and validated the training samples, however during the training, I ran into this problem:
File "/opt/conda/lib/python3.8/runpy.py", line 194, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/opt/conda/lib/python3.8/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/opt/conda/lib/python3.8/site-packages/openpifpaf-0.13.4+2.gc7b05a2-py3.8-linux-x86_64.egg/openpifpaf/train.py", line 202, in
main()
File "/opt/conda/lib/python3.8/site-packages/openpifpaf-0.13.4+2.gc7b05a2-py3.8-linux-x86_64.egg/openpifpaf/train.py", line 198, in main
trainer.loop(train_loader, val_loader, start_epoch=start_epoch)
File "/opt/conda/lib/python3.8/site-packages/openpifpaf-0.13.4+2.gc7b05a2-py3.8-linux-x86_64.egg/openpifpaf/network/trainer.py", line 158, in loop
self.train(train_scenes, epoch)
File "/opt/conda/lib/python3.8/site-packages/openpifpaf-0.13.4+2.gc7b05a2-py3.8-linux-x86_64.egg/openpifpaf/network/trainer.py", line 294, in train
loss, head_losses = self.train_batch(data, target, apply_gradients)
File "/opt/conda/lib/python3.8/site-packages/openpifpaf-0.13.4+2.gc7b05a2-py3.8-linux-x86_64.egg/openpifpaf/network/trainer.py", line 183, in train_batch
loss, head_losses = self.loss(outputs, targets)
File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
return forward_call(*input, **kwargs)
File "/opt/conda/lib/python3.8/site-packages/openpifpaf-0.13.4+2.gc7b05a2-py3.8-linux-x86_64.egg/openpifpaf/network/losses/multi_head.py", line 29, in forward
flat_head_losses = [ll
File "/opt/conda/lib/python3.8/site-packages/openpifpaf-0.13.4+2.gc7b05a2-py3.8-linux-x86_64.egg/openpifpaf/network/losses/multi_head.py", line 31, in
for ll in l(f, t)]
File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
return forward_call(*input, **kwargs)
File "/opt/conda/lib/python3.8/site-packages/openpifpaf-0.13.4+2.gc7b05a2-py3.8-linux-x86_64.egg/openpifpaf/network/losses/composite.py", line 339, in forward
raise Exception('found a loss that is not finite: {}, prev: {}'
Exception: found a loss that is not finite: [tensor(-187.4948, device='cuda:0', grad_fn=), tensor(inf, device='cuda:0', grad_fn=), tensor(0.2073, device='cuda:0', grad_fn=)], prev: [-169.70677185058594, 1248.675048828125, 0.15428416430950165]

again:
File "/opt/conda/lib/python3.8/site-packages/openpifpaf-0.13.4+2.gc7b05a2-py3.8-linux-x86_64.egg/openpifpaf/network/losses/composite.py", line 339, in forward
raise Exception('found a loss that is not finite: {}, prev: {}'
Exception: found a loss that is not finite: [tensor(-181.7841, device='cuda:0', grad_fn=), tensor(inf, device='cuda:0', grad_fn=), tensor(0.1630, device='cuda:0', grad_fn=)], prev: [-187.84259033203125, 1307.96533203125, 0.12597596645355225]

It happens in a low frequency: ~once per 250k images. And I printed out the related values when it happened again.
torch.sum(l_confidence_bg): tensor(499.5447, device='cuda:0', grad_fn=)
torch.sum(l_confidence)): tensor(-16766.3672, device='cuda:0', grad_fn=)
torch.sum(l_reg): tensor(inf, device='cuda:0', grad_fn=)
torch.sum(l_scale): tensor(7.2607, device='cuda:0', grad_fn=)
batch_size: 64
x_regs: tensor(-68.8318, device='cuda:0', grad_fn=)
t_regs: tensor(139.7582, device='cuda:0')
t_sigma_min: tensor(190.9250, device='cuda:0')
t_scales_reg: tensor(17457.9258, device='cuda:0')

In composite.py and components.py under /networks/losses, I noticed that: x[above_max] = self.max_value + torch.log(1 - self.max_value + x[above_max]), this formula does not actually clamp the value when the value is inf. Also there's no corresponding remedy for inf values' occurrance in class RegressionLoss().
I have confirmed that the input images don't contain inf values or nan values.
Please could you suggest what the possible reasons might be?

@bstandaert
Copy link

I have exactly the same issue.

It happens when I am trying to fine-tune a model (using --checkpoint) that has not been trained on my custom dataset.
Surprisingly, I don't have this issue when I am training from scratch (using --basenet) with my custom dataset.

@bstandaert
Copy link

I managed to finetune a checkoint by decreasing the learning rate.
I use now --lr=0.0001 with SGD optimizer and everything seems to work.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants