You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hello, thanks for the excellent work! Your answer would be deeply appreciated.
I'm training a shufflenetv2k30 for human pose estimation using customised dataset and customised data-augmentation. I have visualised and validated the training samples, however during the training, I ran into this problem:
File "/opt/conda/lib/python3.8/runpy.py", line 194, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/opt/conda/lib/python3.8/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/opt/conda/lib/python3.8/site-packages/openpifpaf-0.13.4+2.gc7b05a2-py3.8-linux-x86_64.egg/openpifpaf/train.py", line 202, in
main()
File "/opt/conda/lib/python3.8/site-packages/openpifpaf-0.13.4+2.gc7b05a2-py3.8-linux-x86_64.egg/openpifpaf/train.py", line 198, in main
trainer.loop(train_loader, val_loader, start_epoch=start_epoch)
File "/opt/conda/lib/python3.8/site-packages/openpifpaf-0.13.4+2.gc7b05a2-py3.8-linux-x86_64.egg/openpifpaf/network/trainer.py", line 158, in loop
self.train(train_scenes, epoch)
File "/opt/conda/lib/python3.8/site-packages/openpifpaf-0.13.4+2.gc7b05a2-py3.8-linux-x86_64.egg/openpifpaf/network/trainer.py", line 294, in train
loss, head_losses = self.train_batch(data, target, apply_gradients)
File "/opt/conda/lib/python3.8/site-packages/openpifpaf-0.13.4+2.gc7b05a2-py3.8-linux-x86_64.egg/openpifpaf/network/trainer.py", line 183, in train_batch
loss, head_losses = self.loss(outputs, targets)
File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
return forward_call(*input, **kwargs)
File "/opt/conda/lib/python3.8/site-packages/openpifpaf-0.13.4+2.gc7b05a2-py3.8-linux-x86_64.egg/openpifpaf/network/losses/multi_head.py", line 29, in forward
flat_head_losses = [ll
File "/opt/conda/lib/python3.8/site-packages/openpifpaf-0.13.4+2.gc7b05a2-py3.8-linux-x86_64.egg/openpifpaf/network/losses/multi_head.py", line 31, in
for ll in l(f, t)]
File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
return forward_call(*input, **kwargs)
File "/opt/conda/lib/python3.8/site-packages/openpifpaf-0.13.4+2.gc7b05a2-py3.8-linux-x86_64.egg/openpifpaf/network/losses/composite.py", line 339, in forward
raise Exception('found a loss that is not finite: {}, prev: {}'
Exception: found a loss that is not finite: [tensor(-187.4948, device='cuda:0', grad_fn=), tensor(inf, device='cuda:0', grad_fn=), tensor(0.2073, device='cuda:0', grad_fn=)], prev: [-169.70677185058594, 1248.675048828125, 0.15428416430950165]
again:
File "/opt/conda/lib/python3.8/site-packages/openpifpaf-0.13.4+2.gc7b05a2-py3.8-linux-x86_64.egg/openpifpaf/network/losses/composite.py", line 339, in forward
raise Exception('found a loss that is not finite: {}, prev: {}'
Exception: found a loss that is not finite: [tensor(-181.7841, device='cuda:0', grad_fn=), tensor(inf, device='cuda:0', grad_fn=), tensor(0.1630, device='cuda:0', grad_fn=)], prev: [-187.84259033203125, 1307.96533203125, 0.12597596645355225]
It happens in a low frequency: ~once per 250k images. And I printed out the related values when it happened again.
torch.sum(l_confidence_bg): tensor(499.5447, device='cuda:0', grad_fn=)
torch.sum(l_confidence)): tensor(-16766.3672, device='cuda:0', grad_fn=)
torch.sum(l_reg): tensor(inf, device='cuda:0', grad_fn=)
torch.sum(l_scale): tensor(7.2607, device='cuda:0', grad_fn=)
batch_size: 64
x_regs: tensor(-68.8318, device='cuda:0', grad_fn=)
t_regs: tensor(139.7582, device='cuda:0')
t_sigma_min: tensor(190.9250, device='cuda:0')
t_scales_reg: tensor(17457.9258, device='cuda:0')
In composite.py and components.py under /networks/losses, I noticed that: x[above_max] = self.max_value + torch.log(1 - self.max_value + x[above_max]), this formula does not actually clamp the value when the value is inf. Also there's no corresponding remedy for inf values' occurrance in class RegressionLoss().
I have confirmed that the input images don't contain inf values or nan values.
Please could you suggest what the possible reasons might be?
The text was updated successfully, but these errors were encountered:
It happens when I am trying to fine-tune a model (using --checkpoint) that has not been trained on my custom dataset.
Surprisingly, I don't have this issue when I am training from scratch (using --basenet) with my custom dataset.
Hello, thanks for the excellent work! Your answer would be deeply appreciated.
I'm training a shufflenetv2k30 for human pose estimation using customised dataset and customised data-augmentation. I have visualised and validated the training samples, however during the training, I ran into this problem:
File "/opt/conda/lib/python3.8/runpy.py", line 194, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/opt/conda/lib/python3.8/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/opt/conda/lib/python3.8/site-packages/openpifpaf-0.13.4+2.gc7b05a2-py3.8-linux-x86_64.egg/openpifpaf/train.py", line 202, in
main()
File "/opt/conda/lib/python3.8/site-packages/openpifpaf-0.13.4+2.gc7b05a2-py3.8-linux-x86_64.egg/openpifpaf/train.py", line 198, in main
trainer.loop(train_loader, val_loader, start_epoch=start_epoch)
File "/opt/conda/lib/python3.8/site-packages/openpifpaf-0.13.4+2.gc7b05a2-py3.8-linux-x86_64.egg/openpifpaf/network/trainer.py", line 158, in loop
self.train(train_scenes, epoch)
File "/opt/conda/lib/python3.8/site-packages/openpifpaf-0.13.4+2.gc7b05a2-py3.8-linux-x86_64.egg/openpifpaf/network/trainer.py", line 294, in train
loss, head_losses = self.train_batch(data, target, apply_gradients)
File "/opt/conda/lib/python3.8/site-packages/openpifpaf-0.13.4+2.gc7b05a2-py3.8-linux-x86_64.egg/openpifpaf/network/trainer.py", line 183, in train_batch
loss, head_losses = self.loss(outputs, targets)
File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
return forward_call(*input, **kwargs)
File "/opt/conda/lib/python3.8/site-packages/openpifpaf-0.13.4+2.gc7b05a2-py3.8-linux-x86_64.egg/openpifpaf/network/losses/multi_head.py", line 29, in forward
flat_head_losses = [ll
File "/opt/conda/lib/python3.8/site-packages/openpifpaf-0.13.4+2.gc7b05a2-py3.8-linux-x86_64.egg/openpifpaf/network/losses/multi_head.py", line 31, in
for ll in l(f, t)]
File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
return forward_call(*input, **kwargs)
File "/opt/conda/lib/python3.8/site-packages/openpifpaf-0.13.4+2.gc7b05a2-py3.8-linux-x86_64.egg/openpifpaf/network/losses/composite.py", line 339, in forward
raise Exception('found a loss that is not finite: {}, prev: {}'
Exception: found a loss that is not finite: [tensor(-187.4948, device='cuda:0', grad_fn=), tensor(inf, device='cuda:0', grad_fn=), tensor(0.2073, device='cuda:0', grad_fn=)], prev: [-169.70677185058594, 1248.675048828125, 0.15428416430950165]
again:
File "/opt/conda/lib/python3.8/site-packages/openpifpaf-0.13.4+2.gc7b05a2-py3.8-linux-x86_64.egg/openpifpaf/network/losses/composite.py", line 339, in forward
raise Exception('found a loss that is not finite: {}, prev: {}'
Exception: found a loss that is not finite: [tensor(-181.7841, device='cuda:0', grad_fn=), tensor(inf, device='cuda:0', grad_fn=), tensor(0.1630, device='cuda:0', grad_fn=)], prev: [-187.84259033203125, 1307.96533203125, 0.12597596645355225]
It happens in a low frequency: ~once per 250k images. And I printed out the related values when it happened again.
torch.sum(l_confidence_bg): tensor(499.5447, device='cuda:0', grad_fn=)
torch.sum(l_confidence)): tensor(-16766.3672, device='cuda:0', grad_fn=)
torch.sum(l_reg): tensor(inf, device='cuda:0', grad_fn=)
torch.sum(l_scale): tensor(7.2607, device='cuda:0', grad_fn=)
batch_size: 64
x_regs: tensor(-68.8318, device='cuda:0', grad_fn=)
t_regs: tensor(139.7582, device='cuda:0')
t_sigma_min: tensor(190.9250, device='cuda:0')
t_scales_reg: tensor(17457.9258, device='cuda:0')
In composite.py and components.py under /networks/losses, I noticed that: x[above_max] = self.max_value + torch.log(1 - self.max_value + x[above_max]), this formula does not actually clamp the value when the value is inf. Also there's no corresponding remedy for inf values' occurrance in class RegressionLoss().
I have confirmed that the input images don't contain inf values or nan values.
Please could you suggest what the possible reasons might be?
The text was updated successfully, but these errors were encountered: