Getting Nan loss while training #136

ashutoshIITK · 2018-04-22T08:13:11Z

I have a dataset containing 846 images but when start training I am getting there are 1692 images. I have the dataset in PASCAL_VOC format. The JPEGImages folder contains 846 images.
On training, I am getting loss:nan. Can you please let me know the reason for the same?
Preparing training data... done before filtering, there are 1692 images... after filtering, there are 1692 images... 1692 roidb entries Loading pretrained weights from data/pretrained_model/resnet101_caffe.pth [session 1][epoch 1][iter 0] loss: 6.7142, lr: 1.00e-03 fg/bg=(2/126), time cost: 238.602555 rpn_cls: 0.7190, rpn_box: 1.7119, rcnn_cls: 4.2830, rcnn_box 0.0003 [session 1][epoch 1][iter 100] loss: nan, lr: 1.00e-03 fg/bg=(13/115), time cost: 40.301977 rpn_cls: 0.5280, rpn_box: nan, rcnn_cls: 0.7082, rcnn_box 0.0000 [session 1][epoch 1][iter 200] loss: nan, lr: 1.00e-03 fg/bg=(32/96), time cost: 40.584164 rpn_cls: 0.3966, rpn_box: nan, rcnn_cls: 1.0526, rcnn_box 0.0000 [session 1][epoch 1][iter 300] loss: nan, lr: 1.00e-03 fg/bg=(8/120), time cost: 41.294393 rpn_cls: 0.4398, rpn_box: nan, rcnn_cls: 0.6331, rcnn_box 0.0000 [session 1][epoch 1][iter 400] loss: nan, lr: 1.00e-03 fg/bg=(32/96), time cost: 42.057193 rpn_cls: 0.2161, rpn_box: nan, rcnn_cls: 0.9535, rcnn_box 0.0000 [session 1][epoch 1][iter 500] loss: nan, lr: 1.00e-03 fg/bg=(32/96), time cost: 41.014715 rpn_cls: 0.1673, rpn_box: nan, rcnn_cls: 0.9406, rcnn_box 0.0000 [session 1][epoch 1][iter 600] loss: nan, lr: 1.00e-03 fg/bg=(32/96), time cost: 42.453671 rpn_cls: 0.1687, rpn_box: nan, rcnn_cls: 0.9308, rcnn_box 0.0000

The text was updated successfully, but these errors were encountered:

cui-shaowei · 2018-05-05T13:06:32Z

There are somthing wrong about your dataset.
1.In the "\lib\dataset\pascal_voc.py", change the" x1 = float(bbox.find('xmin').text) - 1 y1 = float(bbox.find('ymin').text) - 1" to x1 = float(bbox.find('xmin').text) y1 = float(bbox.find('ymin').text) " delete the "-1".
2. then "rm -rf $your data cache$"
Maybe the log(-1) lead to this error.

super-wcg · 2018-05-19T05:32:07Z

@ashutoshIITK do you solve the problem?

ashutoshIITK · 2018-05-21T03:37:18Z

@super-wcg
Yes, I solved the problem of getting NaN Loss.
It was due to the error in the coordinates. The following things were giving NaN loss
1.Coordinates out of the image resolution------------> NaN Loss
2. xmin=xmax-----------> Results in NaN Loss
3. ymin==ymax-----------> Results in Nan Loss
4. The size of bounding box was very small-----------> Results in NaN Loss

For the 4th case, we put a condition that the difference of |xmax -xmin| >= 20 and similarly |ymax- ymin| >=20

I trained the model (For 20 epochs) after fixing all this and didn't get NaN Loss error.

Thank you.

JingXiaolun · 2018-07-28T08:09:53Z

@ashutoshIITK
My problem is same as yours.I follow above instruction to modify my code.But Nan problem still exists,can you describe your modifications specificly?I hope I can get your help.Thanks.

ashutoshIITK · 2018-07-30T11:15:15Z

@1csu
What's the size of your image?

Rahul250192 · 2018-11-06T00:08:28Z

Did anyone find a solution for this?
I have done almost everything but couldnt resolve it

rnjtsh · 2018-11-11T14:36:40Z

@ashutoshIITK Where to put the condition for the 4th case?

I trained my model on my dataset (similar to pascal VOC) with the batch size of 4 and 8 which worked fine. But reducing the batch size to 2 produces the NaN loss. Any idea why this happens?

nico-zck · 2019-02-21T12:53:28Z

There are two files pascal_voc.py and pascal_voc_rgb.py, in default case you should change file pascal_voc.py rather than pascal_voc_rgb.py, as @swchui said, it works for me.

EmilioOldenziel · 2019-06-13T11:13:58Z

I also found that it can happen when the learning rate is too high.

armin-azh · 2021-11-09T06:29:48Z

@ashutoshIITK Where to put the condition for the 4th case?

I trained my model on my dataset (similar to pascal VOC) with the batch size of 4 and 8 which worked fine. But reducing the batch size to 2 produces the NaN loss. Any idea why this happens?

exactly, I have the same problem as yours.

ljtruong mentioned this issue Jul 4, 2018

RPN regression loss suddenly becomes NaN #193

Open

ljtruong mentioned this issue Jul 18, 2018

NaN loss when I was training resnet101 on Pascal Voc 2007 #236

Open

benedictflorance mentioned this issue Jun 25, 2019

Warning: NaN or Inf found in input tensor. VisionLearningGroup/DA_Detection#11

Closed

harsh-99 mentioned this issue Nov 13, 2019

how to train this model harsh-99/SCL#2

Closed

Shuntw6096 mentioned this issue Mar 19, 2021

Get NaN loss during training basiclab/DA-OD-MEAA-PyTorch#4

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Getting Nan loss while training #136

Getting Nan loss while training #136

ashutoshIITK commented Apr 22, 2018

cui-shaowei commented May 5, 2018

super-wcg commented May 19, 2018

ashutoshIITK commented May 21, 2018

JingXiaolun commented Jul 28, 2018

ashutoshIITK commented Jul 30, 2018

Rahul250192 commented Nov 6, 2018

rnjtsh commented Nov 11, 2018

nico-zck commented Feb 21, 2019 •

edited

EmilioOldenziel commented Jun 13, 2019

armin-azh commented Nov 9, 2021

Getting Nan loss while training #136

Getting Nan loss while training #136

Comments

ashutoshIITK commented Apr 22, 2018

cui-shaowei commented May 5, 2018

super-wcg commented May 19, 2018

ashutoshIITK commented May 21, 2018

JingXiaolun commented Jul 28, 2018

ashutoshIITK commented Jul 30, 2018

Rahul250192 commented Nov 6, 2018

rnjtsh commented Nov 11, 2018

nico-zck commented Feb 21, 2019 • edited

EmilioOldenziel commented Jun 13, 2019

armin-azh commented Nov 9, 2021

nico-zck commented Feb 21, 2019 •

edited