RPN regression loss suddenly becomes NaN #193

ZHOUXINWEN · 2018-06-11T15:43:08Z

When I use this code to train on customer dataset(Pascal VOC format), RPN loss always turns to NaN after several dozen iterations.
I have excluded the possibility of Coordinates out of the image resolution,xmin=xmax and ymin=ymax.
[session 1][epoch 1][iter 12/4500] loss: 1.1964, lr: 1.00e-03
fg/bg=(16/496), time cost: 0.503772
rpn_cls: 0.1663, rpn_box: 0.0488, rcnn_cls: 0.9381, rcnn_box 0.0433
[session 1][epoch 1][iter 13/4500] loss: 0.8909, lr: 1.00e-03
fg/bg=(12/500), time cost: 0.516370
rpn_cls: 0.1984, rpn_box: 0.0421, rcnn_cls: 0.6251, rcnn_box 0.0254
[session 1][epoch 1][iter 14/4500] loss: 1.1052, lr: 1.00e-03
fg/bg=(20/492), time cost: 0.490039
rpn_cls: 0.1901, rpn_box: 0.0351, rcnn_cls: 0.8329, rcnn_box 0.0469
[session 1][epoch 1][iter 15/4500] loss: nan, lr: 1.00e-03
fg/bg=(6/506), time cost: 0.530968
rpn_cls: 0.1404, rpn_box: nan, rcnn_cls: 0.2575, rcnn_box 0.0102

rxqy · 2018-06-22T05:14:32Z

I met a similar problem here, trained on KITTI pedestrian detection, converted to voc format
loss would go to NaN in like 20 iters
wondering how to solve the problem.

xiaomengyc · 2018-07-01T02:29:33Z

@ZHOUXINWEN @rxqy Have you solved this problem? I am stuck in the same problem. Thanks！

ljtruong · 2018-07-04T22:12:12Z

@xiaomengyc

This should help you. it's to do with your annotations and possible the "-1" when feeding in the annotations in the pascal voc dataset script.

#136 (comment)

xiaomengyc · 2018-07-05T00:04:41Z

Thank you @Worulz. I have found this solution and fixed my problem. I train this faster-rcnn on a pedestrian dataset. When I adopted the restriction for the box sizes, the NaN problem disappeared.

JingXiaolun · 2018-07-29T02:47:42Z

@xiaomengyc
How to restrict the box sizes,can you show the modification in code?I have been stuck in the same problem,I need your help.Thanks a lot.

xiaomengyc · 2018-07-29T16:46:18Z

You just need to change this line to

faster-rcnn.pytorch/lib/roi_data_layer/roibatchLoader.py

Line 190 in 28db6d0

not_keep = (gt_boxes[:,0] == gt_boxes[:,2]) | (gt_boxes[:,1] == gt_boxes[:,3])

not_keep = (gt_boxes[:,2] - gt_boxes[:,0]) < 10 and (gt_boxes[:,3] - gt_boxes[:,1]) < 10
where 10 is the minimum width and height.

askerlee · 2018-08-16T14:31:19Z

@xiaomengyc from your description seems that's because RPN couldn't propose very small anchors? How about trying to set config.py:__C.ANCHOR_SCALES to smaller values, e.g. [1,2,3] (corresponding to 16,32,48 pixels)?

xiaomengyc · 2018-08-16T16:05:20Z

@askerlee As I understand, ANCHOR_SCALES should be set with respect to the scale of ground-bboxes.
Filtering out some very small boxes can avoid the model producing proposals with the area of 0, which may cause NaN problem.

According to my experiments, loading pre-trained weights, e.g. Faster-RCNN trained on COCO, can also avoid the NaN problem, without filtering out small bboxes.
So, if there are no pre-trained weights, I assume that we can train the model with the filter for several epochs, and then remove the filter to continue training.

askerlee · 2018-08-18T08:07:59Z

@xiaomengyc I've also met the same problem. But the culprits are ground bboxes of sizes around 50x10. So I applied your trick and filtered these bboxes. I guess nan appears because the proposals are too much bigger than ground bboxes and hence big rpn_box losses are incurred.

xiaomengyc · 2018-08-18T18:45:24Z

@askerlee I might be.
In my case, I traced the code, and found the area of some proposals become zero, which cause the denominator of somewhere to be zero.

Tianlock · 2018-12-24T14:12:48Z

@askerlee Hi, how do you solve your problem, i met the same problem that the sizes of ground bboxes in my datast are small and around to 20*20. And i get no nan loss in first epoch and it gets nan loss from second epoch. Can you tell me your solution, thanks.

ljtruong · 2018-12-25T00:00:44Z

@Tianlock do you have large and very small bounding boxes?. You could always crop the image to find feature areas. Then run the algorithm on top.

askerlee · 2018-12-25T06:15:15Z

@Tianlock I fixed it by filtering bboxes smaller than 20x20. You could set the filtering threshold to say 15x15, if 20x20 filters many useful bboxes. You could also try to reduce the learning rate at the same time.

amirmgh1375 · 2019-04-14T14:26:55Z

@Tianlock
Great
Can you tell how to filtering the bboxes ??
Thanks

amirmgh1375 · 2019-04-15T09:36:04Z

its just for dataset annotations.
you should modify the pascal_voc.py code as follows 👍

x1 = max(float(bbox.find('xmin').text), 0)
y1 = max(float(bbox.find('ymin').text) , 0)
x2 = max(float(bbox.find('xmax').text) , 0)
y2 = max(float(bbox.find('ymax').text) , 0)
or
x1 = max(float(bbox.find('xmin').text) - 1, 0)
y1 = max(float(bbox.find('ymin').text) - 1, 0)
x2 = max(float(bbox.find('xmax').text) - 1, 0)
y2 = max(float(bbox.find('ymax').text) - 1, 0)

mshu1 · 2019-08-23T15:56:47Z

@xiaomengyc

not_keep = (gt_boxes[:,2] - gt_boxes[:,0]) < 10 and (gt_boxes[:,3] - gt_boxes[:,1]) < 10

Hi, I get RuntimeError: bool value of Tensor with more than one value is ambiguous error when I tried to replace the line with your code. Any idea why this might have happened? Thanks in advance!

xiaomengyc · 2019-08-24T05:30:13Z

@xiaomengyc

not_keep = (gt_boxes[:,2] - gt_boxes[:,0]) < 10 and (gt_boxes[:,3] - gt_boxes[:,1]) < 10

Hi, I get RuntimeError: bool value of Tensor with more than one value is ambiguous error when I tried to replace the line with your code. Any idea why this might have happened? Thanks in advance!

Try to replace the and operation by * , see if it works.

mshu1 · 2019-08-25T19:06:28Z

@xiaomengyc

not_keep = (gt_boxes[:,2] - gt_boxes[:,0]) < 10 and (gt_boxes[:,3] - gt_boxes[:,1]) < 10

Hi, I get RuntimeError: bool value of Tensor with more than one value is ambiguous error when I tried to replace the line with your code. Any idea why this might have happened? Thanks in advance!

Try to replace the and operation by * , see if it works.

Yes it works! Thanks. Although I am still getting nan loss but thanks anyways :D

ZHOUXINWEN changed the title ~~RPN loss suddenly becomes NaN~~ RPN regression loss suddenly becomes NaN Jun 11, 2018

benedictflorance mentioned this issue Jun 25, 2019

Warning: NaN or Inf found in input tensor. VisionLearningGroup/DA_Detection#11

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RPN regression loss suddenly becomes NaN #193

RPN regression loss suddenly becomes NaN #193

ZHOUXINWEN commented Jun 11, 2018

rxqy commented Jun 22, 2018

xiaomengyc commented Jul 1, 2018

ljtruong commented Jul 4, 2018

xiaomengyc commented Jul 5, 2018

JingXiaolun commented Jul 29, 2018

xiaomengyc commented Jul 29, 2018

askerlee commented Aug 16, 2018 •

edited

Loading

xiaomengyc commented Aug 16, 2018

askerlee commented Aug 18, 2018 •

edited

Loading

xiaomengyc commented Aug 18, 2018

Tianlock commented Dec 24, 2018

ljtruong commented Dec 25, 2018

askerlee commented Dec 25, 2018

amirmgh1375 commented Apr 14, 2019

amirmgh1375 commented Apr 15, 2019

mshu1 commented Aug 23, 2019 •

edited

Loading

xiaomengyc commented Aug 24, 2019

mshu1 commented Aug 25, 2019

RPN regression loss suddenly becomes NaN #193

RPN regression loss suddenly becomes NaN #193

Comments

ZHOUXINWEN commented Jun 11, 2018

rxqy commented Jun 22, 2018

xiaomengyc commented Jul 1, 2018

ljtruong commented Jul 4, 2018

xiaomengyc commented Jul 5, 2018

JingXiaolun commented Jul 29, 2018

xiaomengyc commented Jul 29, 2018

askerlee commented Aug 16, 2018 • edited Loading

xiaomengyc commented Aug 16, 2018

askerlee commented Aug 18, 2018 • edited Loading

xiaomengyc commented Aug 18, 2018

Tianlock commented Dec 24, 2018

ljtruong commented Dec 25, 2018

askerlee commented Dec 25, 2018

amirmgh1375 commented Apr 14, 2019

amirmgh1375 commented Apr 15, 2019

mshu1 commented Aug 23, 2019 • edited Loading

xiaomengyc commented Aug 24, 2019

mshu1 commented Aug 25, 2019

askerlee commented Aug 16, 2018 •

edited

Loading

askerlee commented Aug 18, 2018 •

edited

Loading

mshu1 commented Aug 23, 2019 •

edited

Loading