loss nan #39

liyangliu · 2018-01-19T02:30:32Z

Hello @jwyang, I downloaded your pytorch faster rcnn yesterday, only change the coco 2014 dataset path to my local one and trained exactly the same setting as you (large image scale, lr = 0.01, 2 images per gpu and 8 gpus, res101, using caffe pretrained models given by you) but got NAN loss after a few iteration. Have you come across this problem? The loss will not be NAN if I set class_agnostic=True. Can you please help me a little? Thanks.

jwyang · 2018-01-19T03:06:24Z

Hi, @liyangliu , could you share your training command and training log?

I did not encounter this problem.

liyangliu · 2018-01-19T03:08:25Z

python trainval_net.py
--dataset coco
--net res101
--save_dir=exps/baseline/models
--cuda
--mGPUs
--bs 16
--nw 8
--epochs 10
--ls
--lr 0.01
--lr_decay_step 4
&> logs/baseline.log &

yxgeee · 2018-01-19T03:10:21Z

Hi, I met this problem several times before. But it was solved when I run it again without any setting changes.

jwyang · 2018-01-19T03:17:01Z

@liyangliu I think we might have slightly different initializations. If you encounter this again, one way is to clamp the gradient for res101 as well by comment this line.

@gyxoned, have you successfully trained the model and get similar performance as reported in our tables?

yxgeee · 2018-01-19T03:21:07Z

@jwyang Yes, I have trained resnet101 on coco successfully, and the performance is similar with reported.

jwyang · 2018-01-19T03:22:48Z

@gyxoned sounds great!

shenshanlaoma · 2018-01-24T09:18:58Z

modify these 4 lines！delete “- 1”！
x1 = float(bbox.find('xmin').text) - 1
y1 = float(bbox.find('ymin').text) - 1
x2 = float(bbox.find('xmax').text) - 1
y2 = float(bbox.find('ymax').text) - 1

faster-rcnn.pytorch/lib/datasets/pascal_voc.py

Line 234 in 42b92b1

x1 = float(bbox.find('xmin').text) - 1

according to http://caffecn.cn/?/question/1055 https://stackoverflow.com/questions/38513739/warning-during-py-faster-rcnn-training-on-custom-datasets

TomHeaven · 2018-11-20T03:34:22Z

In my case, clamp the gradient for res101 is the correct solution of nan loss.

@liyangliu I think we might have slightly different initializations. If you encounter this again, one way is to clamp the gradient for res101 as well by comment this line.

tangbohu · 2020-09-08T13:56:35Z

modify these 4 lines！delete “- 1”！
x1 = float(bbox.find('xmin').text) - 1
y1 = float(bbox.find('ymin').text) - 1
x2 = float(bbox.find('xmax').text) - 1
y2 = float(bbox.find('ymax').text) - 1

faster-rcnn.pytorch/lib/datasets/pascal_voc.py

Line 234 in 42b92b1

x1 = float(bbox.find('xmin').text) - 1

according to http://caffecn.cn/?/question/1055 https://stackoverflow.com/questions/38513739/warning-during-py-faster-rcnn-training-on-custom-datasets

should we delete "-1" on the experiments on VOC?

jwyang closed this as completed Jan 19, 2018

WUT-xiaoming mentioned this issue Nov 15, 2019

ImportError: cannot import name '_mask' #410

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

loss nan #39

loss nan #39

liyangliu commented Jan 19, 2018

jwyang commented Jan 19, 2018

liyangliu commented Jan 19, 2018

yxgeee commented Jan 19, 2018

jwyang commented Jan 19, 2018

yxgeee commented Jan 19, 2018

jwyang commented Jan 19, 2018

shenshanlaoma commented Jan 24, 2018 •

edited

Loading

TomHeaven commented Nov 20, 2018

tangbohu commented Sep 8, 2020

loss nan #39

loss nan #39

Comments

liyangliu commented Jan 19, 2018

jwyang commented Jan 19, 2018

liyangliu commented Jan 19, 2018

yxgeee commented Jan 19, 2018

jwyang commented Jan 19, 2018

yxgeee commented Jan 19, 2018

jwyang commented Jan 19, 2018

shenshanlaoma commented Jan 24, 2018 • edited Loading

TomHeaven commented Nov 20, 2018

tangbohu commented Sep 8, 2020

shenshanlaoma commented Jan 24, 2018 •

edited

Loading