FPN training: divide by zero, RPNL1Loss explodes #146

smorrel1 · 2018-01-15T20:14:36Z

Hi, please could you assist. I'm training FPN on Coco as per the instructions and get a large RPNL1Loss. It is coming down v v slowly and I suspect training may not work, or at least be delayed a lot.

Any assistance appreciated! Thanks, Stephen
log-error.txt

fighting-liu · 2018-01-16T06:32:09Z

The same situation occurs to me when i use FPN+ResNet101+dcn to train my own dataset. But the same data works fine for ResNet101+dcn.

Bad log looks like follows:

Epoch[0] Batch [100] Speed: 3.10 samples/sec Train-RPNAcc=0.714563, RPNLogLoss=0.677545, RPNL1Loss=0.119950, Proposal FG Fraction=0.008675, R-CNN FG Accuracy=0.034800, RCNNAcc=0.956340, RCNNLogLoss=1.054744, RCNNL1Loss=191189370617.151367,

Epoch[0] Batch [200] Speed: 3.12 samples/sec Train-RPNAcc=0.720455, RPNLogLoss=0.663638, RPNL1Loss=0.113055, Proposal FG Fraction=0.008540, R-CNN FG Accuracy=0.033646, RCNNAcc=0.954282, RCNNLogLoss=1.296537, RCNNL1Loss=257069246790015490457600.000000,

Epoch[0] Batch [300] Speed: 3.08 samples/sec Train-RPNAcc=0.721229, RPNLogLoss=0.648105, RPNL1Loss=0.111896, Proposal FG Fraction=0.008614, R-CNN FG Accuracy=0.038954, RCNNAcc=0.953531, RCNNLogLoss=nan, RCNNL1Loss=nan,

LiangSiyuan21 · 2018-01-17T02:42:35Z

I encountered the same question as yours, have you solved it? @smorrel1

Puzer · 2018-02-10T21:24:02Z

Did you use the default learning rate (0.01) ?
If you use only one GPU for trining try to set lr = 0.00125

smorrel1 · 2018-02-10T22:05:41Z

@Puzer Thanks that solved it!
Yes I used the default lr=0.01 with 2 GPUs (I now have 4 and 0.005 works). Maybe we should use lr=0.00125 * number of GPUs?

hedes1992 · 2018-03-02T08:03:47Z

I received the same problem like you

Kongsea · 2018-06-22T02:55:28Z

I have changed the learning rate to 1e-5, but the error still raised.

maozezhong · 2018-06-22T08:07:34Z

I solved the problem(at least it worked in my case) by changing source code in

lib/dataset/pascal_voc.py, about 175~178 lines, comment -1 just like below:

x1 = float(bbox.find('xmin').text) #- 1 
y1 = float(bbox.find('ymin').text) #- 1
x2 = float(bbox.find('xmax').text) #- 1
y2 = float(bbox.find('ymax').text) #- 1

lib/dataset/imdb.py, about 210 lines, add code below:

for b in range(len(boxes)): 
    if boxes[b][2]< boxes[b][0]: 
        boxes[b][0] = 0

Because in VOC format, it's pixel indexes are 0-based, if you do not transfer your data accordingly, 0 minus 1 will result in 65535, which will cause training loss NAN. You can add print boxes before assert (boxes[:, 2] >= boxes[:, 0]).all() to see the wrong coords.

Hope it helps.

HaydenFaulkner · 2018-09-25T10:10:46Z

I found it to be a combination of the <1 box edges and the higher lerning rate for less GPUs

Make sure you have in your dataset loaders when making the boxes something like:
boxes[ix, :] = [max(x1,1), max(y1,1), x2, y2]

For coco I also changed to:

x1 = np.max((1, x))
y1 = np.max((1, y))
x2 = np.min((width - 1, x1 + np.max((1, w - 1))))
y2 = np.min((height - 1, y1 + np.max((1, h - 1))))
if obj['area'] > 0 and x2 > x1 and y2 > y1:

In the imdb.py file for when it makes the flipped boxes add some code to before the assert (boxes[:, 2] >= boxes[:, 0]).all() as suggested above or I did:

boxes[:, 0] = roi_rec['width'] - oldx2  # - 1
boxes[:, 2] = roi_rec['width'] - oldx1  # - 1
boxes[boxes < 1] = 1  # used to ensure flipped boxes are also 1+ in coords
for b in range(len(boxes)):
    if boxes[b][2] <= boxes[b][0]:
        boxes[b][2] = boxes[b][0]+1
assert (boxes[:, 2] > boxes[:, 0]).all()

Use a learning rate of 0.00125 * num_gpus

lxyyang mentioned this issue Mar 27, 2018

RPNL1Loss=nan #35

Open

This was referenced Oct 24, 2018

Trying to reproduce pre-trained model of author + problem in training jessemelpolio/Faster_RCNN_for_DOTA#17

Open

Ask for more details in training FRCNN OBB jessemelpolio/Faster_RCNN_for_DOTA#16

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

FPN training: divide by zero, RPNL1Loss explodes #146

FPN training: divide by zero, RPNL1Loss explodes #146

smorrel1 commented Jan 15, 2018

fighting-liu commented Jan 16, 2018 •

edited

LiangSiyuan21 commented Jan 17, 2018

Puzer commented Feb 10, 2018

smorrel1 commented Feb 10, 2018 •

edited

hedes1992 commented Mar 2, 2018

Kongsea commented Jun 22, 2018

maozezhong commented Jun 22, 2018 •

edited

HaydenFaulkner commented Sep 25, 2018 •

edited

FPN training: divide by zero, RPNL1Loss explodes #146

FPN training: divide by zero, RPNL1Loss explodes #146

Comments

smorrel1 commented Jan 15, 2018

fighting-liu commented Jan 16, 2018 • edited

LiangSiyuan21 commented Jan 17, 2018

Puzer commented Feb 10, 2018

smorrel1 commented Feb 10, 2018 • edited

hedes1992 commented Mar 2, 2018

Kongsea commented Jun 22, 2018

maozezhong commented Jun 22, 2018 • edited

HaydenFaulkner commented Sep 25, 2018 • edited

fighting-liu commented Jan 16, 2018 •

edited

smorrel1 commented Feb 10, 2018 •

edited

maozezhong commented Jun 22, 2018 •

edited

HaydenFaulkner commented Sep 25, 2018 •

edited