Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FPN training: divide by zero, RPNL1Loss explodes #146

Open
smorrel1 opened this issue Jan 15, 2018 · 8 comments
Open

FPN training: divide by zero, RPNL1Loss explodes #146

smorrel1 opened this issue Jan 15, 2018 · 8 comments

Comments

@smorrel1
Copy link

Hi, please could you assist. I'm training FPN on Coco as per the instructions and get a large RPNL1Loss. It is coming down v v slowly and I suspect training may not work, or at least be delayed a lot.

Any assistance appreciated! Thanks, Stephen
log-error.txt

@fighting-liu
Copy link

fighting-liu commented Jan 16, 2018

The same situation occurs to me when i use FPN+ResNet101+dcn to train my own dataset. But the same data works fine for ResNet101+dcn.

Bad log looks like follows:

Epoch[0] Batch [100] Speed: 3.10 samples/sec Train-RPNAcc=0.714563, RPNLogLoss=0.677545, RPNL1Loss=0.119950, Proposal FG Fraction=0.008675, R-CNN FG Accuracy=0.034800, RCNNAcc=0.956340, RCNNLogLoss=1.054744, RCNNL1Loss=191189370617.151367,

Epoch[0] Batch [200] Speed: 3.12 samples/sec Train-RPNAcc=0.720455, RPNLogLoss=0.663638, RPNL1Loss=0.113055, Proposal FG Fraction=0.008540, R-CNN FG Accuracy=0.033646, RCNNAcc=0.954282, RCNNLogLoss=1.296537, RCNNL1Loss=257069246790015490457600.000000,

Epoch[0] Batch [300] Speed: 3.08 samples/sec Train-RPNAcc=0.721229, RPNLogLoss=0.648105, RPNL1Loss=0.111896, Proposal FG Fraction=0.008614, R-CNN FG Accuracy=0.038954, RCNNAcc=0.953531, RCNNLogLoss=nan, RCNNL1Loss=nan,

@LiangSiyuan21
Copy link

I encountered the same question as yours, have you solved it? @smorrel1

@Puzer
Copy link

Puzer commented Feb 10, 2018

Did you use the default learning rate (0.01) ?
If you use only one GPU for trining try to set lr = 0.00125

@smorrel1
Copy link
Author

smorrel1 commented Feb 10, 2018

@Puzer Thanks that solved it!
Yes I used the default lr=0.01 with 2 GPUs (I now have 4 and 0.005 works). Maybe we should use lr=0.00125 * number of GPUs?

@hedes1992
Copy link

I received the same problem like you
image

@Kongsea
Copy link

Kongsea commented Jun 22, 2018

I have changed the learning rate to 1e-5, but the error still raised.

@maozezhong
Copy link

maozezhong commented Jun 22, 2018

I solved the problem(at least it worked in my case) by changing source code in

  1. lib/dataset/pascal_voc.py, about 175~178 lines, comment -1 just like below:
x1 = float(bbox.find('xmin').text) #- 1 
y1 = float(bbox.find('ymin').text) #- 1
x2 = float(bbox.find('xmax').text) #- 1
y2 = float(bbox.find('ymax').text) #- 1
  1. lib/dataset/imdb.py, about 210 lines, add code below:
for b in range(len(boxes)): 
    if boxes[b][2]< boxes[b][0]: 
        boxes[b][0] = 0

Because in VOC format, it's pixel indexes are 0-based, if you do not transfer your data accordingly, 0 minus 1 will result in 65535, which will cause training loss NAN. You can add print boxes before assert (boxes[:, 2] >= boxes[:, 0]).all() to see the wrong coords.

Hope it helps.

@HaydenFaulkner
Copy link

HaydenFaulkner commented Sep 25, 2018

I found it to be a combination of the <1 box edges and the higher lerning rate for less GPUs

  1. Make sure you have in your dataset loaders when making the boxes something like:
    boxes[ix, :] = [max(x1,1), max(y1,1), x2, y2]

    For coco I also changed to:

x1 = np.max((1, x))
y1 = np.max((1, y))
x2 = np.min((width - 1, x1 + np.max((1, w - 1))))
y2 = np.min((height - 1, y1 + np.max((1, h - 1))))
if obj['area'] > 0 and x2 > x1 and y2 > y1:
  1. In the imdb.py file for when it makes the flipped boxes add some code to before the assert (boxes[:, 2] >= boxes[:, 0]).all() as suggested above or I did:
boxes[:, 0] = roi_rec['width'] - oldx2  # - 1
boxes[:, 2] = roi_rec['width'] - oldx1  # - 1
boxes[boxes < 1] = 1  # used to ensure flipped boxes are also 1+ in coords
for b in range(len(boxes)):
    if boxes[b][2] <= boxes[b][0]:
        boxes[b][2] = boxes[b][0]+1
assert (boxes[:, 2] > boxes[:, 0]).all()
  1. Use a learning rate of 0.00125 * num_gpus

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

8 participants