-
Notifications
You must be signed in to change notification settings - Fork 291
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
approximate joint end-2-end training. #23
Comments
Not right now. The steps you suggest seems right. Have fun! |
ok, i will try this. |
@tornadomeet looking forward to your work. I took a try some days ago and found that the symbol variables are a terrible mess (rpn_bbox_target/bbox_target, etc)..... |
@neodooth yes, for object detection, the io part is the most tedious, because there is no generally io in mxnet for detection. |
@neodooth Another reason is that data parallelism requires such design to correctly determine correspondence for data, label and output. |
hello, @neodooth i have written the preliminary end-to-end training code here: https://github.com/tornadomeet/mx-rcnn/blob/master/train_end2end.py but it will bring nan during training, the reason i think is that at the beginning training, the bbox_delta of rpn output is explode, i will debug it more. glad to see you help for debug the end-to-end training. thanks~ |
Please consider set |
@precedenceguo thanks, i had set it to be True yesterday, and nan still exists, the output of rpn_bbox_pred will be nan sometimes, even when i set |
@tornadomeet Great! |
@neodooth i found potential bugs in proposal_target.py/proposal.py, i am now debugging it. |
@tornadomeet I managed to draw a picture of the net, and found that you wrote two layers with the same name "rpn_cls_prob", one is SoftmaxOutput and the other is SoftmaxActivation. In rbg's implementation, he wrote 2 separate layers (Softmax and SoftmaxWithLoss) |
@neodooth ths~ i have fixed the name this morning, but not arrange the code so no push, i put the same the symbol, SoftmaxOutput and SoftmaxActivation |
@neodooth @precedenceguo when training with joint end-to-end , at this line https://github.com/precedenceguo/mx-rcnn/blob/master/rcnn/rpn/proposal.py#L131, |
I changed RPN_MIN_SIZE to 5 in config.py. In my case, some imagenet images are extremely small, even smaller than the min size (16) |
@neodooth thanks, i have pushed the newest code i fixed today, you can check that~ if bg proposal is not null, then it will continue training, but it is more likely that the valid number of bg proposal is 0 at begining, so will be terminated. you can restart the training again... log is like this: INFO:root:########## TRAIN FASTER-RCNN WITH APPROXIMATE JOINT END2END #############
voc_2007_train gt roidb loaded from /home/work/wuwei/project/github/mx-rcnn/data/cache/voc_2007_train_gt_roidb.pkl
prepare roidb
providing maximum shape [('data', (1, 3, 1000, 1000))] [('label', (1, 34596)), ('bbox_target', (1, 36, 62, 62)), ('bbox_inside_weight', (1, 36, 62, 62)), ('bbox_outside_weight', (1, 36, 62, 62)), ('gt_boxes', (256, 5))]
INFO:root:Epoch[0] Batch [20] Speed: 0.88 samples/sec Train-Accuracy=0.047805, LogLoss=3.000072, SmoothL1Loss=0.462790
INFO:root:Epoch[0] Batch [40] Speed: 0.90 samples/sec Train-Accuracy=0.150438, LogLoss=2.819557, SmoothL1Loss=0.445308
INFO:root:Epoch[0] Batch [60] Speed: 0.87 samples/sec Train-Accuracy=0.304495, LogLoss=2.630411, SmoothL1Loss=0.442107
INFO:root:Epoch[0] Batch [80] Speed: 0.87 samples/sec Train-Accuracy=0.433353, LogLoss=2.449683, SmoothL1Loss=0.436546
INFO:root:Epoch[0] Batch [100] Speed: 0.85 samples/sec Train-Accuracy=0.522316, LogLoss=2.273890, SmoothL1Loss=0.417847
INFO:root:Epoch[0] Batch [120] Speed: 0.88 samples/sec Train-Accuracy=0.582806, LogLoss=2.115118, SmoothL1Loss=0.402035
INFO:root:Epoch[0] Batch [140] Speed: 0.85 samples/sec Train-Accuracy=0.631067, LogLoss=1.964540, SmoothL1Loss=0.378700
INFO:root:Epoch[0] Batch [160] Speed: 0.88 samples/sec Train-Accuracy=0.664596, LogLoss=1.842124, SmoothL1Loss=0.368171
INFO:root:Epoch[0] Batch [180] Speed: 0.86 samples/sec Train-Accuracy=0.687068, LogLoss=1.747761, SmoothL1Loss=0.371310
INFO:root:Epoch[0] Batch [200] Speed: 0.88 samples/sec Train-Accuracy=0.707090, LogLoss=1.660281, SmoothL1Loss=0.366773
INFO:root:Epoch[0] Batch [220] Speed: 0.87 samples/sec Train-Accuracy=0.721790, LogLoss=1.590150, SmoothL1Loss=0.369936
INFO:root:Epoch[0] Batch [240] Speed: 0.86 samples/sec Train-Accuracy=0.737698, LogLoss=1.515215, SmoothL1Loss=0.360027
INFO:root:Epoch[0] Batch [260] Speed: 0.88 samples/sec Train-Accuracy=0.750659, LogLoss=1.452627, SmoothL1Loss=0.353906
INFO:root:Epoch[0] Batch [280] Speed: 0.84 samples/sec Train-Accuracy=0.762177, LogLoss=1.395514, SmoothL1Loss=0.346304 i'll continue debug it tomorrow. |
I think the major problem lies in back-propagation and the losses, since if lr is set to 0 then everything is fine (except that the network is not learning). So the forwarding step seems good. And I found that using a lr of 0.00001 is OK (resnet 101), I think this might be a clue. I remember in a paper it says that the network should be warmed up using a small lr. edit: Well it's not OK, though ran more iterations... |
yes, i also think it needs warm up with a smaller lr. |
the NaN also exists in py-faster-rcnn: rbgirshick/py-faster-rcnn#65 |
I added some BlockGrads to the net and tested 2 situations:
|
@neodooth i think the reason of no NaN during alternative rpn training is that it only use i think we can solve this problem thorough by 2 ways:
|
@neodooth do you have a qq or email? if yes, we can discuss the problem there. |
@tornadomeet sent my qq to your email |
As i am not familar with mxnet-frcnn, but in py-faster-rcnn, NaN during training is because when you generate bbox targets, in pascal_voc.py line 208-211, you should pay attention to the -1. |
@argman, thanks, it is same in mx-rcnn: https://github.com/precedenceguo/mx-rcnn/blob/master/helper/dataset/pascal_voc.py#L126-L129 @precedenceguo @neodooth i have solved the NaN problem, i 'll update it today. |
i have update my code, and the NaN will not come out during end2end training, i'll continue training a 2007 model to check the acc. thanks all~ close now. |
Hey @tornadomeet could you give some pointers as to how you solved the problem? I am trying to solve a imilar problem in faster_rcnn_pytorch. |
Hello, @precedenceguo
Do you have a plan to add the approximate joint end-to-end training code? i think we should first add the python op of propasal_target.py, which is something like ROIIter; then change a little of AnchorLoader.
The text was updated successfully, but these errors were encountered: