Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

approximate joint end-2-end training. #23

Closed
tornadomeet opened this issue Aug 17, 2016 · 26 comments
Closed

approximate joint end-2-end training. #23

tornadomeet opened this issue Aug 17, 2016 · 26 comments

Comments

@tornadomeet
Copy link

tornadomeet commented Aug 17, 2016

Hello, @precedenceguo
Do you have a plan to add the approximate joint end-to-end training code? i think we should first add the python op of propasal_target.py, which is something like ROIIter; then change a little of AnchorLoader.

@ijkguo
Copy link
Owner

ijkguo commented Aug 17, 2016

Not right now. The steps you suggest seems right. Have fun!

@tornadomeet
Copy link
Author

ok, i will try this.

@zhangjiangqige
Copy link

@tornadomeet looking forward to your work. I took a try some days ago and found that the symbol variables are a terrible mess (rpn_bbox_target/bbox_target, etc).....

@tornadomeet
Copy link
Author

@neodooth yes, for object detection, the io part is the most tedious, because there is no generally io in mxnet for detection.

@ijkguo
Copy link
Owner

ijkguo commented Aug 17, 2016

@neodooth Another reason is that data parallelism requires such design to correctly determine correspondence for data, label and output.

@tornadomeet
Copy link
Author

hello, @neodooth i have written the preliminary end-to-end training code here: https://github.com/tornadomeet/mx-rcnn/blob/master/train_end2end.py but it will bring nan during training, the reason i think is that at the beginning training, the bbox_delta of rpn output is explode, i will debug it more. glad to see you help for debug the end-to-end training. thanks~

@ijkguo
Copy link
Owner

ijkguo commented Aug 18, 2016

Please consider set config.TRAIN.BBOX_NORMALIZATION_PRECOMPUTED = True since Ross did that :) and indeed we should normalize that.

@tornadomeet
Copy link
Author

tornadomeet commented Aug 19, 2016

@precedenceguo thanks, i had set it to be True yesterday, and nan still exists, the output of rpn_bbox_pred will be nan sometimes, even when i set lr=0.0 and wd=0.0

@zhangjiangqige
Copy link

@tornadomeet Great!
bbox_delta does explodes, but if I set lr=0.0 it is OK, perhaps you hard coded some learning rate in the python scripts?

@tornadomeet
Copy link
Author

@neodooth i found potential bugs in proposal_target.py/proposal.py, i am now debugging it.

@zhangjiangqige
Copy link

zhangjiangqige commented Aug 19, 2016

@tornadomeet I managed to draw a picture of the net, and found that you wrote two layers with the same name "rpn_cls_prob", one is SoftmaxOutput and the other is SoftmaxActivation. In rbg's implementation, he wrote 2 separate layers (Softmax and SoftmaxWithLoss)

@tornadomeet
Copy link
Author

tornadomeet commented Aug 19, 2016

@neodooth ths~ i have fixed the name this morning, but not arrange the code so no push, i put the same the symbol, SoftmaxOutput and SoftmaxActivation

@tornadomeet
Copy link
Author

@neodooth @precedenceguo when training with joint end-to-end , at this line https://github.com/precedenceguo/mx-rcnn/blob/master/rcnn/rpn/proposal.py#L131, keep is like to be null, so how to fix this?

@zhangjiangqige
Copy link

zhangjiangqige commented Aug 19, 2016

I changed RPN_MIN_SIZE to 5 in config.py. In my case, some imagenet images are extremely small, even smaller than the min size (16)

@tornadomeet
Copy link
Author

tornadomeet commented Aug 19, 2016

@neodooth thanks, i have pushed the newest code i fixed today, you can check that~

if bg proposal is not null, then it will continue training, but it is more likely that the valid number of bg proposal is 0 at begining, so will be terminated. you can restart the training again...

log is like this:

INFO:root:########## TRAIN FASTER-RCNN WITH APPROXIMATE JOINT END2END #############
voc_2007_train gt roidb loaded from /home/work/wuwei/project/github/mx-rcnn/data/cache/voc_2007_train_gt_roidb.pkl
prepare roidb
providing maximum shape [('data', (1, 3, 1000, 1000))] [('label', (1, 34596)), ('bbox_target', (1, 36, 62, 62)), ('bbox_inside_weight', (1, 36, 62, 62)), ('bbox_outside_weight', (1, 36, 62, 62)), ('gt_boxes', (256, 5))]
INFO:root:Epoch[0] Batch [20]   Speed: 0.88 samples/sec Train-Accuracy=0.047805,    LogLoss=3.000072,   SmoothL1Loss=0.462790
INFO:root:Epoch[0] Batch [40]   Speed: 0.90 samples/sec Train-Accuracy=0.150438,    LogLoss=2.819557,   SmoothL1Loss=0.445308
INFO:root:Epoch[0] Batch [60]   Speed: 0.87 samples/sec Train-Accuracy=0.304495,    LogLoss=2.630411,   SmoothL1Loss=0.442107
INFO:root:Epoch[0] Batch [80]   Speed: 0.87 samples/sec Train-Accuracy=0.433353,    LogLoss=2.449683,   SmoothL1Loss=0.436546
INFO:root:Epoch[0] Batch [100]  Speed: 0.85 samples/sec Train-Accuracy=0.522316,    LogLoss=2.273890,   SmoothL1Loss=0.417847
INFO:root:Epoch[0] Batch [120]  Speed: 0.88 samples/sec Train-Accuracy=0.582806,    LogLoss=2.115118,   SmoothL1Loss=0.402035
INFO:root:Epoch[0] Batch [140]  Speed: 0.85 samples/sec Train-Accuracy=0.631067,    LogLoss=1.964540,   SmoothL1Loss=0.378700
INFO:root:Epoch[0] Batch [160]  Speed: 0.88 samples/sec Train-Accuracy=0.664596,    LogLoss=1.842124,   SmoothL1Loss=0.368171
INFO:root:Epoch[0] Batch [180]  Speed: 0.86 samples/sec Train-Accuracy=0.687068,    LogLoss=1.747761,   SmoothL1Loss=0.371310
INFO:root:Epoch[0] Batch [200]  Speed: 0.88 samples/sec Train-Accuracy=0.707090,    LogLoss=1.660281,   SmoothL1Loss=0.366773
INFO:root:Epoch[0] Batch [220]  Speed: 0.87 samples/sec Train-Accuracy=0.721790,    LogLoss=1.590150,   SmoothL1Loss=0.369936
INFO:root:Epoch[0] Batch [240]  Speed: 0.86 samples/sec Train-Accuracy=0.737698,    LogLoss=1.515215,   SmoothL1Loss=0.360027
INFO:root:Epoch[0] Batch [260]  Speed: 0.88 samples/sec Train-Accuracy=0.750659,    LogLoss=1.452627,   SmoothL1Loss=0.353906
INFO:root:Epoch[0] Batch [280]  Speed: 0.84 samples/sec Train-Accuracy=0.762177,    LogLoss=1.395514,   SmoothL1Loss=0.346304

i'll continue debug it tomorrow.

@zhangjiangqige
Copy link

zhangjiangqige commented Aug 19, 2016

I think the major problem lies in back-propagation and the losses, since if lr is set to 0 then everything is fine (except that the network is not learning). So the forwarding step seems good.

And I found that using a lr of 0.00001 is OK (resnet 101), I think this might be a clue. I remember in a paper it says that the network should be warmed up using a small lr.

edit: Well it's not OK, though ran more iterations...

@tornadomeet
Copy link
Author

yes, i also think it needs warm up with a smaller lr.

@tornadomeet
Copy link
Author

tornadomeet commented Aug 20, 2016

the NaN also exists in py-faster-rcnn: rbgirshick/py-faster-rcnn#65

@zhangjiangqige
Copy link

I added some BlockGrads to the net and tested 2 situations:

  1. using rpn softmax to do the bp is ok
  2. using rpn bbox smoothl1 to do the bp leads to NaN
    This is strange since the difference between situation 2 and the original alternative rpn training is that there are more layers, while they don't contribute gradients

@tornadomeet
Copy link
Author

tornadomeet commented Aug 22, 2016

@neodooth i think the reason of no NaN during alternative rpn training is that it only use rpn_bbox_pred and bbox_target to bp in smooth_l1 loss, and not need to get the real box through bbox regression explicitly(this may leads to NaN).
the reason of NaN i found is during the forward of proposal and proposal_target, the valid number of proposal box becomes zero, this maybe happen at any stage between their forward calc.
i had add clip dw and dh and warm up lr_scheduler, but this helps a little.

i think we can solve this problem thorough by 2 ways:

  • use train_rpn for pre-training with one or two epoch.
  • just skip the training samples which will leads to NaN(which keep or bg_inds is null) at the moment, but this may not easily implemented in mxnet.

@tornadomeet
Copy link
Author

@neodooth do you have a qq or email? if yes, we can discuss the problem there.

@zhangjiangqige
Copy link

@tornadomeet sent my qq to your email

@argman
Copy link

argman commented Aug 23, 2016

As i am not familar with mxnet-frcnn, but in py-faster-rcnn, NaN during training is because when you generate bbox targets, in pascal_voc.py line 208-211, you should pay attention to the -1.

@tornadomeet
Copy link
Author

@argman, thanks, it is same in mx-rcnn: https://github.com/precedenceguo/mx-rcnn/blob/master/helper/dataset/pascal_voc.py#L126-L129

@precedenceguo @neodooth i have solved the NaN problem, i 'll update it today.

@tornadomeet
Copy link
Author

tornadomeet commented Aug 23, 2016

i have update my code, and the NaN will not come out during end2end training, i'll continue training a 2007 model to check the acc.

thanks all~

close now.

@abhiML
Copy link

abhiML commented Jun 27, 2017

Hey @tornadomeet could you give some pointers as to how you solved the problem? I am trying to solve a imilar problem in faster_rcnn_pytorch.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants