approximate joint end-2-end training. #23

tornadomeet · 2016-08-17T03:44:49Z

Hello, @precedenceguo
Do you have a plan to add the approximate joint end-to-end training code? i think we should first add the python op of propasal_target.py, which is something like ROIIter; then change a little of AnchorLoader.

ijkguo · 2016-08-17T06:06:43Z

Not right now. The steps you suggest seems right. Have fun!

tornadomeet · 2016-08-17T06:51:02Z

ok, i will try this.

zhangjiangqige · 2016-08-17T06:56:21Z

@tornadomeet looking forward to your work. I took a try some days ago and found that the symbol variables are a terrible mess (rpn_bbox_target/bbox_target, etc).....

tornadomeet · 2016-08-17T07:00:01Z

@neodooth yes, for object detection, the io part is the most tedious, because there is no generally io in mxnet for detection.

ijkguo · 2016-08-17T07:08:36Z

@neodooth Another reason is that data parallelism requires such design to correctly determine correspondence for data, label and output.

tornadomeet · 2016-08-18T12:35:23Z

hello, @neodooth i have written the preliminary end-to-end training code here: https://github.com/tornadomeet/mx-rcnn/blob/master/train_end2end.py but it will bring nan during training, the reason i think is that at the beginning training, the bbox_delta of rpn output is explode, i will debug it more. glad to see you help for debug the end-to-end training. thanks~

ijkguo · 2016-08-18T14:01:20Z

Please consider set config.TRAIN.BBOX_NORMALIZATION_PRECOMPUTED = True since Ross did that :) and indeed we should normalize that.

tornadomeet · 2016-08-19T03:01:21Z

@precedenceguo thanks, i had set it to be True yesterday, and nan still exists, the output of rpn_bbox_pred will be nan sometimes, even when i set lr=0.0 and wd=0.0

zhangjiangqige · 2016-08-19T08:57:06Z

@tornadomeet Great!
bbox_delta does explodes, but if I set lr=0.0 it is OK, perhaps you hard coded some learning rate in the python scripts?

tornadomeet · 2016-08-19T09:06:23Z

@neodooth i found potential bugs in proposal_target.py/proposal.py, i am now debugging it.

zhangjiangqige · 2016-08-19T09:20:00Z

@tornadomeet I managed to draw a picture of the net, and found that you wrote two layers with the same name "rpn_cls_prob", one is SoftmaxOutput and the other is SoftmaxActivation. In rbg's implementation, he wrote 2 separate layers (Softmax and SoftmaxWithLoss)

tornadomeet · 2016-08-19T09:23:51Z

@neodooth ths~ i have fixed the name this morning, but not arrange the code so no push, i put the same the symbol, SoftmaxOutput and SoftmaxActivation

tornadomeet · 2016-08-19T09:31:18Z

@neodooth @precedenceguo when training with joint end-to-end , at this line https://github.com/precedenceguo/mx-rcnn/blob/master/rcnn/rpn/proposal.py#L131, keep is like to be null, so how to fix this?

zhangjiangqige · 2016-08-19T09:35:49Z

I changed RPN_MIN_SIZE to 5 in config.py. In my case, some imagenet images are extremely small, even smaller than the min size (16)

tornadomeet · 2016-08-19T10:33:19Z

@neodooth thanks, i have pushed the newest code i fixed today, you can check that~

if bg proposal is not null, then it will continue training, but it is more likely that the valid number of bg proposal is 0 at begining, so will be terminated. you can restart the training again...

log is like this:

INFO:root:########## TRAIN FASTER-RCNN WITH APPROXIMATE JOINT END2END #############
voc_2007_train gt roidb loaded from /home/work/wuwei/project/github/mx-rcnn/data/cache/voc_2007_train_gt_roidb.pkl
prepare roidb
providing maximum shape [('data', (1, 3, 1000, 1000))] [('label', (1, 34596)), ('bbox_target', (1, 36, 62, 62)), ('bbox_inside_weight', (1, 36, 62, 62)), ('bbox_outside_weight', (1, 36, 62, 62)), ('gt_boxes', (256, 5))]
INFO:root:Epoch[0] Batch [20]   Speed: 0.88 samples/sec Train-Accuracy=0.047805,    LogLoss=3.000072,   SmoothL1Loss=0.462790
INFO:root:Epoch[0] Batch [40]   Speed: 0.90 samples/sec Train-Accuracy=0.150438,    LogLoss=2.819557,   SmoothL1Loss=0.445308
INFO:root:Epoch[0] Batch [60]   Speed: 0.87 samples/sec Train-Accuracy=0.304495,    LogLoss=2.630411,   SmoothL1Loss=0.442107
INFO:root:Epoch[0] Batch [80]   Speed: 0.87 samples/sec Train-Accuracy=0.433353,    LogLoss=2.449683,   SmoothL1Loss=0.436546
INFO:root:Epoch[0] Batch [100]  Speed: 0.85 samples/sec Train-Accuracy=0.522316,    LogLoss=2.273890,   SmoothL1Loss=0.417847
INFO:root:Epoch[0] Batch [120]  Speed: 0.88 samples/sec Train-Accuracy=0.582806,    LogLoss=2.115118,   SmoothL1Loss=0.402035
INFO:root:Epoch[0] Batch [140]  Speed: 0.85 samples/sec Train-Accuracy=0.631067,    LogLoss=1.964540,   SmoothL1Loss=0.378700
INFO:root:Epoch[0] Batch [160]  Speed: 0.88 samples/sec Train-Accuracy=0.664596,    LogLoss=1.842124,   SmoothL1Loss=0.368171
INFO:root:Epoch[0] Batch [180]  Speed: 0.86 samples/sec Train-Accuracy=0.687068,    LogLoss=1.747761,   SmoothL1Loss=0.371310
INFO:root:Epoch[0] Batch [200]  Speed: 0.88 samples/sec Train-Accuracy=0.707090,    LogLoss=1.660281,   SmoothL1Loss=0.366773
INFO:root:Epoch[0] Batch [220]  Speed: 0.87 samples/sec Train-Accuracy=0.721790,    LogLoss=1.590150,   SmoothL1Loss=0.369936
INFO:root:Epoch[0] Batch [240]  Speed: 0.86 samples/sec Train-Accuracy=0.737698,    LogLoss=1.515215,   SmoothL1Loss=0.360027
INFO:root:Epoch[0] Batch [260]  Speed: 0.88 samples/sec Train-Accuracy=0.750659,    LogLoss=1.452627,   SmoothL1Loss=0.353906
INFO:root:Epoch[0] Batch [280]  Speed: 0.84 samples/sec Train-Accuracy=0.762177,    LogLoss=1.395514,   SmoothL1Loss=0.346304

i'll continue debug it tomorrow.

zhangjiangqige · 2016-08-19T13:02:29Z

I think the major problem lies in back-propagation and the losses, since if lr is set to 0 then everything is fine (except that the network is not learning). So the forwarding step seems good.

And I found that using a lr of 0.00001 is OK (resnet 101), I think this might be a clue. I remember in a paper it says that the network should be warmed up using a small lr.

edit: Well it's not OK, though ran more iterations...

tornadomeet · 2016-08-19T13:32:24Z

yes, i also think it needs warm up with a smaller lr.

tornadomeet · 2016-08-20T06:13:28Z

the NaN also exists in py-faster-rcnn: rbgirshick/py-faster-rcnn#65

zhangjiangqige · 2016-08-21T15:27:42Z

I added some BlockGrads to the net and tested 2 situations:

using rpn softmax to do the bp is ok
using rpn bbox smoothl1 to do the bp leads to NaN
This is strange since the difference between situation 2 and the original alternative rpn training is that there are more layers, while they don't contribute gradients

tornadomeet · 2016-08-22T01:26:55Z

@neodooth i think the reason of no NaN during alternative rpn training is that it only use rpn_bbox_pred and bbox_target to bp in smooth_l1 loss, and not need to get the real box through bbox regression explicitly(this may leads to NaN).
the reason of NaN i found is during the forward of proposal and proposal_target, the valid number of proposal box becomes zero, this maybe happen at any stage between their forward calc.
i had add clip dw and dh and warm up lr_scheduler, but this helps a little.

i think we can solve this problem thorough by 2 ways:

use train_rpn for pre-training with one or two epoch.
just skip the training samples which will leads to NaN(which keep or bg_inds is null) at the moment, but this may not easily implemented in mxnet.

tornadomeet · 2016-08-22T03:28:00Z

@neodooth do you have a qq or email? if yes, we can discuss the problem there.

zhangjiangqige · 2016-08-22T03:32:39Z

@tornadomeet sent my qq to your email

argman · 2016-08-23T02:02:30Z

As i am not familar with mxnet-frcnn, but in py-faster-rcnn, NaN during training is because when you generate bbox targets, in pascal_voc.py line 208-211, you should pay attention to the -1.

tornadomeet · 2016-08-23T02:43:37Z

@argman, thanks, it is same in mx-rcnn: https://github.com/precedenceguo/mx-rcnn/blob/master/helper/dataset/pascal_voc.py#L126-L129

@precedenceguo @neodooth i have solved the NaN problem, i 'll update it today.

tornadomeet · 2016-08-23T03:01:08Z

i have update my code, and the NaN will not come out during end2end training, i'll continue training a 2007 model to check the acc.

thanks all~

close now.

abhiML · 2017-06-27T06:57:54Z

Hey @tornadomeet could you give some pointers as to how you solved the problem? I am trying to solve a imilar problem in faster_rcnn_pytorch.

tornadomeet closed this as completed Aug 23, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

approximate joint end-2-end training. #23

approximate joint end-2-end training. #23

tornadomeet commented Aug 17, 2016 •

edited

Loading

ijkguo commented Aug 17, 2016

tornadomeet commented Aug 17, 2016

zhangjiangqige commented Aug 17, 2016

tornadomeet commented Aug 17, 2016

ijkguo commented Aug 17, 2016

tornadomeet commented Aug 18, 2016

ijkguo commented Aug 18, 2016

tornadomeet commented Aug 19, 2016 •

edited

Loading

zhangjiangqige commented Aug 19, 2016

tornadomeet commented Aug 19, 2016

zhangjiangqige commented Aug 19, 2016 •

edited

Loading

tornadomeet commented Aug 19, 2016 •

edited

Loading

tornadomeet commented Aug 19, 2016

zhangjiangqige commented Aug 19, 2016 •

edited

Loading

tornadomeet commented Aug 19, 2016 •

edited

Loading

zhangjiangqige commented Aug 19, 2016 •

edited

Loading

tornadomeet commented Aug 19, 2016

tornadomeet commented Aug 20, 2016 •

edited

Loading

zhangjiangqige commented Aug 21, 2016

tornadomeet commented Aug 22, 2016 •

edited

Loading

tornadomeet commented Aug 22, 2016

zhangjiangqige commented Aug 22, 2016

argman commented Aug 23, 2016

tornadomeet commented Aug 23, 2016

tornadomeet commented Aug 23, 2016 •

edited

Loading

abhiML commented Jun 27, 2017

approximate joint end-2-end training. #23

approximate joint end-2-end training. #23

Comments

tornadomeet commented Aug 17, 2016 • edited Loading

ijkguo commented Aug 17, 2016

tornadomeet commented Aug 17, 2016

zhangjiangqige commented Aug 17, 2016

tornadomeet commented Aug 17, 2016

ijkguo commented Aug 17, 2016

tornadomeet commented Aug 18, 2016

ijkguo commented Aug 18, 2016

tornadomeet commented Aug 19, 2016 • edited Loading

zhangjiangqige commented Aug 19, 2016

tornadomeet commented Aug 19, 2016

zhangjiangqige commented Aug 19, 2016 • edited Loading

tornadomeet commented Aug 19, 2016 • edited Loading

tornadomeet commented Aug 19, 2016

zhangjiangqige commented Aug 19, 2016 • edited Loading

tornadomeet commented Aug 19, 2016 • edited Loading

zhangjiangqige commented Aug 19, 2016 • edited Loading

tornadomeet commented Aug 19, 2016

tornadomeet commented Aug 20, 2016 • edited Loading

zhangjiangqige commented Aug 21, 2016

tornadomeet commented Aug 22, 2016 • edited Loading

tornadomeet commented Aug 22, 2016

zhangjiangqige commented Aug 22, 2016

argman commented Aug 23, 2016

tornadomeet commented Aug 23, 2016

tornadomeet commented Aug 23, 2016 • edited Loading

abhiML commented Jun 27, 2017

tornadomeet commented Aug 17, 2016 •

edited

Loading

tornadomeet commented Aug 19, 2016 •

edited

Loading

zhangjiangqige commented Aug 19, 2016 •

edited

Loading

tornadomeet commented Aug 19, 2016 •

edited

Loading

zhangjiangqige commented Aug 19, 2016 •

edited

Loading

tornadomeet commented Aug 19, 2016 •

edited

Loading

zhangjiangqige commented Aug 19, 2016 •

edited

Loading

tornadomeet commented Aug 20, 2016 •

edited

Loading

tornadomeet commented Aug 22, 2016 •

edited

Loading

tornadomeet commented Aug 23, 2016 •

edited

Loading