Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Training with multiple-GPU is not faster #73

Closed
zdwong opened this issue Jul 6, 2017 · 3 comments
Closed

Training with multiple-GPU is not faster #73

zdwong opened this issue Jul 6, 2017 · 3 comments

Comments

@zdwong
Copy link

zdwong commented Jul 6, 2017

Thanks for your great job transferring py-faster-rcnn in caffe into mxnet. When installing and running the mx-rcnn. I found that two GPU training can't have nearly two times faster one GPU training.
platform: ubuntu 16.04, GPU: Tesla M60, 8G

bash script/vgg_voc07.sh 0

INFO:root:Epoch[0] Batch [20] Speed: 2.04 samples/sec Train-RPNAcc=0.894159, RPNLogLoss=0.361955, RPNL1Loss=1.139758, RCNNAcc=0.712054, RCNNLogLoss=1.508607, RCNNL1Loss=2.551116,
INFO:root:Epoch[0] Batch [40] Speed: 1.89 samples/sec Train-RPNAcc=0.927401, RPNLogLoss=0.283141, RPNL1Loss=1.018088, RCNNAcc=0.743521, RCNNLogLoss=1.378231, RCNNL1Loss=2.585749,
INFO:root:Epoch[0] Batch [60] Speed: 1.99 samples/sec Train-RPNAcc=0.941726, RPNLogLoss=0.229789, RPNL1Loss=0.936680, RCNNAcc=0.758965, RCNNLogLoss=1.284314, RCNNL1Loss=2.618034,
INFO:root:Epoch[0] Batch [80] Speed: 2.08 samples/sec Train-RPNAcc=0.945939, RPNLogLoss=0.203962, RPNL1Loss=0.934596, RCNNAcc=0.763503, RCNNLogLoss=1.227046, RCNNL1Loss=2.619250,
INFO:root:Epoch[0] Batch [100] Speed: 1.89 samples/sec Train-RPNAcc=0.942644, RPNLogLoss=0.211725, RPNL1Loss=0.920782, RCNNAcc=0.769183, RCNNLogLoss=1.197012, RCNNL1Loss=2.589773,

bash script/vgg_voc07.sh 0,1
INFO:root:Epoch[0] Batch [40] Speed: 2.10 samples/sec Train-RPNAcc=0.934642, RPNLogLoss=0.237217, RPNL1Loss=1.014563, RCNNAcc=0.766673, RCNNLogLoss=1.192775, RCNNL1Loss=2.580673,
INFO:root:Epoch[0] Batch [60] Speed: 2.15 samples/sec Train-RPNAcc=0.942495, RPNLogLoss=0.202506, RPNL1Loss=0.930434, RCNNAcc=0.777600, RCNNLogLoss=1.104864, RCNNL1Loss=2.590131,
INFO:root:Epoch[0] Batch [80] Speed: 2.26 samples/sec Train-RPNAcc=0.948712, RPNLogLoss=0.180862, RPNL1Loss=0.889647, RCNNAcc=0.792101, RCNNLogLoss=1.011266, RCNNL1Loss=2.562042,
INFO:root:Epoch[0] Batch [100] Speed: 2.17 samples/sec Train-RPNAcc=0.955039, RPNLogLoss=0.160886, RPNL1Loss=0.852715, RCNNAcc=0.793162, RCNNLogLoss=0.972027, RCNNL1Loss=2.572651

I wonder that this problem causes by data parallelization, but I found that you said this version has implemented it . So how this problem happen? Thanks for your replay.

@zdwong
Copy link
Author

zdwong commented Jul 9, 2017

I check it carefully, I make sure that generally multi-GPU training is faster than one-GPU training depending on hardware and platform.

@315386775
Copy link

315386775 commented Feb 1, 2018

i also notice this question. But in readme:
3.8 img/s to 6 img/s for 2 GPUs

@ijkguo
Copy link
Owner

ijkguo commented Jun 26, 2018

Most of the time the bottleneck is custom layer proposal_target or data loading. Check dmlc/gluon-cv for a gluon implementation.

@ijkguo ijkguo closed this as completed Jun 26, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants