Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Why the speed in minicaffe is much worse than caffe with same prototxt #48

Closed
yonghenglh6 opened this issue Sep 8, 2017 · 22 comments
Closed

Comments

@yonghenglh6
Copy link

yonghenglh6 commented Sep 8, 2017

I compare the resnet from your run_test.cpp.
But the performance is like below. The speed drop down than caffe.
Any ideas?
I'm useing the newest caffe with cudnn 5.1.5.

Thanks

image

@luoyetx
Copy link
Owner

luoyetx commented Sep 10, 2017

would you mind to paste your test code?

@yonghenglh6
Copy link
Author

In mini-caffe, I just add a "for(int i=0;i<100;i++)" before "test.Forward();" in run_net.cpp, and get the time from your output.
In caffe, I use the caffe time, which is a mistake for memory use because it includes the backward process.

@yonghenglh6
Copy link
Author

I get the memory use from "nvidia-smi" with watching in the flesh.

@yonghenglh6
Copy link
Author

would u like to put up your performance with the resnet for a bug checking of mine? Thank you.

@luoyetx
Copy link
Owner

luoyetx commented Sep 11, 2017

I will test the network prototxt on 1070 later. With more details on mini-caffe and official caffe.

@yonghenglh6
Copy link
Author

yonghenglh6 commented Sep 12, 2017

I use those code to test your every layer's time. But cannot find the reason.
image
And the result is:
image

@yonghenglh6
Copy link
Author

yonghenglh6 commented Sep 12, 2017

Your net->Forward(2,3) give me an error. So I can only use net->Forward(0,x) to get the time from the begin and sub them.

@yonghenglh6
Copy link
Author

yonghenglh6 commented Sep 12, 2017

As the net get longer, the performance diff between mini-caffe and caffe become larger, when I test them by adding layer step by step for net construction.

@yonghenglh6
Copy link
Author

Update the performance up.

@yonghenglh6
Copy link
Author

I checked the cudnn and assured it ran well by adding some output info.

@luoyetx
Copy link
Owner

luoyetx commented Sep 12, 2017

please refer to profile.md to check the layer wise performance. I am writing tools to do the network benchmark.

@yonghenglh6
Copy link
Author

I have tried Profile in the beginning, but the time shown in chrome is not consistent with my test result. However, it's most possible that I made the wrong usage way.

@luoyetx
Copy link
Owner

luoyetx commented Sep 12, 2017

Pay attention to the Timer, it's not accuracy, use Profiler::Now() instead.

@luoyetx
Copy link
Owner

luoyetx commented Sep 13, 2017

@yonghenglh6 You can try the benchmark branch, I modify the Profiler log, it now prints layer name not layer type. You can get same layer wise performance through profile.json

@luoyetx
Copy link
Owner

luoyetx commented Sep 13, 2017

I find the performance is not stable under Windows platform, I will test on Linux later.

@yonghenglh6
Copy link
Author

With your new benchmark tool, I found the bn layer is the main part that cause the difference, your bn is cost twice time than conv. So I changed to test it on google without bn layer and the result shows similar performance between caffe and mini-caffe.
Here is the details. By the way, my caffe use atlas lib, which I think it's not important.
image

@luoyetx
Copy link
Owner

luoyetx commented Sep 14, 2017

There is an optimization in this commit for BatchNorm Layer. I will update this from official caffe.

@yonghenglh6
Copy link
Author

yonghenglh6 commented Sep 22, 2017

Everytime you request memory from pool, the blob will be in uninitial state, then it will call gpu_memset function in "to_gpu()", which will cost about 10% time more than original caffe whose blob will be keeped and in head_on_gpu state.

@luoyetx
Copy link
Owner

luoyetx commented Sep 22, 2017

The default behave is the same in official Caffe here.

@yonghenglh6
Copy link
Author

yonghenglh6 commented Sep 22, 2017

Yes, but the official Caffe need not reallocate blob every forward and not call the function frequently. But your minicaffe keeping on setting the blob state to uninitial to reuse the memory and then rememset the memory when calling to_gpu(), which has caused a performance problem.
It can be optimized by designing the memset moment in layers carefully instead of doing it everytime "to_gpu()" called.
In short words, this may be a problem of only minicaffe, even the code is the same

@luoyetx
Copy link
Owner

luoyetx commented Sep 22, 2017

you mean this function? It is a problem that this function called every time a new memory is requested. I think we can remove this function call, as the dirty data in the memory seems no problem for late use. What do you think?

@yonghenglh6
Copy link
Author

Yes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants