Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Inference performance of bvlc_alexnet is far more slower on mkl-dnn #17

Closed
etaf opened this issue Dec 29, 2016 · 7 comments
Closed

Inference performance of bvlc_alexnet is far more slower on mkl-dnn #17

etaf opened this issue Dec 29, 2016 · 7 comments
Labels

Comments

@etaf
Copy link

etaf commented Dec 29, 2016

I found the inference performance of bvlc_alexnet is far more slower on mkl-dnn.
In intel-caffe build with mkl-dnn engine, I run the caffe/examples/cpp_classification example. And collect the elapsed time of the line: net_->Forward();
It's about 835 ms.
But the result of intel-caffe with mkl-2017 is 16 ms.
Then I add the following code to caffe/examples/cpp_classification/classification.cpp

 boost::posix_time::ptime start_cpu_;
 boost::posix_time::ptime stop_cpu_;
//first time
 start_cpu_= boost::posix_time::microsec_clock::local_time();
 net_->Forward();
 top_cpu_ = boost::posix_time::microsec_clock::local_time();
 double first_time = (stop_cpu_ - start_cpu_).total_milliseconds();

// second time
 start_cpu_= boost::posix_time::microsec_clock::local_time();
 net_->Forward();
 stop_cpu_ = boost::posix_time::microsec_clock::local_time();
 double second_time = (stop_cpu_ - start_cpu_).total_milliseconds();

The result is:
first time: 835 ms
second time: 15.32 ms


It's wired there is huge gape between first time and second time.
In intel-caffe with mkl2017, the result is
first time: 18 ms
second time: 16 ms

I collected each layer's forward time for the first inference.
I found the time is wasted in the first dropout layer. In

caffe/src/caffe/mkldnn_memory.cpp => MKLDNNMemoryDescriptor<Dtype, is_diff>::on_to_cpu()
=> StreamHolder::Instance().current_stream()->wait();

The first dropout layer is connected behind a fc(full connected) layer.

I've tried other models, the result of bvlc_reference_caffenet , vgg_16, vgg_19 is similar with bvlc_alexnet.
They all have a dropout layer behind a fc layer.
But bvlc_googlenet do not have large gape between first and second time.
The dropout layer of bvlc_googlenet is not connected behind a fc layer.
Is this an known issue?


Here is my configuration:
CPU: i7-4770 @ 3.40GHz x 8
Memory: 8G
ubuntu 14.04
intel caffe latest
mkl-dnn: commit: 47bda95
(latest mkl-dnn can not work with latest intel caffe, so I can not reproduce this in latest mkl-dnn)
test image: caffe/examples/images/cat.jpg


@emfomenk
Copy link

Hi @etaf,

By default IntelCaffe uses lazy stream for mkl-dnn execution, which means actual execution might be postponed until stream::wait() is called. In your case all the primitives before dropout are put in a single stream and all their execution happens on the first non-mkl-dnn layer (i.e. dropout). That is why you see:
MKLDNNMemoryDescriptor<Dtype, is_diff>::on_to_cpu() => StreamHolder::Instance().current_stream()->wait();

MKL-DNN integration also has lazy primitive initialization -- primitive creation happens during the first run (i.e. during first Forward() call of the corresponding layer). That is why first run may take much more time then all subsequent ones. On the other hand initialization of Intel MKL primitives happens in Layer::SetUp() -- hence first and second runs are roughly the same.

@etaf
Copy link
Author

etaf commented Jan 6, 2017

Thanks for your answer.
I still have two questions:

  1. I've tried other models, the result of bvlc_reference_caffenet , vgg_16, vgg_19 is similar with bvlc_alexnet.
    They all have a dropout layer behind a fc layer.
    But bvlc_googlenet do not have large gap between first and second time.
    The dropout layer of bvlc_googlenet is not connected behind a fc layer.

Is the gap determined by the the position of dropout layer ( the previous layer) ? Why?

  1. Except for dropout layer, is there any more non-mkl-dnn layer ?
    if so, does mkl-dnn team has plan to make those non-mkl-dnn layer to be mkl-dnn layer ?

Thanks.

@emfomenk
Copy link

emfomenk commented Jan 6, 2017

Hi @etaf,

  1. I've tried other models, the result of bvlc_reference_caffenet , vgg_16, vgg_19 is similar with bvlc_alexnet.
    They all have a dropout layer behind a fc layer.
    But bvlc_googlenet do not have large gap between first and second time.
    The dropout layer of bvlc_googlenet is not connected behind a fc layer.

Is the gap determined by the the position of dropout layer ( the previous layer) ? Why?

It is not the performance gap -- all the computations simply happen in drop-out layer (from Caffe perspective). Overall run-time is absolutely the same. If this confuses replace lazy stream with eager one: the behavior will be more intuitive in this case.

  1. Except for dropout layer, is there any more non-mkl-dnn layer ?
    if so, does mkl-dnn team has plan to make those non-mkl-dnn layer to be mkl-dnn layer ?

For now mkl-dnn supports convolution, relu, lrn, pooling, inner-product, concat, split, and elwise. All the other layers will fallback to native Caffe implementation. For the popular topologies like AlexNet, GoogleNet, ResNet, VGG all the most compute intensive layers are covered. We don't have particular plans for new primitives for now... our main current focus is to optimized backward computations and provide optimizations at least on the same level as Intel MKL does.

@etaf
Copy link
Author

etaf commented Jan 6, 2017

It is not the performance gap -- all the computations simply happen in drop-out layer (from Caffe perspective). Overall run-time is absolutely the same. If this confuses replace lazy stream with eager one: the behavior will be more intuitive in this case.

I've replace lazy stream with eager one. The execute time is shared among different layers now.

But I still confused why the first time and second time in AlexNet is more than 30X but google net is less than 2X.

Thanks!

@etaf
Copy link
Author

etaf commented Jan 9, 2017

Hi, @emfomenk
any comment?

@emfomenk
Copy link

emfomenk commented Jan 9, 2017

I believe I've already answered here:

MKL-DNN integration also has lazy primitive initialization -- primitive creation happens during the first run (i.e. during first Forward() call of the corresponding layer). That is why first run may take much more time then all subsequent ones. On the other hand initialization of Intel MKL primitives happens in Layer::SetUp() -- hence first and second runs are roughly the same.

Compare mkl vs. mkl-dnn integration.
Setup time also depends on the primitives and their parameters used in topology -- hence slowdown may be different for AlexNet and GoogleNet.

@rsdubtso
Copy link

Closing as no further questions are being posted.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants