Inference performance of bvlc_alexnet is far more slower on mkl-dnn #17

etaf · 2016-12-29T01:40:17Z

I found the inference performance of bvlc_alexnet is far more slower on mkl-dnn.
In intel-caffe build with mkl-dnn engine, I run the caffe/examples/cpp_classification example. And collect the elapsed time of the line: net_->Forward();
It's about 835 ms.
But the result of intel-caffe with mkl-2017 is 16 ms.
Then I add the following code to caffe/examples/cpp_classification/classification.cpp

 boost::posix_time::ptime start_cpu_;
 boost::posix_time::ptime stop_cpu_;
//first time
 start_cpu_= boost::posix_time::microsec_clock::local_time();
 net_->Forward();
 top_cpu_ = boost::posix_time::microsec_clock::local_time();
 double first_time = (stop_cpu_ - start_cpu_).total_milliseconds();

// second time
 start_cpu_= boost::posix_time::microsec_clock::local_time();
 net_->Forward();
 stop_cpu_ = boost::posix_time::microsec_clock::local_time();
 double second_time = (stop_cpu_ - start_cpu_).total_milliseconds();

The result is:
first time: 835 ms
second time: 15.32 ms

It's wired there is huge gape between first time and second time.
In intel-caffe with mkl2017, the result is
first time: 18 ms
second time: 16 ms

I collected each layer's forward time for the first inference.
I found the time is wasted in the first dropout layer. In

caffe/src/caffe/mkldnn_memory.cpp => MKLDNNMemoryDescriptor<Dtype, is_diff>::on_to_cpu()
=> StreamHolder::Instance().current_stream()->wait();

The first dropout layer is connected behind a fc(full connected) layer.

I've tried other models, the result of bvlc_reference_caffenet , vgg_16, vgg_19 is similar with bvlc_alexnet.
They all have a dropout layer behind a fc layer.
But bvlc_googlenet do not have large gape between first and second time.
The dropout layer of bvlc_googlenet is not connected behind a fc layer.
Is this an known issue?

Here is my configuration:
CPU: i7-4770 @ 3.40GHz x 8
Memory: 8G
ubuntu 14.04
intel caffe latest
mkl-dnn: commit: 47bda95
(latest mkl-dnn can not work with latest intel caffe, so I can not reproduce this in latest mkl-dnn)
test image: caffe/examples/images/cat.jpg

The text was updated successfully, but these errors were encountered:

emfomenk · 2016-12-29T17:39:53Z

Hi @etaf,

By default IntelCaffe uses lazy stream for mkl-dnn execution, which means actual execution might be postponed until stream::wait() is called. In your case all the primitives before dropout are put in a single stream and all their execution happens on the first non-mkl-dnn layer (i.e. dropout). That is why you see:
MKLDNNMemoryDescriptor<Dtype, is_diff>::on_to_cpu() => StreamHolder::Instance().current_stream()->wait();

MKL-DNN integration also has lazy primitive initialization -- primitive creation happens during the first run (i.e. during first Forward() call of the corresponding layer). That is why first run may take much more time then all subsequent ones. On the other hand initialization of Intel MKL primitives happens in Layer::SetUp() -- hence first and second runs are roughly the same.

etaf · 2017-01-06T00:44:25Z

Thanks for your answer.
I still have two questions:

I've tried other models, the result of bvlc_reference_caffenet , vgg_16, vgg_19 is similar with bvlc_alexnet.
They all have a dropout layer behind a fc layer.
But bvlc_googlenet do not have large gap between first and second time.
The dropout layer of bvlc_googlenet is not connected behind a fc layer.

Is the gap determined by the the position of dropout layer ( the previous layer) ? Why?

Except for dropout layer, is there any more non-mkl-dnn layer ?
if so, does mkl-dnn team has plan to make those non-mkl-dnn layer to be mkl-dnn layer ?

Thanks.

emfomenk · 2017-01-06T01:18:46Z

Hi @etaf,

I've tried other models, the result of bvlc_reference_caffenet , vgg_16, vgg_19 is similar with bvlc_alexnet.
They all have a dropout layer behind a fc layer.
But bvlc_googlenet do not have large gap between first and second time.
The dropout layer of bvlc_googlenet is not connected behind a fc layer.

Is the gap determined by the the position of dropout layer ( the previous layer) ? Why?

It is not the performance gap -- all the computations simply happen in drop-out layer (from Caffe perspective). Overall run-time is absolutely the same. If this confuses replace lazy stream with eager one: the behavior will be more intuitive in this case.

Except for dropout layer, is there any more non-mkl-dnn layer ?
if so, does mkl-dnn team has plan to make those non-mkl-dnn layer to be mkl-dnn layer ?

For now mkl-dnn supports convolution, relu, lrn, pooling, inner-product, concat, split, and elwise. All the other layers will fallback to native Caffe implementation. For the popular topologies like AlexNet, GoogleNet, ResNet, VGG all the most compute intensive layers are covered. We don't have particular plans for new primitives for now... our main current focus is to optimized backward computations and provide optimizations at least on the same level as Intel MKL does.

etaf · 2017-01-06T03:11:12Z

It is not the performance gap -- all the computations simply happen in drop-out layer (from Caffe perspective). Overall run-time is absolutely the same. If this confuses replace lazy stream with eager one: the behavior will be more intuitive in this case.

I've replace lazy stream with eager one. The execute time is shared among different layers now.

But I still confused why the first time and second time in AlexNet is more than 30X but google net is less than 2X.

Thanks!

etaf · 2017-01-09T07:13:31Z

Hi, @emfomenk
any comment?

emfomenk · 2017-01-09T17:20:09Z

I believe I've already answered here:

MKL-DNN integration also has lazy primitive initialization -- primitive creation happens during the first run (i.e. during first Forward() call of the corresponding layer). That is why first run may take much more time then all subsequent ones. On the other hand initialization of Intel MKL primitives happens in Layer::SetUp() -- hence first and second runs are roughly the same.

Compare mkl vs. mkl-dnn integration.
Setup time also depends on the primitives and their parameters used in topology -- hence slowdown may be different for AlexNet and GoogleNet.

rsdubtso · 2017-01-24T17:04:22Z

Closing as no further questions are being posted.

rsdubtso added the question label Jan 16, 2017

rsdubtso closed this as completed Jan 24, 2017

Darwinian2 mentioned this issue Aug 9, 2017

Test Failures #98

Closed

greenpdx mentioned this issue Dec 30, 2017

Build error, missing -lpthread for tests #172

Closed

mrinmayk mentioned this issue Jan 9, 2018

Unable to run basic test suite of 32 tests #178

Closed

ghost mentioned this issue Sep 21, 2018

Something has changed in the last commits ? #326

Closed

mkl-dnn pushed a commit that referenced this issue Feb 9, 2021

common: primitive cache: get size using unordered_map method (#17)

d901fb9

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Inference performance of bvlc_alexnet is far more slower on mkl-dnn #17

Inference performance of bvlc_alexnet is far more slower on mkl-dnn #17

etaf commented Dec 29, 2016

emfomenk commented Dec 29, 2016

etaf commented Jan 6, 2017

emfomenk commented Jan 6, 2017 •

edited

Loading

etaf commented Jan 6, 2017

etaf commented Jan 9, 2017

emfomenk commented Jan 9, 2017 •

edited

Loading

rsdubtso commented Jan 24, 2017

Inference performance of bvlc_alexnet is far more slower on mkl-dnn #17

Inference performance of bvlc_alexnet is far more slower on mkl-dnn #17

Comments

etaf commented Dec 29, 2016

emfomenk commented Dec 29, 2016

etaf commented Jan 6, 2017

emfomenk commented Jan 6, 2017 • edited Loading

etaf commented Jan 6, 2017

etaf commented Jan 9, 2017

emfomenk commented Jan 9, 2017 • edited Loading

rsdubtso commented Jan 24, 2017

emfomenk commented Jan 6, 2017 •

edited

Loading

emfomenk commented Jan 9, 2017 •

edited

Loading