nnpack in android cost more time in conv than openblas with singlethread. #39

conansherry · 2016-12-17T02:58:15Z

im2col + openblas sgemm cost 1/2 time which compared with nnpack in single thread. (multi thread will interference other program in weak cpu of android)
batchsize = 1 using inference mode.
so I continue to use openblas sgemm + im2col.

Maratyszcza · 2016-12-17T05:03:34Z

Which convolution parameters and algorithm do you use?

conansherry · 2016-12-17T07:21:13Z

default param. AUTO BLOCK_BASED

conansherry · 2016-12-17T07:28:26Z

this is my prototxt
input data is 1 X 3 X 60 X 60 (i use another size of input image[640X480], also slower than openblas. it cost about 2x time than openblas.)
my openblas is compiled with ndk12b include gfortran.

austingg · 2016-12-17T08:26:55Z

@conansherry nnpack only supports conv with 1 stride, when stride > 1, nnpack also uses im2col + sgemm.
however, I wonder why it cost 2x time compared to openblas.

conansherry · 2016-12-17T12:34:52Z

@austingg oh i see the source code. and you are right.
openblas with gfortran is the best blas library in android according my experiments. compared to pure c openblas and eigen.

austingg · 2016-12-17T13:34:27Z

@conansherry thanks for sharing your experiments result. I will do some further experiments on openblas with gfortran.

conansherry · 2016-12-18T05:12:32Z

@Maratyszcza @austingg does the nnpack only support specify size kernel like 3x3 or 16x16? in my new test, the kernel size 5 and strid 1 come the wrong results.

conansherry · 2016-12-18T05:30:07Z

@Maratyszcza @austingg oh, i lookup the caffe2 implements. i use the tuple_based and everything is ok. I also check the souce code in convolution-inference.c other mode is not implement.

conansherry · 2016-12-18T06:18:38Z

@Maratyszcza @austingg
Here i will share some result in my experiments.
Android mobile phone. XIAOMI 5 Plus. all library run on single thread mode.
dl net contains 4 conv layers and two inner production layers. all conv layers are stride1 and kernel size euqal 5 or 3.
I run program 10 times for each experiment. here is the results, so i continue to choose openblas in my library. Thank you for nnpack source sharing and it's also a good job for the multi-core cpu.

openblas with gfortran
time forward 11.841000
time forward 10.097000
time forward 10.139000
time forward 10.583000
time forward 10.498000
time forward 10.358000
time forward 10.501000
time forward 10.440000
time forward 10.524000
time forward 10.268000

NNPACK FFT16X16
time forward 32.105999
time forward 28.781000
time forward 29.034000
time forward 61.912998
time forward 31.129999
time forward 27.649000
time forward 27.438000
time forward 26.731001
time forward 31.448000
time forward 28.899000

NNPACK FFT8X8
time forward 21.823999
time forward 21.607000
time forward 13.321000
time forward 15.339000
time forward 33.285000
time forward 19.327000
time forward 20.174000
time forward 16.476999
time forward 15.926000
time forward 16.066000

NNPACK AUTO
time forward 19.642000
time forward 20.684000
time forward 17.167999
time forward 15.738000
time forward 15.673000
time forward 14.938000
time forward 14.289000
time forward 17.891001
time forward 17.363001
time forward 16.375000

NNPACK SGEMM
time forward 23.778999
time forward 22.764000
time forward 33.705002
time forward 34.299000
time forward 28.004000
time forward 30.851999
time forward 25.034000
time forward 25.563999
time forward 33.702999
time forward 23.247999

austingg · 2016-12-18T09:27:04Z

@conansherry that's pretty good result for a cnn application only costs about 10 ms on mobile devices.

According to my research, gfortran is only related to LAPACK, the conv layers only use gemm. Have you ever do some experiments without gfortran, correct me if i am wrong.

Maratyszcza · 2016-12-18T22:35:35Z

implicit GEMM algorithm is similar to Caffe's im2col+SGEMM, but it is optimized for smaller memory footprint. This memory footprint optimization can make it slower than im2col+SGEMM.
For stride > 1 cases only implicit GEMM algorithm is supported in NNPACK.
When the number of channels on the input to convolution is small, the operation is similar to outer product: it is intrinsically memory bound, and fast algorithms in NNPACK do not help with performance.

Maratyszcza closed this as completed Dec 18, 2016

knsong mentioned this issue Jan 9, 2017

Performance not so good on armv7 cpu #46

Closed

nihui mentioned this issue Jul 26, 2017

请问有和nnpack比较过速度吗？ Tencent/ncnn#42

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

nnpack in android cost more time in conv than openblas with singlethread. #39

nnpack in android cost more time in conv than openblas with singlethread. #39

conansherry commented Dec 17, 2016

Maratyszcza commented Dec 17, 2016

conansherry commented Dec 17, 2016

conansherry commented Dec 17, 2016 •

edited

austingg commented Dec 17, 2016

conansherry commented Dec 17, 2016 •

edited

austingg commented Dec 17, 2016

conansherry commented Dec 18, 2016 •

edited

conansherry commented Dec 18, 2016 •

edited

conansherry commented Dec 18, 2016 •

edited

austingg commented Dec 18, 2016

Maratyszcza commented Dec 18, 2016

nnpack in android cost more time in conv than openblas with singlethread. #39

nnpack in android cost more time in conv than openblas with singlethread. #39

Comments

conansherry commented Dec 17, 2016

Maratyszcza commented Dec 17, 2016

conansherry commented Dec 17, 2016

conansherry commented Dec 17, 2016 • edited

austingg commented Dec 17, 2016

conansherry commented Dec 17, 2016 • edited

austingg commented Dec 17, 2016

conansherry commented Dec 18, 2016 • edited

conansherry commented Dec 18, 2016 • edited

conansherry commented Dec 18, 2016 • edited

austingg commented Dec 18, 2016

Maratyszcza commented Dec 18, 2016

conansherry commented Dec 17, 2016 •

edited

conansherry commented Dec 17, 2016 •

edited

conansherry commented Dec 18, 2016 •

edited

conansherry commented Dec 18, 2016 •

edited

conansherry commented Dec 18, 2016 •

edited