Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

nnpack in android cost more time in conv than openblas with singlethread. #39

Closed
conansherry opened this issue Dec 17, 2016 · 11 comments
Closed

Comments

@conansherry
Copy link

im2col + openblas sgemm cost 1/2 time which compared with nnpack in single thread. (multi thread will interference other program in weak cpu of android)
batchsize = 1 using inference mode.
so I continue to use openblas sgemm + im2col.

@Maratyszcza
Copy link
Owner

Which convolution parameters and algorithm do you use?

@conansherry
Copy link
Author

default param. AUTO BLOCK_BASED

@conansherry
Copy link
Author

conansherry commented Dec 17, 2016

this is my prototxt
input data is 1 X 3 X 60 X 60 (i use another size of input image[640X480], also slower than openblas. it cost about 2x time than openblas.)
my openblas is compiled with ndk12b include gfortran.

@austingg
Copy link

@conansherry nnpack only supports conv with 1 stride, when stride > 1, nnpack also uses im2col + sgemm.
however, I wonder why it cost 2x time compared to openblas.

@conansherry
Copy link
Author

conansherry commented Dec 17, 2016

@austingg oh i see the source code. and you are right.
openblas with gfortran is the best blas library in android according my experiments. compared to pure c openblas and eigen.

@austingg
Copy link

@conansherry thanks for sharing your experiments result. I will do some further experiments on openblas with gfortran.

@conansherry
Copy link
Author

conansherry commented Dec 18, 2016

@Maratyszcza @austingg does the nnpack only support specify size kernel like 3x3 or 16x16? in my new test, the kernel size 5 and strid 1 come the wrong results.

@conansherry
Copy link
Author

conansherry commented Dec 18, 2016

@Maratyszcza @austingg oh, i lookup the caffe2 implements. i use the tuple_based and everything is ok. I also check the souce code in convolution-inference.c other mode is not implement.

@conansherry
Copy link
Author

conansherry commented Dec 18, 2016

@Maratyszcza @austingg
Here i will share some result in my experiments.
Android mobile phone. XIAOMI 5 Plus. all library run on single thread mode.
dl net contains 4 conv layers and two inner production layers. all conv layers are stride1 and kernel size euqal 5 or 3.
I run program 10 times for each experiment. here is the results, so i continue to choose openblas in my library. Thank you for nnpack source sharing and it's also a good job for the multi-core cpu.

openblas with gfortran
time forward 11.841000
time forward 10.097000
time forward 10.139000
time forward 10.583000
time forward 10.498000
time forward 10.358000
time forward 10.501000
time forward 10.440000
time forward 10.524000
time forward 10.268000

NNPACK FFT16X16
time forward 32.105999
time forward 28.781000
time forward 29.034000
time forward 61.912998
time forward 31.129999
time forward 27.649000
time forward 27.438000
time forward 26.731001
time forward 31.448000
time forward 28.899000

NNPACK FFT8X8
time forward 21.823999
time forward 21.607000
time forward 13.321000
time forward 15.339000
time forward 33.285000
time forward 19.327000
time forward 20.174000
time forward 16.476999
time forward 15.926000
time forward 16.066000

NNPACK AUTO
time forward 19.642000
time forward 20.684000
time forward 17.167999
time forward 15.738000
time forward 15.673000
time forward 14.938000
time forward 14.289000
time forward 17.891001
time forward 17.363001
time forward 16.375000

NNPACK SGEMM
time forward 23.778999
time forward 22.764000
time forward 33.705002
time forward 34.299000
time forward 28.004000
time forward 30.851999
time forward 25.034000
time forward 25.563999
time forward 33.702999
time forward 23.247999

@austingg
Copy link

@conansherry that's pretty good result for a cnn application only costs about 10 ms on mobile devices.

According to my research, gfortran is only related to LAPACK, the conv layers only use gemm. Have you ever do some experiments without gfortran, correct me if i am wrong.

@Maratyszcza
Copy link
Owner

  1. implicit GEMM algorithm is similar to Caffe's im2col+SGEMM, but it is optimized for smaller memory footprint. This memory footprint optimization can make it slower than im2col+SGEMM.
  2. For stride > 1 cases only implicit GEMM algorithm is supported in NNPACK.
  3. When the number of channels on the input to convolution is small, the operation is similar to outer product: it is intrinsically memory bound, and fast algorithms in NNPACK do not help with performance.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants