Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Can we force MKLDNN to use cblas always instead of JIT kernels generated at runtime #415

Closed
avinashcpandey opened this issue Feb 21, 2019 · 23 comments
Assignees
Labels

Comments

@avinashcpandey
Copy link

Can we force MKLDNN to use cblas functions always instead of JIT kernels generated at runtime?
i am using external library openBLAS and want to use that for all gemm related work

@emfomenk
Copy link

Intel MKL-DNN doesn't give any guarantees regarding what gemm would be used if both Cblas and jit gemms are available.
In the current implementation mkldnn_sgemm() would use regular CBlas if it is available. However RNN primitive might explicitly use jitted gemm even if CBlas is available.

Is there any particular reason why you want to avoid using jitted gemm?

@emfomenk emfomenk self-assigned this Feb 22, 2019
@avinashcpandey
Copy link
Author

Hi,
I want to do comparison b/w mkldnn with openblas vs mkldnn with MKL blas...and for this I want for all convolution operation it goes to regular gemm. And in the end I will evaluate mkldnn with this JIT based convolution.
This I am doing with tensors flow framework for resnet50 model.
With profiling that I see most of the time going in JIT based kernel....which vtune shows as "outside any module" profiled data.

I am looking for some way to force mkldnn to don't use this JIT based gemm convolution.

@emfomenk
Copy link

To disable jit-based convolutions use the trick described here.

For convolutions it is safe to assume that they would use CBlas sgemm for now. You can ensure by looking at MKLDNN_VERBOSE. If jitted-gemm-based convolution is used you will see gemm:jit. If cblas is used: gemm:blas.

To enable VTune report see this page.

@avinashcpandey
Copy link
Author

thanks @emfomenk. I tried this with simplenet and its working fine.
Now i want to try the same with tensorflow. I know tensorflow downloads MKL-DNN from source and build it and then link it statically.
Is there a way to link it as a shared library in tensorflow? so that i can modify the MKLDNN code and then link it easily.

@emfomenk
Copy link

Hi @avinashcpandey,

I don't think TF has a standard way of doing so. Sometime back I workarounded that by removing the source of Intel MKL-DNN from TF (hence leaving the U-symblos) and then run workloads with LD_PRELOAD=libmkldnn.so. That allowed me to make quick library replacements.
But I am not sure that can work now.

There is also a way to say bazel where to download an external dependencies, and then (once downloaded for the first time) you can patch them and rebuild TF. That should be relatively fast. I think the option is --output_base=/path/to/dir.

@mgouicem
Copy link
Contributor

Hi @avinashcpandey, you can use your own copy of MKL-DNN sources with Tensorflow by modifying the tensorflow/workspace.bzl file. Modify the mkl_dnn rule as follow:

native.new_local_repository(
    name = "mkl_dnn",
    build_file = clean_dep("//third_party/mkl_dnn:mkldnn.BUILD"),
    path = "/path/to/your/local/copy",
)

Hope this helps.

@avinashcpandey
Copy link
Author

Thanks @emfomenk and @mgouicem! Its working for me.

@avinashcpandey
Copy link
Author

avinashcpandey commented Feb 25, 2019

I did some benchmarking b/w jit based and non jit based(gemm based) convolution. For gemm based i used MKL BLAS calls. I see good performance difference between these

I ran with resnet50 model of Tensorflow(python tf_cnn_benchmarks.py --device=cpu --model=resnet50 --data_format=NHWC --batch_size=256 --num_batches=1 --num_inter_threads=1 --num_intra_threads=16 --mkl=True --nodistortions --forward_only=True)
Observation:
Total 583 convolution calls
Convolution time with JIT
31sec
Convolution time without JIT(MKL blas)
Total BLAS sgemm calls 149248
59 sec
Here i see 256 calls for every convolution(same matrix size) (256*583=149248)

I know JIT kernel will be faster But not sure for such a difference even with MKL BLAS
Why I see such a huge no of sgemm calls(256*583) for without jitted version?

@vpirogov
Copy link
Member

I would guess that Tensorflow uses it's own implementation when you force convolutions to BLAS. Might be just parallelism over the batch size.

Did you look at what MKLDNN_VERBOSE prints out during these runs?

@avinashcpandey
Copy link
Author

I build TF with MKL blas...so for blas going into MKL.

With jitted kernels
mkldnn_verbose,exec,convolution,jit:avx2,forward_training,fsrc:nChw8c fwei:OIhw8i8o fbia:x fdst:nChw8c,alg:convolution_direct

Without jitted kernels
mkldnn_verbose,exec,convolution,gemm:blas,forward_training,fsrc:nchw fwei:oihw fbia:x fdst:nchw,alg:convolution_direc

Convolution time with JIT’ed kernels
5059.991 ms
Convolution time with non JIT’ed (taking 1.7x than JIT’ed)
8994.4565 ms

Resnet50
Convolution time with JIT’ed kernels
30910.3539 ms
Convolution time with non JIT’ed (taking 1.9x than JIT’ed)
58822.9284 ms

Question is for with MKL blas why performance is that slow?

@kwiersch
Copy link
Contributor

kwiersch commented Mar 1, 2019

@avinashcpandey GEMM-based convolution requires so-called im2col transformation of src data so that convolution reduces to a GEMM operation. That adds some overhead, so the JIT kernel is indeed expected to be faster. In general, I the GEMM-based implementation is intended for convolution shapes not yet supported by JIT (such as grouped convolutions with number of channels per group that is not a multiple of the SIMD width).

@avinashcpandey
Copy link
Author

Thanks @kwiersch. I get it now.
I observe one more thing, for Alexnet with batch size=1 the no. of JIT based convolution calls are 55 and GEMM calls also same i.e 55.
However if batch size =256 no. of Convolution calls stays same as 55 but GEMM calls increases to 55*246.

I see GEMM calls that happens because of convolution size which remain same as 33 in both cases.

I am trying to understand as batch sizes increases JIT based convolution calls remain same but GEMM based increases.

Can you point me to some paper who explain the same?

@kwiersch
Copy link
Contributor

kwiersch commented Mar 4, 2019

@avinashcpandey can you explain how you are counting calls to JIT-based convolution and GEMM-based convolution? is it appearing in output using MKLDNN_VERBOSE (if so, please share it!)? or are you adding some hook to count the calls?

one possible explaination for "extra" GEMM calls is parallelization strategy for large minibatch: as you can see at https://github.com/intel/mkl-dnn/blob/master/src/cpu/gemm_convolution.cpp#L63, work is split by minibatch, so there will be jcp.mb calls to SGEMM.

@avinashcpandey
Copy link
Author

With MKL_VERBOSE and MKL_DUMP I am getting no. Of JIT kernels for particular run at a same time I have print statement(printing some string matrix shape) in cblas interface for gemm which is linked to MKL and in the end I am doing grep on that.

This way I see the difference in no. of gemm calls with or without JIT interface.

Sample o/p
With jitted kernels
mkldnn_verbose,exec,convolution,jit:avx2,forward_training,fsrc:nChw8c fwei:OIhw8i8o fbia:x fdst:nChw8c,alg:convolution_direct

Without jitted kernels
mkldnn_verbose,exec,convolution,gemm:blas,forward_training,fsrc:nchw fwei:oihw fbia:x fdst:nchw,alg:convolution_direct

@kwiersch
Copy link
Contributor

kwiersch commented Mar 4, 2019

Okay. In that case, I don't think it is correct to say that the number of JIT calls stays the same while the number of GEMM calls increases. Instead, think of it as the number of JIT-based or GEMM-based convolutional primitives remaining constant (55, as you said above). Within each GEMM-based convolutional primitive, you are seeing the number of calls to SGEMM increase from 1 to 246 as the batch sizes is changed from 1 to 246. As I suspected, this is due to the way that GEMM-based implementation chooses its threading strategy: for small batch sizes, threading is done within the SGEMM routine (ie 1 call to a parallel SGEMM), while for large batch sizes, threading is done outside of the SGEMM routine (ie many calls to a sequentia SGEMM). Hope that makes sense.

@avinashcpandey
Copy link
Author

Ok. I got you.
One last thing algo: convolution_direct is mat mul based algo right? And the same is implemented in JIT based and GEMM based right?
I am curious about the way JIT based convolution_direct kernels work.

@kwiersch
Copy link
Contributor

kwiersch commented Mar 5, 2019

So, algo:convolution has three possible values: convolution_auto, convolution_direct, and convolution_wino. First option chooses for you, second is "direct" as in "do the calculations directly without any complex changes to order of operations", and third is Winograd which you can read about here.

JIT based convolution uses Just-in-time compilation to generate a computational kernel during initialization For more JIT implementation details, see the source code: harness / driver and kernel generator.

Not exactly sure what you mean by mat mul...

@avinashcpandey
Copy link
Author

Thanks @kwiersch.
By mat mul I meant matrix multiplication as convolution operation gets transformed into matrix multiplication operation and then it is computes using BLAS library or JIT based convolution.

@kwiersch
Copy link
Contributor

kwiersch commented Mar 5, 2019

Okay, I think I understand. From that perspective, the "JIT" implementations are not really mat mul (at least not in the way that the "GEMM" implementations are). Instead of transforming the input data for a call to SGEMM, the JIT kernels contained in src/cpu/jit_{isa}_conv* basically apply the convolutional filter to the input image one row of pixels at a time (with some cache blocking and other techniques to boost performance).

@avinashcpandey
Copy link
Author

Ok..this is what I wanted to know. Thanks
I will look into the code as you pointed src/cpu/jit_{isa}_conv*.
Apart form that if this based on some research paper Can you point me to the same?

@vpirogov
Copy link
Member

vpirogov commented Mar 5, 2019

The paper linked below explains the algorithms and optimizations in details.

Anatomy Of High-Performance Deep Learning
Convolutions On SIMD Architectures

@avinashcpandey
Copy link
Author

Thanks @vpirogov

@kruus
Copy link
Contributor

kruus commented Jul 16, 2019

USE_CBLAS (or ref_gemm) + v1.0 --> wrong results

  • I'm attempting to move mkl-dnn v1.0 towards removing mkldnn jit gemm along the lines of this issue, and related issue Can the BLAS library link to OpenBLAS? #327
  • As a first step to removing some JIT functions, I'm trying to compile with USE_CBLAS. Without USE_CBLAS, all is fine, naturally.
  • I pointed my system cblas at Intel MKL using a Debian script
    • some tests create Intel MKL ERROR: messages (when *lda==0), test failures, wrong results and even a segfault.
    • I did not yet try other cblas libraries.
  1. USE_CBLAS should not invoke cblas_sgemm for packed formats. Ex. TestGEMM_packed_fp32
    • I guess extended_sgemm is being used for integer matrix formats as in Intel docs?
    • Possibly:
      1. USE_CBLAS should also check for packed transa/transb value,
      2. and punt to some other integer reference impl (might ref_gemm<T> be compatible?)
    • Surprisingly, some packed tests pass with extended_sgemm invoking ref_gemm<float>, even though ref_gemm seems to not check for packed format
  2. rnn code frequently create some gemm calls with *lda==0. Intel MKL cblas prints a loud error message every time this happens. Ex: cpu-rnn-inference-f32-cpp. The test passes.
    • This might be intentional: check_gemm_input explicitly allows M,N,K dims to be zero.

The USE_CBLAS call to cblas_sgemm is here.

Here are examples of calls generating an ERROR message about lda

10: Test command: /local/kruus/mkl-dnn/build-jitd/examples/cpu-rnn-inference-f32-cpp
10: Test timeout computed to be: 9.99988e+06
10: Parameters:
10:  batch = 128
10:  feature size = 1024
10:  maximum source sequence length = 28
10:  maximum target sequence length = 28
10:  number of layers of the bidirectional encoder = 1
10:  number of layers of the unidirectional encoder = 7
10:  number of layers of the decoder = 8
10: cblas_sgemm(102,N,N;MNK=4096,128,1024;alpha=1.000000,A@ld=4112,B@ld=1040,beta=0.000000,C@ld=4112)
10: cblas_sgemm(102,P,N;MNK=4096,128,1024;alpha=1.000000,A@ld=0,B@ld=1040,beta=1.000000,C@ld=4112)
10:                                                                                                  
10: Intel MKL ERROR: Parameter 9 was incorrect on entry to cblas_sgemm.

(The cblas_sgemm are from a printf inserted before calling cblas_sgemm.)
I also see all cblas_sgemm calls in tests/gtests/test_rnn_forward also can cause the Intel MKL ERROR: Parameter 9 was incorrect on entry to cblas_sgemm. spam.

Without USE_CBLAS, and temporarily disabling gemm_driver with if (0 && mayiuse(sse41)), one can test how the reference impl behaves. It too sees calls with *lda==0 (quietly). The reference impl invokes ref_gemm<float>.

10: ref_gemm<float>(N,N;MNK=4096,128,1024;alpha=1.000000,A@ld=4112,B@ld=1040,beta=0.000000,C@ld=4112,bias)
10: ref_gemm<float>(P,N;MNK=4096,128,1024;alpha=1.000000,A@ld=0,B@ld=1040,beta=1.000000,C@ld=4112,bias)

I also see a segfault, after many subtests with wrong result messages:

45: [ RUN      ] TestGEMM_fp32/gemm_test.TestGEMM/21
45: cblas_sgemm(102,n,n;MNK=2000,2000,2000;alpha=1.000000,A@ld=2000,B@ld=2000,beta=0.000000,C@ld=2000)
45: /local/kruus/mkl-dnn/tests/gtests/test_gemm_common.hpp:445: Failure
45: The difference between e and 0.0 is 1.4646060466766357, which exceeds 1e-4, where
45: e evaluates to 1.4646060466766357,
45: 0.0 evaluates to 0, and
45: 1e-4 evaluates to 0.0001.
45: Row: 0 Col: 0
etc.
45: [  FAILED  ] TestGEMM_packed_fp32/gemm_test.TestGEMM/26, where GetParam() = 104-byte object <74-74 00-00 00-00 00-00 D0-07 00-00 00-00 00-00 88-13 00-00 00-00 00-00 D0-07 00-00 00-00 00-00 00-00 80-3F 00-00 00-40 D0-07 00-00 00-00 00-00 D0-07 00-00 00-00 00-00 88-13 00-00 00-00 00-00 00-00 00-00 01-01 00-00 00-00 00-00 00-00 00-00 00-00 00-00 00-00 00-00 00-00 00-00 00-00 00-00 00-00 00-00 00-00 00-00> (498 ms)
45: [ RUN      ] TestGEMM_packed_fp32/gemm_test.TestGEMM/27
32/37 Test #45: test_gemm_f32 .................................***Exception: SegFault 23.80 sec
  • The ref_gemm<float> comparison run gives wrong results one a slightly different set of tests, ex:
    • 31 - test_inner_product_forward (Failed) (quite wrong result) with USE_CBLAS, but passed with ref_gemm
    • 45 - test_gemm_f32 (SEGFAULT) with USE_CBLAS and ref_gemm
    • 49 - test_rnn_forward (Failed) with ref_gemm but passed with USE_CBLAS
    • TestGEMM_fp32/* ok with ref_gemm, wrong results with cblas_sgemm
    • TestGEMV_fp32_CPU/* ok with ref_gemm, wrong result with cblas_sgemm
    • TestGEMM_packed_fp32/* ok with ref_gemm, wrong result with cblas_sgemm
    • TestGEMM_packed_fp32/* both wrong results (more frequently with cblas_sgemm)

Wrong results are often wrong by .8 -- 3, way above the 1e-4 threshold


Oh, using cblas with v1.0 build (for examples and tests) is a tiny bit difficult,
mostly because register_exe was not pushing LDFLAGS-->CMAKE_EXE_LINKER_FLAGS into the executable link step (now an EXTRA_SHARED_LIBS variable is used).
I eventually just added a new option MKLDNN_USE_CBLAS into options.cmake, added a FindCBLAS script from inria and added a tiny cmake/cblas.cmake.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

6 participants