-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Can we force MKLDNN to use cblas always instead of JIT kernels generated at runtime #415
Comments
Intel MKL-DNN doesn't give any guarantees regarding what gemm would be used if both Cblas and jit gemms are available. Is there any particular reason why you want to avoid using jitted gemm? |
Hi, I am looking for some way to force mkldnn to don't use this JIT based gemm convolution. |
To disable jit-based convolutions use the trick described here. For convolutions it is safe to assume that they would use CBlas sgemm for now. You can ensure by looking at MKLDNN_VERBOSE. If jitted-gemm-based convolution is used you will see To enable VTune report see this page. |
thanks @emfomenk. I tried this with simplenet and its working fine. |
Hi @avinashcpandey, I don't think TF has a standard way of doing so. Sometime back I workarounded that by removing the source of Intel MKL-DNN from TF (hence leaving the U-symblos) and then run workloads with There is also a way to say bazel where to download an external dependencies, and then (once downloaded for the first time) you can patch them and rebuild TF. That should be relatively fast. I think the option is |
Hi @avinashcpandey, you can use your own copy of MKL-DNN sources with Tensorflow by modifying the
Hope this helps. |
I did some benchmarking b/w jit based and non jit based(gemm based) convolution. For gemm based i used MKL BLAS calls. I see good performance difference between these I ran with resnet50 model of Tensorflow(python tf_cnn_benchmarks.py --device=cpu --model=resnet50 --data_format=NHWC --batch_size=256 --num_batches=1 --num_inter_threads=1 --num_intra_threads=16 --mkl=True --nodistortions --forward_only=True) I know JIT kernel will be faster But not sure for such a difference even with MKL BLAS |
I would guess that Tensorflow uses it's own implementation when you force convolutions to BLAS. Might be just parallelism over the batch size. Did you look at what MKLDNN_VERBOSE prints out during these runs? |
I build TF with MKL blas...so for blas going into MKL. With jitted kernels Without jitted kernels Convolution time with JIT’ed kernels Resnet50 Question is for with MKL blas why performance is that slow? |
@avinashcpandey GEMM-based convolution requires so-called im2col transformation of src data so that convolution reduces to a GEMM operation. That adds some overhead, so the JIT kernel is indeed expected to be faster. In general, I the GEMM-based implementation is intended for convolution shapes not yet supported by JIT (such as grouped convolutions with number of channels per group that is not a multiple of the SIMD width). |
Thanks @kwiersch. I get it now. I see GEMM calls that happens because of convolution size which remain same as 33 in both cases. I am trying to understand as batch sizes increases JIT based convolution calls remain same but GEMM based increases. Can you point me to some paper who explain the same? |
@avinashcpandey can you explain how you are counting calls to JIT-based convolution and GEMM-based convolution? is it appearing in output using MKLDNN_VERBOSE (if so, please share it!)? or are you adding some hook to count the calls? one possible explaination for "extra" GEMM calls is parallelization strategy for large minibatch: as you can see at https://github.com/intel/mkl-dnn/blob/master/src/cpu/gemm_convolution.cpp#L63, work is split by minibatch, so there will be |
With MKL_VERBOSE and MKL_DUMP I am getting no. Of JIT kernels for particular run at a same time I have print statement(printing some string matrix shape) in cblas interface for gemm which is linked to MKL and in the end I am doing grep on that. This way I see the difference in no. of gemm calls with or without JIT interface. Sample o/p Without jitted kernels |
Okay. In that case, I don't think it is correct to say that the number of JIT calls stays the same while the number of GEMM calls increases. Instead, think of it as the number of JIT-based or GEMM-based convolutional primitives remaining constant (55, as you said above). Within each GEMM-based convolutional primitive, you are seeing the number of calls to SGEMM increase from 1 to 246 as the batch sizes is changed from 1 to 246. As I suspected, this is due to the way that GEMM-based implementation chooses its threading strategy: for small batch sizes, threading is done within the SGEMM routine (ie 1 call to a parallel SGEMM), while for large batch sizes, threading is done outside of the SGEMM routine (ie many calls to a sequentia SGEMM). Hope that makes sense. |
Ok. I got you. |
So, algo:convolution has three possible values: convolution_auto, convolution_direct, and convolution_wino. First option chooses for you, second is "direct" as in "do the calculations directly without any complex changes to order of operations", and third is Winograd which you can read about here. JIT based convolution uses Just-in-time compilation to generate a computational kernel during initialization For more JIT implementation details, see the source code: harness / driver and kernel generator. Not exactly sure what you mean by mat mul... |
Thanks @kwiersch. |
Okay, I think I understand. From that perspective, the "JIT" implementations are not really mat mul (at least not in the way that the "GEMM" implementations are). Instead of transforming the input data for a call to SGEMM, the JIT kernels contained in src/cpu/jit_{isa}_conv* basically apply the convolutional filter to the input image one row of pixels at a time (with some cache blocking and other techniques to boost performance). |
Ok..this is what I wanted to know. Thanks |
The paper linked below explains the algorithms and optimizations in details. Anatomy Of High-Performance Deep Learning |
Thanks @vpirogov |
USE_CBLAS (or ref_gemm) + v1.0 --> wrong results
The USE_CBLAS call to Here are examples of calls generating an ERROR message about
(The cblas_sgemm are from a printf inserted before calling Without USE_CBLAS, and temporarily disabling
I also see a segfault, after many subtests with wrong result messages:
Wrong results are often wrong by .8 -- 3, way above the 1e-4 threshold Oh, using cblas with v1.0 build (for examples and tests) is a tiny bit difficult, |
Can we force MKLDNN to use cblas functions always instead of JIT kernels generated at runtime?
i am using external library openBLAS and want to use that for all gemm related work
The text was updated successfully, but these errors were encountered: