Channel First Convolution Performance on OpenCL driver #1194
Also oneDNN has good performance in channels last format (for example
Gives around 295.582 ~ 75% of GFlops for
I tried to use oneDNN in dlprimitives project https://github.com/artyom-beilis/dlprimitives as accelerator for convolutions but I found that my no-so-intel-optimized conv-gemm implementation gives much higher performance 144.7 GFlops in comparison to oneDNN's 33.4 GFlops for same setup - making it useless for me as most common setup I'm looking at (pytorch for example) is channel first.
oneDNN master 1884226
oneDNN includes hardware-specific optimizations and may behave
Steps to reproduce
Compare runs of
Performance drop by order of magnitude when using channel first format.
Note: channel first is most common format for DL frameworks: pytorch, mxnet, caffe, and fully supported by TF (also not default)
Have similar performance in nchw and nhwc formats
The text was updated successfully, but these errors were encountered:
Hi @artyom-beilis, thank you for reporting an issue.
I believe at some point
Despite the fact that plain layouts are stock and native in frameworks, most of them have oneDNN integration which would utilize blocked layouts from the beginning of model execution to the very end. This would be true for popular models with regular operations (primitives) supported by oneDNN.
I can't see the team is changing the priority and would spend much (or any) effort in optimizing plain layouts for activations unless serious changes in the industry. At this point we don't anticipate such.
If you find oneDNN useful in your project and plain activations are important for your case, I can see two options:
If so, is it possible today to use pytorch and mxnet with oneDNN on a GPU such has HD 530 or 630 or latest Intel's discrete GPUs?
I clearly understand this but plain layouts are industry standard and are common for major high performance GPU vendors like nVidia and AMD. So even if it isn't as efficient as blocked it is highly useful.
My dlprimitives project goal is to create platform independent GPU deep learning toolkit and integrate it into existing frameworks using industry standard GPU compute API: OpenCL.
I already did integration (to some extent) with pytorch up to the level you can train all torchvision classification networks here and more. Of course the major benefit is being truly open-source and platform-independent - without dependency on proprietary or highly platform specific tech like cuda/cuDNN or rocm/miopen.
I optimised the code for AMD and nVidia GPUs and to some extent to Intel. Also the most performance demanding part like convolutions aren't as good as of NVidia's cuDNN or AMD's MIOpen they are more then good enough to be highly useful.
The hardest and most critical part is optimising GEMMs, Conv-GEMM, Conv-Winograd - operations. The operations that are strongly computationally bounded and require deep knowledge and experience with a specific GPU architecture.
I obviously can't use cuDNN and MIOpen is largely limits platform and GPU choice.
However oneDNN is excellent tool that is easy to integrate and comes with minimal dependencies of Intel OpenCL driver so it is great candidate to improve performance for Intel GPU. But the result was rather disappointing.
I can contribute my kernels/code to oneDNN but they aren't to what I expect to be in terms of performance (utilising 36% of flops instead of 75% in case of NHWC that oneDNN gives)
However if oneDNN already provides OpenCL integration to existing frameworks like TF/PT/MXNet it maybe possible to go other way around and make sure that oneDNN runs on AMD and nVidia GPU using OpenCL in platform independent manner by integrating dlprimitives or its kernels to oneDNN.
This is something I'll be more than glad to explore under assumption that:
Intel GPU support is not upstreamed to public releases (if ever will be), but there are some extension that might help you to check oneDNN integration there: PyTorch and Tensorflow. MXNet doesn't have any special repo with Intel GPU support so far.
I think I see what you mean but not sure I completely agree with the statement. E.g. MXNet doesn't have NCHW supported codes for GPU (see Layout argument at the very bottom). Thanks @TaoLv. I also believe this is the case for other frameworks since engineers tend to agree that memory utilization with nhwc format way better than with nchw for GEMM-like operations.
oneDNN welcomes all contributions as long as they follow contribution guidelines.
oneDNN may benefit from pure OpenCL implementations though it would be hard to compete with specialized versions of OpenCL under specific platforms since performance may be affected without enabling performance critical features. It may also happen to enable oneDNN on AMD will be not a straightforward activity. NVidia integration is done through DPCPP programming model, not through OpenCL one. This also can be an area for exploration.
I like the spirit and ambitions you have with your project. Wish you luck with it. And if you decide to contribute to oneDNN, we would be glad to review changes. Thank you.
@artyom-beilis, this is valid performance gap. Though channel first format optimizations are not a priority for core development team as these are not working well from performance perspective, in particular on newer hardware generations.
I agree that documentation should be clear about that.