Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Channel First Convolution Performance on OpenCL driver #1194

Closed
artyom-beilis opened this issue Nov 21, 2021 · 8 comments
Closed

Channel First Convolution Performance on OpenCL driver #1194

artyom-beilis opened this issue Nov 21, 2021 · 8 comments
Labels
documentation A request to change/fix/improve the documentation

Comments

@artyom-beilis
Copy link

artyom-beilis commented Nov 21, 2021

Summary

Also oneDNN has good performance in channels last format (for example

./tests/benchdnn/benchdnn --engine=gpu:0 --mode=P --conv --stag=nhwc --cfg=f32 --dir=FWD_B mb64ic64ih56oc64oh56kh3ph1n

Gives around 295.582 ~ 75% of GFlops for Intel(R) HD Graphics 530/i5 6600 the same in channel first performance is poor: giving only 33.4 GFlops which is slower by order of mangitude.

I tried to use oneDNN in dlprimitives project https://github.com/artyom-beilis/dlprimitives as accelerator for convolutions but I found that my no-so-intel-optimized conv-gemm implementation gives much higher performance 144.7 GFlops in comparison to oneDNN's 33.4 GFlops for same setup - making it useless for me as most common setup I'm looking at (pytorch for example) is channel first.

Version

oneDNN master 1884226

Environment

oneDNN includes hardware-specific optimizations and may behave
differently on depending on the compiler and build environment. Include
the following information to help reproduce the issue:

  • CPU Intel(R) Core(TM) i5-6600 CPU @ 3.30GHz
  • OS version Linux artik-pc 5.4.0-81-generic Ambiguous Dependencies #91~18.04.1-Ubuntu SMP Fri Jul 23 13:36:29 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux
  • Compiler version gcc (Ubuntu 7.5.0-3ubuntu1~18.04) 7.5.0
  • CMake version 3.10.2
  • git hash 1884226
  • OpenCL driver: compute-runtime Neo 21.34.20767

Steps to reproduce

Compare runs of

./tests/benchdnn/benchdnn --engine=gpu:0 --mode=P --conv --stag=nhwc --cfg=f32 --dir=FWD_B mb64ic64ih56oc64oh56kh3ph1n
./tests/benchdnn/benchdnn --engine=gpu:0 --mode=P --conv --stag=nchw --cfg=f32 --dir=FWD_B mb64ic64ih56oc64oh56kh3ph1n

Observed behavior

Performance drop by order of magnitude when using channel first format.

Note: channel first is most common format for DL frameworks: pytorch, mxnet, caffe, and fully supported by TF (also not default)

Expected behavior

Have similar performance in nchw and nhwc formats

@artyom-beilis artyom-beilis added the sighting Suspicious library behavior. Should be promoted to a bug when confirmed label Nov 21, 2021
@dzarukin
Copy link
Contributor

Hi @artyom-beilis, thank you for reporting an issue.

I believe at some point nhwc format was important for certain models and frameworks when library was establishing GPU code base for Gen9 architecture. At this point we know that plain layouts barely can move us forward with high performance on modern architectures, that's why in most cases you may see blocked-by-channels layouts for activations. This is the priority the team has when it comes to performance optimizations.

Despite the fact that plain layouts are stock and native in frameworks, most of them have oneDNN integration which would utilize blocked layouts from the beginning of model execution to the very end. This would be true for popular models with regular operations (primitives) supported by oneDNN.

I can't see the team is changing the priority and would spend much (or any) effort in optimizing plain layouts for activations unless serious changes in the industry. At this point we don't anticipate such.

If you find oneDNN useful in your project and plain activations are important for your case, I can see two options:

  1. Contribute to oneDNN code base, optimizing cases of your importance, or
  2. Explain a bit more what your project is and what the impact of low performance of oneDNN convolution(s) is so that the team may consider investing in this direction.

Thank you.

@artyom-beilis
Copy link
Author

most of them have oneDNN integration which would utilize blocked layouts from the beginning of model execution to the very end. This would be true for popular models with regular operations (primitives) supported by oneDNN.

If so, is it possible today to use pytorch and mxnet with oneDNN on a GPU such has HD 530 or 630 or latest Intel's discrete GPUs?

  • if the answer is yes I'd like to look into it since it is very interesting
  • if the answer is no or in some future it is different story

At this point we know that plain layouts barely can move us forward with high performance on modern architectures

I clearly understand this but plain layouts are industry standard and are common for major high performance GPU vendors like nVidia and AMD. So even if it isn't as efficient as blocked it is highly useful.

  1. Contribute to oneDNN code base, optimising cases of your importance, or
  2. Explain a bit more what your project is and what the impact of low performance of oneDNN convolution(s) is so that the team may consider investing in this direction.

My dlprimitives project goal is to create platform independent GPU deep learning toolkit and integrate it into existing frameworks using industry standard GPU compute API: OpenCL.

I already did integration (to some extent) with pytorch up to the level you can train all torchvision classification networks here and more. Of course the major benefit is being truly open-source and platform-independent - without dependency on proprietary or highly platform specific tech like cuda/cuDNN or rocm/miopen.

I optimised the code for AMD and nVidia GPUs and to some extent to Intel. Also the most performance demanding part like convolutions aren't as good as of NVidia's cuDNN or AMD's MIOpen they are more then good enough to be highly useful.

The hardest and most critical part is optimising GEMMs, Conv-GEMM, Conv-Winograd - operations. The operations that are strongly computationally bounded and require deep knowledge and experience with a specific GPU architecture.

I obviously can't use cuDNN and MIOpen is largely limits platform and GPU choice.

However oneDNN is excellent tool that is easy to integrate and comes with minimal dependencies of Intel OpenCL driver so it is great candidate to improve performance for Intel GPU. But the result was rather disappointing.

I can contribute my kernels/code to oneDNN but they aren't to what I expect to be in terms of performance (utilising 36% of flops instead of 75% in case of NHWC that oneDNN gives)

However if oneDNN already provides OpenCL integration to existing frameworks like TF/PT/MXNet it maybe possible to go other way around and make sure that oneDNN runs on AMD and nVidia GPU using OpenCL in platform independent manner by integrating dlprimitives or its kernels to oneDNN.

This is something I'll be more than glad to explore under assumption that:

  • non-Intel specific OpenCL implementations are welcome to OneDNN
  • It indeed provides useful integration with existing frameworks. If oneDNN is already integrated I'd prefer not to waste my time on re-implementing "glue" between kernels and frameworks

Thanks,

Artyom Beilis

@dzarukin
Copy link
Contributor

dzarukin commented Dec 1, 2021

Hi @artyom-beilis,

If so, is it possible today to use pytorch and mxnet with oneDNN on a GPU such has HD 530 or 630 or latest Intel's discrete GPUs?

Intel GPU support is not upstreamed to public releases (if ever will be), but there are some extension that might help you to check oneDNN integration there: PyTorch and Tensorflow. MXNet doesn't have any special repo with Intel GPU support so far.

I clearly understand this but plain layouts are industry standard and are common for major high performance GPU vendors like NVidia and AMD.

I think I see what you mean but not sure I completely agree with the statement. E.g. MXNet doesn't have NCHW supported codes for GPU (see Layout argument at the very bottom). Thanks @TaoLv. I also believe this is the case for other frameworks since engineers tend to agree that memory utilization with nhwc format way better than with nchw for GEMM-like operations.

non-Intel specific OpenCL implementations are welcome to OneDNN

oneDNN welcomes all contributions as long as they follow contribution guidelines.

oneDNN may benefit from pure OpenCL implementations though it would be hard to compete with specialized versions of OpenCL under specific platforms since performance may be affected without enabling performance critical features. It may also happen to enable oneDNN on AMD will be not a straightforward activity. NVidia integration is done through DPCPP programming model, not through OpenCL one. This also can be an area for exploration.

I like the spirit and ambitions you have with your project. Wish you luck with it. And if you decide to contribute to oneDNN, we would be glad to review changes. Thank you.

@artyom-beilis
Copy link
Author

why have you closed it? Isn't it valid issue?

Maybe not urgent but more than valid. At least it should be stated very clear in docs that channels first isn't supported in terms of useful performance.

@artyom-beilis
Copy link
Author

I wasted quite a time learning the API and trying to use it until I realized that oneDNN is worthless in terms of performance for most common memory layout.

@vpirogov
Copy link
Member

@artyom-beilis, this is valid performance gap. Though channel first format optimizations are not a priority for core development team as these are not working well from performance perspective, in particular on newer hardware generations.

I agree that documentation should be clear about that.

@vpirogov vpirogov reopened this Dec 17, 2021
@vpirogov vpirogov added documentation A request to change/fix/improve the documentation and removed sighting Suspicious library behavior. Should be promoted to a bug when confirmed labels Dec 17, 2021
@vpirogov vpirogov closed this as completed Feb 1, 2023
@artyom-beilis
Copy link
Author

Is it closed due to documentation update or because channel 1st performance was fixed?

@vpirogov
Copy link
Member

vpirogov commented Feb 2, 2023

The issue is closed with the documentation update. No plans to optimize for channel-first case on GPUs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation A request to change/fix/improve the documentation
Projects
None yet
Development

No branches or pull requests

3 participants