Skip to content

Power8 VSX optimizations for core module #9763

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
Oct 10, 2017

Conversation

seiko2plus
Copy link
Contributor

@seiko2plus seiko2plus commented Oct 2, 2017

Power8 VSX optimizations for core module

Issue:
PowerPC Power8 VSX SIMD optimizations #7207

Architecture:
Power8 PPC64LE (64 bit PowerPC Little Endian mode) and above

What about Big-endian support?
Big-endian is basically dead on modern Power and I'm not sure about adding a support for it , maybe I should ! IDK.

Tested Compilers:
GCC (4.9.2, 5.4.1, 6.4.0, 7.2.0)
CLANG(4.0.1-6, 6.0.0-svn310776-1)

Missings:
Why there's no support for Half-precision?
there's no VSX instructions for Half-precision convert in Power8 ISA but Power9 has it and I'm planning to add it as feature.

@seiko2plus seiko2plus changed the title Add vsx core Power8 VSX optimizations for core module Oct 3, 2017
@barkovv
Copy link

barkovv commented Oct 3, 2017

@seiko2plus
It's funny but I've worked to this issue to and my code fails to compile with GCC or XLC (clang-4 is fine). Should we combine code to get compiler independent code?

https://github.com/barkovv/opencv/tree/vsx
My code passes "opencv_test_core" tests. Compiled with clang-4.

Anyway I would like to list the clang-specific problems here (if you have plans to adapt your code for clang):

  • clang-5 has regression with vec_xl (load from unaligned adress). This intrin maps to vec_ld for some reason. clang-4 is fine here.
  • inline asm with VSX arguments ("wa", "wd", "wf" arguments) is buggy (value of result register isn't stored in __vector ) (this bug exists either in clang-4 or clang-5)
  • all uint64 related intrins from <altivec.h> (I mean vector unsigned long long and vector signed long long) want "unsigned long long" and "signed long long" arguments while uint64 maps to "unsigned long". There is no silent conversion from uint64 to "unsigned long" so all calls are ambious.

Have you got floating point comparation errors in hal_intrin.float32x4 & hal_intrin.float64x2 tests? ( 1.0 != 0.99998....)

Nice work

@mshabunin
Copy link
Contributor

@seiko2plus , can you provide an instruction describing how to build OpenCV for powerpc in Ubuntu x86_64? It would be great to have a cmake toolchain file similar to those we have for ARM (https://github.com/opencv/opencv/tree/master/platforms/linux).

@seiko2plus
Copy link
Contributor Author

seiko2plus commented Oct 6, 2017

Sorry for the delayed response,
@barkovv During working on supports clang , I found these issues and many more
I almost covered all these issues and solutions in vsx_utils.hpp

Have you got floating point comparation errors in hal_intrin.float32x4 & hal_intrin.float64x2 tests? ( 1.0 != 0.99998....)

No but I think this error caused by vec_rsqrte, you should use vec_rsqrt instead for accuracy

It's funny but I've worked to this issue to and my code fails to compile with GCC or XLC (clang-4 is fine). Should we combine code to get compiler independent code?

I don't know what I'm supposed to say but I didn't mean to disappoint you.
I looked at your code, you did well too but it would be better to look at the vsx_utils.hpp and check how I worked around clang issues.

@mshabunin

can you provide an instruction describing how to build OpenCV for powerpc in Ubuntu x86_64?
It would be great to have a cmake toolchain file similar to those we have for ARM (https://github.com/opencv/opencv/tree/master/platforms/linux).

I'm using power8 vm provided by osuosl, I didn't try to build opencv for power8 in GNU/Linux x86_64 before, but I'm willing to give it a try and maybe I would be able to make cmake toolchain for powerpc

@seiko2plus seiko2plus force-pushed the addVsxCore branch 2 times, most recently from 3958e32 to 772fbb1 Compare October 9, 2017 10:24
@vpisarev
Copy link
Contributor

vpisarev commented Oct 9, 2017

@seiko2plus, thank you very much, good support for another architecture is a very valuable addition to the library! I have one (but relatively big) request regarding this patch. As I see, besides providing another implementation of universal intrinsics, there is also a bunch of other functions, implementing directly in VSX intrinsics. I wonder if you could convert the corresponding loops to universal intrinsics (and then in most cases SSE/NEON branches can be deleted)? The reason is simple. VSX branches will be much less tested than SSE/NEON, and we ourselves could not test or even build this code on our test machines. If you convert those branches to universal intrinsics then this code is much better tested in real life.

@seiko2plus
Copy link
Contributor Author

@vpisarev I agree with you, I removed VSX raw from the branch and going to replace it later with universal intrinsics in separate branches but that may limit VSX abilities and affect the desirable performance, so not everything is going to implement by universal intrinsics.

Recently I implemented a toolchain for powerpc using "IBM Advance Toolchain" because of these reasons that you already mentioned and I'm going to release it once I finish testing.

@vpisarev
Copy link
Contributor

great! thank you once again! 👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants