Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OE-27 "Wide Universal Intrinsics" discussion #11022

Open
vpisarev opened this issue Mar 6, 2018 · 6 comments
Open

OE-27 "Wide Universal Intrinsics" discussion #11022

vpisarev opened this issue Mar 6, 2018 · 6 comments
Milestone

Comments

@vpisarev
Copy link
Contributor

vpisarev commented Mar 6, 2018

the feature request about evolution proposal OE-27

@vpisarev vpisarev added this to the 4.0 milestone Mar 6, 2018
@vpisarev
Copy link
Contributor Author

vpisarev commented Mar 7, 2018

@seiko2plus, I've added link to your #10708

@seiko2plus
Copy link
Contributor

@vpisarev, #10708 going to mapping all universal intrinsics to avx2, I also made some changes to make it friendly with OE-27.

@vpisarev vpisarev mentioned this issue Jul 12, 2018
6 tasks
vpisarev added a commit that referenced this issue Jul 16, 2018
* core:OE-27 prepare universal intrinsics to expand (#11022)

* core:OE-27 prepare universal intrinsics to expand (#11022)

* core: Add universal intrinsics for AVX2

* updated implementation of wide univ. intrinsics; converted several OpenCV HAL functions: sqrt, invsqrt, magnitude, phase, exp to the wide universal intrinsics.

* converted log to universal intrinsics; cleaned up the code a bit; added v_lut_deinterleave intrinsics.

* core: Add universal intrinsics for AVX2

* fixed multiple compile errors

* fixed many more compile errors and hopefully some test failures

* fixed some more compile errors

* temporarily disabled IPP to debug exp & log; hopefully fixed Doxygen complains

* fixed some more compile errors

* fixed v_store(short*, v_float16&) signatures

* trying to fix the test failures on Linux

* fixed some issues found by alalek

* restored IPP optimization after the patch with AVX wide intrinsics has been properly tested

* restored IPP optimization after the patch with AVX wide intrinsics has been properly tested
@pemmanuelviel
Copy link
Contributor

@vpisarev I plan to port in OpenCV repo my assembly SSEx and intrinsic AVX-2 implementations of some distances for FLANN.
Are the wide universal instrinsics fully functional, in particular the ones for CV_SIMD256? Is there another doc than the one for "not wide" universal intrinsics? Thanks

@alalek
Copy link
Member

alalek commented Jul 3, 2020

@pemmanuelviel There is page about universal intrinsics in OpenCV Documentation.

@pemmanuelviel
Copy link
Contributor

@alalek Thank you for the link. This is the doc for the "not-wide" universal intrinsics I was mentioning.
But as I didn't see details on the 256 avx registers I would like to know if there is any doc on "wide" universal intrinsics, as well as if they are fully functional.
I would prefer porting directly on 256 bits wide universal intrinsics equivalent of AVX-2, than the equivalents of SSEx intrinsics. Going from SSEx intrinsics to AVX-x intrinsics on Intel architectures is not only a mater of registers size. Actually as both SSE-x and AVX-x intrinsics are implemented in the way
C = A Op B,
the assembly instructions for SSE-x mostly work only with two registers and take the form
A = A Op B
Having a single SSE-x intrinsic instruction mapping a sequence of several assembly instructions might be the reason why the performance difference between SIMD assembly and intrinsic code is quite noticeable with SSE-x.

@terfendail
Copy link
Contributor

Wide universal intrinsics are implemented for AVX2 and AVX512 architectures and are already used in core and impgproc modules.

Unfortunately there is no special documentation for wide universal intrinsics. However they were implemented in accordance with OE-27

Actually there are just a few changes to universal intrinsics idea:

  • vector types don't contain length anymore (e.g. v_uint8 instead of v_uint8x16)
  • intrinsics name should start with vx_ if it's impossible to deduce vector size from input values(e.g. vx_load instead of v_load, but v_fma retain the same name)
  • address evaluation, loop steps etc MUST use type::nlanes( e.g. v_uint8::nlanes) instead of explicit vector length value

WUI always use the most wide vector size available for selected instruction set(i.e. if AVX512 support is enabled vector length for WUI will be 512-bit)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants