-
-
Notifications
You must be signed in to change notification settings - Fork 55.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Runtime nlanes for SVE enablement #20562
Conversation
a8a954c
to
74d6efe
Compare
This patch aims to enable Scalable Vector Extension (SVE) in OpenCV, which permits code to use a vector length defined at runtime. The current Neon implementation sets the number of lanes per vector register at compile time, as this is fixed. To enable SVE, the determination of the number of lanes per vector is now calculated at runtime. The runtime value is used where possible, and where a compile time constant is required, a maximum number of lanes is set accordingly. There were no new unit tests failures. However, the unit tests will fail if nlanes != max_nlanes, but we should be able to resolve this once we add SVE intrinsics. This patch has been tested on both x86 and AArch64 machines.
74d6efe
to
0ad7061
Compare
thank you for the patch! I put several questions, comments |
The current patch doesn't provide any value. It adds complexity but doesn't resolve any problem and doesn't provide any results. "Changes for changes", especially massive should be avoided.
I believe the steps below are more realistic:
See also: https://github.com/opencv/opencv/wiki/CPU-optimizations-build-options BTW, It makes sense to put as much as possible technical decisions in PRs/Issues/commit messages (just because OpenCV is open-source project and technical information should be clear for future maintenance) |
@alalek, @vpisarev, thank you for your feedback. I am back from my First thing first. There is an important reason why we would prefer to
With this change, the SIMD code for the loops like [1] See section 3.2.1 of SVE ACLE at @vpisarev - I think that the approach you suggested (the use of
As @alalek noticed, the change encompassed by 1, 2 and 3 doesn't However, it prepares the ground for SVE, where we would be able to do
@vpisarev, did I get your proposal right? On top of this, I think it is worth spending some time in discussing This approach is indeed one big change in terms of lines of code
I think that these preprocess conditions can be handled also by I extracted a runtime vs compile time version of one of its uses:
For the case of
I agree. However, for the cases you have listed, it seems we will be
I think these last three points are not the right approach. This
Totally agree. We had a couple of iteration via email with Vadim, |
@fpetrogalli Thank you for the information!
This definitely should be mitigated for smooth integration. Perhaps, we need to take a look into "Custom" HAL direction (see documentation of core / imgproc modules and 3rdparty/carotene as example). These optimization wrappers are on higher function/algorithm level - provides much better flexibility for underlying computational backend. As a drawback its coverage is less than OpenCV SIMD UI (need to define extra new wrappers if needed). BTW, There is also no restriction to write SVE HAL implementation in OpenCV SIMD UI style (to see which extra changes are necessary). |
@alalek, thank you for your reply.
Apologies, I am not good at acronyms :) I get UI = User Interface, but I am not sure of the meaning of DX and UX.
We are not suggesting to drop away anything. All we are proposing is to replace the uses of
This is a mechanical change that:
This seems to create a disadvantage though for SVE, and it seems to me that as a solution looks less appealing than the one we are proposing from the point of view of OpenCV. SVE2 is required by Armv9 (see https://developer.arm.com/architectures/cpu-architecture/a-profile), therefore we expect it to be a pervasive technology in Arm-based devices. A solution with less coverage could be a disadvantage for OpenCV vs other computer vision solutions on these devices.
Sorry, I am not sure what you mean here with |
@alalek / @vpisarev - gentle ping. Could me and @Nicholas-Ho-arm proceed with the Kind regards, Francesco |
OpenCV SIMD UI backend's prerequisite is providing vector SIMD types (with several constructors signatures and methods). As I can see SVE doesn't allow to build full-featured OpenCV SIMD UI backend in its current form. So,
Main point is that we should not start to modify/break/increase complexity of existed OpenCV code until we have clear understanding how it works, how it could be used and we have some results (performance, development optimization guidelines). P.S. Please try to optimize BGR2GRAY conversion using SVE. It contains "interleaved" data which is frequently used in Computer Vision. Another non-trivial processing function is resize of BGR data (3 channels). @vpisarev could provide own suggestions how to handle that. |
@alalek , I am not sure this is the case. As far as I can tell, two are the incompatibilities that SVE has with the current SIMD types handled by HAL:
However, I think that these are not issues:
I think that the example in #20640 shows the opposite. What you see there is a fully HAL-compatible hand written SVE code that could be easily ported to HAL intrinsics (the only caveat being the use pf the predicate parameters, of type
I politely disagree on the fact that existing HAL code is incompatible with SVE. All it would require are mechanical changes that would not have any impact in performance.
Again, there are no real issues or limitation. It is just a matter of making some mechanical changes with no (expected) performance impact.
The example in #20640 is not BGR-specific, as we didn't have one ready at hand. We can come up with it if you really want to see how SVE handles interleaved data. However, I wonder if this is really necessary. SVE has 2/3/4 vector interleaved loads/stores (as NEON does), so I don't expect that the BRG-specific example will give us any extra info on the level of HAL-compatibility that SVE has. @alalek, please let me know if you want me to tackle this example anyway. (If that's the case, it would be great if you could point me at the specific SIMD code you want us to re-write for SVE).
All in all, I think that by extending HAL the way we need it to efficiently support SVE ( @alalek / @vpisarev / @asmorkalov Please let me know if you still have concerns, and thank you for your patience! |
/* template<int i> | ||
inline v_float64x2 v_broadcast_element(const v_float64x2& v) | ||
{ | ||
__m128i tmp = (__m128d) v.val; | ||
tmp = _mm_shuffle_epi32(tmp, _MM_SHUFFLE(2*i + 1, 2*i, | ||
2*i + 1, 2*i)); | ||
__m128d tmp2 = (__m128i) tmp; | ||
return v_float64x2(tmp2); | ||
} */ | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove me. :)
@@ -306,7 +306,7 @@ CV_ALWAYS_INLINE void absdiff_store(float out[], const v_float32& a, const v_flo | |||
template<typename T, typename VT> | |||
CV_ALWAYS_INLINE int absdiff_impl(const T in1[], const T in2[], T out[], int length) | |||
{ | |||
constexpr int nlanes = static_cast<int>(VT::nlanes); | |||
const int nlanes = VT::nlanes; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this constexpr
to const
change needed? If not, please restore it. Please revert also all similar cases below.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is if nlanes is determined at runtime
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, but here we are not yet using a runtime nlanes
, so it seems a bit premature to do this change.
…ckend to constexpr
@@ -1097,7 +1097,7 @@ static void run_sepfilter3x3_any2short(DST out[], const SRC *in[], int width, in | |||
|
|||
for (int l=0; l < length;) | |||
{ | |||
constexpr int nlanes = v_int16::nlanes; | |||
constexpr int nlanes = v_uint16::nlanes; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should be v_int16
.
@@ -1284,7 +1284,7 @@ static void run_sepfilter3x3_char2short(short out[], const uchar *in[], int widt | |||
{ | |||
for (int l=0; l < length;) | |||
{ | |||
constexpr int nlanes = v_int16::nlanes; | |||
constexpr int nlanes = v_uint16::nlanes; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should be v_int16
.
@@ -1311,7 +1311,7 @@ static void run_sepfilter3x3_char2short(short out[], const uchar *in[], int widt | |||
|
|||
for (int l=0; l < length;) | |||
{ | |||
constexpr int nlanes = v_int16::nlanes; | |||
constexpr int nlanes = v_uint16::nlanes; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should be v_int16
.
@@ -1219,7 +1219,7 @@ template<template<typename T1, typename T2, typename Tvec> class OP> | |||
struct scalar_loader_n<sizeof(double), OP, double, double, v_float64> | |||
{ | |||
typedef OP<double, double, v_float64> op; | |||
enum {step = v_float64::nlanes}; | |||
enum {step=v_float64::nlanes}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please restore the spaces:
enum {step=v_float64::nlanes}; | |
enum {step = v_float64::nlanes}; |
@@ -1162,7 +1162,7 @@ struct scalar_loader_n<sizeof(float), OP, float, double, v_float32> | |||
{ | |||
typedef OP<float, float, v_float32> op; | |||
typedef OP<double, double, v_float64> op64; | |||
enum {step = v_float32::nlanes}; | |||
enum {step=v_float32::nlanes}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please restore the spaces:
enum {step=v_float32::nlanes}; | |
enum {step = v_float32::nlanes}; |
@@ -1212,13 +1212,13 @@ OPENCV_HAL_IMPL_NEON_SHIFT_OP(v_int64x2, s64, int64, s64) | |||
template<int n> inline _Tpvec v_rotate_right(const _Tpvec& a) \ | |||
{ return _Tpvec(vextq_##suffix(a.val, vdupq_n_##suffix(0), n)); } \ | |||
template<int n> inline _Tpvec v_rotate_left(const _Tpvec& a) \ | |||
{ return _Tpvec(vextq_##suffix(vdupq_n_##suffix(0), a.val, _Tpvec::nlanes - n)); } \ | |||
{ return _Tpvec(vextq_##suffix(vdupq_n_##suffix(0), a.val, _Tpvec::max_nlanes - n)); } \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this using max_nlanes
? Shouldn't it be using nlanes
?
I actually think that in all the intrin_*.hpp
files the intrinsics defined in there should be using nlanes
, not max_nlanes
.
@vpisarev - am I missing something here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nlanes should be fine, it's likely an artifact from testing
Pull Request Readiness Checklist
See details at https://github.com/opencv/opencv/wiki/How_to_contribute#making-a-good-pull-request
Patch to opencv_extra has the same branch name.