Runtime nlanes for SVE enablement #20562

Nicholas-Ho-arm · 2021-08-17T10:11:37Z

Pull Request Readiness Checklist

See details at https://github.com/opencv/opencv/wiki/How_to_contribute#making-a-good-pull-request

I agree to contribute to the project under Apache 2 License.
To the best of my knowledge, the proposed patch is not based on a code under GPL or other license that is incompatible with OpenCV
The PR is proposed to proper branch
There is reference to original bug report and related work
There is accuracy test, performance test and test data in opencv_extra repository, if applicable
Patch to opencv_extra has the same branch name.
The feature is well documented and sample code can be built with the project CMake

force_builders=linux, docs, ARMv8, ARMv7, Custom
build_image:Custom=javascript
buildworker:Custom=linux-4, linux-6

This patch aims to enable Scalable Vector Extension (SVE) in OpenCV, which permits code to use a vector length defined at runtime. The current Neon implementation sets the number of lanes per vector register at compile time, as this is fixed. To enable SVE, the determination of the number of lanes per vector is now calculated at runtime. The runtime value is used where possible, and where a compile time constant is required, a maximum number of lanes is set accordingly. There were no new unit tests failures. However, the unit tests will fail if nlanes != max_nlanes, but we should be able to resolve this once we add SVE intrinsics. This patch has been tested on both x86 and AArch64 machines.

modules/imgproc/src/sumpixels.simd.hpp

modules/imgproc/src/canny.cpp

modules/gapi/src/backends/fluid/gfluidcore.cpp

vpisarev · 2021-08-17T16:03:59Z

thank you for the patch! I put several questions, comments

alalek · 2021-08-20T17:55:19Z

The current patch doesn't provide any value. It adds complexity but doesn't resolve any problem and doesn't provide any results. "Changes for changes", especially massive should be avoided.

nlanes is definitely not the last problem which needs to be resolved. And definitely it should not be the first problem. At least there are several #if conditions with CV_SIMD_WIDTH or similar to #if CV_SIMD256 || CV_SIMD512.
Also runtime checks (replacing enums values by variables) are always slower than compile-time checks.
It is better to avoid adding extra "runtime" checks until we have any real performance results.

I believe the steps below are more realistic:

SVE SIMD backend should be implemented first. This would require independent "local" changes without massive touching of other parts of OpenCV library.
SVE backend should provide SIMD types/intrinsics for some CV_SVE_WIDTH macro value as input (similar to SSE / AVX2 / AVX512 approach but with single .hpp file).
dispatch modes should be extended to support SVE128 / SVE256 / SVE512 / SVE1024 / etc (CMake).
through compile-time defines code can be compiled several times (see .simd.hpp) for different vector widths.
it can be utilized through existed runtime dispatching code.
initial performance results can be measured at this point.

See also: https://github.com/opencv/opencv/wiki/CPU-optimizations-build-options

BTW, It makes sense to put as much as possible technical decisions in PRs/Issues/commit messages (just because OpenCV is open-source project and technical information should be clear for future maintenance)

fpetrogalli · 2021-08-24T16:26:28Z

@alalek, @vpisarev, thank you for your feedback. I am back from my
leave and I would like to chip in in the discussion.

First thing first. There is an important reason why we would prefer to
make nlanes() a function call rather than an enum. Unfortunately,
it is not possible to use SVE C types (svfloat32_t,
svuntin64_t,...) as fields of structs and classes [1], so we won't
be able to use the method nlanes() for SVE. For the SVE header, we
would have to end up writing something along the lines of
https://godbolt.org/z/Tz9fYGsa6. Therefore, we cannot proceed
directly with our suggestion of ditching the nlanes enum in favor of
the runtime nlanes() invocation. However, we could come up with a
template function (equivalent to the member function) that could
return the number of lanes as follows:

template<T vector_type>
int get_nlanes() {
#ifndef __ARM_FEATURE_SVE
    if (vector_type == svfloat32_t || svint32_t || svuint32_t) // need some proper C++ type check here...
      return svcntw();
    if (vector_type == svfloat64_t || svint64_t || svuint64_t)
      return svcntd();
#else
    /// Use the enum when it is defined (SSE, AVX, ... NEON).
    return vector_type::nlanes;
#endif
}

With this change, the SIMD code for the loops like for (int i = 0; i < n; i+=v_float32::nlanes) {body} will become for (int i = 0; i < n; i+=get_nlanes<v_float32>()) {body}.

[1] See section 3.2.1 of SVE ACLE at
https://developer.arm.com/documentation/100987/latest

@vpisarev - I think that the approach you suggested (the use of
max_nlanes) makes sense. Please correct me if I am wrong, but it seems
to me that what you are suggesting is the following:

Change enum {nlanes = C} into enum {nlanes = C, max_lanes = C'}, where C == C' for traditional fixed length vector extension
(SSE, AVX, ... NEON).
Leave unchanged the code that does the loop increments for (int i = 0; i < n; i+=v_float32::nlanes) {body}.
Use max_lanes in places where the size of the vectors is needed
to set the memory that is needed to run the tests, for example in
expressions like float data[v_float32::nlanes] (to be replaced by
float data[v_float32::max_nlanes]).

As @alalek noticed, the change encompassed by 1, 2 and 3 doesn't
resolve any problem itself, and can be perceived as not providing any
value.

However, it prepares the ground for SVE, where we would be able to do
the following:

Add SVE vector types, setting the enum to hold only enum {max_nlanes = C''} (no nlanes here), where C'' would be the
maximum number of lanes allowed by the architecture (2048 bits
divided by the number of bits in the scalar type).
Modify the loop code to use increments conditional on whether we
are using SVE or traditional SIMD (notice that this conditional
compilation could be avoided with the get_nlanes<vtype>()
template invocation):
```
#if __ARM_FEATURE_SVE
const int local_nlanes = svcntw();
#else
const int local_nlanes = v_float32::nlanes;
#endif
for (int i = 0; i < n; i+=local_nlanes) {body}
```
max_lanes will be used as in 3, without changes. Because
v_float32::max_nlanes will always be greater or equal of
svcntw(), the tests will not have any memory leak/overflow issues
as enough memory will be allocated for their execution.

@vpisarev, did I get your proposal right?

On top of this, I think it is worth spending some time in discussing
the original idea of converting enum {nlanes = COMPILE_TIME_CONSTANT} into a runtime invocation (whether static inline int nlanes() {return COMPILE_TIME_CONSTANT;} or the template
get_nlanes() suggested in the beginning of the reply), and the
concerns around the issues of generating runtime checks vs compile
time checks.

This approach is indeed one big change in terms of lines of code
(admittedly, mostly a mechanical change, nothing fancy needs to be
introduced). However, as you can see from the two examples I mention
below in my in-line reply to @alalek, this approach doesn't seem to
introduce any runtime overhead for the architectures currently
supported by OpenCV (where the size of the vector types is known at
compile time).

The current patch doesn't provide any value. It adds complexity but
doesn't resolve any problem and doesn't provide any
results. "Changes for changes", especially massive should be
avoided.

nlanes is definitely not the last problem which needs to be
resolved. And definitely it should not be the first problem.

At least there are several #if conditions with CV_SIMD_WIDTH or
similar to #if CV_SIMD256 || CV_SIMD512.

I think that these preprocess conditions can be handled also by
runtime code, which will eventually end up doing the same
code-generation (no runtime checks) for targets that have a known
vector length at compile time. For example, CV_SIMD256 is used in
~10 places:
https://github.com/opencv/opencv/search?q=CV_SIMD256&type=code

I extracted a runtime vs compile time version of one of its uses:
https://godbolt.org/z/WGxndWfqc. I think we can safely say that using
a static method of a class (visible in the compilation unit) will have
the same effect of using the preprocessor macros: the compiler sees
the compile time constant inside the method, in-lines the constant, and
removes the code that is not specific to the value specified by the
method.

Also runtime checks (replacing enums values by variables) are always
slower than compile-time checks.

For the case of enum vs static method the compiler seems to do the
right job when body of the static method is visible in the compilation
unit (which is the case for HAL): https://godbolt.org/z/53xPbnaG9 In
this reduced example, you can see that the compiler is able to remove
the call to the static method and generate the same code without
runtime checks. Notice that this is achieved even by old compiler (you
can see gcc 4.9).

It is better to avoid adding extra "runtime" checks until we have
any real performance results.

I agree. However, for the cases you have listed, it seems we will be
avoiding runtime checks.

I believe the steps below are more realistic:

* SVE SIMD backend should be implemented first. This would require independent "local" changes without massive touching of other parts of OpenCV library.

* SVE backend should provide SIMD types/intrinsics for some `CV_SVE_WIDTH` macro value as input (similar to SSE / AVX2 / AVX512 approach but with single .hpp file).

* dispatch modes should be extended to support SVE128 / SVE256 / SVE512 / SVE1024 / etc (CMake).

* through compile-time defines code can be compiled several times (see `.simd.hpp`) for different vector widths.

* it can be utilized through existed runtime dispatching code.

I think these last three points are not the right approach. This
approach forces us to blow the library and add extra indirection,
despite that not being needed architecturally.

* initial performance results can be measured at this point.
See also: https://github.com/opencv/opencv/wiki/CPU-optimizations-build-options

BTW, It makes sense to put as much as possible technical decisions in PRs/Issues/commit messages (just because OpenCV is open-source project and technical information should be clear for future maintenance)

Totally agree. We had a couple of iteration via email with Vadim,
mostly to ensure Nick was being helped in my absence. We didn't have
any intention to keep the discussion private.

alalek · 2021-08-24T23:13:26Z

@fpetrogalli Thank you for the information!

it is not possible to use SVE C types (svfloat32_t,
svuntin64_t,...) as fields of structs and classes [1], so we won't
be able to use the method nlanes() for SVE

This definitely should be mitigated for smooth integration.
OpenCV SIMD UI wrappers add extra methods or constructor variants for SIMD types for better DX/UX.
We can't just drop away all this stuff.

Perhaps, we need to take a look into "Custom" HAL direction (see documentation of core / imgproc modules and 3rdparty/carotene as example). These optimization wrappers are on higher function/algorithm level - provides much better flexibility for underlying computational backend. As a drawback its coverage is less than OpenCV SIMD UI (need to define extra new wrappers if needed).

BTW, There is also no restriction to write SVE HAL implementation in OpenCV SIMD UI style (to see which extra changes are necessary).

fpetrogalli · 2021-08-25T08:24:19Z

@alalek, thank you for your reply.

@fpetrogalli Thank you for the information!

it is not possible to use SVE C types (svfloat32_t,
svuntin64_t,...) as fields of structs and classes [1], so we won't
be able to use the method nlanes() for SVE

This definitely should be mitigated for smooth integration.
OpenCV SIMD UI wrappers add extra methods or constructor variants for SIMD types for better DX/UX.

Apologies, I am not good at acronyms :) I get UI = User Interface, but I am not sure of the meaning of DX and UX.

We can't just drop away all this stuff.

We are not suggesting to drop away anything. All we are proposing is to replace the uses of <vector_type>::nlanes, where <vector_type> is any of the structs/typedefs used by HAL to represent vectors - for example float32x4, v_int8, and so on- with an invocation of theget_nlanesfunction template proposed in my long reply, so that we will end up seeingget_nlanes<vector_type>()` in the SIMD loops:

// old version
for (int i = 0; i <n; i+=v_float32::nlanes) {...}

// new version
for ((int i = 0; i <n; i+=get_nlanes<v_float32>()) {...}

This is a mechanical change that:

will not have performance change because no runtime checks will be added (as shown in the examples of my reply)
will not require any change in the current implementation of HAL. The enum {nlanes = C} used in the headers for SSE,...,NEON will stay there and will be used by get_nlanes.
Will allow adding the header file for SVE in VLA (Vector Length Agnostic) form, so that we will not have to compile the library for different VLs (Vector Length) and we will not have to come up with any runtime selection system for SVE.

Perhaps, we need to take a look into "Custom" HAL direction (see documentation of core / imgproc modules and 3rdparty/carotene as example). These optimization wrappers are on higher function/algorithm level - provides much better flexibility for underlying computational backend. As a drawback its coverage is less than OpenCV SIMD UI (need to define extra new wrappers if needed).

This seems to create a disadvantage though for SVE, and it seems to me that as a solution looks less appealing than the one we are proposing from the point of view of OpenCV. SVE2 is required by Armv9 (see https://developer.arm.com/architectures/cpu-architecture/a-profile), therefore we expect it to be a pervasive technology in Arm-based devices. A solution with less coverage could be a disadvantage for OpenCV vs other computer vision solutions on these devices.

BTW, There is also no restriction to write SVE HAL implementation in OpenCV SIMD UI style (to see which extra changes are necessary).

Sorry, I am not sure what you mean here with write SVE HAL implementation in OpenCV SIMD UI style.

fpetrogalli · 2021-08-25T08:33:28Z

@alalek / @vpisarev - would it help to see a (WIP) patch with the proposed changes (the template get_nlanes) to see the amount of mechanical changes required?

fpetrogalli · 2021-08-31T08:47:30Z

@alalek / @vpisarev - gentle ping. Could me and @Nicholas-Ho-arm proceed with the get_nlanes proposal?

Kind regards,

Francesco

alalek · 2021-08-31T09:31:30Z

OpenCV SIMD UI backend's prerequisite is providing vector SIMD types (with several constructors signatures and methods).
SVE doesn't allow to build them in efficient way with current compilers support.

As I can see SVE doesn't allow to build full-featured OpenCV SIMD UI backend in its current form. So,

we need to build some subset of SIMD intrinsics compatible with SVE / RVV. This would require new API, defines/macros, get_nlanes implementation, other things. The most problematic part is adopting of already existed OpenCV code.
or we can provide HAL implementation for OpenCV. This code could help to collect optimization patterns and other issues/limitations related to SVE paradigm. As I write above, this can be implemented in SIMD UI way for easier integration (dedicated code, but with own SIMD UI-like backend for SVE and optimizations with get_nlanes similar to existed based on current OpenCV SIMD UI backends). This would be the first step before adoption of SIMD UI API and optimized OpenCV code.

Main point is that we should not start to modify/break/increase complexity of existed OpenCV code until we have clear understanding how it works, how it could be used and we have some results (performance, development optimization guidelines).

P.S. Please try to optimize BGR2GRAY conversion using SVE. It contains "interleaved" data which is frequently used in Computer Vision. Another non-trivial processing function is resize of BGR data (3 channels).

@vpisarev could provide own suggestions how to handle that.
@asmorkalov could you please share experience about similar technology from RISCV (RVV)?

fpetrogalli · 2021-09-01T10:26:45Z

OpenCV SIMD UI backend's prerequisite is providing vector SIMD types (with several constructors signatures and methods).
SVE doesn't allow to build them in efficient way with current compilers support.

@alalek , I am not sure this is the case. As far as I can tell, two are the incompatibilities that SVE has with the current SIMD types handled by HAL:

The use of the member field val, which carries the C intrinsics type used by the vector extension
The use of constructors from scalars.

However, I think that these are not issues:

val is used only in the header files of the port (intrin_sse.hpp, intrin_neon.hpp). It could be made a private member of the struct, without affecting any of the used SIMD code (UPDATE: sorry, this is not true, we can't make val private - however, it is still true that it is used only in the header files of HAL). In fact, I couldn't find any code in the modules that uses the field .val directly. Hence, for SVE, we could create a intrin_sve.hpp file where the SIMD types are typedef-ed to the SVE types, like in the example at https://godbolt.org/z/Tz9fYGsa6 (the same typedef could be used to remap v_float32x4 t to the SVE type). The header file will simply not need to deal with .val.
Constructors that take a set of scalars in input to build a SIMD vector are used in user code. But those could be easily replaced with LUT-like constructors. I am pretty sure that such LUT constructors will not have significant performance issues with respect to the use of the original constructors, as the example at https://godbolt.org/z/sohTbGE3a is proving. The code generated for LUT constructors - when the indexes are known, which is the case for all the uses of scalars-to-vector constructors in the user code - is exactly the same as the code generated by the scalars-to-vector constructor. Moreover, OpenCV supports AVX512, so these LUT-like constructors are likely to be beneficial not just for SVE.

As I can see SVE doesn't allow to build full-featured OpenCV SIMD UI backend in its current form. So,

I think that the example in #20640 shows the opposite. What you see there is a fully HAL-compatible hand written SVE code that could be easily ported to HAL intrinsics (the only caveat being the use pf the predicate parameters, of type svbool_t, which is anyway easily solvable in the header file by setting all predicates to all lanes active).

* we need to build some subset of SIMD intrinsics compatible with SVE / RVV. This would require new API, defines/macros, get_nlanes implementation, other things. The most problematic part is adopting of already existed OpenCV code.

I politely disagree on the fact that existing HAL code is incompatible with SVE. All it would require are mechanical changes that would not have any impact in performance.

* or we can provide HAL implementation for OpenCV. This code could help to collect optimization patterns and other issues/limitations related to SVE paradigm. As I write above, this can be implemented in SIMD UI way for easier integration (dedicated code, but with own SIMD UI-like backend for SVE and optimizations with get_nlanes similar to existed based on current OpenCV SIMD UI backends). This would be the first step before adoption of SIMD UI API and optimized OpenCV code.

Again, there are no real issues or limitation. It is just a matter of making some mechanical changes with no (expected) performance impact.

Main point is that we should not start to modify/break/increase complexity of existed OpenCV code until we have clear understanding how it works, how it could be used and we have some results (performance, development optimization guidelines).

P.S. Please try to optimize BGR2GRAY conversion using SVE. It contains "interleaved" data which is frequently used in Computer Vision. Another non-trivial processing function is resize of BGR data (3 channels).

The example in #20640 is not BGR-specific, as we didn't have one ready at hand. We can come up with it if you really want to see how SVE handles interleaved data. However, I wonder if this is really necessary. SVE has 2/3/4 vector interleaved loads/stores (as NEON does), so I don't expect that the BRG-specific example will give us any extra info on the level of HAL-compatibility that SVE has. @alalek, please let me know if you want me to tackle this example anyway. (If that's the case, it would be great if you could point me at the specific SIMD code you want us to re-write for SVE).

@vpisarev could provide own suggestions how to handle that.
@asmorkalov could you please share experience about similar technology from RISCV (RVV)?

All in all, I think that by extending HAL the way we need it to efficiently support SVE (get_nlanes, LUT-constructors, making field val private) would be beneficial not just for SVE but also for RVV (vector length agnosticism) and AVX512 (look up tables).

@alalek / @vpisarev / @asmorkalov Please let me know if you still have concerns, and thank you for your patience!

fpetrogalli · 2021-09-06T08:50:53Z

modules/core/include/opencv2/core/hal/intrin_sse.hpp

+/* template<int i>
+inline v_float64x2 v_broadcast_element(const v_float64x2& v)
+{
+    __m128i tmp = (__m128d) v.val;
+    tmp = _mm_shuffle_epi32(tmp, _MM_SHUFFLE(2*i + 1, 2*i,
+                                             2*i + 1, 2*i));
+    __m128d tmp2 = (__m128i) tmp;
+    return v_float64x2(tmp2);
+} */
+


Remove me. :)

fpetrogalli · 2021-09-06T08:52:58Z

modules/gapi/src/backends/fluid/gfluidcore.cpp

@@ -306,7 +306,7 @@ CV_ALWAYS_INLINE void absdiff_store(float out[], const v_float32& a, const v_flo
 template<typename T, typename VT>
 CV_ALWAYS_INLINE int absdiff_impl(const T in1[], const T in2[], T out[], int length)
 {
-    constexpr int nlanes = static_cast<int>(VT::nlanes);
+    const int nlanes = VT::nlanes;


Is this constexpr to const change needed? If not, please restore it. Please revert also all similar cases below.

It is if nlanes is determined at runtime

Sure, but here we are not yet using a runtime nlanes, so it seems a bit premature to do this change.

…ckend to constexpr

fpetrogalli · 2021-09-07T08:58:04Z

modules/gapi/src/backends/fluid/gfluidimgproc_func.simd.hpp

@@ -1097,7 +1097,7 @@ static void run_sepfilter3x3_any2short(DST out[], const SRC *in[], int width, in

    for (int l=0; l < length;)
    {
-        constexpr int nlanes = v_int16::nlanes;
+        constexpr int nlanes = v_uint16::nlanes;


Should be v_int16.

fpetrogalli · 2021-09-07T08:58:11Z

modules/gapi/src/backends/fluid/gfluidimgproc_func.simd.hpp

@@ -1284,7 +1284,7 @@ static void run_sepfilter3x3_char2short(short out[], const uchar *in[], int widt
    {
        for (int l=0; l < length;)
        {
-            constexpr int nlanes = v_int16::nlanes;
+            constexpr int nlanes = v_uint16::nlanes;


Should be v_int16.

fpetrogalli · 2021-09-07T08:58:18Z

modules/gapi/src/backends/fluid/gfluidimgproc_func.simd.hpp

@@ -1311,7 +1311,7 @@ static void run_sepfilter3x3_char2short(short out[], const uchar *in[], int widt

    for (int l=0; l < length;)
    {
-        constexpr int nlanes = v_int16::nlanes;
+        constexpr int nlanes = v_uint16::nlanes;


Should be v_int16.

fpetrogalli · 2021-09-07T08:59:18Z

modules/core/src/arithm.simd.hpp

@@ -1219,7 +1219,7 @@ template<template<typename T1, typename T2, typename Tvec> class OP>
 struct scalar_loader_n<sizeof(double), OP, double, double, v_float64>
 {
    typedef OP<double, double, v_float64> op;
-    enum {step = v_float64::nlanes};
+    enum {step=v_float64::nlanes};


Please restore the spaces:

Suggested change

enum {step=v_float64::nlanes};

enum {step = v_float64::nlanes};

fpetrogalli · 2021-09-07T08:59:56Z

modules/core/src/arithm.simd.hpp

@@ -1162,7 +1162,7 @@ struct scalar_loader_n<sizeof(float), OP, float, double, v_float32>
 {
    typedef OP<float, float, v_float32> op;
    typedef OP<double, double, v_float64> op64;
-    enum {step = v_float32::nlanes};
+    enum {step=v_float32::nlanes};


Please restore the spaces:

Suggested change

enum {step=v_float32::nlanes};

enum {step = v_float32::nlanes};

fpetrogalli · 2021-09-07T09:01:52Z

modules/core/include/opencv2/core/hal/intrin_neon.hpp

@@ -1212,13 +1212,13 @@ OPENCV_HAL_IMPL_NEON_SHIFT_OP(v_int64x2, s64, int64, s64)
 template<int n> inline _Tpvec v_rotate_right(const _Tpvec& a) \
 { return _Tpvec(vextq_##suffix(a.val, vdupq_n_##suffix(0), n)); } \
 template<int n> inline _Tpvec v_rotate_left(const _Tpvec& a) \
-{ return _Tpvec(vextq_##suffix(vdupq_n_##suffix(0), a.val, _Tpvec::nlanes - n)); } \
+{ return _Tpvec(vextq_##suffix(vdupq_n_##suffix(0), a.val, _Tpvec::max_nlanes - n)); } \


Why is this using max_nlanes? Shouldn't it be using nlanes?

I actually think that in all the intrin_*.hpp files the intrinsics defined in there should be using nlanes, not max_nlanes.

@vpisarev - am I missing something here?

nlanes should be fine, it's likely an artifact from testing

alalek · 2021-11-25T20:36:46Z

#21127

Nicholas-Ho-arm force-pushed the SVE_enablement branch from a8a954c to 74d6efe Compare August 17, 2021 13:48

Nicholas-Ho-arm force-pushed the SVE_enablement branch from 74d6efe to 0ad7061 Compare August 17, 2021 14:06

asmorkalov requested a review from vpisarev August 17, 2021 15:28