Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[GSoC] New universal intrinsic backend for RVV #22179

Merged
merged 12 commits into from Jul 19, 2022
Merged

Conversation

hanliutong
Copy link
Contributor

This is a patch of my GSoC project whose goal is to make the existing Universal Intrinsic compatible with variable-length backends. Thereby improving the performance of the RISC-V Vector (RVV) backend.

In this patch,

  1. We added a new rvv backend. Unlike the current implementation, we used the rvv type directly instead of the wrapper class, which will improve performance. Currently only the necessary functions are included, mainly v_setall, v_load, v_lut for compatibility with vx_XXX and v_min ,v_max for the MedianBlur example.
  2. To be compatible with other backends that use wrapper classes, we had to modify the existing universal intrinsic interfaces, including Vtraits and function compatibility layers. The documentation has not been updated yet, we will update it when changes of API are finalized.
  3. We modified some of the existing SIMD code to ensure compilation, mainly by adding the packaging of the CV_SIMD macro
  4. We enabled the new rvv backend implementation for the Imgproc module's MedianBlur as an example of how to reuse existing UI-optimized code blocks.

Run the opencv_test_imgproc, the test passed and result are promising: 7040 ms by new implentation vs. 42094 ms by current one.
$ qemu-riscv64 -cpu rv64,x-v=true ./bin/opencv_test_imgproc --gtest_filter="Imgproc_MedianBlur*"

Pull Request Readiness Checklist

See details at https://github.com/opencv/opencv/wiki/How_to_contribute#making-a-good-pull-request

  • I agree to contribute to the project under Apache 2 License.
  • To the best of my knowledge, the proposed patch is not based on a code under GPL or another license that is incompatible with OpenCV
  • The PR is proposed to the proper branch
  • There is a reference to the original bug report and related work
  • There is accuracy test, performance test and test data in opencv_extra repository, if applicable
    Patch to opencv_extra has the same branch name.
  • The feature is well documented and sample code can be built with the project CMake

#define OPENCV_HAL_IMPL_RVV_INIT_INTEGER(_Tpvec, _Tp, suffix1, suffix2, vl) \
inline v_##_Tpvec v_setzero_##suffix1() \
{ \
return vmv_v_x_##suffix2##m1(0, vl); \
Copy link
Contributor

@vpisarev vpisarev Jul 5, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

setting all lanes of a register to 0's or to specific constant are very common operations that are usually implemented very efficiently. E.g. on assembly level the first operation is done in the same way for all data types:

XOR DST, DST, DST

Is the vector size really needed for those intrinsics?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we should not use static inline int vlanes() { return vsetvlmax_e8m1(); } because according to my experiment, it will generate redundant ‘setvl’ instructions, the compiler faithfully translates intrinsic to instruction without any optimization.

At fisrt version, we use static variables static int nlanes, which doesn't produce redundant instructions, but needs to be initialized in some cpp file, where we encountered link time errors.

And if we can use static inline variables, things become easy, we can initialized the static variables in the class like static inline int nlanes = vsetvlmax_e8m1(); . However we need c++ 17 for inline variables.

You can see my experiment here: https://godbolt.org/z/Kjsea8x7f

Copy link
Contributor

@vpisarev vpisarev Jul 12, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@hanliutong,
we need to think about it. On the one hand, it's a good idea to use static inline variables. This way we will minimize the number of changes in SIMD loops that we have and, as you said, it will eliminate irrelevant calls to vsetvlmax_...().
On the other hand, OpenCV 4.x requires an earlier C++ version, C++ 11 actually. So we need to offer some alternative for C++ 11 users as well. Even users of modern versions of GCC and clang do not always set -std=c++17 in their projects.
Also, what still worries me is that we access nlanes/vlanes() in each single intrinsic implementation. See, for example, implementation of v_min/v_max intrinsics.

I wonder, if the following solution will work on RISC-V platform:

template<> struct VTraits<v_uint8> {
static inline int vlanes() { static int nlanes = vsetvlmax_e8m1(); return nlanes; }
};

I tried it briefly on my machine, M1 mac, where vsetvlmax_e8m1() was replaced with rand() and it works well. That is, the initialization runs once and then the value is just reused. Can you try it?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it would also be interesting to compare efficiency of the two approaches (always call vsetvlmax_...() or call it once and store in private static variable) on median blur and deep learning convolution loops

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

According to your latest comment, I have updated the source file, and yes, I also think using static variable in intrin_rvv_scalable.hpp file is the best solution

The experimental results are promising:

always call vsetvl static variable
median blur 7377 ms 7003 ms
convolution 3079 ms 2413 ms

@vpisarev
Copy link
Contributor

@hanliutong, I realized that our possible solution with CV_SIMD_SCALABLE_INIT() can only work if this macro is defined before intrin_rvv_scalable.hpp is included. Probably, the best way is to define

static const int __cv_rvv_u8_nlanes = vlsetvlmax_e8m1();

in the beginning of intrin_rvv_scalable.hpp. In this case CV_SIMD_SCALABLE_INIT() is not required.

@vpisarev
Copy link
Contributor

@alalek, I think, the patch is ready; the further work in this GSoC depends on it. Could you please merge it?

@alalek
Copy link
Member

alalek commented Jul 19, 2022

@alalek, I think, the patch is ready; the further work in this GSoC depends on it. Could you please merge it?

It is not ready.

Some checks were not successful
11 successful, 2 failing, and 1 skipped checks

@hanliutong
Copy link
Contributor Author

Hi @alalek, I have modified the code according to the error message, could you please help me rerun the workflow.

1 workflow awaiting approval
First-time contributors need a maintainer to approve running workflows.
1 pending check

I also found that there are some unnecessary warnings here about initializing other data types with int.
What do we need to do about this warnings? Is there a macro in OpenCV to suppress warnings

@alalek alalek merged commit 0ef8039 into opencv:4.x Jul 19, 2022
@alalek
Copy link
Member

alalek commented Jul 19, 2022

This patch break "Linux Debug" builds.
Please fix it ASAP.

@hanliutong hanliutong mentioned this pull request Jul 21, 2022
6 tasks
@hanliutong hanliutong deleted the new-rvv branch July 21, 2022 09:18
@alalek alalek mentioned this pull request Aug 21, 2022
@hanliutong hanliutong mentioned this pull request Feb 28, 2023
6 tasks
a-sajjad72 pushed a commit to a-sajjad72/opencv that referenced this pull request Mar 30, 2023
[GSoC] New universal intrinsic backend for RVV

* Add new rvv backend (partially implemented).

* Modify the framework of Universal Intrinsic.

* Add CV_SIMD macro guards to current UI code.

* Use vlanes() instead of nlanes.

* Modify the UI test.

* Enable the new RVV (scalable) backend.

* Remove whitespace.

* Rename and some others modify.

* Update intrin.hpp but still not work on AVX/SSE

* Update conditional compilation macros.

* Use static variable for vlanes.

* Use max_nlanes for array defining.
EXPECT_NE((size_t)0, (size_t)&data.u.d % CV_SIMD_WIDTH);
EXPECT_EQ((size_t)0, (size_t)&out.a.d % CV_SIMD_WIDTH);
EXPECT_NE((size_t)0, (size_t)&out.u.d % CV_SIMD_WIDTH);
EXPECT_EQ((size_t)0, (size_t)&data.a.d % VTraits<R>::max_nlanes);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like a wrong change. CV_SIMD_WIDTH was replaced with sizeof(typename VTraits<R>::lane_type) * VTraits<R>::vlanes() in other places but not these four lines. Details see #25196 (comment).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants