New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Carotene (ARM HAL) uses wrong rounding in some places #24163
Comments
The result on macOS is so weird, why it is all |
Correct result is "all zeros". Verification:
|
Windows MSVC 2022 is A = |
I worked with @IskXCr together, and found that the initilization of a struct template template <> struct wAdd<u32>
{
typedef u32 type;
f32 alpha, beta, gamma;
float32x4_t valpha, vbeta, vgamma;
wAdd(f32 _alpha, f32 _beta, f32 _gamma):
alpha(_alpha), beta(_beta), gamma(_gamma)
{
valpha = vdupq_n_f32(_alpha);
vbeta = vdupq_n_f32(_beta);
vgamma = vdupq_n_f32(_gamma + 0.5);
}
void operator() (const VecTraits<u32>::vec128 & v_src0,
const VecTraits<u32>::vec128 & v_src1,
VecTraits<u32>::vec128 & v_dst) const
{
float32x4_t vs1 = vcvtq_f32_u32(v_src0);
float32x4_t vs2 = vcvtq_f32_u32(v_src1);
vs1 = vmlaq_f32(vgamma, vs1, valpha);
vs1 = vmlaq_f32(vs1, vs2, vbeta);
v_dst = vcvtq_u32_f32(vs1);
}
<..remaining part..>
} That 0.5 is suspicious. It seems that we can simply delete that 0.5, but we think there must be some reasons why 0.5 stays there. |
Me and @Haosonn investigated the call stack while debugging the given example: The problem can be traced back to these recursive template functions. // 3rdparty/carotene/src/add_weighted.cpp
#define IMPL_ADDWEIGHTED(type) \
void addWeighted(const Size2D &size, \
const type * src0Base, ptrdiff_t src0Stride, \
const type * src1Base, ptrdiff_t src1Stride, \
type * dstBase, ptrdiff_t dstStride, \
f32 alpha, f32 beta, f32 gamma) \
{ \
internal::assertSupportedConfiguration(); \
wAdd<type> wgtAdd(alpha, \
beta, \
gamma); \
internal::vtransform(size, \
src0Base, src0Stride, \
src1Base, src1Stride, \
dstBase, dstStride, \
wgtAdd); \
} The above code snippet is a macro that when expanded computes the weighted sum: When executing function // 3rdparty/carotene/src/add_weighted.cpp
void operator() (const T * src0, const T * src1, T * dst) const
{
dst[0] = saturate_cast<T>(alpha*src0[0] + beta*src1[0] + gamma);
} and the wider type wAdd(f32 _alpha, f32 _beta, f32 _gamma):
alpha(_alpha), beta(_beta), gamma(_gamma)
{
valpha = vdupq_n_f32(_alpha);
vbeta = vdupq_n_f32(_beta);
vgamma = vdupq_n_f32(_gamma + 0.5);
} This results in inconsistency between results from functions that utilize SIMD and results from direct calculation on the given type. |
@ZJUGuoShuai could you build OpenCV from 4.x branch and check if the issue is still relevant. |
Hello, I investigate a little more. opencv/3rdparty/carotene/src/add_weighted.cpp Lines 156 to 166 in 21fb10c
If arm7 is not need to care, we can replace instruction (It may works well only armv8+). However if it supports armv7, it is not easy because the hardware will not assist/support it. kmtr@ubuntu:~/work/opencv_ram$ git diff -c
diff --git a/3rdparty/carotene/src/add_weighted.cpp b/3rdparty/carotene/src/add_weighted.cpp
index 6559b9fe53..09940a8524 100644
--- a/3rdparty/carotene/src/add_weighted.cpp
+++ b/3rdparty/carotene/src/add_weighted.cpp
@@ -150,7 +150,7 @@ template <> struct wAdd<u32>
{
valpha = vdupq_n_f32(_alpha);
vbeta = vdupq_n_f32(_beta);
- vgamma = vdupq_n_f32(_gamma + 0.5);
+ vgamma = vdupq_n_f32(_gamma);
}
void operator() (const VecTraits<u32>::vec128 & v_src0,
@@ -162,7 +162,7 @@ template <> struct wAdd<u32>
vs1 = vmlaq_f32(vgamma, vs1, valpha);
vs1 = vmlaq_f32(vs1, vs2, vbeta);
- v_dst = vcvtq_u32_f32(vs1);
+ v_dst = vcvtnq_u32_f32(vs1);
}
void operator() (const VecTraits<u32>::vec64 & v_src0, |
armv7 is still more than alive. Its support is required. |
Confirmed the issue with Jetson NANO (Armv8, linux). |
Looks like the approach with +0.5 and vdupq_n_f32 is widely used in Carotene. Need to revise it:
|
(1+0)/2
on different platforms
Thank you for comment. linaro@linaro-alip:~/work/opencv$ git --no-pager diff -c
diff --git a/3rdparty/carotene/src/add_weighted.cpp b/3rdparty/carotene/src/add_weighted.cpp
index 6559b9fe53..c56f95a4e3 100644
--- a/3rdparty/carotene/src/add_weighted.cpp
+++ b/3rdparty/carotene/src/add_weighted.cpp
@@ -150,7 +150,7 @@ template <> struct wAdd<u32>
{
valpha = vdupq_n_f32(_alpha);
vbeta = vdupq_n_f32(_beta);
- vgamma = vdupq_n_f32(_gamma + 0.5);
+ vgamma = vdupq_n_f32(_gamma);
}
void operator() (const VecTraits<u32>::vec128 & v_src0,
@@ -162,7 +162,7 @@ template <> struct wAdd<u32>
vs1 = vmlaq_f32(vgamma, vs1, valpha);
vs1 = vmlaq_f32(vs1, vs2, vbeta);
- v_dst = vcvtq_u32_f32(vs1);
+ v_dst = round_u32_f32(vs1);
}
void operator() (const VecTraits<u32>::vec64 & v_src0,
@@ -174,7 +174,7 @@ template <> struct wAdd<u32>
vs1 = vmla_f32(vget_low(vgamma), vs1, vget_low(valpha));
vs1 = vmla_f32(vs1, vs2, vget_low(vbeta));
- v_dst = vcvt_u32_f32(vs1);
+ v_dst = round_u32_f32(vs1);
}
void operator() (const u32 * src0, const u32 * src1, u32 * dst) const
diff --git a/3rdparty/carotene/src/vtransform.hpp b/3rdparty/carotene/src/vtransform.hpp
index 08841a2263..7ae38e95df 100644
--- a/3rdparty/carotene/src/vtransform.hpp
+++ b/3rdparty/carotene/src/vtransform.hpp
@@ -682,6 +682,31 @@ void vtransform(Size2D size,
}
}
+inline VecTraits<u32>::vec128 round_u32_f32(const float32x4_t val)
+{
+#if defined(__aarch64__) || defined(__aarch32__)
+ return vcvrnq_u32_f32(val);
+#else // armv7
+#if 1
+ static const float32x4_t f32_v0_5 = vdupq_n_f32(0.5);
+ static const uint32x4_t u32_v1_0 = vdupq_n_u32(1);
+
+ const uint32x4_t round = vcvtq_u32_f32( vaddq_f32(val, f32_v0_5 ) );
+ const uint32x4_t isOdd = vandq_u32( round, u32_v1_0 );
+ const uint32x4_t isFrac0_5 = vceqq_f32(vsubq_f32(vcvtq_f32_u32(round),val), f32_v0_5 );
+ return vsubq_u32( round, vandq_u32( isOdd, isFrac0_5 ) );
+#else
+ static const float32x4_t f32_v0_5 = vdupq_n_f32(0.5);
+ return vcvtq_u32_f32( vaddq_f32(val, f32_v0_5) );
+#endif
+#endif
+}
+
+inline VecTraits<u32>::vec64 round_u32_f32(const float32x2_t val)
+{
+ return vcvt_u32_f32(val);
+}
+
} }
#endif // CAROTENE_NEON
|
Hello, I'm sorry. I couldn't finish fixing just this weekend. My trial code is here, but I will fix/change/refactor it. And fixing v_round() are likely to particularly affect the performance of computationally intensive tasks such as DNNs. Following is comment to this trouble to investigate. I was getting wrong results in reference/original code when the input was negative. Here is OK. opencv/modules/core/include/opencv2/core/hal/intrin_neon.hpp Lines 1970 to 1977 in c552e9e
Here is not OK. opencv/3rdparty/carotene/src/add_weighted.cpp Lines 106 to 122 in c552e9e
|
I create a pull request to fix this problem. I believe there are no performance effetcs on A32/A64. [Fixed]
[Not fixed] I cannot fix it because I cannot verify on tegra device
[Not fixed] They are not simple rounding. I believe it is better that they will be fix if there are any issue.
|
Fix to convert float32 to int32/uint32 with rounding to nearest (ties to even). #24271 Fix #24163 ### Pull Request Readiness Checklist See details at https://github.com/opencv/opencv/wiki/How_to_contribute#making-a-good-pull-request - [x] I agree to contribute to the project under Apache 2 License. - [x] To the best of my knowledge, the proposed patch is not based on a code under GPL or another license that is incompatible with OpenCV - [x] The PR is proposed to the proper branch - [x] There is a reference to the original bug report and related work - [x] There is accuracy test, performance test and test data in opencv_extra repository, if applicable Patch to opencv_extra has the same branch name. - [x] The feature is well documented and sample code can be built with the project CMake (carotene is BSD)
System Information
Detailed description
Even this
(1+0)/2
gives different result on macOS and Linux:Steps to reproduce
Write this code:
On macOS:
On Linux:
Issue submission checklist
The text was updated successfully, but these errors were encountered: