Carotene (ARM HAL) uses wrong rounding in some places #24163

AtomicVar · 2023-08-16T12:43:32Z

System Information

OpenCV version: 4.8.0
Operating System: macOS 13.3.1 (ARM64) / Ubuntu 20.04.4 (x86_64)
Compiler: Apple clang version 14.0.3 / g++ (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0

Detailed description

Even this (1+0)/2 gives different result on macOS and Linux:

cv::Mat A = cv::Mat::ones(3, 3, CV_8UC1);
cv::Mat B = cv::Mat::zeros(3, 3, CV_8UC1);
cv::Mat mean = (A + B) / 2;
std::cout << "A = \n" << A << std::endl;
std::cout << "B = \n" << B << std::endl;
std::cout << "(A+B)/2 = \n" << mean << std::endl;

Steps to reproduce

Write this code:

cv::Mat A = cv::Mat::ones(3, 3, CV_8UC1);
cv::Mat B = cv::Mat::zeros(3, 3, CV_8UC1);
cv::Mat mean = (A + B) / 2;
std::cout << "A = \n" << A << std::endl;
std::cout << "B = \n" << B << std::endl;
std::cout << "(A+B)/2 = \n" << mean << std::endl;

On macOS:

A =
[  1,   1,   1;
   1,   1,   1;
   1,   1,   1]
B =
[  0,   0,   0;
   0,   0,   0;
   0,   0,   0]
(A+B)/2 =
[  1,   1,   1;
   1,   1,   1;
   1,   1,   0]

On Linux:

A =
[  1,   1,   1;
   1,   1,   1;
   1,   1,   1]
B =
[  0,   0,   0;
   0,   0,   0;
   0,   0,   0]
(A+B)/2 =
[  0,   0,   0;
   0,   0,   0;
   0,   0,   0]

Issue submission checklist

I report the issue, it's not a question
I checked the problem with documentation, FAQ, open issues, forum.opencv.org, Stack Overflow, etc and have not found any solution
I updated to the latest OpenCV version and the issue is still there
There is reproducer code and related data files (videos, images, onnx, etc)

The text was updated successfully, but these errors were encountered:

AtomicVar · 2023-08-16T12:45:25Z

The result on macOS is so weird, why it is all 1s except the last one?

opencv-alalek · 2023-08-16T21:01:34Z

Correct result is "all zeros".
Looks like these is some issue with ARM optimizations on OSX.

Verification: cmake <...> -DCMAKE_BUILD_TYPE=Debug -DCV_DISABLE_OPTIMIZATION=ON

A = 
[  1,   1,   1;
   1,   1,   1;
   1,   1,   1]
B = 
[  0,   0,   0;
   0,   0,   0;
   0,   0,   0]
(A+B)/2 = 
[  0,   0,   0;
   0,   0,   0;
   0,   0,   0]

LaurentBerger · 2023-08-17T07:03:31Z

Windows MSVC 2022 is

A =
[ 1, 1, 1;
1, 1, 1;
1, 1, 1]
B =
[ 0, 0, 0;
0, 0, 0;
0, 0, 0]
(A+B)/2 =
[ 0, 0, 0;
0, 0, 0;
0, 0, 0]

Haosonn · 2023-08-22T14:10:08Z

I worked with @IskXCr together, and found that the initilization of a struct template wAdd may cause the bug.
In ./3rdparty/carotene/src/add_weighted.cpp

template <> struct wAdd<u32>
{
    typedef u32 type;
    f32 alpha, beta, gamma;
    float32x4_t valpha, vbeta, vgamma;
    wAdd(f32 _alpha, f32 _beta, f32 _gamma):
        alpha(_alpha), beta(_beta), gamma(_gamma)
    {
        valpha = vdupq_n_f32(_alpha);
        vbeta = vdupq_n_f32(_beta);
        vgamma = vdupq_n_f32(_gamma + 0.5);
    }

    void operator() (const VecTraits<u32>::vec128 & v_src0,
                     const VecTraits<u32>::vec128 & v_src1,
                     VecTraits<u32>::vec128 & v_dst) const
    {
        float32x4_t vs1 = vcvtq_f32_u32(v_src0);
        float32x4_t vs2 = vcvtq_f32_u32(v_src1);

        vs1 = vmlaq_f32(vgamma, vs1, valpha);
        vs1 = vmlaq_f32(vs1, vs2, vbeta);
        v_dst = vcvtq_u32_f32(vs1);
    }
<..remaining part..>
}

That 0.5 is suspicious.
The first 8 entries are computed by an SIMD approach. Originally, gamma should be 0, but an extra 0.5 is added, causing an addition to v_dst, which, by definition of vmlaq_f32, can be expressed as v_dst = 0.5(vgamma) + 1(vs1) + 0.5(valpha) + 0(vs2) + 0.5(vbeta) = 1(1.5 rounded to 1!)(actually the operands are arrays of 4, we consider it as an float only for convenience).
However, the remaining entry in the matrix is calculated primitively so the result is 0.

It seems that we can simply delete that 0.5, but we think there must be some reasons why 0.5 stays there.

IskXCr · 2023-08-22T16:25:22Z

Me and @Haosonn investigated the call stack while debugging the given example: The problem can be traced back to these recursive template functions.

// 3rdparty/carotene/src/add_weighted.cpp
#define IMPL_ADDWEIGHTED(type)                                \
void addWeighted(const Size2D &size,                          \
                 const type * src0Base, ptrdiff_t src0Stride, \
                 const type * src1Base, ptrdiff_t src1Stride, \
                 type * dstBase, ptrdiff_t dstStride,         \
                 f32 alpha, f32 beta, f32 gamma)              \
{                                                             \
    internal::assertSupportedConfiguration();                 \
    wAdd<type> wgtAdd(alpha,                                  \
                      beta,                                   \
                      gamma);                                 \
    internal::vtransform(size,                                \
                         src0Base, src0Stride,                \
                         src1Base, src1Stride,                \
                         dstBase, dstStride,                  \
                         wgtAdd);                             \
}

The above code snippet is a macro that when expanded computes the weighted sum:

$$ \textbf{r} = \alpha * \textbf{a} + \beta * \textbf{b} + \gamma. $$

When executing function vtransform, the behavior differs between the one for the actual type

// 3rdparty/carotene/src/add_weighted.cpp
void operator() (const T * src0, const T * src1, T * dst) const
{
    dst[0] = saturate_cast<T>(alpha*src0[0] + beta*src1[0] + gamma);
}

and the wider type wAdd<u32>, which, by definition during its initialization, adds $0.5$ to the constant $\gamma$.

wAdd(f32 _alpha, f32 _beta, f32 _gamma):
        alpha(_alpha), beta(_beta), gamma(_gamma)
{
    valpha = vdupq_n_f32(_alpha);
    vbeta = vdupq_n_f32(_beta);
    vgamma = vdupq_n_f32(_gamma + 0.5);
}

This results in inconsistency between results from functions that utilize SIMD and results from direct calculation on the given type.
The same problem also occurs when using type CV_16UC1. Theoretically it should affect any computation that involves this particular template function.

asmorkalov · 2023-09-04T16:23:32Z

May be related to #24213 and fix #24215

asmorkalov · 2023-09-04T16:25:11Z

@ZJUGuoShuai could you build OpenCV from 4.x branch and check if the issue is still relevant.

Haosonn · 2023-09-05T08:44:07Z

May be related to #24213 and fix #24215

This issue still exists after the newest fix #24215

Kumataro · 2023-09-05T13:07:56Z

Hello, I investigate a little more.

opencv/3rdparty/carotene/src/add_weighted.cpp

Lines 156 to 166 in 21fb10c

    
           void operator() (const VecTraits<u32>::vec128 & v_src0, 
        
                            const VecTraits<u32>::vec128 & v_src1, 
        
                            VecTraits<u32>::vec128 & v_dst) const 
        
           { 
        
               float32x4_t vs1 = vcvtq_f32_u32(v_src0); 
        
               float32x4_t vs2 = vcvtq_f32_u32(v_src1); 
        
               vs1 = vmlaq_f32(vgamma, vs1, valpha); 
        
               vs1 = vmlaq_f32(vs1, vs2, vbeta); 
        
               v_dst = vcvtq_u32_f32(vs1); 
        
           }

vcvtq_u32_f32() is for armv7, aarch32 and aarch64. it rounds with FPCR (Floating point control register).We can select round to nearest mode, but cannot select which mode is used from ties to even or ties away from zero.
vcvtnq_u32_f32() is for aarch32 and aarch64. it rounds to nearest with ties to even

If arm7 is not need to care, we can replace instruction (It may works well only armv8+).

However if it supports armv7, it is not easy because the hardware will not assist/support it.

kmtr@ubuntu:~/work/opencv_ram$ git diff -c
diff --git a/3rdparty/carotene/src/add_weighted.cpp b/3rdparty/carotene/src/add_weighted.cpp
index 6559b9fe53..09940a8524 100644
--- a/3rdparty/carotene/src/add_weighted.cpp
+++ b/3rdparty/carotene/src/add_weighted.cpp
@@ -150,7 +150,7 @@ template <> struct wAdd<u32>
     {
         valpha = vdupq_n_f32(_alpha);
         vbeta = vdupq_n_f32(_beta);
-        vgamma = vdupq_n_f32(_gamma + 0.5);
+        vgamma = vdupq_n_f32(_gamma);
     }

     void operator() (const VecTraits<u32>::vec128 & v_src0,
@@ -162,7 +162,7 @@ template <> struct wAdd<u32>

         vs1 = vmlaq_f32(vgamma, vs1, valpha);
         vs1 = vmlaq_f32(vs1, vs2, vbeta);
-        v_dst = vcvtq_u32_f32(vs1);
+        v_dst = vcvtnq_u32_f32(vs1);
     }

     void operator() (const VecTraits<u32>::vec64 & v_src0,

asmorkalov · 2023-09-05T13:44:27Z

armv7 is still more than alive. Its support is required.

asmorkalov · 2023-09-07T10:46:23Z

Confirmed the issue with Jetson NANO (Armv8, linux).

asmorkalov · 2023-09-07T11:00:49Z

Looks like the approach with +0.5 and vdupq_n_f32 is widely used in Carotene. Need to revise it:

hal/tegra_hal.hpp:1442:        (dst_width + 0.5)/inv_scale_x + 0.5 >= src_width && (dst_height + 0.5)/inv_scale_y + 0.5 >= src_height && \
src/phase.cpp:196:            dst[j] = (u8)(s32)floor(a + 0.5f);
src/blur.cpp:161:            f32 val = (prevx + currx + nextx) * (1 / 9.f) + 0.5f;
src/resize.cpp:156:        ofs[x] = static_cast<u32>(floorf((x + 0.5f) * ratio));
src/resize.cpp:173:        size_t src_y = static_cast<size_t>(floorf((dst_y + 0.5f) * hr));
src/resize.cpp:203:                                           (dsize.width + 0.5) * wr >= ssize.width &&
src/resize.cpp:204:                                           (dsize.height + 0.5) * hr >= ssize.height && // Ensure source isn't too big
src/resize.cpp:1587:        buf2[row] = floorf(rweight * (1 << SHIFT_BITS) + 0.5f);
src/resize.cpp:1608:            cwi[k] = (short)floorf((col2[k] - c) * (1 << SHIFT_BITS) + 0.5f);
src/resize.cpp:1771:            cwi[k] = (s16)floorf((col2[k] - c) * (1 << SHIFT_BITS) + 0.5f);
src/resize.cpp:1875:                                           (dsize.width + 0.5) * wr + 0.5 >= ssize.width &&
src/resize.cpp:1876:                                           (dsize.height + 0.5) * hr + 0.5 >= ssize.height && // Ensure source isn't too big
src/resize.cpp:1911:                                           (dsize.width + 0.5) * wr + 0.5 >= ssize.width &&
src/resize.cpp:1912:                                           (dsize.height + 0.5) * hr + 0.5 >= ssize.height && // Ensure source isn't too big
src/canny.cpp:557:            const s32 TG22 = (s32)(0.4142135623730950488016887242097*(1<<CANNY_SHIFT) + 0.5);
src/colorconvert.cpp:1289:    h = ((h * s32((hrange << hsv_shift)/(6.f*diff) + 0.5)) + (1 << (hsv_shift-1))) >> hsv_shift;
src/add_weighted.cpp:109:        vgamma = vdupq_n_f32(_gamma + 0.5);
src/add_weighted.cpp:153:        vgamma = vdupq_n_f32(_gamma + 0.5);
src/add_weighted.cpp:197:        vgamma = vdupq_n_f32(_gamma + 0.5);
src/convert_scale.cpp:141:    register float32x4_t vshift asm ("q1") = vdupq_n_f32((f32)beta + 0.5f);,
src/convert_scale.cpp:188:    float32x4_t vshift = vdupq_n_f32((f32)beta + 0.5f);,
src/convert_scale.cpp:226:    register float32x4_t vshift asm ("q1") = vdupq_n_f32((f32)beta + 0.5f);,
src/convert_scale.cpp:273:    float32x4_t vshift = vdupq_n_f32((f32)beta + 0.5f);,
src/convert_scale.cpp:311:    register float32x4_t vshift asm ("q1") = vdupq_n_f32((f32)beta + 0.5f);,
src/convert_scale.cpp:358:    float32x4_t vshift = vdupq_n_f32((f32)beta + 0.5f);,
src/convert_scale.cpp:395:    register float32x4_t vshift asm ("q1") = vdupq_n_f32((f32)beta + 0.5f);,
src/convert_scale.cpp:442:    float32x4_t vshift = vdupq_n_f32((f32)beta + 0.5f);,
src/convert_scale.cpp:479:    register float32x4_t vshift asm ("q1") = vdupq_n_f32((f32)beta + 0.5f);,
src/convert_scale.cpp:529:    float32x4_t vshift = vdupq_n_f32((f32)beta + 0.5f);,
src/convert_scale.cpp:649:    register float32x4_t vshift asm ("q1") = vdupq_n_f32((f32)beta + 0.5f);,
src/convert_scale.cpp:696:    float32x4_t vshift = vdupq_n_f32((f32)beta + 0.5f);,
src/convert_scale.cpp:734:    register float32x4_t vshift asm ("q1") = vdupq_n_f32((f32)beta + 0.5f);,
src/convert_scale.cpp:781:    float32x4_t vshift = vdupq_n_f32((f32)beta + 0.5f);,
src/convert_scale.cpp:819:    register float32x4_t vshift asm ("q1") = vdupq_n_f32((f32)beta + 0.5f);,
src/convert_scale.cpp:866:    float32x4_t vshift = vdupq_n_f32((f32)beta + 0.5f);,
src/convert_scale.cpp:905:    register float32x4_t vshift asm ("q1") = vdupq_n_f32((f32)beta + 0.5f);,
src/convert_scale.cpp:952:    float32x4_t vshift = vdupq_n_f32((f32)beta + 0.5f);,
src/convert_scale.cpp:991:    register float32x4_t vshift asm ("q1") = vdupq_n_f32((f32)beta + 0.5f);,
src/convert_scale.cpp:1041:    float32x4_t vshift = vdupq_n_f32((f32)beta + 0.5f);,
src/convert_scale.cpp:1161:    register float32x4_t vshift asm ("q1") = vdupq_n_f32((f32)beta + 0.5f);,
src/convert_scale.cpp:1193:    float32x4_t vshift = vdupq_n_f32((f32)beta + 0.5f);,
src/convert_scale.cpp:1220:    register float32x4_t vshift asm ("q1") = vdupq_n_f32((f32)beta + 0.5f);,
src/convert_scale.cpp:1252:    float32x4_t vshift = vdupq_n_f32((f32)beta + 0.5f);,
src/convert_scale.cpp:1279:    register float32x4_t vshift asm ("q1") = vdupq_n_f32((f32)beta + 0.5f);,
src/convert_scale.cpp:1310:    float32x4_t vshift = vdupq_n_f32((f32)beta + 0.5f);,
src/convert_scale.cpp:1336:    register float32x4_t vshift asm ("q1") = vdupq_n_f32((f32)beta + 0.5f);,
src/convert_scale.cpp:1367:    float32x4_t vshift = vdupq_n_f32((f32)beta + 0.5f);,
src/convert_scale.cpp:1393:    register float32x4_t vshift asm ("q1") = vdupq_n_f32((f32)beta + 0.5f);,
src/convert_scale.cpp:1424:    float32x4_t vshift = vdupq_n_f32((f32)beta + 0.5f);,
src/convert_scale.cpp:1501:    register float32x4_t vshift asm ("q1") = vdupq_n_f32((f32)beta + 0.5f);,
src/convert_scale.cpp:1533:    float32x4_t vshift = vdupq_n_f32((f32)beta + 0.5f);,
src/convert_scale.cpp:1560:    register float32x4_t vshift asm ("q1") = vdupq_n_f32((f32)beta + 0.5f);,
src/convert_scale.cpp:1592:    float32x4_t vshift = vdupq_n_f32((f32)beta + 0.5f);,
src/convert_scale.cpp:1619:    register float32x4_t vshift asm ("q1") = vdupq_n_f32((f32)beta + 0.5f);,
src/convert_scale.cpp:1650:    float32x4_t vshift = vdupq_n_f32((f32)beta + 0.5f);,
src/convert_scale.cpp:1676:    register float32x4_t vshift asm ("q1") = vdupq_n_f32((f32)beta + 0.5f);,
src/convert_scale.cpp:1707:    float32x4_t vshift = vdupq_n_f32((f32)beta + 0.5f);,
src/convert_scale.cpp:1733:    register float32x4_t vshift asm ("q1") = vdupq_n_f32((f32)beta + 0.5f);,
src/convert_scale.cpp:1764:    float32x4_t vshift = vdupq_n_f32((f32)beta + 0.5f);,
src/convert_scale.cpp:1841:    register float32x4_t vshift asm ("q1") = vdupq_n_f32((f32)beta + 0.5f);,
src/convert_scale.cpp:1873:    float32x4_t vshift = vdupq_n_f32((f32)beta + 0.5f);,
src/convert_scale.cpp:1899:    register float32x4_t vshift asm ("q1") = vdupq_n_f32((f32)beta + 0.5f);,
src/convert_scale.cpp:1931:    float32x4_t vshift = vdupq_n_f32((f32)beta + 0.5f);,
src/convert_scale.cpp:1957:    register float32x4_t vshift asm ("q1") = vdupq_n_f32((f32)beta + 0.5f);,
src/convert_scale.cpp:1988:    float32x4_t vshift = vdupq_n_f32((f32)beta + 0.5f);,
src/convert_scale.cpp:2013:    register float32x4_t vshift asm ("q1") = vdupq_n_f32((f32)beta + 0.5f);,
src/convert_scale.cpp:2044:    float32x4_t vshift = vdupq_n_f32((f32)beta + 0.5f);,
src/convert_scale.cpp:2069:    register float32x4_t vshift asm ("q1") = vdupq_n_f32((f32)beta + 0.5f);,
src/convert_scale.cpp:2100:    float32x4_t vshift = vdupq_n_f32((f32)beta + 0.5f);,
src/convert_scale.cpp:2245:    register float32x4_t vshift asm ("q1") = vdupq_n_f32((f32)beta + 0.5f);,
src/convert_scale.cpp:2275:    float32x4_t vshift = vdupq_n_f32((f32)beta + 0.5f);,
src/convert_scale.cpp:2299:    register float32x4_t vshift asm ("q1") = vdupq_n_f32((f32)beta + 0.5f);,
src/convert_scale.cpp:2328:    float32x4_t vshift = vdupq_n_f32((f32)beta + 0.5f);,
src/convert_scale.cpp:2351:    register float32x4_t vshift asm ("q1") = vdupq_n_f32((f32)beta + 0.5f);,
src/convert_scale.cpp:2380:    float32x4_t vshift = vdupq_n_f32((f32)beta + 0.5f);,
src/convert_scale.cpp:2403:    register float32x4_t vshift asm ("q1") = vdupq_n_f32((f32)beta + 0.5f);,
src/convert_scale.cpp:2432:    float32x4_t vshift = vdupq_n_f32((f32)beta + 0.5f);,

Kumataro · 2023-09-07T21:27:21Z

Thank you for comment.
This is only trial, however I think this fix will slow down processing.
I'll continue to investigate ...

linaro@linaro-alip:~/work/opencv$ git --no-pager diff -c
diff --git a/3rdparty/carotene/src/add_weighted.cpp b/3rdparty/carotene/src/add_weighted.cpp
index 6559b9fe53..c56f95a4e3 100644
--- a/3rdparty/carotene/src/add_weighted.cpp
+++ b/3rdparty/carotene/src/add_weighted.cpp
@@ -150,7 +150,7 @@ template <> struct wAdd<u32>
     {
         valpha = vdupq_n_f32(_alpha);
         vbeta = vdupq_n_f32(_beta);
-        vgamma = vdupq_n_f32(_gamma + 0.5);
+        vgamma = vdupq_n_f32(_gamma);
     }

     void operator() (const VecTraits<u32>::vec128 & v_src0,
@@ -162,7 +162,7 @@ template <> struct wAdd<u32>

         vs1 = vmlaq_f32(vgamma, vs1, valpha);
         vs1 = vmlaq_f32(vs1, vs2, vbeta);
-        v_dst = vcvtq_u32_f32(vs1);
+        v_dst = round_u32_f32(vs1);
     }

     void operator() (const VecTraits<u32>::vec64 & v_src0,
@@ -174,7 +174,7 @@ template <> struct wAdd<u32>

         vs1 = vmla_f32(vget_low(vgamma), vs1, vget_low(valpha));
         vs1 = vmla_f32(vs1, vs2, vget_low(vbeta));
-        v_dst = vcvt_u32_f32(vs1);
+        v_dst = round_u32_f32(vs1);
     }

     void operator() (const u32 * src0, const u32 * src1, u32 * dst) const
diff --git a/3rdparty/carotene/src/vtransform.hpp b/3rdparty/carotene/src/vtransform.hpp
index 08841a2263..7ae38e95df 100644
--- a/3rdparty/carotene/src/vtransform.hpp
+++ b/3rdparty/carotene/src/vtransform.hpp
@@ -682,6 +682,31 @@ void vtransform(Size2D size,
     }
 }

+inline VecTraits<u32>::vec128 round_u32_f32(const float32x4_t val)
+{
+#if defined(__aarch64__) || defined(__aarch32__)
+    return vcvrnq_u32_f32(val);
+#else // armv7
+#if 1
+    static const float32x4_t f32_v0_5 = vdupq_n_f32(0.5);
+    static const uint32x4_t  u32_v1_0 = vdupq_n_u32(1);
+
+    const uint32x4_t round = vcvtq_u32_f32( vaddq_f32(val, f32_v0_5 ) );
+    const uint32x4_t isOdd = vandq_u32( round, u32_v1_0 );
+    const uint32x4_t isFrac0_5 = vceqq_f32(vsubq_f32(vcvtq_f32_u32(round),val), f32_v0_5 );
+    return vsubq_u32( round, vandq_u32( isOdd, isFrac0_5 ) );
+#else
+    static const float32x4_t f32_v0_5 = vdupq_n_f32(0.5);
+    return vcvtq_u32_f32( vaddq_f32(val, f32_v0_5) );
+#endif
+#endif
+}
+
+inline VecTraits<u32>::vec64 round_u32_f32(const float32x2_t val)
+{
+  return vcvt_u32_f32(val);
+}
+
 } }

 #endif // CAROTENE_NEON

val = 10.5
-> round = 11
-> isOdd = 1
-> isFrac0_5 = 0xffff
-> ret = round - (isOdd and isFrac0_5) = 11 - 1 = 10

val = 11.5
-> round = 12
-> isOdd = 0
-> isFrac0_5 = 0xffff
-> ret = round - (isOdd and isFrac0_5) = 12 - 0 = 12

val = 12.5
-> round = 13
-> isOdd = 1
-> isFrac0_5 = 0xffff
-> ret = round - (isOdd and isFrac0_5) = 13 - 1 = 12

Kumataro · 2023-09-10T13:46:50Z

Hello, I'm sorry. I couldn't finish fixing just this weekend.

My trial code is here, but I will fix/change/refactor it.
And I think v_round() in intrin_neon.hpp at my fix is too much heavy.
I will continue to investigate.

And fixing v_round() are likely to particularly affect the performance of computationally intensive tasks such as DNNs.
I felt that the default setting for ARMv7 should be the same speed priority as the current situation.
( For ARMv8 including aarch32/aarch64, complex instructions are not needed. So there are no effects )

Following is comment to this trouble to investigate.

I was getting wrong results in reference/original code when the input was negative.
v_gamma ( _gamma + 0.5) is not suitable for negative value ( e.g. -4.3).
I think bias sign must be same as value.

Here is OK.

opencv/modules/core/include/opencv2/core/hal/intrin_neon.hpp

Lines 1970 to 1977 in c552e9e

    
           inline v_int32x4 v_round(const v_float32x4& a) 
        
           { 
        
               static const int32x4_t v_sign = vdupq_n_s32(1 << 31), 
        
                   v_05 = vreinterpretq_s32_f32(vdupq_n_f32(0.5f)); 
        
               int32x4_t v_addition = vorrq_s32(v_05, vandq_s32(v_sign, vreinterpretq_s32_f32(a.val))); 
        
               return v_int32x4(vcvtq_s32_f32(vaddq_f32(a.val, vreinterpretq_f32_s32(v_addition)))); 
        
           }

Here is not OK.

opencv/3rdparty/carotene/src/add_weighted.cpp

Lines 106 to 122 in c552e9e

    
           { 
        
               valpha = vdupq_n_f32(_alpha); 
        
               vbeta = vdupq_n_f32(_beta); 
        
               vgamma = vdupq_n_f32(_gamma + 0.5); 
        
           } 
        
           void operator() (const VecTraits<s32>::vec128 & v_src0, 
        
                            const VecTraits<s32>::vec128 & v_src1, 
        
                            VecTraits<s32>::vec128 & v_dst) const 
        
           { 
        
               float32x4_t vs1 = vcvtq_f32_s32(v_src0); 
        
               float32x4_t vs2 = vcvtq_f32_s32(v_src1); 
        
               vs1 = vmlaq_f32(vgamma, vs1, valpha); 
        
               vs1 = vmlaq_f32(vs1, vs2, vbeta); 
        
               v_dst = vcvtq_s32_f32(vs1); 
        
           }

Kumataro · 2023-09-14T12:11:38Z

I create a pull request to fix this problem.

I believe there are no performance effetcs on A32/A64.
However on ARMv7, it slow slightly(or very).
ARMv7-legacy mode is implemented, it works "Round to nearest, ties away from zero".
(In this case, 0.5 rounds to 1)

[Fixed]

src/add_weighted.cpp
src/blur.cpp
src/colorconvert.cpp
src/div.cpp
src/phase.cpp
src/convert_scale.cpp

[Not fixed] I cannot fix it because I cannot verify on tegra device

hal/tegra_hal.hpp

[Not fixed] They are not simple rounding. I believe it is better that they will be fix if there are any issue.

src/resize.cpp
- float32x4_t v_dstf = vmulq_f32(vaddq_f32(v_index, v_05), v_ratio);
src/canny.cpp
- const s32 TG22 = (s32)(0.4142135623730950488016887242097*(1<<CANNY_SHIFT) + 0.5);

Fix to convert float32 to int32/uint32 with rounding to nearest (ties to even). #24271 Fix #24163 ### Pull Request Readiness Checklist See details at https://github.com/opencv/opencv/wiki/How_to_contribute#making-a-good-pull-request - [x] I agree to contribute to the project under Apache 2 License. - [x] To the best of my knowledge, the proposed patch is not based on a code under GPL or another license that is incompatible with OpenCV - [x] The PR is proposed to the proper branch - [x] There is a reference to the original bug report and related work - [x] There is accuracy test, performance test and test data in opencv_extra repository, if applicable Patch to opencv_extra has the same branch name. - [x] The feature is well documented and sample code can be built with the project CMake (carotene is BSD)

AtomicVar added the bug label Aug 16, 2023

opencv-alalek added optimization platform: ios/osx platform: arm ARM boards related issues: RPi, NVIDIA TK/TX, etc labels Aug 16, 2023

asmorkalov assigned vpisarev Aug 18, 2023

Kumataro mentioned this issue Aug 30, 2023

python: support tuple src for cv::add()/subtract()/... #24074

Merged

5 tasks

asmorkalov added this to the 4.9.0 milestone Sep 7, 2023

asmorkalov added the confirmed There is stable reproducer / investigation complete label Sep 7, 2023

asmorkalov changed the title ~~OpenCV gives different result for (1+0)/2 on different platforms~~ Carotene (ARM HAL) uses wrong rounding in some places Sep 7, 2023

asmorkalov assigned asmorkalov and unassigned vpisarev Sep 7, 2023

Kumataro mentioned this issue Sep 14, 2023

fix to convert float32 to int32/uint32 with rounding to nearest (ties to even). #24271

Merged

6 tasks

asmorkalov mentioned this issue Oct 4, 2023

cvtColor with COLOR_YUV2RGB_NV21 get different result on android arm and linux x64 #20432

Closed

4 tasks

asmorkalov closed this as completed in #24271 Dec 25, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Carotene (ARM HAL) uses wrong rounding in some places #24163

Carotene (ARM HAL) uses wrong rounding in some places #24163

AtomicVar commented Aug 16, 2023 •

edited

AtomicVar commented Aug 16, 2023

opencv-alalek commented Aug 16, 2023

LaurentBerger commented Aug 17, 2023

Haosonn commented Aug 22, 2023 •

edited

IskXCr commented Aug 22, 2023 •

edited

asmorkalov commented Sep 4, 2023

asmorkalov commented Sep 4, 2023

Haosonn commented Sep 5, 2023

Kumataro commented Sep 5, 2023

asmorkalov commented Sep 5, 2023

asmorkalov commented Sep 7, 2023

asmorkalov commented Sep 7, 2023

Kumataro commented Sep 7, 2023 •

edited

Kumataro commented Sep 10, 2023 •

edited

Kumataro commented Sep 14, 2023

Carotene (ARM HAL) uses wrong rounding in some places #24163

Carotene (ARM HAL) uses wrong rounding in some places #24163

Comments

AtomicVar commented Aug 16, 2023 • edited

System Information

Detailed description

Steps to reproduce

Issue submission checklist

AtomicVar commented Aug 16, 2023

opencv-alalek commented Aug 16, 2023

LaurentBerger commented Aug 17, 2023

Haosonn commented Aug 22, 2023 • edited

IskXCr commented Aug 22, 2023 • edited

asmorkalov commented Sep 4, 2023

asmorkalov commented Sep 4, 2023

Haosonn commented Sep 5, 2023

Kumataro commented Sep 5, 2023

asmorkalov commented Sep 5, 2023

asmorkalov commented Sep 7, 2023

asmorkalov commented Sep 7, 2023

Kumataro commented Sep 7, 2023 • edited

Kumataro commented Sep 10, 2023 • edited

Kumataro commented Sep 14, 2023

AtomicVar commented Aug 16, 2023 •

edited

Haosonn commented Aug 22, 2023 •

edited

IskXCr commented Aug 22, 2023 •

edited

Kumataro commented Sep 7, 2023 •

edited

Kumataro commented Sep 10, 2023 •

edited