Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Carotene (ARM HAL) uses wrong rounding in some places #24163

Closed
4 tasks done
AtomicVar opened this issue Aug 16, 2023 · 15 comments · Fixed by #24271
Closed
4 tasks done

Carotene (ARM HAL) uses wrong rounding in some places #24163

AtomicVar opened this issue Aug 16, 2023 · 15 comments · Fixed by #24271
Assignees
Labels
bug confirmed There is stable reproducer / investigation complete optimization platform: arm ARM boards related issues: RPi, NVIDIA TK/TX, etc platform: ios/osx
Milestone

Comments

@AtomicVar
Copy link

AtomicVar commented Aug 16, 2023

System Information

  • OpenCV version: 4.8.0
  • Operating System: macOS 13.3.1 (ARM64) / Ubuntu 20.04.4 (x86_64)
  • Compiler: Apple clang version 14.0.3 / g++ (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0

Detailed description

Even this (1+0)/2 gives different result on macOS and Linux:

cv::Mat A = cv::Mat::ones(3, 3, CV_8UC1);
cv::Mat B = cv::Mat::zeros(3, 3, CV_8UC1);
cv::Mat mean = (A + B) / 2;
std::cout << "A = \n" << A << std::endl;
std::cout << "B = \n" << B << std::endl;
std::cout << "(A+B)/2 = \n" << mean << std::endl;

Steps to reproduce

Write this code:

cv::Mat A = cv::Mat::ones(3, 3, CV_8UC1);
cv::Mat B = cv::Mat::zeros(3, 3, CV_8UC1);
cv::Mat mean = (A + B) / 2;
std::cout << "A = \n" << A << std::endl;
std::cout << "B = \n" << B << std::endl;
std::cout << "(A+B)/2 = \n" << mean << std::endl;

On macOS:

A =
[  1,   1,   1;
   1,   1,   1;
   1,   1,   1]
B =
[  0,   0,   0;
   0,   0,   0;
   0,   0,   0]
(A+B)/2 =
[  1,   1,   1;
   1,   1,   1;
   1,   1,   0]

On Linux:

A =
[  1,   1,   1;
   1,   1,   1;
   1,   1,   1]
B =
[  0,   0,   0;
   0,   0,   0;
   0,   0,   0]
(A+B)/2 =
[  0,   0,   0;
   0,   0,   0;
   0,   0,   0]

Issue submission checklist

  • I report the issue, it's not a question
  • I checked the problem with documentation, FAQ, open issues, forum.opencv.org, Stack Overflow, etc and have not found any solution
  • I updated to the latest OpenCV version and the issue is still there
  • There is reproducer code and related data files (videos, images, onnx, etc)
@AtomicVar AtomicVar added the bug label Aug 16, 2023
@AtomicVar
Copy link
Author

The result on macOS is so weird, why it is all 1s except the last one?

@opencv-alalek opencv-alalek added optimization platform: ios/osx platform: arm ARM boards related issues: RPi, NVIDIA TK/TX, etc labels Aug 16, 2023
@opencv-alalek
Copy link
Contributor

Correct result is "all zeros".
Looks like these is some issue with ARM optimizations on OSX.

Verification: cmake <...> -DCMAKE_BUILD_TYPE=Debug -DCV_DISABLE_OPTIMIZATION=ON

A = 
[  1,   1,   1;
   1,   1,   1;
   1,   1,   1]
B = 
[  0,   0,   0;
   0,   0,   0;
   0,   0,   0]
(A+B)/2 = 
[  0,   0,   0;
   0,   0,   0;
   0,   0,   0]

@LaurentBerger
Copy link
Contributor

Windows MSVC 2022 is

A =
[ 1, 1, 1;
1, 1, 1;
1, 1, 1]
B =
[ 0, 0, 0;
0, 0, 0;
0, 0, 0]
(A+B)/2 =
[ 0, 0, 0;
0, 0, 0;
0, 0, 0]

@Haosonn
Copy link
Contributor

Haosonn commented Aug 22, 2023

I worked with @IskXCr together, and found that the initilization of a struct template wAdd may cause the bug.
In ./3rdparty/carotene/src/add_weighted.cpp

template <> struct wAdd<u32>
{
    typedef u32 type;
    f32 alpha, beta, gamma;
    float32x4_t valpha, vbeta, vgamma;
    wAdd(f32 _alpha, f32 _beta, f32 _gamma):
        alpha(_alpha), beta(_beta), gamma(_gamma)
    {
        valpha = vdupq_n_f32(_alpha);
        vbeta = vdupq_n_f32(_beta);
        vgamma = vdupq_n_f32(_gamma + 0.5);
    }

    void operator() (const VecTraits<u32>::vec128 & v_src0,
                     const VecTraits<u32>::vec128 & v_src1,
                     VecTraits<u32>::vec128 & v_dst) const
    {
        float32x4_t vs1 = vcvtq_f32_u32(v_src0);
        float32x4_t vs2 = vcvtq_f32_u32(v_src1);

        vs1 = vmlaq_f32(vgamma, vs1, valpha);
        vs1 = vmlaq_f32(vs1, vs2, vbeta);
        v_dst = vcvtq_u32_f32(vs1);
    }
<..remaining part..>
}

That 0.5 is suspicious.
The first 8 entries are computed by an SIMD approach. Originally, gamma should be 0, but an extra 0.5 is added, causing an addition to v_dst, which, by definition of vmlaq_f32, can be expressed as v_dst = 0.5(vgamma) + 1(vs1) + 0.5(valpha) + 0(vs2) + 0.5(vbeta) = 1(1.5 rounded to 1!)(actually the operands are arrays of 4, we consider it as an float only for convenience).
However, the remaining entry in the matrix is calculated primitively so the result is 0.

It seems that we can simply delete that 0.5, but we think there must be some reasons why 0.5 stays there.

@IskXCr
Copy link
Contributor

IskXCr commented Aug 22, 2023

Me and @Haosonn investigated the call stack while debugging the given example: The problem can be traced back to these recursive template functions.

// 3rdparty/carotene/src/add_weighted.cpp
#define IMPL_ADDWEIGHTED(type)                                \
void addWeighted(const Size2D &size,                          \
                 const type * src0Base, ptrdiff_t src0Stride, \
                 const type * src1Base, ptrdiff_t src1Stride, \
                 type * dstBase, ptrdiff_t dstStride,         \
                 f32 alpha, f32 beta, f32 gamma)              \
{                                                             \
    internal::assertSupportedConfiguration();                 \
    wAdd<type> wgtAdd(alpha,                                  \
                      beta,                                   \
                      gamma);                                 \
    internal::vtransform(size,                                \
                         src0Base, src0Stride,                \
                         src1Base, src1Stride,                \
                         dstBase, dstStride,                  \
                         wgtAdd);                             \
}

The above code snippet is a macro that when expanded computes the weighted sum:

$$ \textbf{r} = \alpha * \textbf{a} + \beta * \textbf{b} + \gamma. $$

When executing function vtransform, the behavior differs between the one for the actual type

// 3rdparty/carotene/src/add_weighted.cpp
void operator() (const T * src0, const T * src1, T * dst) const
{
    dst[0] = saturate_cast<T>(alpha*src0[0] + beta*src1[0] + gamma);
}

and the wider type wAdd<u32>, which, by definition during its initialization, adds $0.5$ to the constant $\gamma$.

wAdd(f32 _alpha, f32 _beta, f32 _gamma):
        alpha(_alpha), beta(_beta), gamma(_gamma)
{
    valpha = vdupq_n_f32(_alpha);
    vbeta = vdupq_n_f32(_beta);
    vgamma = vdupq_n_f32(_gamma + 0.5);
}

This results in inconsistency between results from functions that utilize SIMD and results from direct calculation on the given type.
The same problem also occurs when using type CV_16UC1. Theoretically it should affect any computation that involves this particular template function.

@asmorkalov
Copy link
Contributor

May be related to #24213 and fix #24215

@asmorkalov
Copy link
Contributor

@ZJUGuoShuai could you build OpenCV from 4.x branch and check if the issue is still relevant.

@Haosonn
Copy link
Contributor

Haosonn commented Sep 5, 2023

May be related to #24213 and fix #24215

This issue still exists after the newest fix #24215

@Kumataro
Copy link
Contributor

Kumataro commented Sep 5, 2023

Hello, I investigate a little more.

void operator() (const VecTraits<u32>::vec128 & v_src0,
const VecTraits<u32>::vec128 & v_src1,
VecTraits<u32>::vec128 & v_dst) const
{
float32x4_t vs1 = vcvtq_f32_u32(v_src0);
float32x4_t vs2 = vcvtq_f32_u32(v_src1);
vs1 = vmlaq_f32(vgamma, vs1, valpha);
vs1 = vmlaq_f32(vs1, vs2, vbeta);
v_dst = vcvtq_u32_f32(vs1);
}

  • vcvtq_u32_f32() is for armv7, aarch32 and aarch64. it rounds with FPCR (Floating point control register).We can select round to nearest mode, but cannot select which mode is used from ties to even or ties away from zero.
  • vcvtnq_u32_f32() is for aarch32 and aarch64. it rounds to nearest with ties to even

If arm7 is not need to care, we can replace instruction (It may works well only armv8+).

However if it supports armv7, it is not easy because the hardware will not assist/support it.

kmtr@ubuntu:~/work/opencv_ram$ git diff -c
diff --git a/3rdparty/carotene/src/add_weighted.cpp b/3rdparty/carotene/src/add_weighted.cpp
index 6559b9fe53..09940a8524 100644
--- a/3rdparty/carotene/src/add_weighted.cpp
+++ b/3rdparty/carotene/src/add_weighted.cpp
@@ -150,7 +150,7 @@ template <> struct wAdd<u32>
     {
         valpha = vdupq_n_f32(_alpha);
         vbeta = vdupq_n_f32(_beta);
-        vgamma = vdupq_n_f32(_gamma + 0.5);
+        vgamma = vdupq_n_f32(_gamma);
     }

     void operator() (const VecTraits<u32>::vec128 & v_src0,
@@ -162,7 +162,7 @@ template <> struct wAdd<u32>

         vs1 = vmlaq_f32(vgamma, vs1, valpha);
         vs1 = vmlaq_f32(vs1, vs2, vbeta);
-        v_dst = vcvtq_u32_f32(vs1);
+        v_dst = vcvtnq_u32_f32(vs1);
     }

     void operator() (const VecTraits<u32>::vec64 & v_src0,

@asmorkalov
Copy link
Contributor

armv7 is still more than alive. Its support is required.

@asmorkalov
Copy link
Contributor

Confirmed the issue with Jetson NANO (Armv8, linux).

@asmorkalov asmorkalov added this to the 4.9.0 milestone Sep 7, 2023
@asmorkalov asmorkalov added the confirmed There is stable reproducer / investigation complete label Sep 7, 2023
@asmorkalov
Copy link
Contributor

Looks like the approach with +0.5 and vdupq_n_f32 is widely used in Carotene. Need to revise it:

hal/tegra_hal.hpp:1442:        (dst_width + 0.5)/inv_scale_x + 0.5 >= src_width && (dst_height + 0.5)/inv_scale_y + 0.5 >= src_height && \
src/phase.cpp:196:            dst[j] = (u8)(s32)floor(a + 0.5f);
src/blur.cpp:161:            f32 val = (prevx + currx + nextx) * (1 / 9.f) + 0.5f;
src/resize.cpp:156:        ofs[x] = static_cast<u32>(floorf((x + 0.5f) * ratio));
src/resize.cpp:173:        size_t src_y = static_cast<size_t>(floorf((dst_y + 0.5f) * hr));
src/resize.cpp:203:                                           (dsize.width + 0.5) * wr >= ssize.width &&
src/resize.cpp:204:                                           (dsize.height + 0.5) * hr >= ssize.height && // Ensure source isn't too big
src/resize.cpp:1587:        buf2[row] = floorf(rweight * (1 << SHIFT_BITS) + 0.5f);
src/resize.cpp:1608:            cwi[k] = (short)floorf((col2[k] - c) * (1 << SHIFT_BITS) + 0.5f);
src/resize.cpp:1771:            cwi[k] = (s16)floorf((col2[k] - c) * (1 << SHIFT_BITS) + 0.5f);
src/resize.cpp:1875:                                           (dsize.width + 0.5) * wr + 0.5 >= ssize.width &&
src/resize.cpp:1876:                                           (dsize.height + 0.5) * hr + 0.5 >= ssize.height && // Ensure source isn't too big
src/resize.cpp:1911:                                           (dsize.width + 0.5) * wr + 0.5 >= ssize.width &&
src/resize.cpp:1912:                                           (dsize.height + 0.5) * hr + 0.5 >= ssize.height && // Ensure source isn't too big
src/canny.cpp:557:            const s32 TG22 = (s32)(0.4142135623730950488016887242097*(1<<CANNY_SHIFT) + 0.5);
src/colorconvert.cpp:1289:    h = ((h * s32((hrange << hsv_shift)/(6.f*diff) + 0.5)) + (1 << (hsv_shift-1))) >> hsv_shift;
src/add_weighted.cpp:109:        vgamma = vdupq_n_f32(_gamma + 0.5);
src/add_weighted.cpp:153:        vgamma = vdupq_n_f32(_gamma + 0.5);
src/add_weighted.cpp:197:        vgamma = vdupq_n_f32(_gamma + 0.5);
src/convert_scale.cpp:141:    register float32x4_t vshift asm ("q1") = vdupq_n_f32((f32)beta + 0.5f);,
src/convert_scale.cpp:188:    float32x4_t vshift = vdupq_n_f32((f32)beta + 0.5f);,
src/convert_scale.cpp:226:    register float32x4_t vshift asm ("q1") = vdupq_n_f32((f32)beta + 0.5f);,
src/convert_scale.cpp:273:    float32x4_t vshift = vdupq_n_f32((f32)beta + 0.5f);,
src/convert_scale.cpp:311:    register float32x4_t vshift asm ("q1") = vdupq_n_f32((f32)beta + 0.5f);,
src/convert_scale.cpp:358:    float32x4_t vshift = vdupq_n_f32((f32)beta + 0.5f);,
src/convert_scale.cpp:395:    register float32x4_t vshift asm ("q1") = vdupq_n_f32((f32)beta + 0.5f);,
src/convert_scale.cpp:442:    float32x4_t vshift = vdupq_n_f32((f32)beta + 0.5f);,
src/convert_scale.cpp:479:    register float32x4_t vshift asm ("q1") = vdupq_n_f32((f32)beta + 0.5f);,
src/convert_scale.cpp:529:    float32x4_t vshift = vdupq_n_f32((f32)beta + 0.5f);,
src/convert_scale.cpp:649:    register float32x4_t vshift asm ("q1") = vdupq_n_f32((f32)beta + 0.5f);,
src/convert_scale.cpp:696:    float32x4_t vshift = vdupq_n_f32((f32)beta + 0.5f);,
src/convert_scale.cpp:734:    register float32x4_t vshift asm ("q1") = vdupq_n_f32((f32)beta + 0.5f);,
src/convert_scale.cpp:781:    float32x4_t vshift = vdupq_n_f32((f32)beta + 0.5f);,
src/convert_scale.cpp:819:    register float32x4_t vshift asm ("q1") = vdupq_n_f32((f32)beta + 0.5f);,
src/convert_scale.cpp:866:    float32x4_t vshift = vdupq_n_f32((f32)beta + 0.5f);,
src/convert_scale.cpp:905:    register float32x4_t vshift asm ("q1") = vdupq_n_f32((f32)beta + 0.5f);,
src/convert_scale.cpp:952:    float32x4_t vshift = vdupq_n_f32((f32)beta + 0.5f);,
src/convert_scale.cpp:991:    register float32x4_t vshift asm ("q1") = vdupq_n_f32((f32)beta + 0.5f);,
src/convert_scale.cpp:1041:    float32x4_t vshift = vdupq_n_f32((f32)beta + 0.5f);,
src/convert_scale.cpp:1161:    register float32x4_t vshift asm ("q1") = vdupq_n_f32((f32)beta + 0.5f);,
src/convert_scale.cpp:1193:    float32x4_t vshift = vdupq_n_f32((f32)beta + 0.5f);,
src/convert_scale.cpp:1220:    register float32x4_t vshift asm ("q1") = vdupq_n_f32((f32)beta + 0.5f);,
src/convert_scale.cpp:1252:    float32x4_t vshift = vdupq_n_f32((f32)beta + 0.5f);,
src/convert_scale.cpp:1279:    register float32x4_t vshift asm ("q1") = vdupq_n_f32((f32)beta + 0.5f);,
src/convert_scale.cpp:1310:    float32x4_t vshift = vdupq_n_f32((f32)beta + 0.5f);,
src/convert_scale.cpp:1336:    register float32x4_t vshift asm ("q1") = vdupq_n_f32((f32)beta + 0.5f);,
src/convert_scale.cpp:1367:    float32x4_t vshift = vdupq_n_f32((f32)beta + 0.5f);,
src/convert_scale.cpp:1393:    register float32x4_t vshift asm ("q1") = vdupq_n_f32((f32)beta + 0.5f);,
src/convert_scale.cpp:1424:    float32x4_t vshift = vdupq_n_f32((f32)beta + 0.5f);,
src/convert_scale.cpp:1501:    register float32x4_t vshift asm ("q1") = vdupq_n_f32((f32)beta + 0.5f);,
src/convert_scale.cpp:1533:    float32x4_t vshift = vdupq_n_f32((f32)beta + 0.5f);,
src/convert_scale.cpp:1560:    register float32x4_t vshift asm ("q1") = vdupq_n_f32((f32)beta + 0.5f);,
src/convert_scale.cpp:1592:    float32x4_t vshift = vdupq_n_f32((f32)beta + 0.5f);,
src/convert_scale.cpp:1619:    register float32x4_t vshift asm ("q1") = vdupq_n_f32((f32)beta + 0.5f);,
src/convert_scale.cpp:1650:    float32x4_t vshift = vdupq_n_f32((f32)beta + 0.5f);,
src/convert_scale.cpp:1676:    register float32x4_t vshift asm ("q1") = vdupq_n_f32((f32)beta + 0.5f);,
src/convert_scale.cpp:1707:    float32x4_t vshift = vdupq_n_f32((f32)beta + 0.5f);,
src/convert_scale.cpp:1733:    register float32x4_t vshift asm ("q1") = vdupq_n_f32((f32)beta + 0.5f);,
src/convert_scale.cpp:1764:    float32x4_t vshift = vdupq_n_f32((f32)beta + 0.5f);,
src/convert_scale.cpp:1841:    register float32x4_t vshift asm ("q1") = vdupq_n_f32((f32)beta + 0.5f);,
src/convert_scale.cpp:1873:    float32x4_t vshift = vdupq_n_f32((f32)beta + 0.5f);,
src/convert_scale.cpp:1899:    register float32x4_t vshift asm ("q1") = vdupq_n_f32((f32)beta + 0.5f);,
src/convert_scale.cpp:1931:    float32x4_t vshift = vdupq_n_f32((f32)beta + 0.5f);,
src/convert_scale.cpp:1957:    register float32x4_t vshift asm ("q1") = vdupq_n_f32((f32)beta + 0.5f);,
src/convert_scale.cpp:1988:    float32x4_t vshift = vdupq_n_f32((f32)beta + 0.5f);,
src/convert_scale.cpp:2013:    register float32x4_t vshift asm ("q1") = vdupq_n_f32((f32)beta + 0.5f);,
src/convert_scale.cpp:2044:    float32x4_t vshift = vdupq_n_f32((f32)beta + 0.5f);,
src/convert_scale.cpp:2069:    register float32x4_t vshift asm ("q1") = vdupq_n_f32((f32)beta + 0.5f);,
src/convert_scale.cpp:2100:    float32x4_t vshift = vdupq_n_f32((f32)beta + 0.5f);,
src/convert_scale.cpp:2245:    register float32x4_t vshift asm ("q1") = vdupq_n_f32((f32)beta + 0.5f);,
src/convert_scale.cpp:2275:    float32x4_t vshift = vdupq_n_f32((f32)beta + 0.5f);,
src/convert_scale.cpp:2299:    register float32x4_t vshift asm ("q1") = vdupq_n_f32((f32)beta + 0.5f);,
src/convert_scale.cpp:2328:    float32x4_t vshift = vdupq_n_f32((f32)beta + 0.5f);,
src/convert_scale.cpp:2351:    register float32x4_t vshift asm ("q1") = vdupq_n_f32((f32)beta + 0.5f);,
src/convert_scale.cpp:2380:    float32x4_t vshift = vdupq_n_f32((f32)beta + 0.5f);,
src/convert_scale.cpp:2403:    register float32x4_t vshift asm ("q1") = vdupq_n_f32((f32)beta + 0.5f);,
src/convert_scale.cpp:2432:    float32x4_t vshift = vdupq_n_f32((f32)beta + 0.5f);,

@asmorkalov asmorkalov changed the title OpenCV gives different result for (1+0)/2 on different platforms Carotene (ARM HAL) uses wrong rounding in some places Sep 7, 2023
@asmorkalov asmorkalov assigned asmorkalov and unassigned vpisarev Sep 7, 2023
@Kumataro
Copy link
Contributor

Kumataro commented Sep 7, 2023

Thank you for comment.
This is only trial, however I think this fix will slow down processing.
I'll continue to investigate ...

linaro@linaro-alip:~/work/opencv$ git --no-pager diff -c
diff --git a/3rdparty/carotene/src/add_weighted.cpp b/3rdparty/carotene/src/add_weighted.cpp
index 6559b9fe53..c56f95a4e3 100644
--- a/3rdparty/carotene/src/add_weighted.cpp
+++ b/3rdparty/carotene/src/add_weighted.cpp
@@ -150,7 +150,7 @@ template <> struct wAdd<u32>
     {
         valpha = vdupq_n_f32(_alpha);
         vbeta = vdupq_n_f32(_beta);
-        vgamma = vdupq_n_f32(_gamma + 0.5);
+        vgamma = vdupq_n_f32(_gamma);
     }

     void operator() (const VecTraits<u32>::vec128 & v_src0,
@@ -162,7 +162,7 @@ template <> struct wAdd<u32>

         vs1 = vmlaq_f32(vgamma, vs1, valpha);
         vs1 = vmlaq_f32(vs1, vs2, vbeta);
-        v_dst = vcvtq_u32_f32(vs1);
+        v_dst = round_u32_f32(vs1);
     }

     void operator() (const VecTraits<u32>::vec64 & v_src0,
@@ -174,7 +174,7 @@ template <> struct wAdd<u32>

         vs1 = vmla_f32(vget_low(vgamma), vs1, vget_low(valpha));
         vs1 = vmla_f32(vs1, vs2, vget_low(vbeta));
-        v_dst = vcvt_u32_f32(vs1);
+        v_dst = round_u32_f32(vs1);
     }

     void operator() (const u32 * src0, const u32 * src1, u32 * dst) const
diff --git a/3rdparty/carotene/src/vtransform.hpp b/3rdparty/carotene/src/vtransform.hpp
index 08841a2263..7ae38e95df 100644
--- a/3rdparty/carotene/src/vtransform.hpp
+++ b/3rdparty/carotene/src/vtransform.hpp
@@ -682,6 +682,31 @@ void vtransform(Size2D size,
     }
 }

+inline VecTraits<u32>::vec128 round_u32_f32(const float32x4_t val)
+{
+#if defined(__aarch64__) || defined(__aarch32__)
+    return vcvrnq_u32_f32(val);
+#else // armv7
+#if 1
+    static const float32x4_t f32_v0_5 = vdupq_n_f32(0.5);
+    static const uint32x4_t  u32_v1_0 = vdupq_n_u32(1);
+
+    const uint32x4_t round = vcvtq_u32_f32( vaddq_f32(val, f32_v0_5 ) );
+    const uint32x4_t isOdd = vandq_u32( round, u32_v1_0 );
+    const uint32x4_t isFrac0_5 = vceqq_f32(vsubq_f32(vcvtq_f32_u32(round),val), f32_v0_5 );
+    return vsubq_u32( round, vandq_u32( isOdd, isFrac0_5 ) );
+#else
+    static const float32x4_t f32_v0_5 = vdupq_n_f32(0.5);
+    return vcvtq_u32_f32( vaddq_f32(val, f32_v0_5) );
+#endif
+#endif
+}
+
+inline VecTraits<u32>::vec64 round_u32_f32(const float32x2_t val)
+{
+  return vcvt_u32_f32(val);
+}
+
 } }

 #endif // CAROTENE_NEON
val = 10.5
-> round = 11
-> isOdd = 1
-> isFrac0_5 = 0xffff
-> ret = round - (isOdd and isFrac0_5) = 11 - 1 = 10

val = 11.5
-> round = 12
-> isOdd = 0
-> isFrac0_5 = 0xffff
-> ret = round - (isOdd and isFrac0_5) = 12 - 0 = 12

val = 12.5
-> round = 13
-> isOdd = 1
-> isFrac0_5 = 0xffff
-> ret = round - (isOdd and isFrac0_5) = 13 - 1 = 12

@Kumataro
Copy link
Contributor

Kumataro commented Sep 10, 2023

Hello, I'm sorry. I couldn't finish fixing just this weekend.

My trial code is here, but I will fix/change/refactor it.
And I think v_round() in intrin_neon.hpp at my fix is too much heavy.
I will continue to investigate.

And fixing v_round() are likely to particularly affect the performance of computationally intensive tasks such as DNNs.
I felt that the default setting for ARMv7 should be the same speed priority as the current situation.
( For ARMv8 including aarch32/aarch64, complex instructions are not needed. So there are no effects )


Following is comment to this trouble to investigate.

I was getting wrong results in reference/original code when the input was negative.
v_gamma ( _gamma + 0.5) is not suitable for negative value ( e.g. -4.3).
I think bias sign must be same as value.

Here is OK.

inline v_int32x4 v_round(const v_float32x4& a)
{
static const int32x4_t v_sign = vdupq_n_s32(1 << 31),
v_05 = vreinterpretq_s32_f32(vdupq_n_f32(0.5f));
int32x4_t v_addition = vorrq_s32(v_05, vandq_s32(v_sign, vreinterpretq_s32_f32(a.val)));
return v_int32x4(vcvtq_s32_f32(vaddq_f32(a.val, vreinterpretq_f32_s32(v_addition))));
}

Here is not OK.

{
valpha = vdupq_n_f32(_alpha);
vbeta = vdupq_n_f32(_beta);
vgamma = vdupq_n_f32(_gamma + 0.5);
}
void operator() (const VecTraits<s32>::vec128 & v_src0,
const VecTraits<s32>::vec128 & v_src1,
VecTraits<s32>::vec128 & v_dst) const
{
float32x4_t vs1 = vcvtq_f32_s32(v_src0);
float32x4_t vs2 = vcvtq_f32_s32(v_src1);
vs1 = vmlaq_f32(vgamma, vs1, valpha);
vs1 = vmlaq_f32(vs1, vs2, vbeta);
v_dst = vcvtq_s32_f32(vs1);
}

@Kumataro
Copy link
Contributor

I create a pull request to fix this problem.

I believe there are no performance effetcs on A32/A64.
However on ARMv7, it slow slightly(or very).
ARMv7-legacy mode is implemented, it works "Round to nearest, ties away from zero".
(In this case, 0.5 rounds to 1)

[Fixed]

  • src/add_weighted.cpp
  • src/blur.cpp
  • src/colorconvert.cpp
  • src/div.cpp
  • src/phase.cpp
  • src/convert_scale.cpp

[Not fixed] I cannot fix it because I cannot verify on tegra device

  • hal/tegra_hal.hpp

[Not fixed] They are not simple rounding. I believe it is better that they will be fix if there are any issue.

  • src/resize.cpp
    • float32x4_t v_dstf = vmulq_f32(vaddq_f32(v_index, v_05), v_ratio);
  • src/canny.cpp
    • const s32 TG22 = (s32)(0.4142135623730950488016887242097*(1<<CANNY_SHIFT) + 0.5);

asmorkalov pushed a commit that referenced this issue Dec 25, 2023
Fix to convert float32 to int32/uint32 with rounding to nearest (ties to even). #24271

Fix #24163

### Pull Request Readiness Checklist

See details at https://github.com/opencv/opencv/wiki/How_to_contribute#making-a-good-pull-request

- [x] I agree to contribute to the project under Apache 2 License.
- [x] To the best of my knowledge, the proposed patch is not based on a code under GPL or another license that is incompatible with OpenCV
- [x] The PR is proposed to the proper branch
- [x] There is a reference to the original bug report and related work
- [x] There is accuracy test, performance test and test data in opencv_extra repository, if applicable
      Patch to opencv_extra has the same branch name.
- [x] The feature is well documented and sample code can be built with the project CMake

(carotene is BSD)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug confirmed There is stable reproducer / investigation complete optimization platform: arm ARM boards related issues: RPi, NVIDIA TK/TX, etc platform: ios/osx
Projects
None yet
Development

Successfully merging a pull request may close this issue.

8 participants