Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve the accuracy of minps, maxps, rcp, rsqrt in NEON #24

Closed
syoyo opened this issue Mar 31, 2020 · 16 comments
Closed

Improve the accuracy of minps, maxps, rcp, rsqrt in NEON #24

syoyo opened this issue Mar 31, 2020 · 16 comments
Labels
enhancement New feature or request

Comments

@syoyo
Copy link
Contributor

syoyo commented Mar 31, 2020

ARM NEON's rcp estimate and rsqt estimate has less accuracy than corresponding SSE2 ops.

https://qiita.com/sanmanyannyan/items/62bb5ce6ada975a7106a

We need to increase the iteration of NewtonRaphson steps(2 or 3 times more iterations) to get the same level of the accuracy of rcp, rsqrt in SSE2 code path (estimate + one round of NewtonRaphson)

Currently we use use 2 iterations for NEON code path(vrcpsq_f32, vrsqrtsq_f32)

float32x4_t reciprocal = vrecpeq_f32(a);

Relates hole issue in #20

@syoyo syoyo added the enhancement New feature or request label Mar 31, 2020
@syoyo
Copy link
Contributor Author

syoyo commented Mar 31, 2020

@maikschulze Could you please try adding more NewtonRaphson iterations and check if it will solve hole/artifact issues in curve and displacement test scene?

@maikschulze
Copy link
Contributor

Hi @syoyo ,
thank you very for filing this bug. I haven't had the time yet to test the influence of different amounts of Newton-Raphson iterations yet. Generally, most holes in the curve have disappeared with 2 iterations in NEON and the formula by Intel -> 3 in total.

However, I could not get rid off one particular hole even with 8 iterations for this ray:

RTCIntersectContext context;
rtcInitIntersectContext(&context);

RTCRayHit rayhitdebug;
rayhitdebug.ray.org_x = -0.000000476837158;
rayhitdebug.ray.org_y = 4.99999952;
rayhitdebug.ray.org_z = -7.07106781;

rayhitdebug.ray.dir_x = 0.635353327;
rayhitdebug.ray.dir_y = -0.5503245;
rayhitdebug.ray.dir_z = 0.541727901;

rayhitdebug.ray.time = 0.0;
rayhitdebug.ray.tnear = 0.0;
rayhitdebug.ray.tfar = 100000.0;
rayhitdebug.ray.mask = -1;
rayhitdebug.ray.id = 0;
rayhitdebug.ray.flags = 0;

rtcIntersect1(g_scene, &context, &rayhitdebug);

This ray is a bit special because it hits the curve of the interpolation tutorial at (u=0.860379398, v =0). I've then started debugging this on my smartphone, but ran into weird seg faults again in the function intersect_bezier_recursive_jacobian. After carefully debugging the difference, I believe I've found one important difference that is independent of numerical precision.

The CylinderN::intersect function creates NaN values with sqrt(D) along with a mask that marks then as non-valid.

It seems that this block 'repairs' the NaN values:

/* clamp and correct u parameter */
      u_outer0 = clamp(u_outer0,vfloatx(0.0f),vfloatx(1.0f));
      u_outer1 = clamp(u_outer1,vfloatx(0.0f),vfloatx(1.0f));
      u_outer0 = lerp(u0,u1,(vfloatx(step)+u_outer0)*(1.0f/float(vfloatx::size)));
      u_outer1 = lerp(u0,u1,(vfloatx(step)+u_outer1)*(1.0f/float(vfloatx::size)));

on x64:

u_outer0_before = {-nan(ind), 1.61889660, 0.594353378, 0.295167416}
u_outer1_before = {-nan(ind), 1.50396919, 0.380308062, 1.70483279}

u_outer0_after  = {1.00000000, 1.00000000, 0.594353378, 0.295167416}
u_outer1_after  = {1.00000000, 1.00000000, 0.380308062, 1.00000000}

but not on aarch64:

u_outer0_before = {NaN, 1.61888874, 0.594353676, 0.295167357}
u_outer1_before = {NaN, 1.50397587, 0.380308151, 1.70483422}

u_outer0_after  ={NaN, 1, 0.594353676, 0.295167357}
u_outer1_after  = {NaN, 1, 0.380308151, 1}

I will try to solve this and continue from there. It's a bit tedious for my to debug this in Android Studio.

@syoyo
Copy link
Contributor Author

syoyo commented Mar 31, 2020

@maikschulze Good catch!

As you may know, min/max in SSE2 correctly(?) handles NaN

https://stackoverflow.com/questions/40196817/what-is-the-instruction-that-gives-branchless-fp-min-and-max-on-x86

whereas ARM NEON's vmin/vmax is not.

http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.dui0802b/CIHDEEBE.html

it returns NaN when any input of vmin/vmax is NaN.

Fortunately, from ARMv8(or AARCH64), there is vminnm and vmaxnm intrinsics

http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.dui0802b/CIHFCJCF.html

which correctly handles NaN according to IEEE754-2008. So, replacing vmaxq_f32 and vminq_f32 with vmaxnmq_f32 and vminnmq_f32 respectively should solve the inconsistency of clamp() behavior.

It's a bit tedious for my to debug this in Android Studio.

If you can prepare minimal and reproducible code(e.g. like this https://github.com/lighttransport/embree-aarch64/tree/master/test/curve), I can run the code on AARCH64 linux(Jetson AGX).

@maikschulze
Copy link
Contributor

@syoyo , you're amazing. I quickly addedvminnmq_f32 and vmaxnmq_f32 into SSE2NEON.h and there's no more hole. Nan's are now properly replaced in the curve algorithm. I will test more and create a submit.

Thank you very much!

I hope these debugging sessions will get easier once I've create a public test collection. I have various ideas on making it precise and convenient for everyone given my learning experience.

@syoyo
Copy link
Contributor Author

syoyo commented Mar 31, 2020

@maikschulze 🎉 Yeah, PR is very appreciated.

@maikschulze
Copy link
Contributor

Hi, my PR will take a bit longer.
While the curve intersections now seem fixed with vminnmq_f32 and vmaxnmq_f32 , my hair intersection test crashes with an assert in bvh_traverser1.h
assert(c0 != BVH::emptyNode);
I will have to understand and fix this first.

@syoyo syoyo changed the title Improve the accuracy of rcp, rsqrt in NEON Improve the accuracy of minps, maxps, rcp, rsqrt in NEON Jul 6, 2020
@syoyo
Copy link
Contributor Author

syoyo commented Jul 6, 2020

@maikschulze I have implemented correct(as far as I've tested) implementation of _mm_min_ps and _mm_max_ps in aarch64-v3.11.0 branch: e4a2f68

It should give more consistent result with x86-64 result.

Could you try aarch64-v3.11.0 branch? It also have some improvements for curves(hairs) from original intel embree v3.11.0.

@maikschulze
Copy link
Contributor

maikschulze commented Jul 7, 2020

Hi @syoyo,

I have pulled the latest state 7eb0b58 and created test results for windows-x64 and android-arm64.

Unfortunately, the infamous hole is still present in the curve for arm64 (right), whereas x64 (left) has no such issue:
image

I have not debugged into the setup. The artifact looks similar to the discovered lack of 'NaN repair' in function CylinderN::intersect in my test scene (interpolation tutorial).

@syoyo
Copy link
Contributor Author

syoyo commented Jul 8, 2020

@maikschulze Thanks!

There is still a bug we need to hunt.

I have added test/numric test in recent aarch64-v3.11.0 branch https://github.com/lighttransport/embree-aarch64/tree/aarch64-v3.11.0/test/numeric

and confirmed at least NaN repair now works fine on aarch64.

void issue_24()
{
  typedef embree::vfloat4 vfloatx;

  vfloatx a(-std::numeric_limits<float>::signaling_NaN());

  vfloatx b = clamp(a, vfloatx(0.0f), vfloatx(1.0f));

  printf("a.x = %f\n", a[0]);
  printf("b.x = %f\n", b[0]); // expects 1.0

  vfloatx c(-std::numeric_limits<float>::quiet_NaN());

  vfloatx d = clamp(c, vfloatx(0.0f), vfloatx(1.0f));

  printf("c.x = %f\n", c[0]);
  printf("d.x = %f\n", d[0]); // expects 1.0
}
a.x = -nan
b.x = 1.000000
c.x = -nan
d.x = 1.000000

@syoyo
Copy link
Contributor Author

syoyo commented Jul 20, 2020

@maikschulze I think I have fixed the hole issue for curve(tube) geometry.

NEON implementation of _mm_div_ps and _mm_rsqrt_ps requires one(or two) more Newton-Raphson iteration than SSE2 to get the same level of the accuracy of SSE2.

Could you try aarch64-v3.11.0-rcp-improve branch? https://github.com/lighttransport/embree-aarch64/tree/aarch64-v3.11.0-rcp-improve

This branch uses 4 Newton-Raphson iteration for div and rsqrt.
(4 may be overkill and we may be able to reduce it to 3)

At least I could now confirm apparently no holes onto the curve(tube) geometry anymore for interpolation tutorial. 🎉

@maikschulze
Copy link
Contributor

Hi @syoyo , I've tested a69200d on Android aarch64 and cannot observe any improvement, unfortunately. The hole in the curve remains. I've then tried to increase the amount of iterations further in various related functions, but could not observe any change.

For sanity: is there any special compile flag that should be used? Can you confirm that you do not get such a hole in release mode?

@syoyo
Copy link
Contributor Author

syoyo commented Jul 21, 2020

@maikschulze Sorry, there was a missing commit. Please try this commit(commited to aarch64-v3.11.0-rcp-improvement branch): a435e0e

FYI, I am running interpolation on Jetson AGX.

Here is the cmake configuration and CXX flags. I am using clang-10.

$CMAKE_BIN \
  -DCMAKE_BUILD_TYPE=RelWithDebInfo \
  -DEMBREE_ARM=On \
  -DEMBREE_ADDRESS_SANITIZER=Off \
  -DCMAKE_INSTALL_PREFIX=$HOME/local/embree3 \
  -DCMAKE_C_COMPILER=clang \
  -DCMAKE_CXX_COMPILER=clang++ \
  -DEMBREE_ISPC_SUPPORT=Off \
  -DEMBREE_TASKING_SYSTEM=Internal \
  -DEMBREE_TUTORIALS=On \
  -DEMBREE_TUTORIALS_GLFW=On \
  -DEMBREE_MAX_ISA=SSE2 \
  -DEMBREE_RAY_PACKETS=Off \
  ..
...
/home/syoyo/local/clang+llvm-10.0.0-aarch64-linux-gnu/bin/clang++  -DEMBREE_TARGET_SSE2 -DTASKING_INTERNAL  -Wall -Wformat -Wformat-security -fPIE -fPIC -std=c++11 -fvisibility=hidden -fvisibility-inlines-hidden -fno-strict-aliasing -fno-tree-vectorize -D_FORTIFY_SOURCE=2  -g -DNDEBUG -O3     -o CMakeFiles/algorithms.dir/parallel_for.cpp.o -c /home/syoyo/work/embree-aarch64/common/algorithms/parallel_for.cpp
...

@maikschulze
Copy link
Contributor

Hi @syoyo , unfortunately the hole is still here. I've tested your latest code on Android arm64 and iOS arm64. The results there are identical and show the hole ;(

@syoyo
Copy link
Contributor Author

syoyo commented Jul 21, 2020

@maikschulze I see. Could you please open another issue with minimal reproducible scene data and code(or view configuration for interpolation)? The issue may be related to OS or compiler.

And I'll close this since I've already fixed(improved) min/max and rcp/rsqrt accuracy in aarch64-v3.11.0-rcp-improve branch.

@syoyo syoyo closed this as completed Jul 21, 2020
@maikschulze
Copy link
Contributor

@syoyo , I'm optimistic that I have identified the issue being caused by sqrt(0)-> +inf.

After a coffee break I will test further and create an appropriate issue with repro and pull request.

@syoyo
Copy link
Contributor Author

syoyo commented Jul 22, 2020

@maikschulze Super awesome! BTW, I have merged aarch64-v3.11.0-rcp-improve to master, so please use master branch when you create a pull request.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants