Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize IM_NORMALIZE2F_OVER_ZERO #4091

Open
wants to merge 5 commits into
base: master
Choose a base branch
from
Open

Optimize IM_NORMALIZE2F_OVER_ZERO #4091

wants to merge 5 commits into from

Conversation

wolfpld
Copy link
Contributor

@wolfpld wolfpld commented Apr 30, 2021

Currently Dear ImGui uses 1/sqrt() in IM_NORMALIZE2F_OVER_ZERO(). This produces the following instruction sequence in ImDrawList::AddPolyline():

11

Notice that both vsqrtss and vdivss both have high sample hit count (visible in dependent instructions). This is caused by high latency of both instructions:

sqrt

This PR introduces a new function, ImRsqrt(), which also has an alternate implementation if SSE1 is available. Note that SSE2 support is mandated in x64 architecture. With these changes the following assembly is generated:

22

With the new code only one instruction is emitted, vrqsrtss, which produces an approximate result, but the reduced precision shouldn't matter in this case. Notice that the relative cost of calling IM_NORMALIZE2F_OVER_ZERO() has dropped significantly. This is caused by much lower latency of vrqsrtss across all uarchs:

rsqrt

ARM NEON has similar instruction, FRSQRTE, which may or may not be beneficial to use.

For context, ImDrawList::AddPolyline() in certain conditions can be the hottest function in Tracy.

I'm not sure if including immintrin.h is the right solution. It works for me, but it may be problematic on older compilers, in which case a more targetted header inclusion should be used.

@ocornut
Copy link
Owner

ocornut commented May 6, 2021

Thanks @wolfpld we'll try to merge this soon (currently in the middle of construction work).
No extra feedback needed for you, writing things down for myself:

I used this as another occasion to test the perf tools @rokups we've been working on. Pardon taking a large
tangent there but any occasions to test those tools is good for us as our process of generating/using the data is still a little tedious (working on it, ideal should be that in 1 command we should be able to compare two branches across multiple builds and runs).

Average of 8 runs over a dozen tests:

image

Individual runs

image

I noticed things are slower in Debug:

image

Release X64

image

Debug X64

image
image

That's fine for me as I am going to take the occasion to enable some of the "reduce agressive stack bound checks" macros around some key low-level functions.

Gets us more toward those: (green is Debug X64 with patch + some pragma to reduce stack bound checks on a few functions)

image

@wolfpld
Copy link
Contributor Author

wolfpld commented May 6, 2021

Pardon taking a large tangent there but any occasions to test those tools is good for us

I think these measurements are very interesting and appropriate to talk about here.

I noticed things are slower in Debug:

This is understandable, given that MSVC is instrumenting the code for use in the debugger. In the assembly output you can see multiple instances of seemingly nonsensical behavior, such as:

movss dword ptr [rbp+114h], xmm0
movss xmm0, dword ptr [rbp+114h]

This sequence stores the xmm0 register on the stack, only to immediately load the value again to the same register. This enables user to modify values held in the variables when a breakpoint is hit. On the other hand, this also introduces FPU pipeline stalls.

(currently in the middle of construction work)

I have another similar optimization queued up: wolfpld/tracy@0bd6479
Would you like me to include this change in this PR (given that it is taking some time to merge), or do you want it in a second PR?

@ocornut
Copy link
Owner

ocornut commented May 6, 2021 via email

This function calculates the result of 1/x. On SSE1 capable platforms this is
performed as an approximation, using rcpps.
@wolfpld
Copy link
Contributor Author

wolfpld commented May 6, 2021

Ok, added new changes and rebased to current master.

ocornut pushed a commit that referenced this pull request May 6, 2021
@ocornut
Copy link
Owner

ocornut commented May 6, 2021

I merged the first part as 4c9f0ce (with minor amends).

First part in red, second part (ImRecip) in green:

Release X64
image

Debug X64
image

In Debug the overhead of non-inlined function call sorts of hinders it for now. I'll investigate later with using macros.

@ocornut
Copy link
Owner

ocornut commented May 6, 2021

I tried various variants (with/without macros) and I couldn't find a satisfying setup where the ImRecip change seems very worthy. It's a very minor gain in "Release" mode and often a more noticeable loss in "Debug" mode.

Did you measure meaningful changes when doing the ImRecip() change on your side?

@wolfpld
Copy link
Contributor Author

wolfpld commented May 6, 2021

This is the 1.0f / x version:
obraz

And here is the optimized one:
obraz

This is in context of time spent in AddPolyline(). The cause can be seen in instruction latencies:

obraz
obraz

@wolfpld
Copy link
Contributor Author

wolfpld commented May 6, 2021

BTW, I have measured that functions such as AddLine() spent a significant time putting things into arrays passed to AddPolyline(). I removed this cost with a simple workaround: wolfpld/tracy@fe22d5a, which elides a copy (with force inline the ImVec data is stored only in the final location). This solution doesn't need a check for array capacity, because the array size is known, but it has a bit different semanticts than AddLine() and co.:

  • This will draw a single line, instead of finishing an already started path with a line.
  • The user is expected to add (0.5, 0.5) to the coordinates passed to function, so that pixels are properly centered. In my case this is preferable, as I do a lot of the following:
const auto wpos = ImGui::GetWindowPos();
...
AddLine( wpos + v1, wpos + v2 );

So I can just add (0.5, 0.5) to wpos during initialization.

@metarutaiga
Copy link

You can use this if SSE is not available.
the fast inverse square root

the rsqrtss is a lookup table instruction. a low accurate (16bit or 20bit) inverse square root.

@rokups
Copy link
Contributor

rokups commented Jun 3, 2021

I took a peek at ImRecip change, including using it as a macro instead of inline function.

Release builds benefit a bit from it, effect of using a macro instead of inline function can be written off as statistically insignificant.
image

Debug builds however, still suffer more than they should. Using a macro does not help much.
image
This is probably because debug builds generate more instructions when using SSE instructions. Tested on GCC x64 with -O0 (https://godbolt.org/z/Y1Exr413j).

#define ImRecip(x) (1.0f / x):

        push    rbp
        mov     rbp, rsp
        sub     rsp, 16
        movss   xmm0, DWORD PTR .LC0[rip]
        movss   DWORD PTR [rbp-4], xmm0
        movss   xmm1, DWORD PTR [rbp-4]
        movss   xmm0, DWORD PTR .LC1[rip]
        divss   xmm0, xmm1
        pxor    xmm2, xmm2
        cvtss2sd        xmm2, xmm0
        movq    rax, xmm2
        movq    xmm0, rax
        mov     edi, OFFSET FLAT:.LC2
        mov     eax, 1
        call    printf
        mov     eax, 0
        leave
        ret

#define ImRecip(x) _mm_cvtss_f32(_mm_rcp_ps(_mm_set_ss(x))):

        push    rbp
        mov     rbp, rsp
        sub     rsp, 48
        movss   xmm0, DWORD PTR .LC0[rip]
        movss   DWORD PTR [rbp-40], xmm0
        movss   xmm0, DWORD PTR [rbp-40]
        movss   DWORD PTR [rbp-36], xmm0
        mov     eax, DWORD PTR [rbp-36]
        movd    xmm0, eax
        movaps  XMMWORD PTR [rbp-32], xmm0
        rcpps   xmm0, XMMWORD PTR [rbp-32]
        nop
        movaps  XMMWORD PTR [rbp-16], xmm0
        movss   xmm0, DWORD PTR [rbp-16]
        pxor    xmm1, xmm1
        cvtss2sd        xmm1, xmm0
        movq    rax, xmm1
        movq    xmm0, rax
        mov     edi, OFFSET FLAT:.LC1
        mov     eax, 1
        call    printf
        mov     eax, 0
        leave
        ret

Conclusion: SSE for release builds and a macro with simple 1.0 / x for debug builds would be best of both worlds.

@wolfpld
Copy link
Contributor Author

wolfpld commented Jun 3, 2021

Conclusion: SSE for release builds and a macro with simple 1.0 / x for debug builds would be best of both worlds.

The SSE version has lower precision than division, so the obtained results will vary between release and debug builds. This can potentially turn someone's debugging session into an exercise in frustration, should this difference be not known.

@sergeyn
Copy link
Contributor

sergeyn commented Mar 19, 2022

Just my 2 cents here. Relative error of rcpps is quite high, which will become a problem on high-resolution displays (like 5k ones). I'd suggest adding one newton-raphson iteration to increase accuracy

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants