Optimize IM_NORMALIZE2F_OVER_ZERO #4091

wolfpld · 2021-04-30T21:00:22Z

Currently Dear ImGui uses 1/sqrt() in IM_NORMALIZE2F_OVER_ZERO(). This produces the following instruction sequence in ImDrawList::AddPolyline():

Notice that both vsqrtss and vdivss both have high sample hit count (visible in dependent instructions). This is caused by high latency of both instructions:

This PR introduces a new function, ImRsqrt(), which also has an alternate implementation if SSE1 is available. Note that SSE2 support is mandated in x64 architecture. With these changes the following assembly is generated:

With the new code only one instruction is emitted, vrqsrtss, which produces an approximate result, but the reduced precision shouldn't matter in this case. Notice that the relative cost of calling IM_NORMALIZE2F_OVER_ZERO() has dropped significantly. This is caused by much lower latency of vrqsrtss across all uarchs:

ARM NEON has similar instruction, FRSQRTE, which may or may not be beneficial to use.

For context, ImDrawList::AddPolyline() in certain conditions can be the hottest function in Tracy.

I'm not sure if including immintrin.h is the right solution. It works for me, but it may be problematic on older compilers, in which case a more targetted header inclusion should be used.

ocornut · 2021-05-06T08:02:27Z

Thanks @wolfpld we'll try to merge this soon (currently in the middle of construction work).
No extra feedback needed for you, writing things down for myself:

I used this as another occasion to test the perf tools @rokups we've been working on. Pardon taking a large
tangent there but any occasions to test those tools is good for us as our process of generating/using the data is still a little tedious (working on it, ideal should be that in 1 command we should be able to compare two branches across multiple builds and runs).

Average of 8 runs over a dozen tests:

Individual runs

I noticed things are slower in Debug:

Release X64

Debug X64

That's fine for me as I am going to take the occasion to enable some of the "reduce agressive stack bound checks" macros around some key low-level functions.

Gets us more toward those: (green is Debug X64 with patch + some pragma to reduce stack bound checks on a few functions)

wolfpld · 2021-05-06T11:09:07Z

Pardon taking a large tangent there but any occasions to test those tools is good for us

I think these measurements are very interesting and appropriate to talk about here.

I noticed things are slower in Debug:

This is understandable, given that MSVC is instrumenting the code for use in the debugger. In the assembly output you can see multiple instances of seemingly nonsensical behavior, such as:

movss dword ptr [rbp+114h], xmm0
movss xmm0, dword ptr [rbp+114h]

This sequence stores the xmm0 register on the stack, only to immediately load the value again to the same register. This enables user to modify values held in the variables when a breakpoint is hit. On the other hand, this also introduces FPU pipeline stalls.

(currently in the middle of construction work)

I have another similar optimization queued up: wolfpld/tracy@0bd6479
Would you like me to include this change in this PR (given that it is taking some time to merge), or do you want it in a second PR?

ocornut · 2021-05-06T11:34:21Z

Same PR separate commit is good. Already pushed some of the changes to remove the stack-bound check on small low level functions so the numbers will be back to saner levels and we can benefit from this even in typical “debug” settings.

This function calculates the result of 1/x. On SSE1 capable platforms this is performed as an approximation, using rcpps.

wolfpld · 2021-05-06T11:44:19Z

Ok, added new changes and rebased to current master.

Squashed 3 commits.

ocornut · 2021-05-06T16:14:05Z

I merged the first part as 4c9f0ce (with minor amends).

First part in red, second part (ImRecip) in green:

Release X64

Debug X64

In Debug the overhead of non-inlined function call sorts of hinders it for now. I'll investigate later with using macros.

ocornut · 2021-05-06T18:00:02Z

I tried various variants (with/without macros) and I couldn't find a satisfying setup where the ImRecip change seems very worthy. It's a very minor gain in "Release" mode and often a more noticeable loss in "Debug" mode.

Did you measure meaningful changes when doing the ImRecip() change on your side?

wolfpld · 2021-05-06T18:48:10Z

This is the 1.0f / x version:

And here is the optimized one:

This is in context of time spent in AddPolyline(). The cause can be seen in instruction latencies:

wolfpld · 2021-05-06T19:01:34Z

BTW, I have measured that functions such as AddLine() spent a significant time putting things into arrays passed to AddPolyline(). I removed this cost with a simple workaround: wolfpld/tracy@fe22d5a, which elides a copy (with force inline the ImVec data is stored only in the final location). This solution doesn't need a check for array capacity, because the array size is known, but it has a bit different semanticts than AddLine() and co.:

This will draw a single line, instead of finishing an already started path with a line.
The user is expected to add (0.5, 0.5) to the coordinates passed to function, so that pixels are properly centered. In my case this is preferable, as I do a lot of the following:

const auto wpos = ImGui::GetWindowPos();
...
AddLine( wpos + v1, wpos + v2 );

So I can just add (0.5, 0.5) to wpos during initialization.

metarutaiga · 2021-05-08T14:49:02Z

You can use this if SSE is not available.
the fast inverse square root

the rsqrtss is a lookup table instruction. a low accurate (16bit or 20bit) inverse square root.

rokups · 2021-06-03T08:49:18Z

I took a peek at ImRecip change, including using it as a macro instead of inline function.

Release builds benefit a bit from it, effect of using a macro instead of inline function can be written off as statistically insignificant.

Debug builds however, still suffer more than they should. Using a macro does not help much.

This is probably because debug builds generate more instructions when using SSE instructions. Tested on GCC x64 with -O0 (https://godbolt.org/z/Y1Exr413j).

#define ImRecip(x) (1.0f / x):

        push    rbp
        mov     rbp, rsp
        sub     rsp, 16
        movss   xmm0, DWORD PTR .LC0[rip]
        movss   DWORD PTR [rbp-4], xmm0
        movss   xmm1, DWORD PTR [rbp-4]
        movss   xmm0, DWORD PTR .LC1[rip]
        divss   xmm0, xmm1
        pxor    xmm2, xmm2
        cvtss2sd        xmm2, xmm0
        movq    rax, xmm2
        movq    xmm0, rax
        mov     edi, OFFSET FLAT:.LC2
        mov     eax, 1
        call    printf
        mov     eax, 0
        leave
        ret

#define ImRecip(x) _mm_cvtss_f32(_mm_rcp_ps(_mm_set_ss(x))):

        push    rbp
        mov     rbp, rsp
        sub     rsp, 48
        movss   xmm0, DWORD PTR .LC0[rip]
        movss   DWORD PTR [rbp-40], xmm0
        movss   xmm0, DWORD PTR [rbp-40]
        movss   DWORD PTR [rbp-36], xmm0
        mov     eax, DWORD PTR [rbp-36]
        movd    xmm0, eax
        movaps  XMMWORD PTR [rbp-32], xmm0
        rcpps   xmm0, XMMWORD PTR [rbp-32]
        nop
        movaps  XMMWORD PTR [rbp-16], xmm0
        movss   xmm0, DWORD PTR [rbp-16]
        pxor    xmm1, xmm1
        cvtss2sd        xmm1, xmm0
        movq    rax, xmm1
        movq    xmm0, rax
        mov     edi, OFFSET FLAT:.LC1
        mov     eax, 1
        call    printf
        mov     eax, 0
        leave
        ret

Conclusion: SSE for release builds and a macro with simple 1.0 / x for debug builds would be best of both worlds.

wolfpld · 2021-06-03T09:26:47Z

Conclusion: SSE for release builds and a macro with simple 1.0 / x for debug builds would be best of both worlds.

The SSE version has lower precision than division, so the obtained results will vary between release and debug builds. This can potentially turn someone's debugging session into an exercise in frustration, should this difference be not known.

sergeyn · 2022-03-19T02:51:23Z

Just my 2 cents here. Relative error of rcpps is quite high, which will become a problem on high-resolution displays (like 5k ones). I'd suggest adding one newton-raphson iteration to increase accuracy

ocornut added drawing/drawlist optimization labels May 3, 2021

wolfpld added 5 commits May 6, 2021 13:40

Add ImRsqrt().

38563bc

Use ImRsqrt() in place of 1.0f / ImSqrt().

0da5cde

Use SSE rsqrt, if available.

fd3f999

Add ImRecip() wrapper.

4fb9122

This function calculates the result of 1/x. On SSE1 capable platforms this is performed as an approximation, using rcpps.

Use ImRecip2() in IM_FIXNORMAL2F().

81eed5e

ocornut pushed a commit that referenced this pull request May 6, 2021

Add and use SSE-enabled ImRsqrt() in place of 1.0f / ImSqrt(). (#4091)

4c9f0ce

Squashed 3 commits.

ocornut mentioned this pull request Jun 21, 2021

SSE optimized rsqrt causes line stroke disappearing #4250

Open

ocornut added a commit that referenced this pull request Jun 21, 2021

Added IMGUI_DISABLE_SSE (#4250, #4091)

98876b4

epezent mentioned this pull request Jul 31, 2021

improve indexing, line rendering performance by 45% epezent/implot#270

Merged

PallHaraldsson mentioned this pull request Dec 5, 2023

More accurate streaming mean method for arbitrary iterators JuliaLang/julia#52365

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize IM_NORMALIZE2F_OVER_ZERO #4091

Optimize IM_NORMALIZE2F_OVER_ZERO #4091

wolfpld commented Apr 30, 2021

ocornut commented May 6, 2021

wolfpld commented May 6, 2021

ocornut commented May 6, 2021 via email

wolfpld commented May 6, 2021

ocornut commented May 6, 2021

ocornut commented May 6, 2021

wolfpld commented May 6, 2021

wolfpld commented May 6, 2021

metarutaiga commented May 8, 2021

rokups commented Jun 3, 2021 •

edited

Loading

wolfpld commented Jun 3, 2021

sergeyn commented Mar 19, 2022

Optimize IM_NORMALIZE2F_OVER_ZERO #4091

Are you sure you want to change the base?

Optimize IM_NORMALIZE2F_OVER_ZERO #4091

Conversation

wolfpld commented Apr 30, 2021

ocornut commented May 6, 2021

wolfpld commented May 6, 2021

ocornut commented May 6, 2021 via email

wolfpld commented May 6, 2021

ocornut commented May 6, 2021

ocornut commented May 6, 2021

wolfpld commented May 6, 2021

wolfpld commented May 6, 2021

metarutaiga commented May 8, 2021

rokups commented Jun 3, 2021 • edited Loading

wolfpld commented Jun 3, 2021

sergeyn commented Mar 19, 2022

rokups commented Jun 3, 2021 •

edited

Loading