Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimization opportunity: ((((a) > (b)) - ((a) < (b))) < 0) -> ((a) < (b)) #62790

Open
ryao opened this issue May 18, 2023 · 14 comments
Open

Optimization opportunity: ((((a) > (b)) - ((a) < (b))) < 0) -> ((a) < (b)) #62790

ryao opened this issue May 18, 2023 · 14 comments

Comments

@ryao
Copy link

ryao commented May 18, 2023

When trying to make a binary search function used in ZFS run faster, I noticed that Clang misses a high level optimization opportunity, which is ((((a) > (b)) - ((a) < (b))) < 0) -> ((a) < (b)).

Other variations of this include:

  • ((((a) > (b)) - ((a) < (b))) <= 0) -> ((a) <= (b))

  • ((((a) > (b)) - ((a) < (b))) == -1) -> ((a) < (b))

  • ((((a) > (b)) - ((a) < (b))) == 1) -> ((a) > (b))

  • ((((a) > (b)) - ((a) < (b))) == 0) -> ((a) == (b))

  • ((((a) > (b)) - ((a) < (b))) > 0) -> ((a) > (b)).

  • ((((a) > (b)) - ((a) < (b))) >= 0) -> ((a) >= (b)).

  • ((((a) >= (b)) - ((a) <= (b))) < 0) -> ((a) < (b))

  • ((((a) >= (b)) - ((a) <= (b))) <= 0) -> ((a) <= (b))

  • ((((a) >= (b)) - ((a) <= (b))) == -1) -> ((a) < (b))

  • ((((a) >= (b)) - ((a) <= (b))) == 1) -> ((a) > (b))

  • ((((a) >= (b)) - ((a) <= (b))) == 0) -> ((a) == (b))

  • ((((a) >= (b)) - ((a) <= (b))) > 0) -> ((a) > (b)).

  • ((((a) >= (b)) - ((a) <= (b))) >= 0) -> ((a) >= (b)).

(((a) > (b)) - ((a) < (b))) is a trick to generate -1, 0 or 1 when comparing two integers (signed or unsigned) and when comparators using this trick are inlined into the caller, the above transformations become applicable.

Getting back to binary search, the idea to make binary search faster was to eliminate a control hazard from the binary search function by making the loop body branchless. The following site explains the control hazard and shows to how avoid it:

https://en.algorithmica.org/hpc/data-structures/binary-search/#the-bottleneck

This site provides benchmarks that confirm that avoiding the control hazard makes binary search faster:

https://github.com/scandum/binary_search/tree/master

The comparators used in ZFS are expected to return -1, 0, or 1, so (((a) > (b)) - ((a) < (b))) is often used in them. Since I wanted to avoid control hazards in binary search, I wrote a macro that can be used to do C++-style inlining of comparators into the binary search function by making a different version for each. Unfortunately, LLVM/Clang reintroduced the control hazard when all of this was put together. Here is a minimal version that avoids the macro so that it is easy to see which lines correspond to which instructions:

https://gcc.godbolt.org/z/KbKc6zqjT

On amd64, if I comment out line 35 and uncomment line 36, the control hazard vanishes, but the code generation is less than ideal. If I instead uncomment line 37 instead of 36, Clang generates what I wanted in the first place, but I cannot use it because I need the compiler to inline the existing comparison function.

If I compile for rv64gc, LLVM/Clang does not generate a control hazard for any of those versions, although it still generates different code. Line 35 gives a 13 instruction loop. Line 36 gives a 16 instruction loop. Line 37 gives a 12 instruction loop.

For comparison, GCC 13.1 generates a 15-instruction loop for amd64 without any control hazards for both versions on lines 35 and 36, although it also misses this optimization opportunity.

If LLVM had an optimization pass that recognized the equivalency of the above expressions, then it would generate code corresponding to what it generated before I had done the abstraction needed to make this function useful in production.

I will switch my code to use line 36 at the expense of code generation on RISC-V, but I would be happier if LLVM/Clang implemented this optimization.

@topperc
Copy link
Collaborator

topperc commented May 18, 2023

Might be worth looking at -mllvm -x86-cmov-converter=false That should stop the X86 backend from turning cmov instructions into control flow. Maybe its heuristics need some more adjusting.

@ryao
Copy link
Author

ryao commented May 18, 2023

-mllvm -x86-cmov-converter=false causes LLVM/Clang to generate a beautiful 9-instruction version of the loop. Is there any way to specify that for a small region of code through a pragma?

@ryao
Copy link
Author

ryao commented May 18, 2023

A bug has been filed against GCC regarding the high level optimization opportunity at the advice of the Gentoo toolchain team:

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109901

@ryao
Copy link
Author

ryao commented May 18, 2023

#39374 appears to be related.

@ryao
Copy link
Author

ryao commented May 18, 2023

No amount of inserting __builtin_unpredictable makes a difference.

@topperc
Copy link
Collaborator

topperc commented May 18, 2023

No amount of inserting __builtin_unpredictable makes a difference.

Yeah that metadata doesn't make it past IR. @davidbolvansky was looking at fixing that I think.

@ryao
Copy link
Author

ryao commented May 18, 2023

https://gcc.godbolt.org/z/KbKc6zqjT

Would I be correct to think that line 6 of the assembly output serves no purpose due to line 3?

@RKSimon
Copy link
Collaborator

RKSimon commented May 18, 2023

CC @RKSimon

@davidbolvansky
Copy link
Collaborator

No amount of inserting __builtin_unpredictable makes a difference.

Yeah that metadata doesn't make it past IR. @davidbolvansky was looking at fixing that I think.

Yes, I need to rebase https://reviews.llvm.org/D118118 and check every callsite.

@davidbolvansky
Copy link
Collaborator

davidbolvansky commented May 18, 2023

Is there any way to specify that for a small region of code through a pragma?

Pragma is not a bad idea (many __builtin_unpredictable calls could be a bit inconvenient in user code), but for now, after D118118 lands, __builtin_unpredictable becomes stronger and should solve your issues. New pragma is a good follow-up, I think.

@ryao
Copy link
Author

ryao commented May 27, 2023

This was going to be a write-up explaining how not having this optimization slows us down. To my pleasant surprise, LLVM/Clang trunk appears to perform this optimization:

https://gcc.godbolt.org/z/P6Pbxxovv

That being said, there is room to do better.

Here is the assembly generated by LLVM trunk for the first case statement for the assembly that I linked:

        leal    512(%rcx), %eax
        cmpl    %edx, (%rdi,%rax,4)
        cmovgl  %ecx, %eax
        movl    %eax, %ecx

Here is the assembly generated by GCC 12.3:

        leal    512(%rdx), %ecx
        cmpl    %esi, (%rdi,%rcx,4)
        cmovle  %ecx, %edx

GCC's output is about 10% faster in micro-benckmarks, although doing micro-benchmarks of this code is complicated by #62948. It must be worked around by manually patching the assembly output before any sort of benchmarking can be done:

diff --git a/out.s b/out.s
index a336c59..e4cacbb 100644
--- a/out.s
+++ b/out.s
@@ -12,8 +12,8 @@ custom_binary_search_slow:              # @custom_binary_search_slow
        bsrl    %eax, %eax
        movl    %eax, %r8d
        xorl    $31, %r8d
-       movb    $30, %cl
-       subb    %r8b, %cl
+       movl    $30, %ecx
+       subl    %r8d, %ecx
        movl    $1, %r8d
        shll    %cl, %r8d
        movl    %esi, %r9d
@@ -191,8 +191,8 @@ custom_binary_search_fast:              # @custom_binary_search_fast
        bsrl    %eax, %eax
        movl    %eax, %r8d
        xorl    $31, %r8d
-       movb    $30, %cl
-       subb    %r8b, %cl
+       movl    $30, %ecx
+       subl    %r8d, %ecx
        movl    $1, %r8d
        shll    %cl, %r8d
        movl    %esi, %r9d

In any case, I modified the microbenchmark from scandum/binary_search to better suit a workload that I am micro-optimizing:

https://github.com/scandum/binary_search

In specific, I wanted to test on small arrays, but avoid cache effects contaminating the results. One could easily add the search functions from my modified version into the original to get get numbers for bigger array sizes.

The source code for the modified micro-benchmark is attached to the following GCC issue:

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109901

Should I file another issue for this or is leaving this note here fine? Does anyone know offhand which commit to trunk made this difference?

@ryao
Copy link
Author

ryao commented May 27, 2023

Disregard my remarks that this is fixed. I just looked at the original example and that is not fixed in trunk.

@davidbolvansky
Copy link
Collaborator

davidbolvansky commented Jun 2, 2023

No amount of inserting __builtin_unpredictable makes a difference.

Yeah that metadata doesn't make it past IR. @davidbolvansky was looking at fixing that I think.

Yes, I need to rebase https://reviews.llvm.org/D118118 and check every callsite.

Landed. Now you can use __builtin_unpredictable to get CMOVs.

godbolt:
https://gcc.godbolt.org/z/hWx39qdeM

@ryao
Copy link
Author

ryao commented Jun 3, 2023

No amount of inserting __builtin_unpredictable makes a difference.

Yeah that metadata doesn't make it past IR. @davidbolvansky was looking at fixing that I think.

Yes, I need to rebase https://reviews.llvm.org/D118118 and check every callsite.

Landed. Now you can use __builtin_unpredictable to get CMOVs.

godbolt: https://gcc.godbolt.org/z/hWx39qdeM

Awesome. Thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants