-
Notifications
You must be signed in to change notification settings - Fork 12k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
clang can optimize _tzcnt_u32 a bit more #64477
Comments
The result should be f(int, int):
tzcnt eax, DWORD PTR [4+esp] #9.16
ret #9.16
test(int):
tzcnt eax, DWORD PTR [4+esp] #9.16
ret #13.12 not f(int, int): # @f(int, int)
mov eax, dword ptr [esp + 4]
test eax, eax
je .LBB0_2
rep bsf eax, eax
ret
.LBB0_2:
mov eax, 32
ret |
clang requires -mbmi just like gcc |
Yeap, this makes the user code hard to write because differencies between different compiler, Currently MSVC/ICL act eaxtly the same. GCC won't compile without My suggestion is wihout f(int, int):
tzcnt eax, DWORD PTR [4+esp] #9.16
ret #9.16
test(int):
tzcnt eax, DWORD PTR [4+esp] #9.16
ret #13.12 directly, and this can be runned on each x86 system, so that's make sense. |
It will produce an incorrect result when the input is zero on a system that doesn't support tzcnt. |
That's what user want, otherwise user will guard with either #ifdef __BMI__ macro or check it at runtime |
It's make the following code can not be optimized to a single tzcnt instrunction without static inline uint32_t BitScanForward(
uint32_t mask) ///< [in] Bitmask to scan
{
assert(mask > 0);
unsigned long out = 0;
#if (defined(_WIN32) && (defined(_M_IX64) || defined(_M_IX86) || defined(_M_X64))) || \
defined(__x86_64__) || defined(__i386__)
out = _tzcnt_u32(mask);
#elif defined(__GNUC__)
out = __builtin_ctz(mask);
#else
while ((mask & 1) == 0)
{
mask >>= 1;
out++;
}
#endif
return out;
} |
Are you saying that if user uses tzcnt intrinsic without -mbmi we should assume they don't care about 0? |
Why not prefer __builtin_ctz whenever possible instead of using an x86 intrinsic first? |
Using tzcnt in principle is faster than bsf in many cases since bsf always has a dependency on the output register (due to the zero input case), but tzcnt doesn't. In practice tzcnt had a false dependency anyways, at least until Skylake where that was fixed (again IIRC, I recall that one of these bit manipulation didn't have their false report fixed, but the rest did). https://news.ycombinator.com/item?id=18210808 clang header also comment about that |
Clang compiles __builtin_clz to |
That's part is correct:) I am complain clang add extra instrunction without mov eax, dword ptr [esp + 4]
test eax, eax
je .LBB0_2 The |
@lygstate Have you considered: int foo(unsigned i) {
if (i == 0)
__builtin_unreachable();
return _tzcnt_u32(i);
} |
brilliant, the following code works for ICC/MSVC/Clang/GCC(With -mbmi) ! #ifdef _MSC_VER
#include <intrin.h>
__forceinline void
unreachable() {__assume(0);}
#else
#include <x86intrin.h>
inline __attribute__((always_inline)) void
unreachable() {
#if defined(__INTEL_COMPILER)
__assume(0);
#else
__builtin_unreachable();
#endif
}
#endif
int f(int a)
{
if (a == 0) {
unreachable();
}
return _tzcnt_u32 (a);
} There is a fun fact that ICC doesn't support for GCC emit a extra redundant |
Can int f(int a)
{
return _tzcnt_u32 (a);
} be optimized to: f(int): # @f(int)
mov eax, 32
rep bsf eax, dword ptr [esp + 4]
ret without It's worked according to https://godbolt.org/z/qYKhPTzrK through
|
Possibly, we'd need to know for sure that every X86 CPU ever made implemented that behavior. Which Intel documents as undefined. This includes old vendors like VIA, Centaur, etc. Or we'd need some other feature we can check for where we know every CPU with that feature implements this behavior. I think the Linux kernel uses this trick. Not sure if they know every possible CPU works. @nickdesaulniers do you know? |
The godbolt compile demo for _tzcnt_u32 under MSVC/ICL/GCC/Clang not optimize
The godbolt compile demo for _tzcnt_u32 under MSVC/ICL/GCC/Clang optimized
_tzcnt_u32 can be optimized to:
The text was updated successfully, but these errors were encountered: