Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The _mm_movemask_epi8 intrinsic will sometimes call pmovmskb and will other times call movmskps #51909

Closed
llvmbot opened this issue Nov 20, 2021 · 4 comments
Assignees
Labels
backend:X86 bugzilla Issues migrated from bugzilla

Comments

@llvmbot
Copy link
Collaborator

llvmbot commented Nov 20, 2021

Bugzilla Link 52567
Resolution FIXED
Resolved on Nov 19, 2021 22:26
Version trunk
OS Linux
Reporter LLVM Bugzilla Contributor
CC @topperc,@RKSimon,@phoebewang,@rotateright
Fixed by commit(s) a4373f6

Extended Description

I've found that calling _mm_movemask_epi8 changes its behavior depending on the way that it is used in surrounding expressions.

This is reproducible with this code:

1 #include <xmmintrin.h>
2 int main()
3 {
4 const int mask = 0x00FF;
5 const auto recip = _mm_rcp_ps(_mm_set_ps(0.0f, 0.0f, 4.0f, 2.0f));
6 const auto diff = _mm_sub_ps(_mm_set_ps(0.0f, 0.0f, 0.25f, 0.5f), recip);
7 const auto abs = _mm_and_ps(diff, _mm_set1_epi32(0x7FFFFFFF));
8 const auto compare = _mm_castps_si128(_mm_cmpgt_ps(abs, _mm_set1_ps(0.001f)));
9 return (_mm_movemask_epi8(compare) & mask) == 0;
10 }

In plain English, this code:

  1. Gets the reciprocal of the vector (0.0f, 0.0f, 4.0f, 2.0f)
  2. Subtracts (0.0f, 0.0f, 0.25f, 0.5f) from the result of 1
  3. Takes the absolute value of the result of 2
  4. Returns 0 if the last 2 values of the result of 3 are greater than 0.001f, 1 otherwise

Intel's Intrinsics guide states that the _mm_movemask_epi8 intrinsic function corresponds to the "pmovmskb r32, xmm" instruction. Thus, it should return 0xff00. However, because the value of mask (declared on line 4) is known at compile time, the call to _mm_movemask_epi8 on line 9 generates a movmskps instruction, which returns a different value. Instead of 16 1 bits per true value 0xff00, we only get 1 true bit per value, 0b1100.

Assembly that is generated for this code is:
rcpps xmm0, xmmword ptr [rip + .LCPI0_0]
movaps xmm1, xmmword ptr [rip + .LCPI0_1] # xmm1 = [5.0E-1,2.5E-1,0.0E+0,0.0E+0]
subps xmm1, xmm0
andps xmm1, xmmword ptr [rip + .LCPI0_2]
movaps xmm0, xmmword ptr [rip + .LCPI0_3] # xmm0 = [1.00000005E-3,1.00000005E-3,1.00000005E-3,1.00000005E-3]
cmpltps xmm0, xmm1
movmskps ecx, xmm0 # <--- this is the instruction from _mm_movemask_epi8
xor eax, eax
test ecx, ecx
sete al
ret

If we do anything to the mask to force the compiler's hand, it will call pmovmskb instead. The simplest thing to do is to mark the mask variable as volatile, but moving it to a separate TU also works.
rcpps xmm0, xmmword ptr [rip + .LCPI0_0]
movaps xmm1, xmmword ptr [rip + .LCPI0_1] # xmm1 = [5.0E-1,2.5E-1,0.0E+0,0.0E+0]
subps xmm1, xmm0
andps xmm1, xmmword ptr [rip + .LCPI0_2]
mov dword ptr [rsp - 4], 255
movaps xmm0, xmmword ptr [rip + .LCPI0_3] # xmm0 = [1.00000005E-3,1.00000005E-3,1.00000005E-3,1.00000005E-3]
cmpltps xmm0, xmm1
pmovmskb ecx, xmm0 # <--- this is the instruction from _mm_movemask_epi8
xor eax, eax
and ecx, dword ptr [rsp - 4]
sete al
ret

The result is that the return value of _mm_movemask_epi8 is unpredictable.

The code can be seen here: https://godbolt.org/z/3Yc734rnj
The top code is the current behavior, the bottom code only differs in that mask is volatile.

I used git bisect and found that this behavior changed in git commit 0741b75.

@llvmbot
Copy link
Collaborator Author

llvmbot commented Nov 20, 2021

assigned to @topperc

@topperc
Copy link
Collaborator

topperc commented Nov 20, 2021

I agree there is something funny going on here. I'll take a look.

@topperc
Copy link
Collaborator

topperc commented Nov 20, 2021

Does this look better

    rcpps   .LCPI0_0(%rip), %xmm0
    movaps  .LCPI0_1(%rip), %xmm1           # xmm1 = [5.0E-1,2.5E-1,0.0E+0,0.0E+0]
    subps   %xmm0, %xmm1
    andps   .LCPI0_2(%rip), %xmm1
    movaps  .LCPI0_3(%rip), %xmm0           # xmm0 = [1.00000005E-3,1.00000005E-3,1.00000005E-3,1.00000005E-3]
    cmpltps %xmm1, %xmm0
    movmskps        %xmm0, %ecx
    xorl    %eax, %eax
    testb   $3, %cl
    sete    %al
    retq

@topperc
Copy link
Collaborator

topperc commented Nov 20, 2021

I disabled this transform for this case in a4373f6

@llvmbot llvmbot transferred this issue from llvm/llvm-bugzilla-archive Dec 11, 2021
This issue was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backend:X86 bugzilla Issues migrated from bugzilla
Projects
None yet
Development

No branches or pull requests

2 participants