Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[X86][AVX] Recognise out of bounds AVX2 shift amounts #83840

Closed
RKSimon opened this issue Mar 4, 2024 · 24 comments · Fixed by #84426 or #86922
Closed

[X86][AVX] Recognise out of bounds AVX2 shift amounts #83840

RKSimon opened this issue Mar 4, 2024 · 24 comments · Fixed by #84426 or #86922
Assignees
Labels
backend:X86 good first issue https://github.com/llvm/llvm-project/contribute missed-optimization

Comments

@RKSimon
Copy link
Collaborator

RKSimon commented Mar 4, 2024

Pulled out of #39822 which was a bit too general.

Unlike the general ISD SRA/SRL/SHL nodes, the AVX2 vector shift nodes X86ISD VSRAV/VSRLV/VSHLV handle out of bounds shift amounts:

  • VSRAV clamps the unsigned shift amount to (BITWIDTH-1)
  • VSRLV/VSHLV returns a zero value for unsigned shift amounts greater than (BITWIDTH-1).

So when lowering vector shifts, we should be able to fold any shift amount clamp patterns and use the X86ISD nodetypes.

e.g.

define <4 x i32> @ashr(<4 x i32> %sh, <4 x i32> %amt) {
  %elt.min.i = tail call <4 x i32> @llvm.umin.v4i32(<4 x i32> %amt, <4 x i32> <i32 31, i32 31, i32 31, i32 31>)
  %shr = ashr <4 x i32> %sh, %elt.min.i
  ret <4 x i32> %shr
}

->

ashr(int vector[4], unsigned int vector[4]):
        vpbroadcastd    xmm2, dword ptr [rip + .LCPI0_0] # xmm2 = [31,31,31,31]
        vpminud xmm1, xmm1, xmm2
        vpsravd xmm0, xmm0, xmm1
        ret

vs

ashr(int vector[4], unsigned int vector[4]):
        vpsravd xmm0, xmm0, xmm1
        ret

Logical shifts are trickier but also foldable:

define <4 x i32> @lshr(<4 x i32> %sh, <4 x i32> %amt) {
  %cmp.i = icmp ult <4 x i32> %amt, <i32 32, i32 32, i32 32, i32 32>
  %shr = lshr <4 x i32> %sh, %amt
  %0 = select <4 x i1> %cmp.i, <4 x i32> %shr, <4 x i32> zeroinitializer
  ret <4 x i32> %0
}

define <4 x i32> @lshr2(<4 x i32> %sh, <4 x i32> %amt) {
  %cmp.i = icmp ult <4 x i32> %amt, <i32 32, i32 32, i32 32, i32 32>
  %0 = select <4 x i1> %cmp.i, <4 x i32> %sh, <4 x i32> zeroinitializer
  %shr = lshr <4 x i32> %0, %amt
  ret <4 x i32> %shr
}
@RKSimon RKSimon added good first issue https://github.com/llvm/llvm-project/contribute backend:X86 missed-optimization labels Mar 4, 2024
@llvmbot
Copy link
Collaborator

llvmbot commented Mar 4, 2024

@llvm/issue-subscribers-backend-x86

Author: Simon Pilgrim (RKSimon)

Pulled out of #39822 which was a bit too general.

Unlike the general ISD SRA/SRL/SHL nodes, the AVX2 vector shift nodes X86ISD VSRAV/VSRLV/VSHLV handle out of bounds shift amounts:

  • VSRAV clamps the unsigned shift amount to (BITWIDTH-1)
  • VSRLV/VSHLV returns a zero value for unsigned shift amounts greater than (BITWIDTH-1).

So when lowering vector shifts, we should be able to fold any shift amount clamp patterns and use the X86ISD nodetypes.

e.g.

define &lt;4 x i32&gt; @<!-- -->ashr(&lt;4 x i32&gt; %sh, &lt;4 x i32&gt; %amt) {
  %elt.min.i = tail call &lt;4 x i32&gt; @<!-- -->llvm.umin.v4i32(&lt;4 x i32&gt; %amt, &lt;4 x i32&gt; &lt;i32 31, i32 31, i32 31, i32 31&gt;)
  %shr = ashr &lt;4 x i32&gt; %sh, %elt.min.i
  ret &lt;4 x i32&gt; %shr
}

->

ashr(int vector[4], unsigned int vector[4]):
        vpbroadcastd    xmm2, dword ptr [rip + .LCPI0_0] # xmm2 = [31,31,31,31]
        vpminud xmm1, xmm1, xmm2
        vpsravd xmm0, xmm0, xmm1
        ret

vs

ashr(int vector[4], unsigned int vector[4]):
        vpsravd xmm0, xmm0, xmm1
        ret

Logical shifts are trickier but also foldable:

define &lt;4 x i32&gt; @<!-- -->lshr(&lt;4 x i32&gt; %sh, &lt;4 x i32&gt; %amt) {
  %cmp.i = icmp ult &lt;4 x i32&gt; %amt, &lt;i32 32, i32 32, i32 32, i32 32&gt;
  %shr = lshr &lt;4 x i32&gt; %sh, %amt
  %0 = select &lt;4 x i1&gt; %cmp.i, &lt;4 x i32&gt; %shr, &lt;4 x i32&gt; zeroinitializer
  ret &lt;4 x i32&gt; %0
}

define &lt;4 x i32&gt; @<!-- -->lshr2(&lt;4 x i32&gt; %sh, &lt;4 x i32&gt; %amt) {
  %cmp.i = icmp ult &lt;4 x i32&gt; %amt, &lt;i32 32, i32 32, i32 32, i32 32&gt;
  %0 = select &lt;4 x i1&gt; %cmp.i, &lt;4 x i32&gt; %sh, &lt;4 x i32&gt; zeroinitializer
  %shr = lshr &lt;4 x i32&gt; %0, %amt
  ret &lt;4 x i32&gt; %shr
}

@llvmbot
Copy link
Collaborator

llvmbot commented Mar 4, 2024

Hi!

This issue may be a good introductory issue for people new to working on LLVM. If you would like to work on this issue, your first steps are:

  1. In the comments of the issue, request for it to be assigned to you.
  2. Fix the issue locally.
  3. Run the test suite locally. Remember that the subdirectories under test/ create fine-grained testing targets, so you can e.g. use make check-clang-ast to only run Clang's AST tests.
  4. Create a Git commit.
  5. Run git clang-format HEAD~1 to format your changes.
  6. Open a pull request to the upstream repository on GitHub. Detailed instructions can be found in GitHub's documentation.

If you have any further questions about this issue, don't hesitate to ask via a comment in the thread below.

@llvmbot
Copy link
Collaborator

llvmbot commented Mar 4, 2024

@llvm/issue-subscribers-good-first-issue

Author: Simon Pilgrim (RKSimon)

Pulled out of #39822 which was a bit too general.

Unlike the general ISD SRA/SRL/SHL nodes, the AVX2 vector shift nodes X86ISD VSRAV/VSRLV/VSHLV handle out of bounds shift amounts:

  • VSRAV clamps the unsigned shift amount to (BITWIDTH-1)
  • VSRLV/VSHLV returns a zero value for unsigned shift amounts greater than (BITWIDTH-1).

So when lowering vector shifts, we should be able to fold any shift amount clamp patterns and use the X86ISD nodetypes.

e.g.

define &lt;4 x i32&gt; @<!-- -->ashr(&lt;4 x i32&gt; %sh, &lt;4 x i32&gt; %amt) {
  %elt.min.i = tail call &lt;4 x i32&gt; @<!-- -->llvm.umin.v4i32(&lt;4 x i32&gt; %amt, &lt;4 x i32&gt; &lt;i32 31, i32 31, i32 31, i32 31&gt;)
  %shr = ashr &lt;4 x i32&gt; %sh, %elt.min.i
  ret &lt;4 x i32&gt; %shr
}

->

ashr(int vector[4], unsigned int vector[4]):
        vpbroadcastd    xmm2, dword ptr [rip + .LCPI0_0] # xmm2 = [31,31,31,31]
        vpminud xmm1, xmm1, xmm2
        vpsravd xmm0, xmm0, xmm1
        ret

vs

ashr(int vector[4], unsigned int vector[4]):
        vpsravd xmm0, xmm0, xmm1
        ret

Logical shifts are trickier but also foldable:

define &lt;4 x i32&gt; @<!-- -->lshr(&lt;4 x i32&gt; %sh, &lt;4 x i32&gt; %amt) {
  %cmp.i = icmp ult &lt;4 x i32&gt; %amt, &lt;i32 32, i32 32, i32 32, i32 32&gt;
  %shr = lshr &lt;4 x i32&gt; %sh, %amt
  %0 = select &lt;4 x i1&gt; %cmp.i, &lt;4 x i32&gt; %shr, &lt;4 x i32&gt; zeroinitializer
  ret &lt;4 x i32&gt; %0
}

define &lt;4 x i32&gt; @<!-- -->lshr2(&lt;4 x i32&gt; %sh, &lt;4 x i32&gt; %amt) {
  %cmp.i = icmp ult &lt;4 x i32&gt; %amt, &lt;i32 32, i32 32, i32 32, i32 32&gt;
  %0 = select &lt;4 x i1&gt; %cmp.i, &lt;4 x i32&gt; %sh, &lt;4 x i32&gt; zeroinitializer
  %shr = lshr &lt;4 x i32&gt; %0, %amt
  ret &lt;4 x i32&gt; %shr
}

@RKSimon
Copy link
Collaborator Author

RKSimon commented Mar 4, 2024

@SahilPatidar
Copy link
Contributor

@RKSimon,
I'm interested in working on this issue. Could you confirm if all the mentioned behavior regarding the AVX2 vector shift nodes (X86ISD VSRAV/VSRLV/VSHLV) and the handling of out-of-bounds shift amounts occurs in the llvm/lib/Target/X86 directory?

@RKSimon
Copy link
Collaborator Author

RKSimon commented Mar 4, 2024

I expect all of this to be handled as DAG combines inside X86ISelLowering.cpp

@SahilPatidar
Copy link
Contributor

What command did you use for this?

@SahilPatidar
Copy link
Contributor

@RKSimon,
I am working on it, so could you please assign it to me?

@RKSimon
Copy link
Collaborator Author

RKSimon commented Mar 4, 2024

I suggest you start with the SRA(X,UMIN(Y,BW-1)) case in combineShiftRightArithmetic

You should be able to use supportedVectorVarShift to detect when the target supports replacing the ISD::SRA node with X86ISD::VSRAV

@SahilPatidar
Copy link
Contributor

@RKSimon,
IWhile reviewing the X86ISelLowering codebase, I encountered a scenario where the code executes the SRA operation and calls the combineShiftRightArithmetic function. However, when the following condition is met:

if (VT.isVector() || N1.getOpcode() != ISD::Constant ||
    N0.getOpcode() != ISD::SHL || !N0.hasOneUse() ||
    N0.getOperand(1).getOpcode() != ISD::Constant)
  return SDValue();

the code returns without performing any optimization.

My observation is that even though we are working with vectors, the code does not invoke the combineVectorShiftImm function in these cases.

I would like to understand why the code behaves this way and does not call the combineVectorShiftImm function when dealing with vectors. Could you please provide some insights into this behavior?

@RKSimon
Copy link
Collaborator Author

RKSimon commented Mar 4, 2024

The code below that early-out is for a single fold (see the description comment) - just above it is the combineShiftToPMULH fold - you can put your code after combineShiftToPMULH.

@SahilPatidar
Copy link
Contributor

@RKSimon
I have been working on a particular piece of code in LLVM and encountered an unexpected behavior that I'm seeking clarification on.

APInt ShiftAmt;
SDNode *NNode = N1.getNode();
if (NNode->getOpcode() == ISD::UMIN &&
    ISD::isConstantSplatVector(NNode->getOperand(1).getNode(), ShiftAmt) && 
    ShiftAmt == VT.getScalarSizeInBits() - 1) {
  SDValue SHAmtVal = NNode->getOperand(0);
  SDLoc DL(N);
  return DAG.getNode(N->getOpcode(), DL, N->getVTList(), N0, SHAmtVal);
}

When running this code, I observed the following unexpected output:

vpsravd %xmm1, %xmm0, %xmm0
retq

I would greatly appreciate it if you could help me understand why the position of xmm is changing in the output.

@RKSimon
Copy link
Collaborator Author

RKSimon commented Mar 5, 2024

The order looks to just be the difference between Intel and AT&T ASM ayntax

@SahilPatidar
Copy link
Contributor

@RKSimon, I think some changes need to be made in select and logical shift operations.

@RKSimon
Copy link
Collaborator Author

RKSimon commented Mar 6, 2024

I suggest you just focus on the SRA case first

@SahilPatidar
Copy link
Contributor

@RKSimon, could the code I shared above be improved? If it's incorrect, please let me know.

@RKSimon
Copy link
Collaborator Author

RKSimon commented Mar 6, 2024

That SRA code looks fine - I'd like to see it in a PR, and then there is the question of adding test coverage....

@SahilPatidar
Copy link
Contributor

@RKSimon,
Do tests need to be changed in combine-sra.ll?

@RKSimon
Copy link
Collaborator Author

RKSimon commented Mar 6, 2024

@RKSimon, Do tests need to be changed in combine-sra.ll?

Yes

SahilPatidar added a commit to SahilPatidar/llvm-project that referenced this issue Mar 8, 2024
SahilPatidar added a commit to SahilPatidar/llvm-project that referenced this issue Mar 11, 2024
SahilPatidar added a commit to SahilPatidar/llvm-project that referenced this issue Mar 11, 2024
SahilPatidar added a commit to SahilPatidar/llvm-project that referenced this issue Mar 12, 2024
SahilPatidar added a commit to SahilPatidar/llvm-project that referenced this issue Mar 12, 2024
SahilPatidar added a commit to SahilPatidar/llvm-project that referenced this issue Mar 15, 2024
@RKSimon RKSimon reopened this Mar 15, 2024
@RKSimon
Copy link
Collaborator Author

RKSimon commented Mar 15, 2024

ISD::SRA handling landed at #84426 / 1865655

@SahilPatidar
Copy link
Contributor

SahilPatidar commented Mar 18, 2024

@RKSimon, it seems SRL also needs to be addressed. Is that correct?

@RKSimon
Copy link
Collaborator Author

RKSimon commented Mar 18, 2024

Logical shift left/right both need dealing with, but those will be combined from either VSELECT or AND nodes.

@SahilPatidar
Copy link
Contributor

Logical shift left/right both need dealing with, but those will be combined from either VSELECT or AND nodes.

could you please provide an example for AND?

@RKSimon
Copy link
Collaborator Author

RKSimon commented Mar 24, 2024

This was the kind of thing I had in mind: https://gcc.godbolt.org/z/rdh9j3ocW

SahilPatidar added a commit to SahilPatidar/llvm-project that referenced this issue Mar 28, 2024
SahilPatidar added a commit to SahilPatidar/llvm-project that referenced this issue Apr 8, 2024
SahilPatidar added a commit to SahilPatidar/llvm-project that referenced this issue May 5, 2024
SahilPatidar added a commit to SahilPatidar/llvm-project that referenced this issue Jun 12, 2024
aaryanshukla pushed a commit to aaryanshukla/llvm-project that referenced this issue Jul 14, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment