-
Notifications
You must be signed in to change notification settings - Fork 11k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[X86][AVX] Recognise out of bounds AVX2 shift amounts #83840
Comments
@llvm/issue-subscribers-backend-x86 Author: Simon Pilgrim (RKSimon)
Pulled out of #39822 which was a bit too general.
Unlike the general ISD SRA/SRL/SHL nodes, the AVX2 vector shift nodes X86ISD VSRAV/VSRLV/VSHLV handle out of bounds shift amounts:
So when lowering vector shifts, we should be able to fold any shift amount clamp patterns and use the X86ISD nodetypes. e.g. define <4 x i32> @<!-- -->ashr(<4 x i32> %sh, <4 x i32> %amt) {
%elt.min.i = tail call <4 x i32> @<!-- -->llvm.umin.v4i32(<4 x i32> %amt, <4 x i32> <i32 31, i32 31, i32 31, i32 31>)
%shr = ashr <4 x i32> %sh, %elt.min.i
ret <4 x i32> %shr
} -> ashr(int vector[4], unsigned int vector[4]):
vpbroadcastd xmm2, dword ptr [rip + .LCPI0_0] # xmm2 = [31,31,31,31]
vpminud xmm1, xmm1, xmm2
vpsravd xmm0, xmm0, xmm1
ret vs ashr(int vector[4], unsigned int vector[4]):
vpsravd xmm0, xmm0, xmm1
ret Logical shifts are trickier but also foldable: define <4 x i32> @<!-- -->lshr(<4 x i32> %sh, <4 x i32> %amt) {
%cmp.i = icmp ult <4 x i32> %amt, <i32 32, i32 32, i32 32, i32 32>
%shr = lshr <4 x i32> %sh, %amt
%0 = select <4 x i1> %cmp.i, <4 x i32> %shr, <4 x i32> zeroinitializer
ret <4 x i32> %0
}
define <4 x i32> @<!-- -->lshr2(<4 x i32> %sh, <4 x i32> %amt) {
%cmp.i = icmp ult <4 x i32> %amt, <i32 32, i32 32, i32 32, i32 32>
%0 = select <4 x i1> %cmp.i, <4 x i32> %sh, <4 x i32> zeroinitializer
%shr = lshr <4 x i32> %0, %amt
ret <4 x i32> %shr
} |
Hi! This issue may be a good introductory issue for people new to working on LLVM. If you would like to work on this issue, your first steps are:
If you have any further questions about this issue, don't hesitate to ask via a comment in the thread below. |
@llvm/issue-subscribers-good-first-issue Author: Simon Pilgrim (RKSimon)
Pulled out of #39822 which was a bit too general.
Unlike the general ISD SRA/SRL/SHL nodes, the AVX2 vector shift nodes X86ISD VSRAV/VSRLV/VSHLV handle out of bounds shift amounts:
So when lowering vector shifts, we should be able to fold any shift amount clamp patterns and use the X86ISD nodetypes. e.g. define <4 x i32> @<!-- -->ashr(<4 x i32> %sh, <4 x i32> %amt) {
%elt.min.i = tail call <4 x i32> @<!-- -->llvm.umin.v4i32(<4 x i32> %amt, <4 x i32> <i32 31, i32 31, i32 31, i32 31>)
%shr = ashr <4 x i32> %sh, %elt.min.i
ret <4 x i32> %shr
} -> ashr(int vector[4], unsigned int vector[4]):
vpbroadcastd xmm2, dword ptr [rip + .LCPI0_0] # xmm2 = [31,31,31,31]
vpminud xmm1, xmm1, xmm2
vpsravd xmm0, xmm0, xmm1
ret vs ashr(int vector[4], unsigned int vector[4]):
vpsravd xmm0, xmm0, xmm1
ret Logical shifts are trickier but also foldable: define <4 x i32> @<!-- -->lshr(<4 x i32> %sh, <4 x i32> %amt) {
%cmp.i = icmp ult <4 x i32> %amt, <i32 32, i32 32, i32 32, i32 32>
%shr = lshr <4 x i32> %sh, %amt
%0 = select <4 x i1> %cmp.i, <4 x i32> %shr, <4 x i32> zeroinitializer
ret <4 x i32> %0
}
define <4 x i32> @<!-- -->lshr2(<4 x i32> %sh, <4 x i32> %amt) {
%cmp.i = icmp ult <4 x i32> %amt, <i32 32, i32 32, i32 32, i32 32>
%0 = select <4 x i1> %cmp.i, <4 x i32> %sh, <4 x i32> zeroinitializer
%shr = lshr <4 x i32> %0, %amt
ret <4 x i32> %shr
} |
@RKSimon, |
I expect all of this to be handled as DAG combines inside X86ISelLowering.cpp |
What command did you use for this? |
@RKSimon, |
I suggest you start with the SRA(X,UMIN(Y,BW-1)) case in combineShiftRightArithmetic You should be able to use supportedVectorVarShift to detect when the target supports replacing the ISD::SRA node with X86ISD::VSRAV |
@RKSimon, if (VT.isVector() || N1.getOpcode() != ISD::Constant ||
N0.getOpcode() != ISD::SHL || !N0.hasOneUse() ||
N0.getOperand(1).getOpcode() != ISD::Constant)
return SDValue(); the code returns without performing any optimization. My observation is that even though we are working with vectors, the code does not invoke the combineVectorShiftImm function in these cases. I would like to understand why the code behaves this way and does not call the combineVectorShiftImm function when dealing with vectors. Could you please provide some insights into this behavior? |
The code below that early-out is for a single fold (see the description comment) - just above it is the combineShiftToPMULH fold - you can put your code after combineShiftToPMULH. |
@RKSimon APInt ShiftAmt;
SDNode *NNode = N1.getNode();
if (NNode->getOpcode() == ISD::UMIN &&
ISD::isConstantSplatVector(NNode->getOperand(1).getNode(), ShiftAmt) &&
ShiftAmt == VT.getScalarSizeInBits() - 1) {
SDValue SHAmtVal = NNode->getOperand(0);
SDLoc DL(N);
return DAG.getNode(N->getOpcode(), DL, N->getVTList(), N0, SHAmtVal);
} When running this code, I observed the following unexpected output: vpsravd %xmm1, %xmm0, %xmm0
retq I would greatly appreciate it if you could help me understand why the position of |
The order looks to just be the difference between Intel and AT&T ASM ayntax |
@RKSimon, I think some changes need to be made in select and logical shift operations. |
I suggest you just focus on the SRA case first |
@RKSimon, could the code I shared above be improved? If it's incorrect, please let me know. |
That SRA code looks fine - I'd like to see it in a PR, and then there is the question of adding test coverage.... |
@RKSimon, |
Yes |
@RKSimon, it seems SRL also needs to be addressed. Is that correct? |
Logical shift left/right both need dealing with, but those will be combined from either VSELECT or AND nodes. |
could you please provide an example for |
This was the kind of thing I had in mind: https://gcc.godbolt.org/z/rdh9j3ocW |
… logical shift nodes llvm#83840
… logical shift nodes llvm#83840
… logical shift nodes llvm#83840
… logical shift nodes llvm#83840
… logical shift nodes llvm#83840 (llvm#86922) Resolve llvm#83840
Pulled out of #39822 which was a bit too general.
Unlike the general ISD SRA/SRL/SHL nodes, the AVX2 vector shift nodes X86ISD VSRAV/VSRLV/VSHLV handle out of bounds shift amounts:
So when lowering vector shifts, we should be able to fold any shift amount clamp patterns and use the X86ISD nodetypes.
e.g.
->
vs
Logical shifts are trickier but also foldable:
The text was updated successfully, but these errors were encountered: