-
Notifications
You must be signed in to change notification settings - Fork 15.2k
[AArch64] Fold scalar-to-vector shuffles into DUP/FMOV #166962
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change | ||||
|---|---|---|---|---|---|---|
|
|
@@ -15654,6 +15654,56 @@ SDValue AArch64TargetLowering::LowerBUILD_VECTOR(SDValue Op, | |||||
| } | ||||||
| } | ||||||
|
|
||||||
| // 128-bit NEON integer vectors: | ||||||
| // If BUILD_VECTOR has low half == splat(lane 0) and high half == zero, | ||||||
| // build the low half and return SUBREG_TO_REG(0, Lo, dsub). | ||||||
| // This avoids INSERT_VECTOR_ELT chains and lets later passes assume the | ||||||
| // other lanes are zero. | ||||||
| if (VT.isFixedLengthVector() && VT.getSizeInBits() == 128) { | ||||||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Will this work if it is 64 bits vector? Is it valid to have the same for 64 bits vector? |
||||||
| EVT LaneVT = VT.getVectorElementType(); | ||||||
| if (LaneVT.isInteger()) { | ||||||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think, this should work on floating point types as well since we don't do any integer specific operations here.
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Does it needs to be always integer? What happens if this is a float type? It looks like nothing breaks if I remove this statement. |
||||||
| const unsigned HalfElts = NumElts >> 1; | ||||||
| SDValue FirstVal = Op.getOperand(0); | ||||||
|
|
||||||
| auto IsZero = [&](SDValue V) { return isNullConstant(V); }; | ||||||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. You probably want your check here to be: |
||||||
|
|
||||||
| bool IsLoSplatHiZero = true; | ||||||
| for (unsigned i = 0; i < NumElts; ++i) { | ||||||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think this for loop could be merged to the big one at the beginning. We don't need to add anything to check the lower half, we can just check if NumDifferentLanes == 2 and for the upper half we can introduce new variable to keep track if upper half is zero or undef. This will simplify guarding of this optimization to single if check. |
||||||
| SDValue Vi = Op.getOperand(i); | ||||||
| bool violates = (i < HalfElts) ? (Vi != FirstVal) : !IsZero(Vi); | ||||||
| if (violates) { | ||||||
| IsLoSplatHiZero = false; | ||||||
| break; | ||||||
| } | ||||||
| } | ||||||
|
|
||||||
| if (IsLoSplatHiZero) { | ||||||
| EVT HalfVT = VT.getHalfNumVectorElementsVT(*DAG.getContext()); | ||||||
| unsigned LaneBits = LaneVT.getSizeInBits(); | ||||||
|
|
||||||
| auto buildSubregToReg = [&](SDValue LoHalf) -> SDValue { | ||||||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I am not sure this needs to be its own lambda function. |
||||||
| SDValue ZeroImm = DAG.getTargetConstant(0, DL, MVT::i32); | ||||||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Do we need targetConstatnts here ? I think getConstant should be sufficient. |
||||||
| SDValue SubIdx = DAG.getTargetConstant(AArch64::dsub, DL, MVT::i32); | ||||||
| SDNode *N = DAG.getMachineNode(TargetOpcode::SUBREG_TO_REG, DL, VT, | ||||||
|
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Hi - we should ideally not be introducing MachineNodes this early in the pipeline. It is like a layering violation. It looked like it would be possible to use CONCAT(Dup, Zeroes) and get the same result in the simpler tests. Some of the others didn't work so well though. We could also think of doing this for other BV's too, not just dups. CONCAT(BV, Zero) could potentially lower more efficiently than a lot of zero inserts.
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yeah we have tried different nodes here, but we always ended up with redundant moves for some cases. That's why we went with SUBREG_TO_REG. We could create a new ISD node to use here to take care of layering concern here if you would like though. |
||||||
| {ZeroImm, LoHalf, SubIdx}); | ||||||
| return SDValue(N, 0); | ||||||
| }; | ||||||
|
|
||||||
| if (LaneBits == 64) { | ||||||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. You do a lot of code repetition in this if statement. I think common parts should be pulled out of it and then it can be simplified
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. You should also add quick comment to explain why you use different node for this case. |
||||||
| // v2i64 | ||||||
| SDValue First64 = DAG.getZExtOrTrunc(FirstVal, DL, MVT::i64); | ||||||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. THere is no need to cast here. I think you can safely assume that the first element is 64 bit already |
||||||
| SDValue Lo = DAG.getNode(ISD::SCALAR_TO_VECTOR, DL, HalfVT, First64); | ||||||
| return buildSubregToReg(Lo); | ||||||
| } else { | ||||||
| // v4i32/v8i16/v16i8 | ||||||
| SDValue FirstW = DAG.getZExtOrTrunc(FirstVal, DL, MVT::i32); | ||||||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Why are you always casting to i32 here ? |
||||||
| SDValue DupLo = DAG.getNode(AArch64ISD::DUP, DL, HalfVT, FirstW); | ||||||
| return buildSubregToReg(DupLo); | ||||||
| } | ||||||
| } | ||||||
| } | ||||||
| } | ||||||
|
|
||||||
| // Use DUP for non-constant splats. For f32 constant splats, reduce to | ||||||
| // i32 and try again. | ||||||
| if (usesOnlyOneValue) { | ||||||
|
|
||||||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,45 @@ | ||
| ; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py UTC_ARGS: --version 5 | ||
| ; RUN: llc -mtriple=aarch64-linux-gnu -o - %s | FileCheck %s | ||
|
|
||
| define <2 x i64> @low_vector_splat_v2i64_from_i64(i64 %0){ | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Can you add tests for the floating points as well. |
||
| ; CHECK-LABEL: low_vector_splat_v2i64_from_i64: | ||
| ; CHECK: // %bb.0: | ||
| ; CHECK-NEXT: fmov d0, x0 | ||
| ; CHECK-NEXT: ret | ||
| %2 = insertelement <1 x i64> poison, i64 %0, i64 0 | ||
| %3 = shufflevector <1 x i64> %2, <1 x i64> zeroinitializer, <2 x i32> <i32 0, i32 1> | ||
| ret <2 x i64> %3 | ||
| } | ||
|
|
||
| define <4 x i32> @low_vector_splat_v4i32_from_i32(i32 %0) { | ||
| ; CHECK-LABEL: low_vector_splat_v4i32_from_i32: | ||
| ; CHECK: // %bb.0: | ||
| ; CHECK-NEXT: dup v0.2s, w0 | ||
| ; CHECK-NEXT: ret | ||
| %2 = insertelement <2 x i32> poison, i32 %0, i64 0 | ||
| %3 = shufflevector <2 x i32> %2, <2 x i32> poison, <2 x i32> zeroinitializer | ||
| %4 = shufflevector <2 x i32> %3, <2 x i32> zeroinitializer, <4 x i32> <i32 0, i32 1, i32 2, i32 3> | ||
| ret <4 x i32> %4 | ||
| } | ||
|
|
||
| define <8 x i16> @low_vector_splat_v8i16_from_i16(i16 %0) { | ||
| ; CHECK-LABEL: low_vector_splat_v8i16_from_i16: | ||
| ; CHECK: // %bb.0: | ||
| ; CHECK-NEXT: dup v0.4h, w0 | ||
| ; CHECK-NEXT: ret | ||
| %2 = insertelement <4 x i16> poison, i16 %0, i64 0 | ||
| %3 = shufflevector <4 x i16> %2, <4 x i16> poison, <4 x i32> zeroinitializer | ||
| %4 = shufflevector <4 x i16> %3, <4 x i16> zeroinitializer, <8 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7> | ||
| ret <8 x i16> %4 | ||
| } | ||
|
|
||
| define <16 x i8> @low_vector_splat_v16i8_from_i8(i8 %0) { | ||
| ; CHECK-LABEL: low_vector_splat_v16i8_from_i8: | ||
| ; CHECK: // %bb.0: | ||
| ; CHECK-NEXT: dup v0.8b, w0 | ||
| ; CHECK-NEXT: ret | ||
| %2 = insertelement <8 x i8> poison, i8 %0, i64 0 | ||
| %3 = shufflevector <8 x i8> %2, <8 x i8> poison, <8 x i32> zeroinitializer | ||
| %4 = shufflevector <8 x i8> %3, <8 x i8> zeroinitializer, <16 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7, i32 8, i32 9, i32 10, i32 11, i32 12, i32 13, i32 14, i32 15> | ||
| ret <16 x i8> %4 | ||
| } | ||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This comment is wrong. I think you want to say that if data is only contained in lower half of the register and the upper half is zeroed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I also think this comment should be less detailed and at high level explain that we can exploit the fact that 64-bit dup and fmov instructions zero the other 64 bits, to simplify generated code.