Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[AArch64][SVE2] Enable dynamic shuffle for fixed length types. #72490

Merged
merged 7 commits into from
Feb 21, 2024

Conversation

dtemirbulatov
Copy link
Contributor

When SVE register size is unknown or the minimal size is not equal to the maximum size then we could determine the actual SVE register size in the runtime and adjust shuffle mask in the runtime.

@llvmbot
Copy link
Collaborator

llvmbot commented Nov 16, 2023

@llvm/pr-subscribers-backend-aarch64

Author: Dinar Temirbulatov (dtemirbulatov)

Changes

When SVE register size is unknown or the minimal size is not equal to the maximum size then we could determine the actual SVE register size in the runtime and adjust shuffle mask in the runtime.


Patch is 28.95 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/72490.diff

2 Files Affected:

  • (modified) llvm/lib/Target/AArch64/AArch64ISelLowering.cpp (+59-14)
  • (modified) llvm/test/CodeGen/AArch64/sve-fixed-length-vector-shuffle-tbl.ll (+389-24)
diff --git a/llvm/lib/Target/AArch64/AArch64ISelLowering.cpp b/llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
index 9ff6d6f0f565edb..2423ef6f8962a53 100644
--- a/llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
+++ b/llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
@@ -26123,7 +26123,7 @@ static SDValue GenerateFixedLengthSVETBL(SDValue Op, SDValue Op1, SDValue Op2,
 
   // Ignore two operands if no SVE2 or all index numbers couldn't
   // be represented.
-  if (!IsSingleOp && (!Subtarget.hasSVE2() || MinSVESize != MaxSVESize))
+  if (!IsSingleOp && !Subtarget.hasSVE2())
     return SDValue();
 
   EVT VTOp1 = Op.getOperand(0).getValueType();
@@ -26131,18 +26131,40 @@ static SDValue GenerateFixedLengthSVETBL(SDValue Op, SDValue Op1, SDValue Op2,
   unsigned IndexLen = MinSVESize / BitsPerElt;
   unsigned ElementsPerVectorReg = VTOp1.getVectorNumElements();
   uint64_t MaxOffset = APInt(BitsPerElt, -1, false).getZExtValue();
+  EVT MaskEltType = EVT::getIntegerVT(*DAG.getContext(), BitsPerElt);
+  EVT MaskType = EVT::getVectorVT(*DAG.getContext(), MaskEltType, IndexLen);
+  bool MinMaxEqual = (MinSVESize == MaxSVESize);
   assert(ElementsPerVectorReg <= IndexLen && ShuffleMask.size() <= IndexLen &&
          "Incorrectly legalised shuffle operation");
 
   SmallVector<SDValue, 8> TBLMask;
+  // If MinSVESize is not equal to MaxSVESize then we need to know which
+  // TBL mask element needs adjustment.
+  SmallVector<SDValue, 8> MaskNormalized;
+
+  if (BitsPerElt == 8 && !MinMaxEqual && !IsSingleOp)
+    return SDValue();
+
   for (int Index : ShuffleMask) {
     // Handling poison index value.
     if (Index < 0)
       Index = 0;
     // If we refer to the second operand then we have to add elements
-    // number in hardware register minus number of elements in a type.
-    if ((unsigned)Index >= ElementsPerVectorReg)
-      Index += IndexLen - ElementsPerVectorReg;
+    // number in hardware register minus number of elements in a type in
+    // case if MinSVESize equals to MaxSVESize, otherwise just add normalized
+    // value and record this element in MaskNormalized to be adjusted in the
+    // runtime.
+    if ((unsigned)Index >= ElementsPerVectorReg) {
+      if (!MinMaxEqual) {
+        Index = Index - ElementsPerVectorReg;
+        MaskNormalized.push_back(DAG.getConstant(1, DL, MVT::i64));
+      } else {
+        Index += IndexLen - ElementsPerVectorReg;
+      }
+    } else {
+      if (!MinMaxEqual)
+        MaskNormalized.push_back(DAG.getConstant(0, DL, MVT::i64));
+    }
     // For 8-bit elements and 1024-bit SVE registers and MaxOffset equals
     // to 255, this might point to the last element of in the second operand
     // of the shufflevector, thus we are rejecting this transform.
@@ -26155,11 +26177,12 @@ static SDValue GenerateFixedLengthSVETBL(SDValue Op, SDValue Op1, SDValue Op2,
   // value where it would perform first lane duplication for out of
   // index elements. For i8 elements an out-of-range index could be a valid
   // for 2048-bit vector register size.
-  for (unsigned i = 0; i < IndexLen - ElementsPerVectorReg; ++i)
+  for (unsigned i = 0; i < IndexLen - ElementsPerVectorReg; ++i) {
     TBLMask.push_back(DAG.getConstant((int)MaxOffset, DL, MVT::i64));
+    if (!MinMaxEqual)
+      MaskNormalized.push_back(DAG.getConstant(0, DL, MVT::i64));
+  }
 
-  EVT MaskEltType = EVT::getIntegerVT(*DAG.getContext(), BitsPerElt);
-  EVT MaskType = EVT::getVectorVT(*DAG.getContext(), MaskEltType, IndexLen);
   EVT MaskContainerVT = getContainerForFixedLengthVector(DAG, MaskType);
   SDValue VecMask =
       DAG.getBuildVector(MaskType, DL, ArrayRef(TBLMask.data(), IndexLen));
@@ -26171,13 +26194,35 @@ static SDValue GenerateFixedLengthSVETBL(SDValue Op, SDValue Op1, SDValue Op2,
         DAG.getNode(ISD::INTRINSIC_WO_CHAIN, DL, ContainerVT,
                     DAG.getConstant(Intrinsic::aarch64_sve_tbl, DL, MVT::i32),
                     Op1, SVEMask);
-  else if (Subtarget.hasSVE2())
-    Shuffle =
-        DAG.getNode(ISD::INTRINSIC_WO_CHAIN, DL, ContainerVT,
-                    DAG.getConstant(Intrinsic::aarch64_sve_tbl2, DL, MVT::i32),
-                    Op1, Op2, SVEMask);
-  else
-    llvm_unreachable("Cannot lower shuffle without SVE2 TBL");
+  else if (Subtarget.hasSVE2()) {
+    if (!MinMaxEqual) {
+      SDValue VScale = DAG.getVScale(DL, MVT::i32, APInt(32, 1));
+      SDValue Mul =
+          DAG.getNode(ISD::MUL, DL, MVT::i32,
+                      DAG.getConstant(128 / BitsPerElt, DL, MVT::i32), VScale);
+      SDValue VecMask =
+          DAG.getBuildVector(MaskType, DL, ArrayRef(TBLMask.data(), IndexLen));
+      SDValue MulMask = DAG.getBuildVector(
+          MaskType, DL, ArrayRef(MaskNormalized.data(), IndexLen));
+      SDValue SplatPred = DAG.getNode(ISD::SPLAT_VECTOR, DL, MaskType, Mul);
+      SDValue MulMaskNormalized =
+          DAG.getNode(ISD::MUL, DL, MaskType, SplatPred, MulMask);
+      SDValue UpdatedVecMask =
+          DAG.getNode(ISD::ADD, DL, MaskType, VecMask, MulMaskNormalized);
+      EVT MaskContainerVT = getContainerForFixedLengthVector(DAG, MaskType);
+      SDValue SVEMask =
+          convertToScalableVector(DAG, MaskContainerVT, UpdatedVecMask);
+      Shuffle = DAG.getNode(
+          ISD::INTRINSIC_WO_CHAIN, DL, ContainerVT,
+          DAG.getConstant(Intrinsic::aarch64_sve_tbl2, DL, MVT::i32), Op1, Op2,
+          SVEMask);
+    } else {
+      Shuffle = DAG.getNode(
+          ISD::INTRINSIC_WO_CHAIN, DL, ContainerVT,
+          DAG.getConstant(Intrinsic::aarch64_sve_tbl2, DL, MVT::i32), Op1, Op2,
+          SVEMask);
+    }
+  }
   Shuffle = convertFromScalableVector(DAG, VT, Shuffle);
   return DAG.getNode(ISD::BITCAST, DL, Op.getValueType(), Shuffle);
 }
diff --git a/llvm/test/CodeGen/AArch64/sve-fixed-length-vector-shuffle-tbl.ll b/llvm/test/CodeGen/AArch64/sve-fixed-length-vector-shuffle-tbl.ll
index f646319ba5fccb3..3beccdf278ebd10 100644
--- a/llvm/test/CodeGen/AArch64/sve-fixed-length-vector-shuffle-tbl.ll
+++ b/llvm/test/CodeGen/AArch64/sve-fixed-length-vector-shuffle-tbl.ll
@@ -1,6 +1,8 @@
 ; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py
 ; RUN: llc -mattr=+sve2 -force-streaming-compatible-sve -aarch64-sve-vector-bits-min=128 -aarch64-sve-vector-bits-max=128  < %s | FileCheck %s -check-prefixes=CHECK,SVE2_128
 ; RUN: llc -mattr=+sve2 -force-streaming-compatible-sve -aarch64-sve-vector-bits-min=128 < %s | FileCheck %s -check-prefixes=CHECK,SVE2_128_NOMAX
+; RUN: llc -mattr=+sve2 -force-streaming-compatible-sve < %s | FileCheck %s -check-prefixes=CHECK,SVE2_NOMIN_NOMAX
+; RUN: llc -mattr=+sve2 -force-streaming-compatible-sve -aarch64-sve-vector-bits-min=256 < %s | FileCheck %s -check-prefixes=CHECK,SVE2_MIN_256_NOMAX
 
 target triple = "aarch64-unknown-linux-gnu"
 
@@ -16,14 +18,43 @@ target triple = "aarch64-unknown-linux-gnu"
 ; SVE2_128-NEXT:        .byte   255                             // 0xff
 ; SVE2_128-NEXT:        .byte   255                             // 0xff
 define <8 x i8> @shuffle_index_indices_from_op1(ptr %a, ptr %b) {
-; CHECK-LABEL: shuffle_index_indices_from_op1:
-; CHECK:       // %bb.0:
-; CHECK-NEXT:    adrp x8, .LCPI0_0
-; CHECK-NEXT:    ldr d0, [x0]
-; CHECK-NEXT:    ldr q1, [x8, :lo12:.LCPI0_0]
-; CHECK-NEXT:    tbl z0.b, { z0.b }, z1.b
-; CHECK-NEXT:    // kill: def $d0 killed $d0 killed $z0
-; CHECK-NEXT:    ret
+; SVE2_128-LABEL: shuffle_index_indices_from_op1:
+; SVE2_128:       // %bb.0:
+; SVE2_128-NEXT:    adrp x8, .LCPI0_0
+; SVE2_128-NEXT:    ldr d0, [x0]
+; SVE2_128-NEXT:    ldr q1, [x8, :lo12:.LCPI0_0]
+; SVE2_128-NEXT:    tbl z0.b, { z0.b }, z1.b
+; SVE2_128-NEXT:    // kill: def $d0 killed $d0 killed $z0
+; SVE2_128-NEXT:    ret
+;
+; SVE2_128_NOMAX-LABEL: shuffle_index_indices_from_op1:
+; SVE2_128_NOMAX:       // %bb.0:
+; SVE2_128_NOMAX-NEXT:    adrp x8, .LCPI0_0
+; SVE2_128_NOMAX-NEXT:    ldr d0, [x0]
+; SVE2_128_NOMAX-NEXT:    ldr q1, [x8, :lo12:.LCPI0_0]
+; SVE2_128_NOMAX-NEXT:    tbl z0.b, { z0.b }, z1.b
+; SVE2_128_NOMAX-NEXT:    // kill: def $d0 killed $d0 killed $z0
+; SVE2_128_NOMAX-NEXT:    ret
+;
+; SVE2_NOMIN_NOMAX-LABEL: shuffle_index_indices_from_op1:
+; SVE2_NOMIN_NOMAX:       // %bb.0:
+; SVE2_NOMIN_NOMAX-NEXT:    adrp x8, .LCPI0_0
+; SVE2_NOMIN_NOMAX-NEXT:    ldr d0, [x0]
+; SVE2_NOMIN_NOMAX-NEXT:    ldr q1, [x8, :lo12:.LCPI0_0]
+; SVE2_NOMIN_NOMAX-NEXT:    tbl z0.b, { z0.b }, z1.b
+; SVE2_NOMIN_NOMAX-NEXT:    // kill: def $d0 killed $d0 killed $z0
+; SVE2_NOMIN_NOMAX-NEXT:    ret
+;
+; SVE2_MIN_256_NOMAX-LABEL: shuffle_index_indices_from_op1:
+; SVE2_MIN_256_NOMAX:       // %bb.0:
+; SVE2_MIN_256_NOMAX-NEXT:    ptrue p0.b, vl32
+; SVE2_MIN_256_NOMAX-NEXT:    adrp x8, .LCPI0_0
+; SVE2_MIN_256_NOMAX-NEXT:    add x8, x8, :lo12:.LCPI0_0
+; SVE2_MIN_256_NOMAX-NEXT:    ldr d0, [x0]
+; SVE2_MIN_256_NOMAX-NEXT:    ld1b { z1.b }, p0/z, [x8]
+; SVE2_MIN_256_NOMAX-NEXT:    tbl z0.b, { z0.b }, z1.b
+; SVE2_MIN_256_NOMAX-NEXT:    // kill: def $d0 killed $d0 killed $z0
+; SVE2_MIN_256_NOMAX-NEXT:    ret
   %op1 = load <8 x i8>, ptr %a
   %op2 = load <8 x i8>, ptr %b
   %1 = shufflevector <8 x i8> %op1, <8 x i8> %op2, <8 x i32> <i32 0, i32 7, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7>
@@ -42,14 +73,43 @@ define <8 x i8> @shuffle_index_indices_from_op1(ptr %a, ptr %b) {
 ; SVE2_128-NEXT:        .byte   255                             // 0xff
 ; SVE2_128-NEXT:        .byte   255                             // 0xff
 define <8 x i8> @shuffle_index_indices_from_op2(ptr %a, ptr %b) {
-; CHECK-LABEL: shuffle_index_indices_from_op2:
-; CHECK:       // %bb.0:
-; CHECK-NEXT:    adrp x8, .LCPI1_0
-; CHECK-NEXT:    ldr d0, [x1]
-; CHECK-NEXT:    ldr q1, [x8, :lo12:.LCPI1_0]
-; CHECK-NEXT:    tbl z0.b, { z0.b }, z1.b
-; CHECK-NEXT:    // kill: def $d0 killed $d0 killed $z0
-; CHECK-NEXT:    ret
+; SVE2_128-LABEL: shuffle_index_indices_from_op2:
+; SVE2_128:       // %bb.0:
+; SVE2_128-NEXT:    adrp x8, .LCPI1_0
+; SVE2_128-NEXT:    ldr d0, [x1]
+; SVE2_128-NEXT:    ldr q1, [x8, :lo12:.LCPI1_0]
+; SVE2_128-NEXT:    tbl z0.b, { z0.b }, z1.b
+; SVE2_128-NEXT:    // kill: def $d0 killed $d0 killed $z0
+; SVE2_128-NEXT:    ret
+;
+; SVE2_128_NOMAX-LABEL: shuffle_index_indices_from_op2:
+; SVE2_128_NOMAX:       // %bb.0:
+; SVE2_128_NOMAX-NEXT:    adrp x8, .LCPI1_0
+; SVE2_128_NOMAX-NEXT:    ldr d0, [x1]
+; SVE2_128_NOMAX-NEXT:    ldr q1, [x8, :lo12:.LCPI1_0]
+; SVE2_128_NOMAX-NEXT:    tbl z0.b, { z0.b }, z1.b
+; SVE2_128_NOMAX-NEXT:    // kill: def $d0 killed $d0 killed $z0
+; SVE2_128_NOMAX-NEXT:    ret
+;
+; SVE2_NOMIN_NOMAX-LABEL: shuffle_index_indices_from_op2:
+; SVE2_NOMIN_NOMAX:       // %bb.0:
+; SVE2_NOMIN_NOMAX-NEXT:    adrp x8, .LCPI1_0
+; SVE2_NOMIN_NOMAX-NEXT:    ldr d0, [x1]
+; SVE2_NOMIN_NOMAX-NEXT:    ldr q1, [x8, :lo12:.LCPI1_0]
+; SVE2_NOMIN_NOMAX-NEXT:    tbl z0.b, { z0.b }, z1.b
+; SVE2_NOMIN_NOMAX-NEXT:    // kill: def $d0 killed $d0 killed $z0
+; SVE2_NOMIN_NOMAX-NEXT:    ret
+;
+; SVE2_MIN_256_NOMAX-LABEL: shuffle_index_indices_from_op2:
+; SVE2_MIN_256_NOMAX:       // %bb.0:
+; SVE2_MIN_256_NOMAX-NEXT:    ptrue p0.b, vl32
+; SVE2_MIN_256_NOMAX-NEXT:    adrp x8, .LCPI1_0
+; SVE2_MIN_256_NOMAX-NEXT:    add x8, x8, :lo12:.LCPI1_0
+; SVE2_MIN_256_NOMAX-NEXT:    ldr d0, [x1]
+; SVE2_MIN_256_NOMAX-NEXT:    ld1b { z1.b }, p0/z, [x8]
+; SVE2_MIN_256_NOMAX-NEXT:    tbl z0.b, { z0.b }, z1.b
+; SVE2_MIN_256_NOMAX-NEXT:    // kill: def $d0 killed $d0 killed $z0
+; SVE2_MIN_256_NOMAX-NEXT:    ret
   %op1 = load <8 x i8>, ptr %a
   %op2 = load <8 x i8>, ptr %b
   %1 = shufflevector <8 x i8> %op1, <8 x i8> %op2, <8 x i32> <i32 8, i32 9, i32 9, i32 11, i32 12, i32 15, i32 14, i32 15>
@@ -109,6 +169,70 @@ define <8 x i8> @shuffle_index_indices_from_both_ops(ptr %a, ptr %b) {
 ; SVE2_128_NOMAX-NEXT:    ldr d0, [sp, #8]
 ; SVE2_128_NOMAX-NEXT:    add sp, sp, #16
 ; SVE2_128_NOMAX-NEXT:    ret
+;
+; SVE2_NOMIN_NOMAX-LABEL: shuffle_index_indices_from_both_ops:
+; SVE2_NOMIN_NOMAX:       // %bb.0:
+; SVE2_NOMIN_NOMAX-NEXT:    sub sp, sp, #16
+; SVE2_NOMIN_NOMAX-NEXT:    .cfi_def_cfa_offset 16
+; SVE2_NOMIN_NOMAX-NEXT:    ldr d0, [x1]
+; SVE2_NOMIN_NOMAX-NEXT:    mov z1.b, z0.b[7]
+; SVE2_NOMIN_NOMAX-NEXT:    mov z2.b, z0.b[6]
+; SVE2_NOMIN_NOMAX-NEXT:    mov z3.b, z0.b[4]
+; SVE2_NOMIN_NOMAX-NEXT:    fmov w8, s1
+; SVE2_NOMIN_NOMAX-NEXT:    ldr d1, [x0]
+; SVE2_NOMIN_NOMAX-NEXT:    fmov w9, s2
+; SVE2_NOMIN_NOMAX-NEXT:    mov z2.b, z0.b[3]
+; SVE2_NOMIN_NOMAX-NEXT:    mov z1.b, z1.b[1]
+; SVE2_NOMIN_NOMAX-NEXT:    strb w8, [sp, #15]
+; SVE2_NOMIN_NOMAX-NEXT:    fmov w8, s3
+; SVE2_NOMIN_NOMAX-NEXT:    mov z3.b, z0.b[2]
+; SVE2_NOMIN_NOMAX-NEXT:    strb w9, [sp, #14]
+; SVE2_NOMIN_NOMAX-NEXT:    mov z0.b, z0.b[1]
+; SVE2_NOMIN_NOMAX-NEXT:    fmov w9, s2
+; SVE2_NOMIN_NOMAX-NEXT:    strb w8, [sp, #13]
+; SVE2_NOMIN_NOMAX-NEXT:    strb w8, [sp, #12]
+; SVE2_NOMIN_NOMAX-NEXT:    fmov w8, s3
+; SVE2_NOMIN_NOMAX-NEXT:    strb w9, [sp, #11]
+; SVE2_NOMIN_NOMAX-NEXT:    fmov w9, s0
+; SVE2_NOMIN_NOMAX-NEXT:    strb w8, [sp, #10]
+; SVE2_NOMIN_NOMAX-NEXT:    fmov w8, s1
+; SVE2_NOMIN_NOMAX-NEXT:    strb w9, [sp, #9]
+; SVE2_NOMIN_NOMAX-NEXT:    strb w8, [sp, #8]
+; SVE2_NOMIN_NOMAX-NEXT:    ldr d0, [sp, #8]
+; SVE2_NOMIN_NOMAX-NEXT:    add sp, sp, #16
+; SVE2_NOMIN_NOMAX-NEXT:    ret
+;
+; SVE2_MIN_256_NOMAX-LABEL: shuffle_index_indices_from_both_ops:
+; SVE2_MIN_256_NOMAX:       // %bb.0:
+; SVE2_MIN_256_NOMAX-NEXT:    sub sp, sp, #16
+; SVE2_MIN_256_NOMAX-NEXT:    .cfi_def_cfa_offset 16
+; SVE2_MIN_256_NOMAX-NEXT:    ldr d0, [x1]
+; SVE2_MIN_256_NOMAX-NEXT:    mov z1.b, z0.b[7]
+; SVE2_MIN_256_NOMAX-NEXT:    mov z2.b, z0.b[6]
+; SVE2_MIN_256_NOMAX-NEXT:    mov z3.b, z0.b[4]
+; SVE2_MIN_256_NOMAX-NEXT:    fmov w8, s1
+; SVE2_MIN_256_NOMAX-NEXT:    ldr d1, [x0]
+; SVE2_MIN_256_NOMAX-NEXT:    fmov w9, s2
+; SVE2_MIN_256_NOMAX-NEXT:    mov z2.b, z0.b[3]
+; SVE2_MIN_256_NOMAX-NEXT:    mov z1.b, z1.b[1]
+; SVE2_MIN_256_NOMAX-NEXT:    strb w8, [sp, #15]
+; SVE2_MIN_256_NOMAX-NEXT:    fmov w8, s3
+; SVE2_MIN_256_NOMAX-NEXT:    mov z3.b, z0.b[2]
+; SVE2_MIN_256_NOMAX-NEXT:    strb w9, [sp, #14]
+; SVE2_MIN_256_NOMAX-NEXT:    mov z0.b, z0.b[1]
+; SVE2_MIN_256_NOMAX-NEXT:    fmov w9, s2
+; SVE2_MIN_256_NOMAX-NEXT:    strb w8, [sp, #13]
+; SVE2_MIN_256_NOMAX-NEXT:    strb w8, [sp, #12]
+; SVE2_MIN_256_NOMAX-NEXT:    fmov w8, s3
+; SVE2_MIN_256_NOMAX-NEXT:    strb w9, [sp, #11]
+; SVE2_MIN_256_NOMAX-NEXT:    fmov w9, s0
+; SVE2_MIN_256_NOMAX-NEXT:    strb w8, [sp, #10]
+; SVE2_MIN_256_NOMAX-NEXT:    fmov w8, s1
+; SVE2_MIN_256_NOMAX-NEXT:    strb w9, [sp, #9]
+; SVE2_MIN_256_NOMAX-NEXT:    strb w8, [sp, #8]
+; SVE2_MIN_256_NOMAX-NEXT:    ldr d0, [sp, #8]
+; SVE2_MIN_256_NOMAX-NEXT:    add sp, sp, #16
+; SVE2_MIN_256_NOMAX-NEXT:    ret
   %op1 = load <8 x i8>, ptr %a
   %op2 = load <8 x i8>, ptr %b
   %1 = shufflevector <8 x i8> %op1, <8 x i8> %op2, <8 x i32> <i32 1, i32 9, i32 10, i32 11, i32 12, i32 12, i32 14, i32 15>
@@ -165,6 +289,64 @@ define <8 x i8> @shuffle_index_poison_value(ptr %a, ptr %b) {
 ; SVE2_128_NOMAX-NEXT:    ldr d0, [sp, #8]
 ; SVE2_128_NOMAX-NEXT:    add sp, sp, #16
 ; SVE2_128_NOMAX-NEXT:    ret
+;
+; SVE2_NOMIN_NOMAX-LABEL: shuffle_index_poison_value:
+; SVE2_NOMIN_NOMAX:       // %bb.0:
+; SVE2_NOMIN_NOMAX-NEXT:    sub sp, sp, #16
+; SVE2_NOMIN_NOMAX-NEXT:    .cfi_def_cfa_offset 16
+; SVE2_NOMIN_NOMAX-NEXT:    ldr d0, [x1]
+; SVE2_NOMIN_NOMAX-NEXT:    ldr d3, [x0]
+; SVE2_NOMIN_NOMAX-NEXT:    mov z1.b, z0.b[6]
+; SVE2_NOMIN_NOMAX-NEXT:    mov z2.b, z0.b[4]
+; SVE2_NOMIN_NOMAX-NEXT:    fmov w8, s1
+; SVE2_NOMIN_NOMAX-NEXT:    mov z1.b, z0.b[3]
+; SVE2_NOMIN_NOMAX-NEXT:    fmov w9, s2
+; SVE2_NOMIN_NOMAX-NEXT:    mov z2.b, z0.b[2]
+; SVE2_NOMIN_NOMAX-NEXT:    mov z0.b, z0.b[1]
+; SVE2_NOMIN_NOMAX-NEXT:    strb w8, [sp, #14]
+; SVE2_NOMIN_NOMAX-NEXT:    fmov w8, s1
+; SVE2_NOMIN_NOMAX-NEXT:    mov z1.b, z3.b[1]
+; SVE2_NOMIN_NOMAX-NEXT:    strb w9, [sp, #13]
+; SVE2_NOMIN_NOMAX-NEXT:    strb w9, [sp, #12]
+; SVE2_NOMIN_NOMAX-NEXT:    fmov w9, s2
+; SVE2_NOMIN_NOMAX-NEXT:    strb w8, [sp, #11]
+; SVE2_NOMIN_NOMAX-NEXT:    fmov w8, s0
+; SVE2_NOMIN_NOMAX-NEXT:    strb w9, [sp, #10]
+; SVE2_NOMIN_NOMAX-NEXT:    fmov w9, s1
+; SVE2_NOMIN_NOMAX-NEXT:    strb w8, [sp, #9]
+; SVE2_NOMIN_NOMAX-NEXT:    strb w9, [sp, #8]
+; SVE2_NOMIN_NOMAX-NEXT:    ldr d0, [sp, #8]
+; SVE2_NOMIN_NOMAX-NEXT:    add sp, sp, #16
+; SVE2_NOMIN_NOMAX-NEXT:    ret
+;
+; SVE2_MIN_256_NOMAX-LABEL: shuffle_index_poison_value:
+; SVE2_MIN_256_NOMAX:       // %bb.0:
+; SVE2_MIN_256_NOMAX-NEXT:    sub sp, sp, #16
+; SVE2_MIN_256_NOMAX-NEXT:    .cfi_def_cfa_offset 16
+; SVE2_MIN_256_NOMAX-NEXT:    ldr d0, [x1]
+; SVE2_MIN_256_NOMAX-NEXT:    ldr d3, [x0]
+; SVE2_MIN_256_NOMAX-NEXT:    mov z1.b, z0.b[6]
+; SVE2_MIN_256_NOMAX-NEXT:    mov z2.b, z0.b[4]
+; SVE2_MIN_256_NOMAX-NEXT:    fmov w8, s1
+; SVE2_MIN_256_NOMAX-NEXT:    mov z1.b, z0.b[3]
+; SVE2_MIN_256_NOMAX-NEXT:    fmov w9, s2
+; SVE2_MIN_256_NOMAX-NEXT:    mov z2.b, z0.b[2]
+; SVE2_MIN_256_NOMAX-NEXT:    mov z0.b, z0.b[1]
+; SVE2_MIN_256_NOMAX-NEXT:    strb w8, [sp, #14]
+; SVE2_MIN_256_NOMAX-NEXT:    fmov w8, s1
+; SVE2_MIN_256_NOMAX-NEXT:    mov z1.b, z3.b[1]
+; SVE2_MIN_256_NOMAX-NEXT:    strb w9, [sp, #13]
+; SVE2_MIN_256_NOMAX-NEXT:    strb w9, [sp, #12]
+; SVE2_MIN_256_NOMAX-NEXT:    fmov w9, s2
+; SVE2_MIN_256_NOMAX-NEXT:    strb w8, [sp, #11]
+; SVE2_MIN_256_NOMAX-NEXT:    fmov w8, s0
+; SVE2_MIN_256_NOMAX-NEXT:    strb w9, [sp, #10]
+; SVE2_MIN_256_NOMAX-NEXT:    fmov w9, s1
+; SVE2_MIN_256_NOMAX-NEXT:    strb w8, [sp, #9]
+; SVE2_MIN_256_NOMAX-NEXT:    strb w9, [sp, #8]
+; SVE2_MIN_256_NOMAX-NEXT:    ldr d0, [sp, #8]
+; SVE2_MIN_256_NOMAX-NEXT:    add sp, sp, #16
+; SVE2_MIN_256_NOMAX-NEXT:    ret
   %op1 = load <8 x i8>, ptr %a
   %op2 = load <8 x i8>, ptr %b
   %1 = shufflevector <8 x i8> %op1, <8 x i8> %op2, <8 x i32> <i32 1, i32 9, i32 10, i32 11, i32 12, i32 12, i32 14, i32 poison>
@@ -172,14 +354,43 @@ define <8 x i8> @shuffle_index_poison_value(ptr %a, ptr %b) {
 }
 
 define <8 x i8> @shuffle_op1_poison(ptr %a, ptr %b) {
-; CHECK-LABEL: shuffle_op1_poison:
-; CHECK:       // %bb.0:
-; CHECK-NEXT:    adrp x8, .LCPI4_0
-; CHECK-NEXT:    ldr d0, [x1]
-; CHECK-NEXT:    ldr q1, [x8, :lo12:.LCPI4_0]
-; CHECK-NEXT:    tbl z0.b, { z0.b }, z1.b
-; CHECK-NEXT:    // kill: def $d0 killed $d0 killed $z0
-; CHECK-NEXT:    ret
+; SVE2_128-LABEL: shuffle_op1_poison:
+; SVE2_128:       // %bb.0:
+; SVE2_128-NEXT:    adrp x8, .LCPI4_0
+; SVE2_128-NEXT:    ldr d0, [x1]
+; SVE2_128-NEXT:    ldr q1, [x8, :lo12:.LCPI4_0]
+; SVE2_128-NEXT:    tbl z0.b, { z0.b }, z1.b
+; SVE2_128-NEXT:    // kill: def $d0 killed $d0 killed $z0
+; SVE2_128-NEXT:    ret
+;
+; SVE2_128_NOMAX-LABEL: shuffle_op1_poison:
+; SVE2_128_NOMAX:       // %bb.0:
+; SVE2_128_NOMAX-NEXT:    adrp x8, .LCPI4_0
+; SVE2_128_NOMAX-NEXT:    ldr d0, [x1]
+; SVE2_128_NOMAX-NEXT:    ldr q1, [x8, :lo12:.LCPI4_0]
+; SVE2_128_NOMAX-NEXT:    tbl z0.b, { z0.b }, z1.b
+; SVE2_128_NOMAX-NEXT:    // kill: def $d0 killed $d0 killed $z0
+; SVE2_128_NOMAX-NEXT:    ret
+;
+; SVE2_NOMIN_NOMAX-LABEL: shuffle_op1_poison:
+; SVE2_NOMIN_NOMAX:       // %bb.0:
+; SVE2_NOMIN_NOMAX-NEXT:    adrp x8, .LCPI4_0
+; SVE2_NOMIN_NOMAX-NEXT:    ldr d0, [x1]
+; SVE2_NOMIN_NOMAX-NEXT:    ldr q1, [x8, :lo12:.LCPI4_0]
+; SVE2_NOMIN_NOMAX-NEXT:    tbl z0.b, { z0.b }, z1.b
+; SVE2_NOMIN_NOMAX-NEXT:    // kill: def $d0 killed $d0 killed $z0
+; SVE2_NOMIN_NOMAX-NEXT:    ret
+;
+; SVE2_MIN_256_NOMAX-LABEL: shuffle_op1_poison:
+; SVE2_MIN_256_NOMAX:       // %bb.0:
+; SVE2_MIN_256_NOMAX-NEXT:    ptrue p0.b, vl32
+; SVE2_MIN_256_NOMAX-NEXT:    adrp x8, .LCPI4_0
+; SVE2_MIN_256_NOMAX-NEXT:    add x8, x8, :lo12:.LCPI4_0
+; SVE2_MIN_256_NOMAX-NEXT:    ldr d0, [x1]
+; SVE2_MIN_256_NOMAX-NEXT:    ld1b { z1.b }, p0/z, [x8]
+; SVE2_MIN_256_NOMAX-NEXT:    tbl z0.b, { z0.b }, z1.b
+; SVE2_MIN_256_NOMAX-NEXT:    // kill: def $d0 killed $d0 killed $z0
+; SVE2_MIN_256_NOMAX-NEXT:    ret
   %op2 = load <8 x i8>, ptr %b
   %1 = shufflevector <8 x i8> poison, <8 x i8> %op2, <8 x i32> <i32 1, i32 9, i32 10, i32 11, i32 12, i32 12, i32 14, i32 15>
   ret <8 x i8> %1
@@ -252,3 +463,157 @@ define <8 x i8> @shuffle_index_size_op1_maxhw(ptr %a, ptr %b) "target-features"=
   %1 = shufflevector <8 x i8> %op1, <8 x i8> %op2, <8 x i32> <i32 0, i32 7, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7>
   ret <8 x i8> %1
 }
+
+; SVE2_128: .LCPI7_0:
+; SVE2_128-NEXT:        .hword  1                               // 0x1
+; SVE2_128-NEXT: ...
[truncated]

@dtemirbulatov dtemirbulatov changed the title [AArch64][SME] Enable dynamic shuffle for fixed length types. WIP: [AArch64][SME] Enable dynamic shuffle for fixed length types. Nov 28, 2023
@dtemirbulatov dtemirbulatov changed the title WIP: [AArch64][SME] Enable dynamic shuffle for fixed length types. [AArch64][SME] Enable dynamic shuffle for fixed length types. Dec 11, 2023
llvm/lib/Target/AArch64/AArch64ISelLowering.cpp Outdated Show resolved Hide resolved
llvm/lib/Target/AArch64/AArch64ISelLowering.cpp Outdated Show resolved Hide resolved
llvm/lib/Target/AArch64/AArch64ISelLowering.cpp Outdated Show resolved Hide resolved
llvm/lib/Target/AArch64/AArch64ISelLowering.cpp Outdated Show resolved Hide resolved
llvm/lib/Target/AArch64/AArch64ISelLowering.cpp Outdated Show resolved Hide resolved
llvm/lib/Target/AArch64/AArch64ISelLowering.cpp Outdated Show resolved Hide resolved
Copy link
Collaborator

@sdesmalen-arm sdesmalen-arm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks mostly fine to me, with some nits I'd like you to address before landing the patch.

Suggestion on the commit message: it says "SME" but should this be "SVE2"?

assert(ElementsPerVectorReg <= IndexLen && ShuffleMask.size() <= IndexLen &&
"Incorrectly legalised shuffle operation");

SmallVector<SDValue, 8> TBLMask;
// If MinSVESize is not equal to MaxSVESize then we need to know which
// TBL mask element needs adjustment.
SmallVector<SDValue, 8> MulByVLMask;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Should this be named AddRuntimeVLMask (because it adds the runtime vector length to the indices)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Comment on lines 26166 to 26169
} else {
if (!MinMaxEqual)
MulByVLMask.push_back(DAG.getConstant(0, DL, MVT::i64));
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
} else {
if (!MinMaxEqual)
MulByVLMask.push_back(DAG.getConstant(0, DL, MVT::i64));
}
} else if (!MinMaxEqual)
MulByVLMask.push_back(DAG.getConstant(0, DL, MVT::i64));

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Comment on lines 26160 to 26165
if (!MinMaxEqual) {
Index = Index - ElementsPerVectorReg;
MulByVLMask.push_back(DAG.getConstant(1, DL, MVT::i64));
} else {
Index += IndexLen - ElementsPerVectorReg;
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if (!MinMaxEqual) {
Index = Index - ElementsPerVectorReg;
MulByVLMask.push_back(DAG.getConstant(1, DL, MVT::i64));
} else {
Index += IndexLen - ElementsPerVectorReg;
}
if (MinMaxEqual)
Index += IndexLen - ElementsPerVectorReg;
else {
Index = Index - ElementsPerVectorReg;
MulByVLMask.push_back(DAG.getConstant(1, DL, MVT::i64));
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Comment on lines 26201 to 26204
SDValue VScale =
(BitsPerElt == 64)
? DAG.getVScale(DL, MVT::i64, APInt(64, 128 / BitsPerElt))
: DAG.getVScale(DL, MVT::i32, APInt(32, 128 / BitsPerElt));
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
SDValue VScale =
(BitsPerElt == 64)
? DAG.getVScale(DL, MVT::i64, APInt(64, 128 / BitsPerElt))
: DAG.getVScale(DL, MVT::i32, APInt(32, 128 / BitsPerElt));
unsigned MinNumElts = AArch64::SVEBitsPerBlock / BitsPerElt;
SDValue VScale =
BitsPerElt == 64
? DAG.getVScale(DL, MVT::i64, APInt(64, MinNumElts))
: DAG.getVScale(DL, MVT::i32, APInt(32, MinNumElts));

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Comment on lines 26155 to 26158
// number in hardware register minus number of elements in a type in
// case if MinSVESize equals to MaxSVESize, otherwise just add normalized
// value and record this element in MulByVLMask to be adjusted in the
// runtime.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would phrase this comment as:

If the mask refers to elements in the second operand, then we have to offset the index by the number of elements in a vector. If this is number is not known at compile-time, we need to maintain a mask with 'VL' values to add at runtime.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@dtemirbulatov dtemirbulatov changed the title [AArch64][SME] Enable dynamic shuffle for fixed length types. [AArch64][SVE2] Enable dynamic shuffle for fixed length types. Feb 13, 2024
Copy link
Collaborator

@sdesmalen-arm sdesmalen-arm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for making the changes @dtemirbulatov! LGTM

When SVE register size is unknown or the minimal size is not equal to
the maximum size then we could determine the actual SVE register size in
the runtime and adjust shuffle mask in the runtime.
@dtemirbulatov dtemirbulatov merged commit 5a023f5 into llvm:main Feb 21, 2024
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants