Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[PPC] generate stxvw4x/lxvw4x on P7 #87049

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

chenzheng1030
Copy link
Collaborator

My understanding is that we should also use stxvw4x/lxvw4x on P7 even for unaligned load/stores.

P8 ISA and P7 ISA indicates that these two instructions are able to handle the unaligned addresses well.

@llvmbot
Copy link
Collaborator

llvmbot commented Mar 29, 2024

@llvm/pr-subscribers-backend-powerpc

Author: Chen Zheng (chenzheng1030)

Changes

My understanding is that we should also use stxvw4x/lxvw4x on P7 even for unaligned load/stores.

P8 ISA and P7 ISA indicates that these two instructions are able to handle the unaligned addresses well.


Patch is 23.37 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/87049.diff

10 Files Affected:

  • (modified) llvm/lib/Target/PowerPC/PPCISelLowering.cpp (+4-5)
  • (modified) llvm/test/CodeGen/PowerPC/aix-vec-arg-spills-mir.ll (+27-34)
  • (modified) llvm/test/CodeGen/PowerPC/aix-vec-arg-spills.ll (+49-55)
  • (modified) llvm/test/CodeGen/PowerPC/memcpy-vec.ll (+4-7)
  • (modified) llvm/test/CodeGen/PowerPC/unal-altivec-wint.ll (+3-3)
  • (modified) llvm/test/CodeGen/PowerPC/unal-altivec2.ll (+1-1)
  • (modified) llvm/test/CodeGen/PowerPC/unal-vec-ldst.ll (+12-48)
  • (modified) llvm/test/CodeGen/PowerPC/unal-vec-negarith.ll (+1-1)
  • (modified) llvm/test/CodeGen/PowerPC/unaligned.ll (+2-6)
  • (modified) llvm/test/CodeGen/PowerPC/vsx.ll (+3-15)
diff --git a/llvm/lib/Target/PowerPC/PPCISelLowering.cpp b/llvm/lib/Target/PowerPC/PPCISelLowering.cpp
index 7436b202fba0d9..289e0bc29c4a55 100644
--- a/llvm/lib/Target/PowerPC/PPCISelLowering.cpp
+++ b/llvm/lib/Target/PowerPC/PPCISelLowering.cpp
@@ -15965,8 +15965,8 @@ SDValue PPCTargetLowering::PerformDAGCombine(SDNode *N,
     Align ABIAlignment = DAG.getDataLayout().getABITypeAlign(Ty);
     if (LD->isUnindexed() && VT.isVector() &&
         ((Subtarget.hasAltivec() && ISD::isNON_EXTLoad(N) &&
-          // P8 and later hardware should just use LOAD.
-          !Subtarget.hasP8Vector() &&
+          // Hardware has VSX should just use LOAD.
+          !Subtarget.hasVSX() &&
           (VT == MVT::v16i8 || VT == MVT::v8i16 || VT == MVT::v4i32 ||
            VT == MVT::v4f32))) &&
         LD->getAlign() < ABIAlignment) {
@@ -17250,8 +17250,7 @@ bool PPCTargetLowering::getTgtMemIntrinsic(IntrinsicInfo &Info,
 EVT PPCTargetLowering::getOptimalMemOpType(
     const MemOp &Op, const AttributeList &FuncAttributes) const {
   if (getTargetMachine().getOptLevel() != CodeGenOptLevel::None) {
-    // We should use Altivec/VSX loads and stores when available. For unaligned
-    // addresses, unaligned VSX loads are only fast starting with the P8.
+    // We should use Altivec/VSX loads and stores when available.
     if (Subtarget.hasAltivec() && Op.size() >= 16) {
       if (Op.isMemset() && Subtarget.hasVSX()) {
         uint64_t TailSize = Op.size() % 16;
@@ -17263,7 +17262,7 @@ EVT PPCTargetLowering::getOptimalMemOpType(
         }
         return MVT::v4i32;
       }
-      if (Op.isAligned(Align(16)) || Subtarget.hasP8Vector())
+      if (Op.isAligned(Align(16)) || Subtarget.hasVSX())
         return MVT::v4i32;
     }
   }
diff --git a/llvm/test/CodeGen/PowerPC/aix-vec-arg-spills-mir.ll b/llvm/test/CodeGen/PowerPC/aix-vec-arg-spills-mir.ll
index 7c45958a1c2ff9..d927b9edb74d1d 100644
--- a/llvm/test/CodeGen/PowerPC/aix-vec-arg-spills-mir.ll
+++ b/llvm/test/CodeGen/PowerPC/aix-vec-arg-spills-mir.ll
@@ -1,4 +1,4 @@
-; NOTE: Assertions have been autogenerated by utils/update_mir_test_checks.py
+; NOTE: Assertions have been autogenerated by utils/update_mir_test_checks.py UTC_ARGS: --version 4
 ; RUN: llc -verify-machineinstrs -mcpu=pwr7 -mattr=+altivec -vec-extabi \
 ; RUN:     -stop-after=machine-cp -mtriple powerpc-ibm-aix-xcoff < %s | \
 ; RUN:   FileCheck %s --check-prefix=MIR32
@@ -12,21 +12,16 @@
 @__const.caller.t = private unnamed_addr constant %struct.Test { double 0.000000e+00, double 1.000000e+00, double 2.000000e+00, double 3.000000e+00 }, align 8
 
 define double @caller() {
-
   ; MIR32-LABEL: name: caller
   ; MIR32: bb.0.entry:
-  ; MIR32-NEXT:   renamable $r3 = LI 0
-  ; MIR32-NEXT:   renamable $r4 = LIS 16392
-  ; MIR32-NEXT:   STW killed renamable $r4, 180, $r1 :: (store (s32) into unknown-address + 24)
-  ; MIR32-NEXT:   renamable $r4 = LIS 16384
-  ; MIR32-NEXT:   STW renamable $r3, 184, $r1 :: (store (s32) into unknown-address + 28)
-  ; MIR32-NEXT:   STW renamable $r3, 176, $r1 :: (store (s32) into unknown-address + 20)
-  ; MIR32-NEXT:   STW killed renamable $r4, 172, $r1 :: (store (s32) into unknown-address + 16)
-  ; MIR32-NEXT:   STW renamable $r3, 168, $r1 :: (store (s32) into unknown-address + 12)
-  ; MIR32-NEXT:   renamable $r4 = LIS 16368
-  ; MIR32-NEXT:   STW killed renamable $r4, 164, $r1 :: (store (s32) into unknown-address + 8)
-  ; MIR32-NEXT:   STW renamable $r3, 160, $r1 :: (store (s32) into unknown-address + 4)
-  ; MIR32-NEXT:   STW killed renamable $r3, 156, $r1 :: (store (s32))
+  ; MIR32-NEXT:   renamable $r3 = LWZtoc @__const.caller.t, $r2 :: (load (s32) from got)
+  ; MIR32-NEXT:   renamable $r4 = LI 16
+  ; MIR32-NEXT:   renamable $vsl0 = LXVW4X renamable $r3, killed renamable $r4 :: (load (s128) from unknown-address + 16, align 8)
+  ; MIR32-NEXT:   renamable $r4 = LI 172
+  ; MIR32-NEXT:   STXVW4X killed renamable $vsl0, $r1, killed renamable $r4 :: (store (s128) into unknown-address + 16, align 4)
+  ; MIR32-NEXT:   renamable $vsl0 = LXVW4X $zero, killed renamable $r3 :: (load (s128), align 8)
+  ; MIR32-NEXT:   renamable $r3 = LI 156
+  ; MIR32-NEXT:   STXVW4X killed renamable $vsl0, $r1, killed renamable $r3 :: (store (s128), align 4)
   ; MIR32-NEXT:   ADJCALLSTACKDOWN 188, 0, implicit-def dead $r1, implicit $r1
   ; MIR32-NEXT:   renamable $vsl0 = XXLXORz
   ; MIR32-NEXT:   renamable $r3 = LI 136
@@ -78,32 +73,30 @@ define double @caller() {
   ;
   ; MIR64-LABEL: name: caller
   ; MIR64: bb.0.entry:
-  ; MIR64-NEXT:   renamable $x3 = LI8 2049
-  ; MIR64-NEXT:   renamable $x4 = LI8 1
-  ; MIR64-NEXT:   renamable $x3 = RLDIC killed renamable $x3, 51, 1
-  ; MIR64-NEXT:   STD killed renamable $x3, 216, $x1 :: (store (s64) into unknown-address + 24, align 4)
-  ; MIR64-NEXT:   renamable $x3 = LI8 1023
-  ; MIR64-NEXT:   renamable $x4 = RLDIC killed renamable $x4, 62, 1
-  ; MIR64-NEXT:   STD killed renamable $x4, 208, $x1 :: (store (s64) into unknown-address + 16, align 4)
-  ; MIR64-NEXT:   renamable $x4 = LI8 0
-  ; MIR64-NEXT:   STD renamable $x4, 192, $x1 :: (store (s64), align 4)
-  ; MIR64-NEXT:   renamable $x3 = RLDIC killed renamable $x3, 52, 2
-  ; MIR64-NEXT:   STD killed renamable $x3, 200, $x1 :: (store (s64) into unknown-address + 8, align 4)
+  ; MIR64-NEXT:   renamable $x3 = LDtoc @__const.caller.t, $x2 :: (load (s64) from got)
+  ; MIR64-NEXT:   renamable $x4 = LI8 16
+  ; MIR64-NEXT:   renamable $vsl0 = LXVW4X renamable $x3, killed renamable $x4 :: (load (s128) from unknown-address + 16, align 8)
+  ; MIR64-NEXT:   renamable $x4 = LI8 208
+  ; MIR64-NEXT:   STXVW4X killed renamable $vsl0, $x1, killed renamable $x4 :: (store (s128) into unknown-address + 16, align 4)
+  ; MIR64-NEXT:   renamable $vsl0 = LXVW4X $zero8, killed renamable $x3 :: (load (s128), align 8)
+  ; MIR64-NEXT:   renamable $x3 = LI8 192
+  ; MIR64-NEXT:   STXVW4X killed renamable $vsl0, $x1, killed renamable $x3 :: (store (s128), align 4)
   ; MIR64-NEXT:   ADJCALLSTACKDOWN 224, 0, implicit-def dead $r1, implicit $r1
   ; MIR64-NEXT:   renamable $vsl0 = XXLXORz
   ; MIR64-NEXT:   renamable $x3 = LI8 160
+  ; MIR64-NEXT:   renamable $x4 = LI8 144
   ; MIR64-NEXT:   STXVW4X renamable $vsl0, $x1, killed renamable $x3 :: (store (s128), align 8)
-  ; MIR64-NEXT:   renamable $x3 = LI8 144
-  ; MIR64-NEXT:   STXVW4X renamable $vsl0, $x1, killed renamable $x3 :: (store (s128), align 8)
+  ; MIR64-NEXT:   STXVW4X renamable $vsl0, $x1, killed renamable $x4 :: (store (s128), align 8)
   ; MIR64-NEXT:   renamable $x3 = LI8 128
+  ; MIR64-NEXT:   renamable $x4 = LDtocCPT %const.0, $x2 :: (load (s64) from got)
   ; MIR64-NEXT:   STXVW4X killed renamable $vsl0, $x1, killed renamable $x3 :: (store (s128), align 8)
-  ; MIR64-NEXT:   renamable $x3 = LDtocCPT %const.0, $x2 :: (load (s64) from got)
-  ; MIR64-NEXT:   renamable $vsl0 = LXVD2X $zero8, killed renamable $x3 :: (load (s128) from constant-pool)
   ; MIR64-NEXT:   renamable $x3 = LI8 80
+  ; MIR64-NEXT:   renamable $vsl0 = LXVD2X $zero8, killed renamable $x4 :: (load (s128) from constant-pool)
+  ; MIR64-NEXT:   renamable $x4 = LI8 512
   ; MIR64-NEXT:   STXVD2X killed renamable $vsl0, $x1, killed renamable $x3 :: (store (s128))
-  ; MIR64-NEXT:   renamable $x3 = LI8 512
-  ; MIR64-NEXT:   STD killed renamable $x3, 184, $x1 :: (store (s64))
-  ; MIR64-NEXT:   STD killed renamable $x4, 176, $x1 :: (store (s64))
+  ; MIR64-NEXT:   renamable $x3 = LI8 0
+  ; MIR64-NEXT:   STD killed renamable $x4, 184, $x1 :: (store (s64))
+  ; MIR64-NEXT:   STD killed renamable $x3, 176, $x1 :: (store (s64))
   ; MIR64-NEXT:   $f1 = XXLXORdpz
   ; MIR64-NEXT:   $f2 = XXLXORdpz
   ; MIR64-NEXT:   $v2 = XXLXORz
@@ -112,8 +105,8 @@ define double @caller() {
   ; MIR64-NEXT:   $v5 = XXLXORz
   ; MIR64-NEXT:   $v6 = XXLXORz
   ; MIR64-NEXT:   $x3 = LI8 128
-  ; MIR64-NEXT:   $x4 = LI8 256
   ; MIR64-NEXT:   $v7 = XXLXORz
+  ; MIR64-NEXT:   $x4 = LI8 256
   ; MIR64-NEXT:   $v8 = XXLXORz
   ; MIR64-NEXT:   $v9 = XXLXORz
   ; MIR64-NEXT:   $v10 = XXLXORz
@@ -136,7 +129,7 @@ define double @caller() {
   ; MIR64-NEXT:   BLR8 implicit $lr8, implicit $rm, implicit $f1
   entry:
     %call = tail call double @callee(i32 signext 128, i32 signext 256, double 0.000000e+00, double 0.000000e+00, <2 x double> <double 0.000000e+00, double 0.000000e+00>, <2 x double> <double 0.000000e+00, double 0.000000e+00>, <2 x double> <double 0.000000e+00, double 0.000000e+00>, <2 x double> <double 0.000000e+00, double 0.000000e+00>, <2 x double> <double 0.000000e+00, double 0.000000e+00>, <2 x double> <double 0.000000e+00, double 0.000000e+00>, <2 x double> <double 0.000000e+00, double 0.000000e+00>, <2 x double> <double 0.000000e+00, double 0.000000e+00>, <2 x double> <double 0.000000e+00, double 0.000000e+00>, <2 x double> <double 0.000000e+00, double 0.000000e+00>, <2 x double> <double 0.000000e+00, double 0.000000e+00>, <2 x double> <double 0.000000e+00, double 0.000000e+00>, <2 x double> <double 2.400000e+01, double 2.500000e+01>, double 0.000000e+00, double 0.000000e+00, double 0.000000e+00, double 0.000000e+00, double 0.000000e+00, double 0.000000e+00, double 0.000000e+00, double 0.000000e+00, double 0.000000e+00, double 0.000000e+00, double 0.000000e+00, i32 signext 512, ptr nonnull byval(%struct.Test) align 4 @__const.caller.t)
-      ret double %call
+  ret double %call
 }
 
 declare double @callee(i32 signext, i32 signext, double, double, <2 x double>, <2 x double>, <2 x double>, <2 x double>, <2 x double>, <2 x double>, <2 x double>, <2 x double>, <2 x double>, <2 x double>, <2 x double>, <2 x double>, <2 x double>, double, double, double, double, double, double, double, double, double, double, double, i32 signext, ptr byval(%struct.Test) align 8)
diff --git a/llvm/test/CodeGen/PowerPC/aix-vec-arg-spills.ll b/llvm/test/CodeGen/PowerPC/aix-vec-arg-spills.ll
index 66f88b4e3d5ab3..91e7d4094fc344 100644
--- a/llvm/test/CodeGen/PowerPC/aix-vec-arg-spills.ll
+++ b/llvm/test/CodeGen/PowerPC/aix-vec-arg-spills.ll
@@ -15,56 +15,52 @@ define double @caller() {
 ; 32BIT:       # %bb.0: # %entry
 ; 32BIT-NEXT:    mflr 0
 ; 32BIT-NEXT:    stwu 1, -192(1)
-; 32BIT-NEXT:    lis 4, 16392
+; 32BIT-NEXT:    lwz 3, L..C0(2) # @__const.caller.t
+; 32BIT-NEXT:    li 4, 16
 ; 32BIT-NEXT:    stw 0, 200(1)
-; 32BIT-NEXT:    li 3, 0
-; 32BIT-NEXT:    xxlxor 0, 0, 0
 ; 32BIT-NEXT:    xxlxor 1, 1, 1
-; 32BIT-NEXT:    stw 4, 180(1)
-; 32BIT-NEXT:    lis 4, 16384
-; 32BIT-NEXT:    stw 3, 184(1)
-; 32BIT-NEXT:    stw 3, 176(1)
-; 32BIT-NEXT:    stw 4, 172(1)
-; 32BIT-NEXT:    lis 4, 16368
-; 32BIT-NEXT:    stw 3, 168(1)
-; 32BIT-NEXT:    stw 3, 160(1)
-; 32BIT-NEXT:    stw 4, 164(1)
-; 32BIT-NEXT:    stw 3, 156(1)
-; 32BIT-NEXT:    li 3, 136
-; 32BIT-NEXT:    li 4, 120
 ; 32BIT-NEXT:    xxlxor 2, 2, 2
-; 32BIT-NEXT:    stxvw4x 0, 1, 3
-; 32BIT-NEXT:    li 3, 104
-; 32BIT-NEXT:    stxvw4x 0, 1, 4
-; 32BIT-NEXT:    li 4, 88
-; 32BIT-NEXT:    stxvw4x 0, 1, 3
-; 32BIT-NEXT:    stxvw4x 0, 1, 4
-; 32BIT-NEXT:    lwz 4, L..C0(2) # %const.0
-; 32BIT-NEXT:    li 3, 72
-; 32BIT-NEXT:    stxvw4x 0, 1, 3
-; 32BIT-NEXT:    li 3, 48
+; 32BIT-NEXT:    lxvw4x 0, 3, 4
+; 32BIT-NEXT:    li 4, 172
 ; 32BIT-NEXT:    xxlxor 34, 34, 34
 ; 32BIT-NEXT:    xxlxor 35, 35, 35
-; 32BIT-NEXT:    lxvd2x 0, 0, 4
-; 32BIT-NEXT:    li 4, 512
+; 32BIT-NEXT:    stxvw4x 0, 1, 4
+; 32BIT-NEXT:    li 4, 120
 ; 32BIT-NEXT:    xxlxor 36, 36, 36
 ; 32BIT-NEXT:    xxlxor 37, 37, 37
 ; 32BIT-NEXT:    xxlxor 38, 38, 38
+; 32BIT-NEXT:    lxvw4x 0, 0, 3
+; 32BIT-NEXT:    li 3, 156
 ; 32BIT-NEXT:    xxlxor 39, 39, 39
+; 32BIT-NEXT:    stxvw4x 0, 1, 3
+; 32BIT-NEXT:    xxlxor 0, 0, 0
+; 32BIT-NEXT:    li 3, 136
 ; 32BIT-NEXT:    xxlxor 40, 40, 40
 ; 32BIT-NEXT:    xxlxor 41, 41, 41
+; 32BIT-NEXT:    stxvw4x 0, 1, 3
+; 32BIT-NEXT:    li 3, 104
+; 32BIT-NEXT:    stxvw4x 0, 1, 4
+; 32BIT-NEXT:    li 4, 88
 ; 32BIT-NEXT:    xxlxor 42, 42, 42
-; 32BIT-NEXT:    stxvd2x 0, 1, 3
-; 32BIT-NEXT:    stw 4, 152(1)
-; 32BIT-NEXT:    li 3, 128
-; 32BIT-NEXT:    li 4, 256
 ; 32BIT-NEXT:    xxlxor 43, 43, 43
 ; 32BIT-NEXT:    xxlxor 44, 44, 44
+; 32BIT-NEXT:    stxvw4x 0, 1, 3
+; 32BIT-NEXT:    stxvw4x 0, 1, 4
+; 32BIT-NEXT:    lwz 4, L..C1(2) # %const.0
+; 32BIT-NEXT:    li 3, 72
 ; 32BIT-NEXT:    xxlxor 45, 45, 45
+; 32BIT-NEXT:    stxvw4x 0, 1, 3
+; 32BIT-NEXT:    li 3, 48
 ; 32BIT-NEXT:    xxlxor 3, 3, 3
 ; 32BIT-NEXT:    xxlxor 4, 4, 4
+; 32BIT-NEXT:    lxvd2x 0, 0, 4
+; 32BIT-NEXT:    li 4, 512
 ; 32BIT-NEXT:    xxlxor 5, 5, 5
 ; 32BIT-NEXT:    xxlxor 6, 6, 6
+; 32BIT-NEXT:    stxvd2x 0, 1, 3
+; 32BIT-NEXT:    stw 4, 152(1)
+; 32BIT-NEXT:    li 3, 128
+; 32BIT-NEXT:    li 4, 256
 ; 32BIT-NEXT:    xxlxor 7, 7, 7
 ; 32BIT-NEXT:    xxlxor 8, 8, 8
 ; 32BIT-NEXT:    xxlxor 9, 9, 9
@@ -83,54 +79,52 @@ define double @caller() {
 ; 64BIT:       # %bb.0: # %entry
 ; 64BIT-NEXT:    mflr 0
 ; 64BIT-NEXT:    stdu 1, -224(1)
-; 64BIT-NEXT:    li 3, 2049
+; 64BIT-NEXT:    ld 3, L..C0(2) # @__const.caller.t
+; 64BIT-NEXT:    li 4, 16
 ; 64BIT-NEXT:    std 0, 240(1)
-; 64BIT-NEXT:    li 4, 1
-; 64BIT-NEXT:    xxlxor 0, 0, 0
 ; 64BIT-NEXT:    xxlxor 1, 1, 1
-; 64BIT-NEXT:    rldic 3, 3, 51, 1
-; 64BIT-NEXT:    rldic 4, 4, 62, 1
 ; 64BIT-NEXT:    xxlxor 2, 2, 2
+; 64BIT-NEXT:    lxvw4x 0, 3, 4
+; 64BIT-NEXT:    li 4, 208
 ; 64BIT-NEXT:    xxlxor 34, 34, 34
-; 64BIT-NEXT:    std 3, 216(1)
-; 64BIT-NEXT:    li 3, 1023
-; 64BIT-NEXT:    std 4, 208(1)
-; 64BIT-NEXT:    li 4, 0
 ; 64BIT-NEXT:    xxlxor 35, 35, 35
+; 64BIT-NEXT:    stxvw4x 0, 1, 4
+; 64BIT-NEXT:    li 4, 144
 ; 64BIT-NEXT:    xxlxor 36, 36, 36
-; 64BIT-NEXT:    rldic 3, 3, 52, 2
-; 64BIT-NEXT:    std 4, 192(1)
 ; 64BIT-NEXT:    xxlxor 37, 37, 37
 ; 64BIT-NEXT:    xxlxor 38, 38, 38
+; 64BIT-NEXT:    lxvw4x 0, 0, 3
+; 64BIT-NEXT:    li 3, 192
 ; 64BIT-NEXT:    xxlxor 39, 39, 39
-; 64BIT-NEXT:    std 3, 200(1)
+; 64BIT-NEXT:    stxvw4x 0, 1, 3
+; 64BIT-NEXT:    xxlxor 0, 0, 0
 ; 64BIT-NEXT:    li 3, 160
 ; 64BIT-NEXT:    xxlxor 40, 40, 40
-; 64BIT-NEXT:    stxvw4x 0, 1, 3
-; 64BIT-NEXT:    li 3, 144
 ; 64BIT-NEXT:    xxlxor 41, 41, 41
-; 64BIT-NEXT:    xxlxor 42, 42, 42
 ; 64BIT-NEXT:    stxvw4x 0, 1, 3
+; 64BIT-NEXT:    stxvw4x 0, 1, 4
+; 64BIT-NEXT:    ld 4, L..C1(2) # %const.0
 ; 64BIT-NEXT:    li 3, 128
+; 64BIT-NEXT:    xxlxor 42, 42, 42
 ; 64BIT-NEXT:    xxlxor 43, 43, 43
 ; 64BIT-NEXT:    stxvw4x 0, 1, 3
-; 64BIT-NEXT:    ld 3, L..C0(2) # %const.0
+; 64BIT-NEXT:    li 3, 80
 ; 64BIT-NEXT:    xxlxor 44, 44, 44
 ; 64BIT-NEXT:    xxlxor 45, 45, 45
-; 64BIT-NEXT:    lxvd2x 0, 0, 3
-; 64BIT-NEXT:    li 3, 80
+; 64BIT-NEXT:    lxvd2x 0, 0, 4
+; 64BIT-NEXT:    li 4, 512
 ; 64BIT-NEXT:    xxlxor 3, 3, 3
-; 64BIT-NEXT:    xxlxor 4, 4, 4
-; 64BIT-NEXT:    xxlxor 5, 5, 5
 ; 64BIT-NEXT:    stxvd2x 0, 1, 3
-; 64BIT-NEXT:    li 3, 512
-; 64BIT-NEXT:    std 4, 176(1)
+; 64BIT-NEXT:    li 3, 0
+; 64BIT-NEXT:    std 4, 184(1)
 ; 64BIT-NEXT:    li 4, 256
+; 64BIT-NEXT:    xxlxor 4, 4, 4
+; 64BIT-NEXT:    xxlxor 5, 5, 5
 ; 64BIT-NEXT:    xxlxor 6, 6, 6
+; 64BIT-NEXT:    std 3, 176(1)
+; 64BIT-NEXT:    li 3, 128
 ; 64BIT-NEXT:    xxlxor 7, 7, 7
 ; 64BIT-NEXT:    xxlxor 8, 8, 8
-; 64BIT-NEXT:    std 3, 184(1)
-; 64BIT-NEXT:    li 3, 128
 ; 64BIT-NEXT:    xxlxor 9, 9, 9
 ; 64BIT-NEXT:    xxlxor 10, 10, 10
 ; 64BIT-NEXT:    xxlxor 11, 11, 11
diff --git a/llvm/test/CodeGen/PowerPC/memcpy-vec.ll b/llvm/test/CodeGen/PowerPC/memcpy-vec.ll
index d636921eea3e51..34a7af4bc45916 100644
--- a/llvm/test/CodeGen/PowerPC/memcpy-vec.ll
+++ b/llvm/test/CodeGen/PowerPC/memcpy-vec.ll
@@ -10,12 +10,8 @@ entry:
   ret void
 
 ; PWR7-LABEL: @foo1
-; PWR7-NOT: bl memcpy
-; PWR7-DAG: li [[OFFSET:[0-9]+]], 16
-; PWR7-DAG: lxvd2x [[TMP0:[0-9]+]], 4, [[OFFSET]]
-; PWR7-DAG: stxvd2x [[TMP0]], 3, [[OFFSET]]
-; PWR7-DAG: lxvd2x [[TMP1:[0-9]+]], 0, 4
-; PWR7-DAG: stxvd2x [[TMP1]], 0, 3
+; PWR7: lxvw4x
+; PWR7: stxvw4x
 ; PWR7: blr
 
 ; PWR8-LABEL: @foo1
@@ -34,7 +30,8 @@ entry:
   ret void
 
 ; PWR7-LABEL: @foo2
-; PWR7: bl memcpy
+; PWR7: lxvw4x
+; PWR7: stxvw4x
 ; PWR7: blr
 
 ; PWR8-LABEL: @foo2
diff --git a/llvm/test/CodeGen/PowerPC/unal-altivec-wint.ll b/llvm/test/CodeGen/PowerPC/unal-altivec-wint.ll
index d6244cd828e5a6..a590f54b5a6765 100644
--- a/llvm/test/CodeGen/PowerPC/unal-altivec-wint.ll
+++ b/llvm/test/CodeGen/PowerPC/unal-altivec-wint.ll
@@ -17,7 +17,7 @@ entry:
 ; CHECK-LABEL: @test1
 ; CHECK: li [[REG:[0-9]+]], 16
 ; CHECK-NOT: li {{[0-9]+}}, 15
-; CHECK-DAG: lvx {{[0-9]+}}, 0, 3
+; CHECK-DAG: lxvw4x {{[0-9]+}}, 0, 3
 ; CHECK-DAG: lvx {{[0-9]+}}, 3, [[REG]]
 ; CHECK: blr
 }
@@ -36,8 +36,8 @@ entry:
 ; CHECK-LABEL: @test2
 ; CHECK: li [[REG:[0-9]+]], 16
 ; CHECK-NOT: li {{[0-9]+}}, 15
-; CHECK-DAG: lvx {{[0-9]+}}, 0, 3
-; CHECK-DAG: lvx {{[0-9]+}}, 3, [[REG]]
+; CHECK-DAG: stvx 2, 3, [[REG]]
+; CHECK-DAG: lxvw4x {{[0-9]+}}, 0, 3
 ; CHECK: blr
 }
 
diff --git a/llvm/test/CodeGen/PowerPC/unal-altivec2.ll b/llvm/test/CodeGen/PowerPC/unal-altivec2.ll
index fafcab8468eb4d..39a82fe0a0977c 100644
--- a/llvm/test/CodeGen/PowerPC/unal-altivec2.ll
+++ b/llvm/test/CodeGen/PowerPC/unal-altivec2.ll
@@ -1,4 +1,4 @@
-; RUN: llc -verify-machineinstrs -mtriple=powerpc64-unknown-linux-gnu -mcpu=pwr7 < %s | FileCheck %s
+; RUN: llc -verify-machineinstrs -mtriple=powerpc64-unknown-linux-gnu -mcpu=pwr6 < %s | FileCheck %s
 target datalayout = "E-p:64:64:64-i1:8:8-i8:8:8-i16:16:16-i32:32:32-i64:64:64-f32:32:32-f64:64:64-f128:128:128-v128:128:128-n32:64"
 target triple = "powerpc64-unknown-linux-gnu"
 
diff --git a/llvm/test/CodeGen/PowerPC/unal-vec-ldst.ll b/llvm/test/CodeGen/PowerPC/unal-vec-ldst.ll
index b0ed395fc3a190..c2e20149cb9fee 100644
--- a/llvm/test/CodeGen/PowerPC/unal-vec-ldst.ll
+++ b/llvm/test/CodeGen/PowerPC/unal-vec-ldst.ll
@@ -5,11 +5,7 @@
 define <16 x i8> @test_l_v16i8(ptr %p) #0 {
 ; CHECK-LABEL: test_l_v16i8:
 ; CHECK:       # %bb.0: # %entry
-; CHECK-NEXT:    li 4, 15
-; CHECK-NEXT:    lvsl 3, 0, 3
-; CHECK-NEXT:    lvx 2, 3, 4
-; CHECK-NEXT:    lvx 4, 0, 3
-; CHECK-NEXT:    vperm 2, 4, 2, 3
+; CHECK-NEXT:    lxvw4x 34, 0, 3
 ; CHECK-NEXT:    blr
 entry:
   %r = load <16 x i8>, ptr %p, align 1
@@ -20,14 +16,9 @@ entry:
 define <32 x i8> @test_l_v32i8(ptr %p) #0 {
 ; CHECK-LABEL: test_l_v32i8:
 ; CHECK:       # %bb.0: # %entry
-; CHECK-NEXT:    li 4, 31
-; CHECK-NEXT:    lvsl 4, 0, 3
-; CHECK-NEXT:    lvx 2, 3, 4
 ; CHECK-NEXT:    li 4, 16
-; CHECK-NEXT:    lvx 5, 3, 4
-; CHECK-NEXT:    vperm 3, 5, 2, 4
-; CHECK-NEXT:    lvx 2, 0, 3
-; CHECK-NEXT:    vperm 2, 2, 5, 4
+; CHECK-NEXT:    lxvw4x 34, 0, 3
+; CHECK-NEXT:    lxvw4x 35, 3, 4
 ; CHECK-NEXT:    blr
 entry:
   %r = load <32 x i8>, ptr %p, align 1
@@ -38,11 +29,7 @@ entry:
 define <8 x i16> @test_l_v8i16(ptr %p) #0 {
 ; CHECK-LABEL: test_l_v8i16:
 ; CHECK:       # %bb.0: # %entry
-; CHECK-NEXT:    li 4, 15
-; CHECK-NEXT:    lvsl 3, 0, 3
-; CHECK-NEXT:    lvx 2, 3, 4
-; CHECK-NEXT:    lvx 4, 0, 3
-; CHECK-NEXT:    vperm 2, 4, 2, 3
+; CHECK-NEXT:    lxvw4x 34, 0, 3
 ; CHECK-NEXT:    blr
 entry:
   %r = load <8 x i16>, ptr %p, align 2
@@ -53,14 +40,9 @@ entry:
 define <16 x i16> @test_l_v16i16(ptr %p) #0 {
 ; CHECK-LABEL: test_l_v16i16:
 ; CHECK:       # %bb.0: # %entry
-; CHECK-NEXT:    li 4, 31
-; CHECK-NEXT:    lvsl 4, 0, 3
-; CHECK-NEXT:    lvx 2, 3, 4
 ; CHECK-NEXT:    li 4, 16
-; CHECK-NEXT:    lvx 5, 3, 4
-; CHECK-NEXT:    vperm 3, 5, 2, 4
-; CHECK-NEXT:    lvx 2, 0, 3
-; CHECK-NEXT:    vperm 2, 2, 5, 4
+; CHECK-NEXT:    lxvw4x 34, 0, 3
+; CHECK-NEXT:    lxvw4x 35, 3, 4
 ; CHECK-NEXT:    blr
 entry:
   %r = load <16 x i16>, ptr %p, align 2
@@ -71,11 +53,7 @@ entry:
 define <4 x i32> @test_l_v4i32(ptr %p) #0 {
 ; CHECK-LABEL: test_l_v4i32:
 ; CHECK:       # %bb.0: # %entry
-; CHECK-NEXT:    li 4, 15
-; CHECK-NEXT:    lvsl 3, 0, 3
-; CHECK-NEXT:    lvx 2, 3, 4
-; CHECK-NEXT:    lvx 4, 0, 3
-; CHECK-NEXT:    vperm 2, 4, 2, 3
+; CHECK-NEXT:    lxvw4x 34, 0, 3
 ; CHECK-NEXT:    blr
 entry:
   %r = load <4 x i32>, ptr %p, align 4
@@ -86,14 +64,9 @@ entry:
 define <8 x i32> @test_l_v8i32(ptr %p) #0 {
 ; CHECK-LABEL: test_l_v8i32:
 ; CHECK:       # %bb.0: # %entry
-; CHECK-NEXT:    li 4, 31
-; CHECK-NEXT:    lvsl 4, 0, 3
-; CHECK-NEXT:    lvx 2, 3, 4
 ; CHECK-NEXT:    li 4, 16
-; CHECK-NEXT:    lvx 5, 3, 4
-; CHECK-NEXT:    vperm 3, 5, 2, 4
-; CHECK-NEXT:    lvx 2, 0, 3
-; CHECK-NEXT:    vperm 2, 2, 5, 4
+; CHECK-NEXT:    lxvw4x 34, 0, 3
+; CHECK-NEXT:    lxvw4x 35, 3, 4
 ; CHECK-NEXT:    blr
 entry:
   %r = load <8 x i32>, ptr %p, align 4
@@ -128,11 +101,7 @@ entry:
 def...
[truncated]

@@ -17250,8 +17250,7 @@ bool PPCTargetLowering::getTgtMemIntrinsic(IntrinsicInfo &Info,
EVT PPCTargetLowering::getOptimalMemOpType(
const MemOp &Op, const AttributeList &FuncAttributes) const {
if (getTargetMachine().getOptLevel() != CodeGenOptLevel::None) {
// We should use Altivec/VSX loads and stores when available. For unaligned
// addresses, unaligned VSX loads are only fast starting with the P8.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have you compared the performance w/o unaligned vector-scalar load/store on powr7? The comment looks indicating why it is not performed prior pwr8.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On pwr7 an lxvw4x with < 8 byte alignment will be flushed and micro-coded, and with < 4 byte alignment will be an alignment interrupt.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I checked on a PWR7 AIX server, with alignment like 1/2, AIX OS observed a misalign load/store operation. But the lxvw4x is able to execute successfully.
The performance shows lxvw4x is same with current lxvd2x(see llvm/test/CodeGen/PowerPC/memcpy-vec.ll). lxvw4x is better than scalar fix point ld on AIX P7.

Tested on AIX servers, it works well both with functionality and performance.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants