Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
86 changes: 73 additions & 13 deletions llvm/lib/Target/AArch64/AArch64RegisterInfo.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -1123,24 +1123,85 @@ unsigned AArch64RegisterInfo::getRegPressureLimit(const TargetRegisterClass *RC,
}
}

// FORM_TRANSPOSED_REG_TUPLE nodes are created to improve register allocation
// where a consecutive multi-vector tuple is constructed from the same indices
// of multiple strided loads. This may still result in unnecessary copies
// between the loads and the tuple. Here we try to return a hint to assign the
// contiguous ZPRMulReg starting at the same register as the first operand of
// the pseudo, which should be a subregister of the first strided load.
// We add regalloc hints for different cases:
// * Choosing a better destination operand for predicated SVE instructions
// where the inactive lanes are undef, by choosing a register that is not
// unique to the other operands of the instruction.
//
// For example, if the first strided load has been assigned $z16_z20_z24_z28
// and the operands of the pseudo are each accessing subregister zsub2, we
// should look through through Order to find a contiguous register which
// begins with $z24 (i.e. $z24_z25_z26_z27).
// * Improve register allocation for SME multi-vector instructions where we can
// benefit from the strided- and contiguous register multi-vector tuples.
//
// Here FORM_TRANSPOSED_REG_TUPLE nodes are created to improve register
// allocation where a consecutive multi-vector tuple is constructed from the
// same indices of multiple strided loads. This may still result in
// unnecessary copies between the loads and the tuple. Here we try to return a
// hint to assign the contiguous ZPRMulReg starting at the same register as
// the first operand of the pseudo, which should be a subregister of the first
// strided load.
//
// For example, if the first strided load has been assigned $z16_z20_z24_z28
// and the operands of the pseudo are each accessing subregister zsub2, we
// should look through through Order to find a contiguous register which
// begins with $z24 (i.e. $z24_z25_z26_z27).
bool AArch64RegisterInfo::getRegAllocationHints(
Register VirtReg, ArrayRef<MCPhysReg> Order,
SmallVectorImpl<MCPhysReg> &Hints, const MachineFunction &MF,
const VirtRegMap *VRM, const LiveRegMatrix *Matrix) const {

auto &ST = MF.getSubtarget<AArch64Subtarget>();
const AArch64InstrInfo *TII =
MF.getSubtarget<AArch64Subtarget>().getInstrInfo();
const MachineRegisterInfo &MRI = MF.getRegInfo();

// For predicated SVE instructions where the inactive lanes are undef,
// pick a destination register that is not unique to avoid introducing
// a movprfx.
const TargetRegisterClass *RegRC = MRI.getRegClass(VirtReg);
if (AArch64::ZPRRegClass.hasSubClassEq(RegRC)) {
for (const MachineOperand &DefOp : MRI.def_operands(VirtReg)) {
const MachineInstr &Def = *DefOp.getParent();
if (DefOp.isImplicit() ||
(TII->get(Def.getOpcode()).TSFlags & AArch64::FalseLanesMask) !=
AArch64::FalseLanesUndef)
continue;

unsigned InstFlags =
TII->get(AArch64::getSVEPseudoMap(Def.getOpcode())).TSFlags;

for (MCPhysReg R : Order) {
auto AddHintIfSuitable = [&](MCPhysReg R, const MachineOperand &MO) {
// R is a suitable register hint if there exists an operand for the
// instruction that is not yet allocated a register or if R matches
// one of the other source operands.
if (!VRM->hasPhys(MO.getReg()) || VRM->getPhys(MO.getReg()) == R)
Hints.push_back(R);
};

switch (InstFlags & AArch64::DestructiveInstTypeMask) {
default:
break;
case AArch64::DestructiveTernaryCommWithRev:
AddHintIfSuitable(R, Def.getOperand(2));
AddHintIfSuitable(R, Def.getOperand(3));
AddHintIfSuitable(R, Def.getOperand(4));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you remember if there is any priority order for hints? E.g. will R, Def.getOperand(2) be considered first for assigning a phys reg?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code here adds hints using the priority order from ArrayRef<MCPhysReg> Order, so the order in which it calls AddHintIfSuitable (for each Def.getOperand(K)) does not matter.

In general, the priority order of hints does matter, as the register allocator will try the hints in the order specified.

break;
case AArch64::DestructiveBinaryComm:
case AArch64::DestructiveBinaryCommWithRev:
AddHintIfSuitable(R, Def.getOperand(2));
AddHintIfSuitable(R, Def.getOperand(3));
break;
case AArch64::DestructiveBinary:
case AArch64::DestructiveBinaryImm:
AddHintIfSuitable(R, Def.getOperand(2));
break;
}
}
}

if (Hints.size())
return TargetRegisterInfo::getRegAllocationHints(VirtReg, Order, Hints,
MF, VRM);
Comment on lines +1200 to +1202
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason to prefer adding the hints above before target-independent ones?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm really sorry if I'm missing something very obvious, but is the expectation that the hints added above should take precedence over the copy hints that the target-independent implementation adds?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that's right. My understanding is that the registers are considered in the order as they are in Hints (i.e. highest priority for hint in element 0 -> lower priority for hint in element K). So here we add the most preferred registers first. TargetRegisterInfo::getRegAllocationHints may append more hints, which will have lower priority.

Copy link
Contributor

@rj-jesus rj-jesus Nov 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe you're right, which is why I expected copy hints to come first. A missed copy hint is likely to lead to a MOV down the line, whereas a missed MOVPRFX hint should only lead to the MOVPRFX itself (which should be cheaper). That would happen in the example below if MachineCP weren't able to rewrite $z0 with $z4.

For what it's worth, the patch does seem to increase the list of hints of affected pseudos considerably, including adding repeated ones (example):

selectOrSplit ZPR:%4 [80r,96r:0) 0@80r  weight:INF
hints: $z0 $z0 $z0 $z1 $z1 $z1 $z2 $z2 $z2 $z3 $z3 $z3 $z4 $z4 $z4 $z5 $z5 $z5 $z6 $z6 $z6 $z7 $z7 $z7 $z16 $z16 $z16 $z17 $z17 $z17 $z18 $z18 $z18 $z19 $z19 $z19 $z20 $z20 $z20 $z21 $z21 $z21 $z22 $z22 $z22 $z23 $z23 $z23 $z24 $z24 $z24 $z25 $z25 $z25 $z26 $z26 $z26 $z27 $z27 $z27 $z28 $z28 $z28 $z29 $z29 $z29 $z30 $z30 $z30 $z31 $z31 $z31 $z8 $z8 $z8 $z9 $z9 $z9 $z10 $z10 $z10 $z11 $z11 $z11 $z12 $z12 $z12 $z13 $z13 $z13 $z14 $z14 $z14 $z15 $z15 $z15 $z4
assigning %4 to $z0: B0 [80r,96r:0) 0@80r B0_HI [80r,96r:0) 0@80r H0_HI [80r,96r:0) 0@80r S0_HI [80r,96r:0) 0@80r D0_HI [80r,96r:0) 0@80r Q0_HI [80r,96r:0) 0@80r

Before the patch:

selectOrSplit ZPR:%4 [80r,96r:0) 0@80r  weight:INF
hints: $z4
assigning %4 to $z4: B4 [80r,96r:0) 0@80r B4_HI [80r,96r:0) 0@80r H4_HI [80r,96r:0) 0@80r S4_HI [80r,96r:0) 0@80r D4_HI [80r,96r:0) 0@80r Q4_HI [80r,96r:0) 0@80r

I'm not sure how this affects the register allocator (or compile time), but since it has already been merged, I suppose we can keep an eye out for any fallout. :)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If nothing else it's probably worth trying to remove the duplicates.

}

if (!ST.hasSME() || !ST.isStreaming())
return TargetRegisterInfo::getRegAllocationHints(VirtReg, Order, Hints, MF,
VRM);
Expand All @@ -1153,8 +1214,7 @@ bool AArch64RegisterInfo::getRegAllocationHints(
// FORM_TRANSPOSED_REG_TUPLE pseudo, we want to favour reducing copy
// instructions over reducing the number of clobbered callee-save registers,
// so we add the strided registers as a hint.
const MachineRegisterInfo &MRI = MF.getRegInfo();
unsigned RegID = MRI.getRegClass(VirtReg)->getID();
unsigned RegID = RegRC->getID();
if (RegID == AArch64::ZPR2StridedOrContiguousRegClassID ||
RegID == AArch64::ZPR4StridedOrContiguousRegClassID) {

Expand Down
7 changes: 3 additions & 4 deletions llvm/test/CodeGen/AArch64/aarch64-combine-add-sub-mul.ll
Original file line number Diff line number Diff line change
Expand Up @@ -52,12 +52,11 @@ define <2 x i64> @test_mul_sub_2x64_2(<2 x i64> %a, <2 x i64> %b, <2 x i64> %c,
; CHECK-NEXT: ptrue p0.d, vl2
; CHECK-NEXT: // kill: def $q0 killed $q0 def $z0
; CHECK-NEXT: // kill: def $q1 killed $q1 def $z1
; CHECK-NEXT: // kill: def $q3 killed $q3 def $z3
; CHECK-NEXT: // kill: def $q2 killed $q2 def $z2
; CHECK-NEXT: // kill: def $q3 killed $q3 def $z3
; CHECK-NEXT: sdiv z0.d, p0/m, z0.d, z1.d
; CHECK-NEXT: movprfx z1, z2
; CHECK-NEXT: mul z1.d, p0/m, z1.d, z3.d
; CHECK-NEXT: sub v0.2d, v1.2d, v0.2d
; CHECK-NEXT: mul z2.d, p0/m, z2.d, z3.d
; CHECK-NEXT: sub v0.2d, v2.2d, v0.2d
; CHECK-NEXT: ret
%div = sdiv <2 x i64> %a, %b
%mul = mul <2 x i64> %c, %d
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -14,13 +14,12 @@ define <vscale x 4 x double> @mull_add(<vscale x 4 x double> %a, <vscale x 4 x d
; CHECK-NEXT: ptrue p0.d
; CHECK-NEXT: fmul z7.d, z0.d, z1.d
; CHECK-NEXT: fmul z1.d, z6.d, z1.d
; CHECK-NEXT: movprfx z3, z7
; CHECK-NEXT: fmla z3.d, p0/m, z6.d, z2.d
; CHECK-NEXT: fmad z6.d, p0/m, z2.d, z7.d
; CHECK-NEXT: fnmsb z0.d, p0/m, z2.d, z1.d
; CHECK-NEXT: uzp2 z1.d, z4.d, z5.d
; CHECK-NEXT: uzp1 z2.d, z4.d, z5.d
; CHECK-NEXT: fadd z2.d, z2.d, z0.d
; CHECK-NEXT: fadd z1.d, z3.d, z1.d
; CHECK-NEXT: fadd z1.d, z6.d, z1.d
; CHECK-NEXT: zip1 z0.d, z2.d, z1.d
; CHECK-NEXT: zip2 z1.d, z2.d, z1.d
; CHECK-NEXT: ret
Expand Down Expand Up @@ -225,17 +224,14 @@ define <vscale x 4 x double> @mul_add_rot_mull(<vscale x 4 x double> %a, <vscale
; CHECK-NEXT: fmul z1.d, z25.d, z1.d
; CHECK-NEXT: fmul z3.d, z4.d, z24.d
; CHECK-NEXT: fmul z24.d, z5.d, z24.d
; CHECK-NEXT: movprfx z7, z26
; CHECK-NEXT: fmla z7.d, p0/m, z25.d, z2.d
; CHECK-NEXT: fmad z25.d, p0/m, z2.d, z26.d
; CHECK-NEXT: fnmsb z0.d, p0/m, z2.d, z1.d
; CHECK-NEXT: movprfx z1, z3
; CHECK-NEXT: fmla z1.d, p0/m, z6.d, z5.d
; CHECK-NEXT: movprfx z2, z24
; CHECK-NEXT: fnmls z2.d, p0/m, z4.d, z6.d
; CHECK-NEXT: fadd z2.d, z0.d, z2.d
; CHECK-NEXT: fadd z1.d, z7.d, z1.d
; CHECK-NEXT: zip1 z0.d, z2.d, z1.d
; CHECK-NEXT: zip2 z1.d, z2.d, z1.d
; CHECK-NEXT: fmla z3.d, p0/m, z6.d, z5.d
; CHECK-NEXT: fnmsb z4.d, p0/m, z6.d, z24.d
; CHECK-NEXT: fadd z1.d, z0.d, z4.d
; CHECK-NEXT: fadd z2.d, z25.d, z3.d
; CHECK-NEXT: zip1 z0.d, z1.d, z2.d
; CHECK-NEXT: zip2 z1.d, z1.d, z2.d
; CHECK-NEXT: ret
entry:
%strided.vec = tail call { <vscale x 2 x double>, <vscale x 2 x double> } @llvm.vector.deinterleave2.nxv4f64(<vscale x 4 x double> %a)
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -200,12 +200,10 @@ define <vscale x 4 x double> @mul_add_rot_mull(<vscale x 4 x double> %a, <vscale
; CHECK-NEXT: fmul z3.d, z2.d, z25.d
; CHECK-NEXT: fmul z25.d, z24.d, z25.d
; CHECK-NEXT: fmla z3.d, p0/m, z24.d, z0.d
; CHECK-NEXT: movprfx z24, z25
; CHECK-NEXT: fmla z24.d, p0/m, z26.d, z1.d
; CHECK-NEXT: movprfx z6, z24
; CHECK-NEXT: fmla z6.d, p0/m, z5.d, z4.d
; CHECK-NEXT: fmla z25.d, p0/m, z26.d, z1.d
; CHECK-NEXT: fmla z25.d, p0/m, z5.d, z4.d
; CHECK-NEXT: fmla z3.d, p0/m, z26.d, z4.d
; CHECK-NEXT: fnmsb z2.d, p0/m, z0.d, z6.d
; CHECK-NEXT: fnmsb z2.d, p0/m, z0.d, z25.d
; CHECK-NEXT: fmsb z1.d, p0/m, z5.d, z3.d
; CHECK-NEXT: zip1 z0.d, z2.d, z1.d
; CHECK-NEXT: zip2 z1.d, z2.d, z1.d
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -17,11 +17,10 @@ define <vscale x 4 x half> @complex_add_v4f16(<vscale x 4 x half> %a, <vscale x
; CHECK-NEXT: uunpklo z3.d, z3.s
; CHECK-NEXT: uunpklo z1.d, z1.s
; CHECK-NEXT: fsubr z0.h, p0/m, z0.h, z1.h
; CHECK-NEXT: movprfx z1, z3
; CHECK-NEXT: fadd z1.h, p0/m, z1.h, z2.h
; CHECK-NEXT: zip2 z2.d, z0.d, z1.d
; CHECK-NEXT: zip1 z0.d, z0.d, z1.d
; CHECK-NEXT: uzp1 z0.s, z0.s, z2.s
; CHECK-NEXT: fadd z2.h, p0/m, z2.h, z3.h
; CHECK-NEXT: zip2 z1.d, z0.d, z2.d
; CHECK-NEXT: zip1 z0.d, z0.d, z2.d
; CHECK-NEXT: uzp1 z0.s, z0.s, z1.s
; CHECK-NEXT: ret
entry:
%a.deinterleaved = tail call { <vscale x 2 x half>, <vscale x 2 x half> } @llvm.vector.deinterleave2.nxv4f16(<vscale x 4 x half> %a)
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -18,11 +18,10 @@ define <vscale x 4 x i16> @complex_mul_v4i16(<vscale x 4 x i16> %a, <vscale x 4
; CHECK-NEXT: uzp2 z1.d, z1.d, z3.d
; CHECK-NEXT: mul z5.d, z2.d, z0.d
; CHECK-NEXT: mul z2.d, z2.d, z4.d
; CHECK-NEXT: movprfx z3, z5
; CHECK-NEXT: mla z3.d, p0/m, z1.d, z4.d
; CHECK-NEXT: mad z4.d, p0/m, z1.d, z5.d
; CHECK-NEXT: msb z0.d, p0/m, z1.d, z2.d
; CHECK-NEXT: zip2 z1.d, z0.d, z3.d
; CHECK-NEXT: zip1 z0.d, z0.d, z3.d
; CHECK-NEXT: zip2 z1.d, z0.d, z4.d
; CHECK-NEXT: zip1 z0.d, z0.d, z4.d
; CHECK-NEXT: uzp1 z0.s, z0.s, z1.s
; CHECK-NEXT: ret
entry:
Expand Down
5 changes: 2 additions & 3 deletions llvm/test/CodeGen/AArch64/llvm-ir-to-intrinsic.ll
Original file line number Diff line number Diff line change
Expand Up @@ -1148,11 +1148,10 @@ define <vscale x 4 x i64> @fshl_rot_illegal_i64(<vscale x 4 x i64> %a, <vscale x
; CHECK-NEXT: and z3.d, z3.d, #0x3f
; CHECK-NEXT: lslr z4.d, p0/m, z4.d, z0.d
; CHECK-NEXT: lsr z0.d, p0/m, z0.d, z2.d
; CHECK-NEXT: movprfx z2, z1
; CHECK-NEXT: lsl z2.d, p0/m, z2.d, z5.d
; CHECK-NEXT: lslr z5.d, p0/m, z5.d, z1.d
; CHECK-NEXT: lsr z1.d, p0/m, z1.d, z3.d
; CHECK-NEXT: orr z0.d, z4.d, z0.d
; CHECK-NEXT: orr z1.d, z2.d, z1.d
; CHECK-NEXT: orr z1.d, z5.d, z1.d
; CHECK-NEXT: ret
%fshl = call <vscale x 4 x i64> @llvm.fshl.nxv4i64(<vscale x 4 x i64> %a, <vscale x 4 x i64> %a, <vscale x 4 x i64> %b)
ret <vscale x 4 x i64> %fshl
Expand Down
60 changes: 24 additions & 36 deletions llvm/test/CodeGen/AArch64/sve-fixed-length-fp-arith.ll
Original file line number Diff line number Diff line change
Expand Up @@ -55,10 +55,9 @@ define void @fadd_v32f16(ptr %a, ptr %b) #0 {
; VBITS_GE_256-NEXT: ld1h { z2.h }, p0/z, [x0]
; VBITS_GE_256-NEXT: ld1h { z3.h }, p0/z, [x1]
; VBITS_GE_256-NEXT: fadd z0.h, p0/m, z0.h, z1.h
; VBITS_GE_256-NEXT: movprfx z1, z2
; VBITS_GE_256-NEXT: fadd z1.h, p0/m, z1.h, z3.h
; VBITS_GE_256-NEXT: fadd z2.h, p0/m, z2.h, z3.h
; VBITS_GE_256-NEXT: st1h { z0.h }, p0, [x0, x8, lsl #1]
; VBITS_GE_256-NEXT: st1h { z1.h }, p0, [x0]
; VBITS_GE_256-NEXT: st1h { z2.h }, p0, [x0]
; VBITS_GE_256-NEXT: ret
;
; VBITS_GE_512-LABEL: fadd_v32f16:
Expand Down Expand Up @@ -154,10 +153,9 @@ define void @fadd_v16f32(ptr %a, ptr %b) #0 {
; VBITS_GE_256-NEXT: ld1w { z2.s }, p0/z, [x0]
; VBITS_GE_256-NEXT: ld1w { z3.s }, p0/z, [x1]
; VBITS_GE_256-NEXT: fadd z0.s, p0/m, z0.s, z1.s
; VBITS_GE_256-NEXT: movprfx z1, z2
; VBITS_GE_256-NEXT: fadd z1.s, p0/m, z1.s, z3.s
; VBITS_GE_256-NEXT: fadd z2.s, p0/m, z2.s, z3.s
; VBITS_GE_256-NEXT: st1w { z0.s }, p0, [x0, x8, lsl #2]
; VBITS_GE_256-NEXT: st1w { z1.s }, p0, [x0]
; VBITS_GE_256-NEXT: st1w { z2.s }, p0, [x0]
; VBITS_GE_256-NEXT: ret
;
; VBITS_GE_512-LABEL: fadd_v16f32:
Expand Down Expand Up @@ -253,10 +251,9 @@ define void @fadd_v8f64(ptr %a, ptr %b) #0 {
; VBITS_GE_256-NEXT: ld1d { z2.d }, p0/z, [x0]
; VBITS_GE_256-NEXT: ld1d { z3.d }, p0/z, [x1]
; VBITS_GE_256-NEXT: fadd z0.d, p0/m, z0.d, z1.d
; VBITS_GE_256-NEXT: movprfx z1, z2
; VBITS_GE_256-NEXT: fadd z1.d, p0/m, z1.d, z3.d
; VBITS_GE_256-NEXT: fadd z2.d, p0/m, z2.d, z3.d
; VBITS_GE_256-NEXT: st1d { z0.d }, p0, [x0, x8, lsl #3]
; VBITS_GE_256-NEXT: st1d { z1.d }, p0, [x0]
; VBITS_GE_256-NEXT: st1d { z2.d }, p0, [x0]
; VBITS_GE_256-NEXT: ret
;
; VBITS_GE_512-LABEL: fadd_v8f64:
Expand Down Expand Up @@ -660,10 +657,9 @@ define void @fma_v32f16(ptr %a, ptr %b, ptr %c) #0 {
; VBITS_GE_256-NEXT: ld1h { z4.h }, p0/z, [x1]
; VBITS_GE_256-NEXT: ld1h { z5.h }, p0/z, [x2]
; VBITS_GE_256-NEXT: fmad z0.h, p0/m, z1.h, z2.h
; VBITS_GE_256-NEXT: movprfx z1, z5
; VBITS_GE_256-NEXT: fmla z1.h, p0/m, z3.h, z4.h
; VBITS_GE_256-NEXT: fmad z3.h, p0/m, z4.h, z5.h
; VBITS_GE_256-NEXT: st1h { z0.h }, p0, [x0, x8, lsl #1]
; VBITS_GE_256-NEXT: st1h { z1.h }, p0, [x0]
; VBITS_GE_256-NEXT: st1h { z3.h }, p0, [x0]
; VBITS_GE_256-NEXT: ret
;
; VBITS_GE_512-LABEL: fma_v32f16:
Expand Down Expand Up @@ -771,10 +767,9 @@ define void @fma_v16f32(ptr %a, ptr %b, ptr %c) #0 {
; VBITS_GE_256-NEXT: ld1w { z4.s }, p0/z, [x1]
; VBITS_GE_256-NEXT: ld1w { z5.s }, p0/z, [x2]
; VBITS_GE_256-NEXT: fmad z0.s, p0/m, z1.s, z2.s
; VBITS_GE_256-NEXT: movprfx z1, z5
; VBITS_GE_256-NEXT: fmla z1.s, p0/m, z3.s, z4.s
; VBITS_GE_256-NEXT: fmad z3.s, p0/m, z4.s, z5.s
; VBITS_GE_256-NEXT: st1w { z0.s }, p0, [x0, x8, lsl #2]
; VBITS_GE_256-NEXT: st1w { z1.s }, p0, [x0]
; VBITS_GE_256-NEXT: st1w { z3.s }, p0, [x0]
; VBITS_GE_256-NEXT: ret
;
; VBITS_GE_512-LABEL: fma_v16f32:
Expand Down Expand Up @@ -881,10 +876,9 @@ define void @fma_v8f64(ptr %a, ptr %b, ptr %c) #0 {
; VBITS_GE_256-NEXT: ld1d { z4.d }, p0/z, [x1]
; VBITS_GE_256-NEXT: ld1d { z5.d }, p0/z, [x2]
; VBITS_GE_256-NEXT: fmad z0.d, p0/m, z1.d, z2.d
; VBITS_GE_256-NEXT: movprfx z1, z5
; VBITS_GE_256-NEXT: fmla z1.d, p0/m, z3.d, z4.d
; VBITS_GE_256-NEXT: fmad z3.d, p0/m, z4.d, z5.d
; VBITS_GE_256-NEXT: st1d { z0.d }, p0, [x0, x8, lsl #3]
; VBITS_GE_256-NEXT: st1d { z1.d }, p0, [x0]
; VBITS_GE_256-NEXT: st1d { z3.d }, p0, [x0]
; VBITS_GE_256-NEXT: ret
;
; VBITS_GE_512-LABEL: fma_v8f64:
Expand Down Expand Up @@ -990,10 +984,9 @@ define void @fmul_v32f16(ptr %a, ptr %b) #0 {
; VBITS_GE_256-NEXT: ld1h { z2.h }, p0/z, [x0]
; VBITS_GE_256-NEXT: ld1h { z3.h }, p0/z, [x1]
; VBITS_GE_256-NEXT: fmul z0.h, p0/m, z0.h, z1.h
; VBITS_GE_256-NEXT: movprfx z1, z2
; VBITS_GE_256-NEXT: fmul z1.h, p0/m, z1.h, z3.h
; VBITS_GE_256-NEXT: fmul z2.h, p0/m, z2.h, z3.h
; VBITS_GE_256-NEXT: st1h { z0.h }, p0, [x0, x8, lsl #1]
; VBITS_GE_256-NEXT: st1h { z1.h }, p0, [x0]
; VBITS_GE_256-NEXT: st1h { z2.h }, p0, [x0]
; VBITS_GE_256-NEXT: ret
;
; VBITS_GE_512-LABEL: fmul_v32f16:
Expand Down Expand Up @@ -1089,10 +1082,9 @@ define void @fmul_v16f32(ptr %a, ptr %b) #0 {
; VBITS_GE_256-NEXT: ld1w { z2.s }, p0/z, [x0]
; VBITS_GE_256-NEXT: ld1w { z3.s }, p0/z, [x1]
; VBITS_GE_256-NEXT: fmul z0.s, p0/m, z0.s, z1.s
; VBITS_GE_256-NEXT: movprfx z1, z2
; VBITS_GE_256-NEXT: fmul z1.s, p0/m, z1.s, z3.s
; VBITS_GE_256-NEXT: fmul z2.s, p0/m, z2.s, z3.s
; VBITS_GE_256-NEXT: st1w { z0.s }, p0, [x0, x8, lsl #2]
; VBITS_GE_256-NEXT: st1w { z1.s }, p0, [x0]
; VBITS_GE_256-NEXT: st1w { z2.s }, p0, [x0]
; VBITS_GE_256-NEXT: ret
;
; VBITS_GE_512-LABEL: fmul_v16f32:
Expand Down Expand Up @@ -1188,10 +1180,9 @@ define void @fmul_v8f64(ptr %a, ptr %b) #0 {
; VBITS_GE_256-NEXT: ld1d { z2.d }, p0/z, [x0]
; VBITS_GE_256-NEXT: ld1d { z3.d }, p0/z, [x1]
; VBITS_GE_256-NEXT: fmul z0.d, p0/m, z0.d, z1.d
; VBITS_GE_256-NEXT: movprfx z1, z2
; VBITS_GE_256-NEXT: fmul z1.d, p0/m, z1.d, z3.d
; VBITS_GE_256-NEXT: fmul z2.d, p0/m, z2.d, z3.d
; VBITS_GE_256-NEXT: st1d { z0.d }, p0, [x0, x8, lsl #3]
; VBITS_GE_256-NEXT: st1d { z1.d }, p0, [x0]
; VBITS_GE_256-NEXT: st1d { z2.d }, p0, [x0]
; VBITS_GE_256-NEXT: ret
;
; VBITS_GE_512-LABEL: fmul_v8f64:
Expand Down Expand Up @@ -1827,10 +1818,9 @@ define void @fsub_v32f16(ptr %a, ptr %b) #0 {
; VBITS_GE_256-NEXT: ld1h { z2.h }, p0/z, [x0]
; VBITS_GE_256-NEXT: ld1h { z3.h }, p0/z, [x1]
; VBITS_GE_256-NEXT: fsub z0.h, p0/m, z0.h, z1.h
; VBITS_GE_256-NEXT: movprfx z1, z2
; VBITS_GE_256-NEXT: fsub z1.h, p0/m, z1.h, z3.h
; VBITS_GE_256-NEXT: fsub z2.h, p0/m, z2.h, z3.h
; VBITS_GE_256-NEXT: st1h { z0.h }, p0, [x0, x8, lsl #1]
; VBITS_GE_256-NEXT: st1h { z1.h }, p0, [x0]
; VBITS_GE_256-NEXT: st1h { z2.h }, p0, [x0]
; VBITS_GE_256-NEXT: ret
;
; VBITS_GE_512-LABEL: fsub_v32f16:
Expand Down Expand Up @@ -1926,10 +1916,9 @@ define void @fsub_v16f32(ptr %a, ptr %b) #0 {
; VBITS_GE_256-NEXT: ld1w { z2.s }, p0/z, [x0]
; VBITS_GE_256-NEXT: ld1w { z3.s }, p0/z, [x1]
; VBITS_GE_256-NEXT: fsub z0.s, p0/m, z0.s, z1.s
; VBITS_GE_256-NEXT: movprfx z1, z2
; VBITS_GE_256-NEXT: fsub z1.s, p0/m, z1.s, z3.s
; VBITS_GE_256-NEXT: fsub z2.s, p0/m, z2.s, z3.s
; VBITS_GE_256-NEXT: st1w { z0.s }, p0, [x0, x8, lsl #2]
; VBITS_GE_256-NEXT: st1w { z1.s }, p0, [x0]
; VBITS_GE_256-NEXT: st1w { z2.s }, p0, [x0]
; VBITS_GE_256-NEXT: ret
;
; VBITS_GE_512-LABEL: fsub_v16f32:
Expand Down Expand Up @@ -2025,10 +2014,9 @@ define void @fsub_v8f64(ptr %a, ptr %b) #0 {
; VBITS_GE_256-NEXT: ld1d { z2.d }, p0/z, [x0]
; VBITS_GE_256-NEXT: ld1d { z3.d }, p0/z, [x1]
; VBITS_GE_256-NEXT: fsub z0.d, p0/m, z0.d, z1.d
; VBITS_GE_256-NEXT: movprfx z1, z2
; VBITS_GE_256-NEXT: fsub z1.d, p0/m, z1.d, z3.d
; VBITS_GE_256-NEXT: fsub z2.d, p0/m, z2.d, z3.d
; VBITS_GE_256-NEXT: st1d { z0.d }, p0, [x0, x8, lsl #3]
; VBITS_GE_256-NEXT: st1d { z1.d }, p0, [x0]
; VBITS_GE_256-NEXT: st1d { z2.d }, p0, [x0]
; VBITS_GE_256-NEXT: ret
;
; VBITS_GE_512-LABEL: fsub_v8f64:
Expand Down
Loading