Skip to content

Conversation

folkertdev
Copy link
Contributor

fixes #129432

Turns out that fixme was (at least) 12 years old.

There is a bunch of code related to fma, but most of it seems to be for older ppc versions? At least, with ppc5 I never end up there. We run into this missing optimization in https://github.com/rust-lang/stdarch, with this optimization we can use one fewer intrinsic and in general improve support for powerpc via using the cross-platform intrinsics.

the DAG here is e.g.

Optimized legalized selection DAG: %bb.0 'manual_vmaddfp:entry'
SelectionDAG has 23 nodes:
  t0: ch,glue = EntryToken
            t2: v4f32,ch = CopyFromReg t0, Register:v4f32 %0
            t4: v4f32,ch = CopyFromReg t0, Register:v4f32 %1
                  t6: v4f32,ch = CopyFromReg t0, Register:v4f32 %2
                t14: v4i32 = bitcast t6
              t17: v4i32 = xor t14, t25
            t18: v4f32 = bitcast t17
          t9: v4f32 = llvm.ppc.altivec.vmaddfp TargetConstant:i32<10609>, t2, t4, t18
        t19: v4i32 = bitcast t9
      t20: v4i32 = xor t19, t25
    t21: v4f32 = bitcast t20
  t12: ch,glue = CopyToReg t0, Register:v4f32 $v2, t21
  t25: v4i32 = llvm.ppc.altivec.vslw Constant:i32<10699>, t28, t28
    t27: v16i8 = BUILD_VECTOR Constant:i32<255>, Constant:i32<255>, Constant:i32<255>, Constant:i32<255>, Constant:i32<255>, Constant:i32<255>, Constant:i32<255>, Constant:i32<255>, Constant:i32<255>, Constant:i32<255>, Constant:i32<255>, Constant:i32<255>, Constant:i32<255>, Constant:i32<255>, Constant:i32<255>, Constant:i32<255>
  t28: v4i32 = bitcast t27
  t13: ch = PPCISD::RET_GLUE t12, Register:v4f32 $v2, t12:1

@llvmbot
Copy link
Member

llvmbot commented Aug 26, 2025

@llvm/pr-subscribers-backend-powerpc

Author: Folkert de Vries (folkertdev)

Changes

fixes #129432

Turns out that fixme was (at least) 12 years old.

There is a bunch of code related to fma, but most of it seems to be for older ppc versions? At least, with ppc5 I never end up there. We run into this missing optimization in https://github.com/rust-lang/stdarch, with this optimization we can use one fewer intrinsic and in general improve support for powerpc via using the cross-platform intrinsics.

the DAG here is e.g.

Optimized legalized selection DAG: %bb.0 'manual_vmaddfp:entry'
SelectionDAG has 23 nodes:
  t0: ch,glue = EntryToken
            t2: v4f32,ch = CopyFromReg t0, Register:v4f32 %0
            t4: v4f32,ch = CopyFromReg t0, Register:v4f32 %1
                  t6: v4f32,ch = CopyFromReg t0, Register:v4f32 %2
                t14: v4i32 = bitcast t6
              t17: v4i32 = xor t14, t25
            t18: v4f32 = bitcast t17
          t9: v4f32 = llvm.ppc.altivec.vmaddfp TargetConstant:i32&lt;10609&gt;, t2, t4, t18
        t19: v4i32 = bitcast t9
      t20: v4i32 = xor t19, t25
    t21: v4f32 = bitcast t20
  t12: ch,glue = CopyToReg t0, Register:v4f32 $v2, t21
  t25: v4i32 = llvm.ppc.altivec.vslw Constant:i32&lt;10699&gt;, t28, t28
    t27: v16i8 = BUILD_VECTOR Constant:i32&lt;255&gt;, Constant:i32&lt;255&gt;, Constant:i32&lt;255&gt;, Constant:i32&lt;255&gt;, Constant:i32&lt;255&gt;, Constant:i32&lt;255&gt;, Constant:i32&lt;255&gt;, Constant:i32&lt;255&gt;, Constant:i32&lt;255&gt;, Constant:i32&lt;255&gt;, Constant:i32&lt;255&gt;, Constant:i32&lt;255&gt;, Constant:i32&lt;255&gt;, Constant:i32&lt;255&gt;, Constant:i32&lt;255&gt;, Constant:i32&lt;255&gt;
  t28: v4i32 = bitcast t27
  t13: ch = PPCISD::RET_GLUE t12, Register:v4f32 $v2, t12:1

Full diff: https://github.com/llvm/llvm-project/pull/155465.diff

2 Files Affected:

  • (modified) llvm/lib/Target/PowerPC/PPCInstrAltivec.td (+16-3)
  • (added) llvm/test/CodeGen/PowerPC/vec-nmsub.ll (+36)
diff --git a/llvm/lib/Target/PowerPC/PPCInstrAltivec.td b/llvm/lib/Target/PowerPC/PPCInstrAltivec.td
index 79fe12e8e4b49..c2254b0ef178c 100644
--- a/llvm/lib/Target/PowerPC/PPCInstrAltivec.td
+++ b/llvm/lib/Target/PowerPC/PPCInstrAltivec.td
@@ -30,6 +30,11 @@
 // Altivec transformation functions and pattern fragments.
 //
 
+// fneg is not legal, and desugared as an xor.
+def desugared_fneg : PatFrag<(ops node:$x), (v4f32 (bitconvert (xor (bitconvert $x), 
+                             (int_ppc_altivec_vslw (bitconvert (v16i8 immAllOnesV)), 
+                             (bitconvert (v16i8 immAllOnesV))))))>; 
+
 def vpkuhum_shuffle : PatFrag<(ops node:$lhs, node:$rhs),
                               (vector_shuffle node:$lhs, node:$rhs), [{
   return PPC::isVPKUHUMShuffleMask(cast<ShuffleVectorSDNode>(N), 0, *CurDAG);
@@ -461,11 +466,12 @@ def VMADDFP : VAForm_1<46, (outs vrrc:$RT), (ins vrrc:$RA, vrrc:$RC, vrrc:$RB),
                        [(set v4f32:$RT,
                         (fma v4f32:$RA, v4f32:$RC, v4f32:$RB))]>;
 
-// FIXME: The fma+fneg pattern won't match because fneg is not legal.
+// fneg is not legal, hence we have to match on the desugared version. 
 def VNMSUBFP: VAForm_1<47, (outs vrrc:$RT), (ins vrrc:$RA, vrrc:$RC, vrrc:$RB),
                        "vnmsubfp $RT, $RA, $RC, $RB", IIC_VecFP,
-                       [(set v4f32:$RT, (fneg (fma v4f32:$RA, v4f32:$RC,
-                                                  (fneg v4f32:$RB))))]>;
+                       [(set v4f32:$RT, (desugared_fneg (fma v4f32:$RA, v4f32:$RC,
+                                                  (desugared_fneg v4f32:$RB))))]>;
+
 let hasSideEffects = 1 in {
   def VMHADDSHS  : VA1a_Int_Ty<32, "vmhaddshs", int_ppc_altivec_vmhaddshs, v8i16>;
   def VMHRADDSHS : VA1a_Int_Ty<33, "vmhraddshs", int_ppc_altivec_vmhraddshs,
@@ -886,6 +892,13 @@ def : Pat<(mul v8i16:$vA, v8i16:$vB), (VMLADDUHM $vA, $vB, (v8i16(V_SET0H)))>;
 // Add
 def : Pat<(add (mul v8i16:$vA, v8i16:$vB), v8i16:$vC), (VMLADDUHM $vA, $vB, $vC)>;
 
+
+// Fused negated multiply-subtract
+def : Pat<(v4f32 (desugared_fneg
+                    (int_ppc_altivec_vmaddfp v4f32:$RA, v4f32:$RC,
+                         (desugared_fneg v4f32:$RB)))),
+          (VNMSUBFP $RA, $RC, $RB)>;
+
 // Saturating adds/subtracts.
 def : Pat<(v16i8 (saddsat v16i8:$vA, v16i8:$vB)), (v16i8 (VADDSBS $vA, $vB))>;
 def : Pat<(v16i8 (uaddsat v16i8:$vA, v16i8:$vB)), (v16i8 (VADDUBS $vA, $vB))>;
diff --git a/llvm/test/CodeGen/PowerPC/vec-nmsub.ll b/llvm/test/CodeGen/PowerPC/vec-nmsub.ll
new file mode 100644
index 0000000000000..8f4ac972346b3
--- /dev/null
+++ b/llvm/test/CodeGen/PowerPC/vec-nmsub.ll
@@ -0,0 +1,36 @@
+; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py UTC_ARGS: --version 5
+; RUN: llc -verify-machineinstrs < %s -mcpu=pwr5 -mtriple=ppc32-- -mattr=+altivec | FileCheck %s
+
+define dso_local <4 x float> @intrinsic(<4 x float> noundef %a, <4 x float> noundef %b, <4 x float> noundef %c) local_unnamed_addr {
+; CHECK-LABEL: intrinsic:
+; CHECK:       # %bb.0: # %entry
+; CHECK-NEXT:    vnmsubfp 2, 2, 3, 4
+; CHECK-NEXT:    blr
+entry:
+  %0 = tail call <4 x float> @llvm.ppc.altivec.vnmsubfp(<4 x float> %a, <4 x float> %b, <4 x float> %c)
+  ret <4 x float> %0
+}
+
+define <4 x float> @manual_llvm_fma(<4 x float> %a, <4 x float> %b, <4 x float> %c) unnamed_addr {
+; CHECK-LABEL: manual_llvm_fma:
+; CHECK:       # %bb.0: # %start
+; CHECK-NEXT:    vnmsubfp 2, 2, 3, 4
+; CHECK-NEXT:    blr
+start:
+  %0 = fneg <4 x float> %c
+  %1 = tail call <4 x float> @llvm.fma.v4f32(<4 x float> %a, <4 x float> %b, <4 x float> %0)
+  %2 = fneg <4 x float> %1
+  ret <4 x float> %2
+}
+
+define dso_local <4 x float> @manual_vmaddfp(<4 x float> noundef %a, <4 x float> noundef %b, <4 x float> noundef %c) local_unnamed_addr {
+; CHECK-LABEL: manual_vmaddfp:
+; CHECK:       # %bb.0: # %entry
+; CHECK-NEXT:    vnmsubfp 2, 2, 3, 4
+; CHECK-NEXT:    blr
+entry:
+  %fneg.i3 = fneg <4 x float> %c
+  %0 = tail call <4 x float> @llvm.ppc.altivec.vmaddfp(<4 x float> %a, <4 x float> %b, <4 x float> %fneg.i3)
+  %fneg.i = fneg <4 x float> %0
+  ret <4 x float> %fneg.i
+}

@folkertdev
Copy link
Contributor Author

@daltenty can you point the right reviewer at this PR?

@folkertdev
Copy link
Contributor Author

@amy-kwan any change you could review this or recommend someone who could?

(I've seen you do some stuff in the rust repo)

@nikic nikic requested review from amy-kwan and lei137 September 22, 2025 08:39
Copy link
Collaborator

@RolandF77 RolandF77 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@folkertdev
Copy link
Contributor Author

I'm still not entirely sure of the LLVM review process, I'm guessing we're waiting for a second approval? In any case, I cannot merge this PR, someone else will have to do that.

@RolandF77 RolandF77 merged commit 3f3d522 into llvm:main Oct 6, 2025
9 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

powerpc: failure to optimize manual vec_nmsub implementation

3 participants