[LLVM][CodeGen][SVE] Use BFMLALB for promoted bfloat fma operations. #167340

paulwalker-arm · 2025-11-10T16:46:42Z

We're likely to get better code from custom legalisation, where we can remove unpack instructions (plus SVE2p1 has BFMLSLB/T), but we get much of benefit with these two small changes.

NOTE: LLVM has no support for FEAT_AFP in terms of feature detection or ACLE builtins, so the compiler works under the assumption the feature is not enabled.

Patch is also more aggressive when enabling bfloat fma construction because it removes unnecessary rounding which is generally preferable regardless of whether BFMLALB is used or not.

NOTE: From what I can see LLVM has no support for FEAT_AFP in terms of feature detection or ACLE builtins and so I believe the compiler can (and does) work under the assumption the feature is not enabled.

llvmbot · 2025-11-10T16:54:18Z

@llvm/pr-subscribers-backend-aarch64

Author: Paul Walker (paulwalker-arm)

Changes

We're likely to get better code from custom legalisation, where we can remove unpack instructions (plus SVE2p1 has BFMLSLB/T), but we get much of benefit with these two small changes.

NOTE: LLVM has no support for FEAT_AFP in terms of feature detection or ACLE builtins, so the compiler works under the assumption the feature is not enabled.

Patch is also more aggressive when enabling bfloat fma construction because it removes unnecessary rounding which is generally preferable regardless of whether BFMLALB is used or not.

Patch is 38.37 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/167340.diff

4 Files Affected:

(modified) llvm/lib/Target/AArch64/AArch64ISelLowering.cpp (+1-1)
(modified) llvm/lib/Target/AArch64/AArch64SVEInstrInfo.td (+4)
(modified) llvm/test/CodeGen/AArch64/sve-bf16-arith.ll (+11-17)
(modified) llvm/test/CodeGen/AArch64/sve-bf16-combines.ll (+599-127)

diff --git a/llvm/lib/Target/AArch64/AArch64ISelLowering.cpp b/llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
index 60aa61e993b26..da6c65f2c1c7d 100644
--- a/llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
+++ b/llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
@@ -18570,7 +18570,7 @@ bool AArch64TargetLowering::isFMAFasterThanFMulAndFAdd(
   case MVT::f64:
     return true;
   case MVT::bf16:
-    return VT.isScalableVector() && Subtarget->hasSVEB16B16() &&
+    return VT.isScalableVector() && Subtarget->hasBF16() &&
            Subtarget->isNonStreamingSVEorSME2Available();
   default:
     break;
diff --git a/llvm/lib/Target/AArch64/AArch64SVEInstrInfo.td b/llvm/lib/Target/AArch64/AArch64SVEInstrInfo.td
index 3b268dcbca600..ce6de5c780cf3 100644
--- a/llvm/lib/Target/AArch64/AArch64SVEInstrInfo.td
+++ b/llvm/lib/Target/AArch64/AArch64SVEInstrInfo.td
@@ -2578,6 +2578,10 @@ let Predicates = [HasBF16, HasSVE_or_SME] in {
   defm BFMLALB_ZZZI : sve2_fp_mla_long_by_indexed_elem<0b100, "bfmlalb", nxv4f32, nxv8bf16, int_aarch64_sve_bfmlalb_lane_v2>;
   defm BFMLALT_ZZZI : sve2_fp_mla_long_by_indexed_elem<0b101, "bfmlalt", nxv4f32, nxv8bf16, int_aarch64_sve_bfmlalt_lane_v2>;
 
+  def : Pat<(nxv4f32 (AArch64fmla_p (SVEAllActive), nxv4f32:$acc, (nxv4f32 (AArch64fcvte_mt (SVEAllActive), nxv4bf16:$Zn, (undef))),
+                                                                  (nxv4f32 (AArch64fcvte_mt (SVEAllActive), nxv4bf16:$Zm, (undef))))),
+            (BFMLALB_ZZZ nxv4f32:$acc, ZPR:$Zn, ZPR:$Zm)>;
+
   defm BFCVT_ZPmZ   : sve_bfloat_convert<"bfcvt", int_aarch64_sve_fcvt_bf16f32_v2, AArch64fcvtr_mt>;
   defm BFCVTNT_ZPmZ : sve_bfloat_convert_top<"bfcvtnt", int_aarch64_sve_fcvtnt_bf16f32_v2>;
 } // End HasBF16, HasSVE_or_SME
diff --git a/llvm/test/CodeGen/AArch64/sve-bf16-arith.ll b/llvm/test/CodeGen/AArch64/sve-bf16-arith.ll
index 0580f5e0b019a..582e8456c05b3 100644
--- a/llvm/test/CodeGen/AArch64/sve-bf16-arith.ll
+++ b/llvm/test/CodeGen/AArch64/sve-bf16-arith.ll
@@ -466,12 +466,10 @@ define <vscale x 2 x bfloat> @fmla_nxv2bf16(<vscale x 2 x bfloat> %a, <vscale x
 define <vscale x 4 x bfloat> @fmla_nxv4bf16(<vscale x 4 x bfloat> %a, <vscale x 4 x bfloat> %b, <vscale x 4 x bfloat> %c) {
 ; NOB16B16-LABEL: fmla_nxv4bf16:
 ; NOB16B16:       // %bb.0:
-; NOB16B16-NEXT:    lsl z1.s, z1.s, #16
-; NOB16B16-NEXT:    lsl z0.s, z0.s, #16
 ; NOB16B16-NEXT:    lsl z2.s, z2.s, #16
 ; NOB16B16-NEXT:    ptrue p0.s
-; NOB16B16-NEXT:    fmad z0.s, p0/m, z1.s, z2.s
-; NOB16B16-NEXT:    bfcvt z0.h, p0/m, z0.s
+; NOB16B16-NEXT:    bfmlalb z2.s, z0.h, z1.h
+; NOB16B16-NEXT:    bfcvt z0.h, p0/m, z2.s
 ; NOB16B16-NEXT:    ret
 ;
 ; B16B16-LABEL: fmla_nxv4bf16:
@@ -486,24 +484,20 @@ define <vscale x 4 x bfloat> @fmla_nxv4bf16(<vscale x 4 x bfloat> %a, <vscale x
 define <vscale x 8 x bfloat> @fmla_nxv8bf16(<vscale x 8 x bfloat> %a, <vscale x 8 x bfloat> %b, <vscale x 8 x bfloat> %c) {
 ; NOB16B16-LABEL: fmla_nxv8bf16:
 ; NOB16B16:       // %bb.0:
-; NOB16B16-NEXT:    uunpkhi z3.s, z1.h
-; NOB16B16-NEXT:    uunpkhi z4.s, z0.h
-; NOB16B16-NEXT:    uunpkhi z5.s, z2.h
+; NOB16B16-NEXT:    uunpkhi z3.s, z2.h
+; NOB16B16-NEXT:    uunpklo z2.s, z2.h
+; NOB16B16-NEXT:    uunpkhi z4.s, z1.h
+; NOB16B16-NEXT:    uunpkhi z5.s, z0.h
 ; NOB16B16-NEXT:    uunpklo z1.s, z1.h
 ; NOB16B16-NEXT:    uunpklo z0.s, z0.h
-; NOB16B16-NEXT:    uunpklo z2.s, z2.h
 ; NOB16B16-NEXT:    ptrue p0.s
 ; NOB16B16-NEXT:    lsl z3.s, z3.s, #16
-; NOB16B16-NEXT:    lsl z4.s, z4.s, #16
-; NOB16B16-NEXT:    lsl z5.s, z5.s, #16
-; NOB16B16-NEXT:    lsl z1.s, z1.s, #16
-; NOB16B16-NEXT:    lsl z0.s, z0.s, #16
 ; NOB16B16-NEXT:    lsl z2.s, z2.s, #16
-; NOB16B16-NEXT:    fmad z3.s, p0/m, z4.s, z5.s
-; NOB16B16-NEXT:    fmad z0.s, p0/m, z1.s, z2.s
-; NOB16B16-NEXT:    bfcvt z1.h, p0/m, z3.s
-; NOB16B16-NEXT:    bfcvt z0.h, p0/m, z0.s
-; NOB16B16-NEXT:    uzp1 z0.h, z0.h, z1.h
+; NOB16B16-NEXT:    bfmlalb z3.s, z5.h, z4.h
+; NOB16B16-NEXT:    bfmlalb z2.s, z0.h, z1.h
+; NOB16B16-NEXT:    bfcvt z0.h, p0/m, z3.s
+; NOB16B16-NEXT:    bfcvt z1.h, p0/m, z2.s
+; NOB16B16-NEXT:    uzp1 z0.h, z1.h, z0.h
 ; NOB16B16-NEXT:    ret
 ;
 ; B16B16-LABEL: fmla_nxv8bf16:
diff --git a/llvm/test/CodeGen/AArch64/sve-bf16-combines.ll b/llvm/test/CodeGen/AArch64/sve-bf16-combines.ll
index 5c58eab391972..16e8feb0dc5bb 100644
--- a/llvm/test/CodeGen/AArch64/sve-bf16-combines.ll
+++ b/llvm/test/CodeGen/AArch64/sve-bf16-combines.ll
@@ -1,79 +1,175 @@
 ; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py
-; RUN: llc -mattr=+sve,+bf16,+sve-b16b16 < %s | FileCheck %s
+; RUN: llc -mattr=+sve,+bf16             < %s | FileCheck %s --check-prefixes=SVE
+; RUN: llc -mattr=+sve,+bf16,+sve-b16b16 < %s | FileCheck %s --check-prefixes=SVE-B16B16
 
 target triple = "aarch64-unknown-linux-gnu"
 
 define <vscale x 8 x bfloat> @fmla_nxv8bf16(<vscale x 8 x bfloat> %acc, <vscale x 8 x bfloat> %m1, <vscale x 8 x bfloat> %m2) {
-; CHECK-LABEL: fmla_nxv8bf16:
-; CHECK:       // %bb.0:
-; CHECK-NEXT:    ptrue p0.h
-; CHECK-NEXT:    bfmla z0.h, p0/m, z1.h, z2.h
-; CHECK-NEXT:    ret
+; SVE-LABEL: fmla_nxv8bf16:
+; SVE:       // %bb.0:
+; SVE-NEXT:    uunpkhi z3.s, z0.h
+; SVE-NEXT:    uunpklo z0.s, z0.h
+; SVE-NEXT:    uunpkhi z4.s, z2.h
+; SVE-NEXT:    uunpkhi z5.s, z1.h
+; SVE-NEXT:    uunpklo z2.s, z2.h
+; SVE-NEXT:    uunpklo z1.s, z1.h
+; SVE-NEXT:    ptrue p0.s
+; SVE-NEXT:    lsl z3.s, z3.s, #16
+; SVE-NEXT:    lsl z0.s, z0.s, #16
+; SVE-NEXT:    bfmlalb z3.s, z5.h, z4.h
+; SVE-NEXT:    bfmlalb z0.s, z1.h, z2.h
+; SVE-NEXT:    bfcvt z1.h, p0/m, z3.s
+; SVE-NEXT:    bfcvt z0.h, p0/m, z0.s
+; SVE-NEXT:    uzp1 z0.h, z0.h, z1.h
+; SVE-NEXT:    ret
+;
+; SVE-B16B16-LABEL: fmla_nxv8bf16:
+; SVE-B16B16:       // %bb.0:
+; SVE-B16B16-NEXT:    ptrue p0.h
+; SVE-B16B16-NEXT:    bfmla z0.h, p0/m, z1.h, z2.h
+; SVE-B16B16-NEXT:    ret
   %mul = fmul contract <vscale x 8 x bfloat> %m1, %m2
   %res = fadd contract <vscale x 8 x bfloat> %acc, %mul
   ret <vscale x 8 x bfloat> %res
 }
 
 define <vscale x 4 x bfloat> @fmla_nxv4bf16(<vscale x 4 x bfloat> %acc, <vscale x 4 x bfloat> %m1, <vscale x 4 x bfloat> %m2) {
-; CHECK-LABEL: fmla_nxv4bf16:
-; CHECK:       // %bb.0:
-; CHECK-NEXT:    ptrue p0.s
-; CHECK-NEXT:    bfmla z0.h, p0/m, z1.h, z2.h
-; CHECK-NEXT:    ret
+; SVE-LABEL: fmla_nxv4bf16:
+; SVE:       // %bb.0:
+; SVE-NEXT:    lsl z0.s, z0.s, #16
+; SVE-NEXT:    ptrue p0.s
+; SVE-NEXT:    bfmlalb z0.s, z1.h, z2.h
+; SVE-NEXT:    bfcvt z0.h, p0/m, z0.s
+; SVE-NEXT:    ret
+;
+; SVE-B16B16-LABEL: fmla_nxv4bf16:
+; SVE-B16B16:       // %bb.0:
+; SVE-B16B16-NEXT:    ptrue p0.s
+; SVE-B16B16-NEXT:    bfmla z0.h, p0/m, z1.h, z2.h
+; SVE-B16B16-NEXT:    ret
   %mul = fmul contract <vscale x 4 x bfloat> %m1, %m2
   %res = fadd contract <vscale x 4 x bfloat> %acc, %mul
   ret <vscale x 4 x bfloat> %res
 }
 
 define <vscale x 2 x bfloat> @fmla_nxv2bf16(<vscale x 2 x bfloat> %acc, <vscale x 2 x bfloat> %m1, <vscale x 2 x bfloat> %m2) {
-; CHECK-LABEL: fmla_nxv2bf16:
-; CHECK:       // %bb.0:
-; CHECK-NEXT:    ptrue p0.d
-; CHECK-NEXT:    bfmla z0.h, p0/m, z1.h, z2.h
-; CHECK-NEXT:    ret
+; SVE-LABEL: fmla_nxv2bf16:
+; SVE:       // %bb.0:
+; SVE-NEXT:    lsl z2.s, z2.s, #16
+; SVE-NEXT:    lsl z1.s, z1.s, #16
+; SVE-NEXT:    lsl z0.s, z0.s, #16
+; SVE-NEXT:    ptrue p0.d
+; SVE-NEXT:    fmla z0.s, p0/m, z1.s, z2.s
+; SVE-NEXT:    bfcvt z0.h, p0/m, z0.s
+; SVE-NEXT:    ret
+;
+; SVE-B16B16-LABEL: fmla_nxv2bf16:
+; SVE-B16B16:       // %bb.0:
+; SVE-B16B16-NEXT:    ptrue p0.d
+; SVE-B16B16-NEXT:    bfmla z0.h, p0/m, z1.h, z2.h
+; SVE-B16B16-NEXT:    ret
   %mul = fmul contract <vscale x 2 x bfloat> %m1, %m2
   %res = fadd contract <vscale x 2 x bfloat> %acc, %mul
   ret <vscale x 2 x bfloat> %res
 }
 
 define <vscale x 8 x bfloat> @fmls_nxv8bf16(<vscale x 8 x bfloat> %acc, <vscale x 8 x bfloat> %m1, <vscale x 8 x bfloat> %m2) {
-; CHECK-LABEL: fmls_nxv8bf16:
-; CHECK:       // %bb.0:
-; CHECK-NEXT:    ptrue p0.h
-; CHECK-NEXT:    bfmls z0.h, p0/m, z1.h, z2.h
-; CHECK-NEXT:    ret
+; SVE-LABEL: fmls_nxv8bf16:
+; SVE:       // %bb.0:
+; SVE-NEXT:    ptrue p0.h
+; SVE-NEXT:    uunpkhi z3.s, z0.h
+; SVE-NEXT:    uunpklo z0.s, z0.h
+; SVE-NEXT:    uunpkhi z5.s, z2.h
+; SVE-NEXT:    uunpklo z2.s, z2.h
+; SVE-NEXT:    fneg z1.h, p0/m, z1.h
+; SVE-NEXT:    ptrue p0.s
+; SVE-NEXT:    lsl z3.s, z3.s, #16
+; SVE-NEXT:    lsl z0.s, z0.s, #16
+; SVE-NEXT:    uunpkhi z4.s, z1.h
+; SVE-NEXT:    uunpklo z1.s, z1.h
+; SVE-NEXT:    bfmlalb z3.s, z4.h, z5.h
+; SVE-NEXT:    bfmlalb z0.s, z1.h, z2.h
+; SVE-NEXT:    bfcvt z1.h, p0/m, z3.s
+; SVE-NEXT:    bfcvt z0.h, p0/m, z0.s
+; SVE-NEXT:    uzp1 z0.h, z0.h, z1.h
+; SVE-NEXT:    ret
+;
+; SVE-B16B16-LABEL: fmls_nxv8bf16:
+; SVE-B16B16:       // %bb.0:
+; SVE-B16B16-NEXT:    ptrue p0.h
+; SVE-B16B16-NEXT:    bfmls z0.h, p0/m, z1.h, z2.h
+; SVE-B16B16-NEXT:    ret
   %mul = fmul contract <vscale x 8 x bfloat> %m1, %m2
   %res = fsub contract <vscale x 8 x bfloat> %acc, %mul
   ret <vscale x 8 x bfloat> %res
 }
 
 define <vscale x 4 x bfloat> @fmls_nxv4bf16(<vscale x 4 x bfloat> %acc, <vscale x 4 x bfloat> %m1, <vscale x 4 x bfloat> %m2) {
-; CHECK-LABEL: fmls_nxv4bf16:
-; CHECK:       // %bb.0:
-; CHECK-NEXT:    ptrue p0.s
-; CHECK-NEXT:    bfmls z0.h, p0/m, z1.h, z2.h
-; CHECK-NEXT:    ret
+; SVE-LABEL: fmls_nxv4bf16:
+; SVE:       // %bb.0:
+; SVE-NEXT:    ptrue p0.s
+; SVE-NEXT:    lsl z0.s, z0.s, #16
+; SVE-NEXT:    fneg z1.h, p0/m, z1.h
+; SVE-NEXT:    bfmlalb z0.s, z1.h, z2.h
+; SVE-NEXT:    bfcvt z0.h, p0/m, z0.s
+; SVE-NEXT:    ret
+;
+; SVE-B16B16-LABEL: fmls_nxv4bf16:
+; SVE-B16B16:       // %bb.0:
+; SVE-B16B16-NEXT:    ptrue p0.s
+; SVE-B16B16-NEXT:    bfmls z0.h, p0/m, z1.h, z2.h
+; SVE-B16B16-NEXT:    ret
   %mul = fmul contract <vscale x 4 x bfloat> %m1, %m2
   %res = fsub contract <vscale x 4 x bfloat> %acc, %mul
   ret <vscale x 4 x bfloat> %res
 }
 
 define <vscale x 2 x bfloat> @fmls_nxv2bf16(<vscale x 2 x bfloat> %acc, <vscale x 2 x bfloat> %m1, <vscale x 2 x bfloat> %m2) {
-; CHECK-LABEL: fmls_nxv2bf16:
-; CHECK:       // %bb.0:
-; CHECK-NEXT:    ptrue p0.d
-; CHECK-NEXT:    bfmls z0.h, p0/m, z1.h, z2.h
-; CHECK-NEXT:    ret
+; SVE-LABEL: fmls_nxv2bf16:
+; SVE:       // %bb.0:
+; SVE-NEXT:    ptrue p0.d
+; SVE-NEXT:    lsl z2.s, z2.s, #16
+; SVE-NEXT:    lsl z0.s, z0.s, #16
+; SVE-NEXT:    fneg z1.h, p0/m, z1.h
+; SVE-NEXT:    lsl z1.s, z1.s, #16
+; SVE-NEXT:    fmla z0.s, p0/m, z1.s, z2.s
+; SVE-NEXT:    bfcvt z0.h, p0/m, z0.s
+; SVE-NEXT:    ret
+;
+; SVE-B16B16-LABEL: fmls_nxv2bf16:
+; SVE-B16B16:       // %bb.0:
+; SVE-B16B16-NEXT:    ptrue p0.d
+; SVE-B16B16-NEXT:    bfmls z0.h, p0/m, z1.h, z2.h
+; SVE-B16B16-NEXT:    ret
   %mul = fmul contract <vscale x 2 x bfloat> %m1, %m2
   %res = fsub contract <vscale x 2 x bfloat> %acc, %mul
   ret <vscale x 2 x bfloat> %res
 }
 
 define <vscale x 8 x bfloat> @fmla_sel_nxv8bf16(<vscale x 8 x i1> %pred, <vscale x 8 x bfloat> %acc, <vscale x 8 x bfloat> %m1, <vscale x 8 x bfloat> %m2) {
-; CHECK-LABEL: fmla_sel_nxv8bf16:
-; CHECK:       // %bb.0:
-; CHECK-NEXT:    bfmla z0.h, p0/m, z1.h, z2.h
-; CHECK-NEXT:    ret
+; SVE-LABEL: fmla_sel_nxv8bf16:
+; SVE:       // %bb.0:
+; SVE-NEXT:    uunpkhi z3.s, z0.h
+; SVE-NEXT:    uunpklo z4.s, z0.h
+; SVE-NEXT:    uunpkhi z5.s, z2.h
+; SVE-NEXT:    uunpkhi z6.s, z1.h
+; SVE-NEXT:    uunpklo z2.s, z2.h
+; SVE-NEXT:    uunpklo z1.s, z1.h
+; SVE-NEXT:    ptrue p1.s
+; SVE-NEXT:    lsl z3.s, z3.s, #16
+; SVE-NEXT:    lsl z4.s, z4.s, #16
+; SVE-NEXT:    bfmlalb z3.s, z6.h, z5.h
+; SVE-NEXT:    bfmlalb z4.s, z1.h, z2.h
+; SVE-NEXT:    bfcvt z1.h, p1/m, z3.s
+; SVE-NEXT:    bfcvt z2.h, p1/m, z4.s
+; SVE-NEXT:    uzp1 z1.h, z2.h, z1.h
+; SVE-NEXT:    mov z0.h, p0/m, z1.h
+; SVE-NEXT:    ret
+;
+; SVE-B16B16-LABEL: fmla_sel_nxv8bf16:
+; SVE-B16B16:       // %bb.0:
+; SVE-B16B16-NEXT:    bfmla z0.h, p0/m, z1.h, z2.h
+; SVE-B16B16-NEXT:    ret
   %mul = fmul contract <vscale x 8 x bfloat> %m1, %m2
   %add = fadd contract <vscale x 8 x bfloat> %acc, %mul
   %res = select <vscale x 8 x i1> %pred, <vscale x 8 x bfloat> %add, <vscale x 8 x bfloat> %acc
@@ -81,10 +177,17 @@ define <vscale x 8 x bfloat> @fmla_sel_nxv8bf16(<vscale x 8 x i1> %pred, <vscale
 }
 
 define <vscale x 4 x bfloat> @fmla_sel_nxv4bf16(<vscale x 4 x i1> %pred, <vscale x 4 x bfloat> %acc, <vscale x 4 x bfloat> %m1, <vscale x 4 x bfloat> %m2) {
-; CHECK-LABEL: fmla_sel_nxv4bf16:
-; CHECK:       // %bb.0:
-; CHECK-NEXT:    bfmla z0.h, p0/m, z1.h, z2.h
-; CHECK-NEXT:    ret
+; SVE-LABEL: fmla_sel_nxv4bf16:
+; SVE:       // %bb.0:
+; SVE-NEXT:    lsl z3.s, z0.s, #16
+; SVE-NEXT:    bfmlalb z3.s, z1.h, z2.h
+; SVE-NEXT:    bfcvt z0.h, p0/m, z3.s
+; SVE-NEXT:    ret
+;
+; SVE-B16B16-LABEL: fmla_sel_nxv4bf16:
+; SVE-B16B16:       // %bb.0:
+; SVE-B16B16-NEXT:    bfmla z0.h, p0/m, z1.h, z2.h
+; SVE-B16B16-NEXT:    ret
   %mul = fmul contract <vscale x 4 x bfloat> %m1, %m2
   %add = fadd contract <vscale x 4 x bfloat> %acc, %mul
   %res = select <vscale x 4 x i1> %pred, <vscale x 4 x bfloat> %add, <vscale x 4 x bfloat> %acc
@@ -92,10 +195,20 @@ define <vscale x 4 x bfloat> @fmla_sel_nxv4bf16(<vscale x 4 x i1> %pred, <vscale
 }
 
 define <vscale x 2 x bfloat> @fmla_sel_nxv2bf16(<vscale x 2 x i1> %pred, <vscale x 2 x bfloat> %acc, <vscale x 2 x bfloat> %m1, <vscale x 2 x bfloat> %m2) {
-; CHECK-LABEL: fmla_sel_nxv2bf16:
-; CHECK:       // %bb.0:
-; CHECK-NEXT:    bfmla z0.h, p0/m, z1.h, z2.h
-; CHECK-NEXT:    ret
+; SVE-LABEL: fmla_sel_nxv2bf16:
+; SVE:       // %bb.0:
+; SVE-NEXT:    lsl z2.s, z2.s, #16
+; SVE-NEXT:    lsl z1.s, z1.s, #16
+; SVE-NEXT:    lsl z3.s, z0.s, #16
+; SVE-NEXT:    ptrue p1.d
+; SVE-NEXT:    fmad z1.s, p1/m, z2.s, z3.s
+; SVE-NEXT:    bfcvt z0.h, p0/m, z1.s
+; SVE-NEXT:    ret
+;
+; SVE-B16B16-LABEL: fmla_sel_nxv2bf16:
+; SVE-B16B16:       // %bb.0:
+; SVE-B16B16-NEXT:    bfmla z0.h, p0/m, z1.h, z2.h
+; SVE-B16B16-NEXT:    ret
   %mul = fmul contract <vscale x 2 x bfloat> %m1, %m2
   %add = fadd contract <vscale x 2 x bfloat> %acc, %mul
   %res = select <vscale x 2 x i1> %pred, <vscale x 2 x bfloat> %add, <vscale x 2 x bfloat> %acc
@@ -103,10 +216,31 @@ define <vscale x 2 x bfloat> @fmla_sel_nxv2bf16(<vscale x 2 x i1> %pred, <vscale
 }
 
 define <vscale x 8 x bfloat> @fmls_sel_nxv8bf16(<vscale x 8 x i1> %pred, <vscale x 8 x bfloat> %acc, <vscale x 8 x bfloat> %m1, <vscale x 8 x bfloat> %m2) {
-; CHECK-LABEL: fmls_sel_nxv8bf16:
-; CHECK:       // %bb.0:
-; CHECK-NEXT:    bfmls z0.h, p0/m, z1.h, z2.h
-; CHECK-NEXT:    ret
+; SVE-LABEL: fmls_sel_nxv8bf16:
+; SVE:       // %bb.0:
+; SVE-NEXT:    ptrue p1.h
+; SVE-NEXT:    uunpkhi z3.s, z0.h
+; SVE-NEXT:    uunpklo z4.s, z0.h
+; SVE-NEXT:    uunpkhi z6.s, z2.h
+; SVE-NEXT:    uunpklo z2.s, z2.h
+; SVE-NEXT:    fneg z1.h, p1/m, z1.h
+; SVE-NEXT:    ptrue p1.s
+; SVE-NEXT:    lsl z3.s, z3.s, #16
+; SVE-NEXT:    lsl z4.s, z4.s, #16
+; SVE-NEXT:    uunpkhi z5.s, z1.h
+; SVE-NEXT:    uunpklo z1.s, z1.h
+; SVE-NEXT:    bfmlalb z3.s, z5.h, z6.h
+; SVE-NEXT:    bfmlalb z4.s, z1.h, z2.h
+; SVE-NEXT:    bfcvt z1.h, p1/m, z3.s
+; SVE-NEXT:    bfcvt z2.h, p1/m, z4.s
+; SVE-NEXT:    uzp1 z1.h, z2.h, z1.h
+; SVE-NEXT:    mov z0.h, p0/m, z1.h
+; SVE-NEXT:    ret
+;
+; SVE-B16B16-LABEL: fmls_sel_nxv8bf16:
+; SVE-B16B16:       // %bb.0:
+; SVE-B16B16-NEXT:    bfmls z0.h, p0/m, z1.h, z2.h
+; SVE-B16B16-NEXT:    ret
   %mul = fmul contract <vscale x 8 x bfloat> %m1, %m2
   %sub = fsub contract <vscale x 8 x bfloat> %acc, %mul
   %res = select <vscale x 8 x i1> %pred, <vscale x 8 x bfloat> %sub, <vscale x 8 x bfloat> %acc
@@ -114,10 +248,19 @@ define <vscale x 8 x bfloat> @fmls_sel_nxv8bf16(<vscale x 8 x i1> %pred, <vscale
 }
 
 define <vscale x 4 x bfloat> @fmls_sel_nxv4bf16(<vscale x 4 x i1> %pred, <vscale x 4 x bfloat> %acc, <vscale x 4 x bfloat> %m1, <vscale x 4 x bfloat> %m2) {
-; CHECK-LABEL: fmls_sel_nxv4bf16:
-; CHECK:       // %bb.0:
-; CHECK-NEXT:    bfmls z0.h, p0/m, z1.h, z2.h
-; CHECK-NEXT:    ret
+; SVE-LABEL: fmls_sel_nxv4bf16:
+; SVE:       // %bb.0:
+; SVE-NEXT:    ptrue p1.s
+; SVE-NEXT:    lsl z3.s, z0.s, #16
+; SVE-NEXT:    fneg z1.h, p1/m, z1.h
+; SVE-NEXT:    bfmlalb z3.s, z1.h, z2.h
+; SVE-NEXT:    bfcvt z0.h, p0/m, z3.s
+; SVE-NEXT:    ret
+;
+; SVE-B16B16-LABEL: fmls_sel_nxv4bf16:
+; SVE-B16B16:       // %bb.0:
+; SVE-B16B16-NEXT:    bfmls z0.h, p0/m, z1.h, z2.h
+; SVE-B16B16-NEXT:    ret
   %mul = fmul contract <vscale x 4 x bfloat> %m1, %m2
   %sub = fsub contract <vscale x 4 x bfloat> %acc, %mul
   %res = select <vscale x 4 x i1> %pred, <vscale x 4 x bfloat> %sub, <vscale x 4 x bfloat> %acc
@@ -125,10 +268,21 @@ define <vscale x 4 x bfloat> @fmls_sel_nxv4bf16(<vscale x 4 x i1> %pred, <vscale
 }
 
 define <vscale x 2 x bfloat> @fmls_sel_nxv2bf16(<vscale x 2 x i1> %pred, <vscale x 2 x bfloat> %acc, <vscale x 2 x bfloat> %m1, <vscale x 2 x bfloat> %m2) {
-; CHECK-LABEL: fmls_sel_nxv2bf16:
-; CHECK:       // %bb.0:
-; CHECK-NEXT:    bfmls z0.h, p0/m, z1.h, z2.h
-; CHECK-NEXT:    ret
+; SVE-LABEL: fmls_sel_nxv2bf16:
+; SVE:       // %bb.0:
+; SVE-NEXT:    ptrue p1.d
+; SVE-NEXT:    lsl z2.s, z2.s, #16
+; SVE-NEXT:    lsl z3.s, z0.s, #16
+; SVE-NEXT:    fneg z1.h, p1/m, z1.h
+; SVE-NEXT:    lsl z1.s, z1.s, #16
+; SVE-NEXT:    fmad z1.s, p1/m, z2.s, z3.s
+; SVE-NEXT:    bfcvt z0.h, p0/m, z1.s
+; SVE-NEXT:    ret
+;
+; SVE-B16B16-LABEL: fmls_sel_nxv2bf16:
+; SVE-B16B16:       // %bb.0:
+; SVE-B16B16-NEXT:    bfmls z0.h, p0/m, z1.h, z2.h
+; SVE-B16B16-NEXT:    ret
   %mul = fmul contract <vscale x 2 x bfloat> %m1, %m2
   %sub = fsub contract <vscale x 2 x bfloat> %acc, %mul
   %res = select <vscale x 2 x i1> %pred, <vscale x 2 x bfloat> %sub, <vscale x 2 x bfloat> %acc
@@ -136,33 +290,90 @@ define <vscale x 2 x bfloat> @fmls_sel_nxv2bf16(<vscale x 2 x i1> %pred, <vscale
 }
 
 define <vscale x 8 x bfloat> @fadd_sel_nxv8bf16(<vscale x 8 x bfloat> %a, <vscale x 8 x bfloat> %b, <vscale x 8 x i1> %mask) {
-; CHECK-LABEL: fadd_sel_nxv8bf16:
-; CHECK:       // %bb.0:
-; CHECK-NEXT:    bfadd z1.h, z0.h, z1.h
-; CHECK-NEXT:    mov z0.h, p0/m, z1.h
-; CHECK-NEXT:    ret
+; SVE-LABEL: fadd_sel_nxv8bf16:
+; SVE:       // %bb.0:
+; SVE-NEXT:    uunpkhi z2.s, z1.h
+; SVE-NEXT:    uunpkhi z3.s, z0.h
+; SVE-NEXT:    uunpklo z1.s, z1.h
+; SVE-NEXT:    uunpklo z4.s, z0.h
+; SVE-NEXT:    ptrue p1.s
+; SVE-NEXT:    lsl z2.s, z2.s, #16
+; SVE-NEXT:    lsl z3.s, z3.s, #16
+; SVE-NEXT:    lsl z1.s, z1.s, #16
+; SVE-NEXT:    lsl z4.s, z4.s, #16
+; SVE-NEXT:    fadd z2.s, z3.s, z2.s
+; SVE-NEXT:    fadd z1.s, z4.s, z1.s
+; SVE-NEXT:    bfcvt z2.h, p1/m, z2.s
+; SVE-NEXT:    bfcvt z1.h, p1/m, z1.s
+; SVE-NEXT:    uzp1 z1.h, z1.h, z2.h
+; SVE-NEXT:    mov z0.h, p0/m, z1.h
+; SVE-NEXT:    ret
+;
+; SVE-B16B16-LABEL: fadd_sel_nxv8bf16:
+; SVE-B16B16:       // %bb.0:
+; SVE-B16B16-NEXT:    bfadd z1.h, z0.h, z1.h
+; SVE-B16B16-NEXT:    mov z0.h, p0/m, z1.h
+; SVE-B16B16-NEXT:    ret
   %sel = select <vscale x 8 x i1> %mask, <vscale x 8 x bfloat> %b, <vscale x 8 x bfloat> zeroinitializer
   %fadd = fadd nsz <vscale x 8 x bfloat> %a, %sel
   ret <vscale x 8 x bfloat> %fadd
 }
 
 define <vscale x 8 x bfloat> @fsub_sel_nxv8bf16(<vscale x 8 x bfloat> %a, <vscale x 8 x bfloat> %b, <vscale x 8 x i1> %mask) {
-; CHECK-LABEL: fsub_sel_nxv8bf16:
-; CHECK:       // %bb.0:
-; CHECK-NEXT:    bfsub z1.h, z0.h, z1.h
-; CHECK-NEXT:    mov z0.h, p0/m, z1.h
-; CHECK-NEXT:    ret
+; SVE-LABEL: fsub_sel_nxv8bf16:
+; SVE:       // %bb.0:
+; SVE-NEXT:    uunpkhi z2.s, z1.h
+; SVE-NEXT:    uunpkhi z3.s, z0.h
+; SVE-NEXT:    uunpklo z1.s, z1.h
+; SVE-NEXT:    uunpklo z4.s, z0.h
+; SVE-NEXT:    ptrue p1.s
+; SVE-NEXT:    lsl z2.s, z2.s, #16
+; SVE-NEXT:    lsl z3.s, z3.s, #16
+; SVE-NEXT:    lsl z1.s, z1.s, #16
+; SVE-NEXT:    lsl z4.s, z4.s, #16
+; SVE-NEXT:    fsub z2.s, z3.s, z2.s
+; SVE-NEXT:    fsub z1.s, z4.s, z1.s
+; SVE-NEXT:    bfcvt z2.h, p1/m, z2.s
+; SVE-NEXT:    bfcvt z1.h, p1/m, z1.s
+; SVE-NEXT:    uzp1 z1.h, z1.h, z2.h
+; SVE-NEXT:    mov z0.h, p0/m, z1.h
+; SVE-NEXT:    ret
+;
+; SVE-B16B16-LABEL: fsub_sel_nxv8bf16:
+; SVE-B16B16:       // %bb.0:
+; SVE-B16B16-NEXT:    bfsub z1.h, z0.h, z1.h
+; SVE-B16B16-NEXT:    mov z0.h, p0/m, z1.h
+; SVE-B16B16-NEXT:    ret
   %sel = select <vscale x 8 x i1> %mask, <vscale x 8 x bfloat> %b, <vscale x 8 x bfloat> zeroinitializer
   %fsub = fsub <vscale x 8 x bfloat> %a, %sel
   ret <vscale x 8 x bfloat> %fsub
 }
 
 define <vscale x 8 x bfloat> @fadd_sel_negzero_nxv8bf16(<vscale x 8 x bfloat> %a, <vscale x 8 x bfloat> %b, <vscale x 8 x i1> %mask) {
...
[truncated]

llvm/lib/Target/AArch64/AArch64SVEInstrInfo.td

MacDue

LGTM!

…lvm#167340) We're likely to get better code from custom legalisation, where we can remove unpack instructions (plus SVE2p1 has BFMLSLB/T), but we get much of benefit with these two small changes. NOTE: LLVM has no support for FEAT_AFP in terms of feature detection or ACLE builtins, so the compiler works under the assumption the feature is not enabled. Patch is also more aggressive when enabling bfloat fma construction because it removes unnecessary rounding which is generally preferable regardless of whether BFMLALB is used or not.

paulwalker-arm added 3 commits November 10, 2025 16:32

Update sve-bf16-combines.ll to show output without sve-b16b16 support.

d0e82f2

[LLVM][CodeGen][SVE] Enable BFloat fma contraction more aggressively.

e0fd9b9

[LLVM][CodeGen][SVE] Use BFMLALB for promoted bfloat fma operations.

9886945

NOTE: From what I can see LLVM has no support for FEAT_AFP in terms of feature detection or ACLE builtins and so I believe the compiler can (and does) work under the assumption the feature is not enabled.

paulwalker-arm requested review from MacDue and huntergr-arm November 10, 2025 16:46

llvmbot added the backend:AArch64 label Nov 10, 2025

MacDue reviewed Nov 10, 2025

View reviewed changes

llvm/lib/Target/AArch64/AArch64SVEInstrInfo.td Outdated Show resolved Hide resolved

Reformat AArch64SVEInstrInfo.td changes.

8112f30

MacDue approved these changes Nov 11, 2025

View reviewed changes

paulwalker-arm merged commit 46e9d63 into llvm:main Nov 12, 2025
10 checks passed

paulwalker-arm deleted the sve-bfmlal branch November 12, 2025 11:46

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[LLVM][CodeGen][SVE] Use BFMLALB for promoted bfloat fma operations. #167340

[LLVM][CodeGen][SVE] Use BFMLALB for promoted bfloat fma operations. #167340

Uh oh!

paulwalker-arm commented Nov 10, 2025

Uh oh!

llvmbot commented Nov 10, 2025

Uh oh!

Uh oh!

MacDue left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

[LLVM][CodeGen][SVE] Use BFMLALB for promoted bfloat fma operations. #167340

[LLVM][CodeGen][SVE] Use BFMLALB for promoted bfloat fma operations. #167340

Uh oh!

Conversation

paulwalker-arm commented Nov 10, 2025

Uh oh!

llvmbot commented Nov 10, 2025

Uh oh!

Uh oh!

MacDue left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants