Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enable Custom Lowering for fabs.v8f16 on AVX #71730

Merged
merged 6 commits into from
Nov 16, 2023
Merged

Conversation

david-xl
Copy link
Contributor

@david-xl david-xl commented Nov 8, 2023

[X86]: Enable custom lowering for fabs.v8f16 on AVX

Currently, custom lowering of fabs.v8f16 requires AVX512FP16, which is too restrictive. For v8f16 fabs lowering, no instructions in AVX512FP16 are needed. Without the fix, horribly inefficient code is generated without AVX512FP16. Note instcombiner generates calls to intrinsics @llvm.fabs.v8f16 when simplifyping AND <8 x half> operations.

@llvmbot
Copy link
Collaborator

llvmbot commented Nov 8, 2023

@llvm/pr-subscribers-backend-x86

Author: David Li (david-xl)

Changes

[X86]: Enable custom lowering for fabs.v8f16 on AVX

Currently, custom lowering of fabs.v8f16 requires AVX512FP16, which is too restrictive. For v8f16 fabs lowering, no instructions in AVX512FP16 are needed. Without the fix, horribly inefficient code is generated without AVX512FP16. Note instcombiner generates calls to intrinsics @llvm.fabs.v8f16 when simplifyping AND <8 x half> operations.


Full diff: https://github.com/llvm/llvm-project/pull/71730.diff

2 Files Affected:

  • (modified) llvm/lib/Target/X86/X86ISelLowering.cpp (+3)
  • (modified) llvm/test/CodeGen/X86/vec_fabs.ll (+41)
diff --git a/llvm/lib/Target/X86/X86ISelLowering.cpp b/llvm/lib/Target/X86/X86ISelLowering.cpp
index 22fba5601ccfd38..b3b5a0c1b68ec82 100644
--- a/llvm/lib/Target/X86/X86ISelLowering.cpp
+++ b/llvm/lib/Target/X86/X86ISelLowering.cpp
@@ -2238,6 +2238,9 @@ X86TargetLowering::X86TargetLowering(const X86TargetMachine &TM,
     }
   }
 
+  if (!Subtarget.useSoftFloat() && Subtarget.hasAVX())
+    setOperationAction(ISD::FABS, MVT::v8f16, Custom);
+
   if (!Subtarget.useSoftFloat() &&
       (Subtarget.hasAVXNECONVERT() || Subtarget.hasBF16())) {
     addRegisterClass(MVT::v8bf16, Subtarget.hasAVX512() ? &X86::VR128XRegClass
diff --git a/llvm/test/CodeGen/X86/vec_fabs.ll b/llvm/test/CodeGen/X86/vec_fabs.ll
index 982062d8907542a..08364449ab1a378 100644
--- a/llvm/test/CodeGen/X86/vec_fabs.ll
+++ b/llvm/test/CodeGen/X86/vec_fabs.ll
@@ -1,8 +1,10 @@
 ; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py
 ; RUN: llc < %s -mtriple=i686-unknown-unknown -mattr=+avx | FileCheck %s --check-prefix=X86 --check-prefix=X86-AVX
+; RUN: llc < %s -mtriple=i686-unknown-unknown -mattr=+avx2 | FileCheck %s --check-prefix=X86 --check-prefix=X86-AVX2
 ; RUN: llc < %s -mtriple=i686-unknown-unknown -mattr=+avx512vl | FileCheck %s --check-prefix=X86 --check-prefix=X86-AVX512VL
 ; RUN: llc < %s -mtriple=i686-unknown-unknown -mattr=+avx512dq,+avx512vl | FileCheck %s --check-prefix=X86 --check-prefix=X86-AVX512VLDQ
 ; RUN: llc < %s -mtriple=x86_64-unknown-unknown -mattr=+avx | FileCheck %s --check-prefix=X64 --check-prefix=X64-AVX
+; RUN: llc < %s -mtriple=x86_64-unknown-unknown -mattr=+avx2 | FileCheck %s --check-prefix=X64 --check-prefix=X64-AVX2
 ; RUN: llc < %s -mtriple=x86_64-unknown-unknown -mattr=+avx512vl | FileCheck %s --check-prefix=X64 --check-prefix=X64-AVX512VL
 ; RUN: llc < %s -mtriple=x86_64-unknown-unknown -mattr=+avx512dq,+avx512vl | FileCheck %s --check-prefix=X64 --check-prefix=X64-AVX512VLDQ
 
@@ -111,6 +113,45 @@ define <4 x double> @fabs_v4f64(<4 x double> %p) {
 }
 declare <4 x double> @llvm.fabs.v4f64(<4 x double> %p)
 
+define <8 x half> @fabs_v8f16(ptr %p) {
+; X86-AVX-LABEL: fabs_v8f16:
+; X86-AVX:       # %bb.0:
+; X86-AVX-NEXT:    movl 4(%esp), [[ADDRREG:%.*]]
+; X86-AVX-NEXT:    vmovaps ([[ADDRREG]]), %xmm0
+; X86-AVX-NEXT:    vandps {{\.?LCPI[0-9]+_[0-9]+}}, %xmm0, %xmm0
+; X86-AVX-NEXT:    retl
+
+; X86-AVX2-LABEL: fabs_v8f16:
+; X86-AVX2:       # %bb.0:
+; X86-AVX2-NEXT:    movl 4(%esp), [[REG:%.*]]
+; X86-AVX2-NEXT:    vpbroadcastw {{\.?LCPI[0-9]+_[0-9]+}}, %xmm0
+; X86-AVX2-NEXT:    vpand ([[REG]]), %xmm0, %xmm0
+; X86-AVX2-NEXT:    retl
+
+; X64-AVX512VL-LABEL: fabs_v8f16:
+; X64-AVX512VL:       # %bb.0:
+; X64-AVX512VL-NEXT:    vpbroadcastw {{\.?LCPI[0-9]+_[0-9]+}}(%rip), %xmm0
+; X64-AVX512VL-NEXT:    vpand (%rdi), %xmm0, %xmm0
+; X64-AVX512VL-NEXT:    retq
+
+; X64-AVX-LABEL: fabs_v8f16:
+; X64-AVX:       # %bb.0:
+; X64-AVX-NEXT:    vmovaps (%rdi), %xmm0
+; X64-AVX-NEXT:    vandps {{\.?LCPI[0-9]+_[0-9]+}}(%rip), %xmm0, %xmm0
+; X64-AVX-NEXT:    retq
+
+; X64-AVX2-LABEL: fabs_v8f16:
+; X64-AVX2:       # %bb.0:
+; X64-AVX2-NEXT:    vpbroadcastw {{\.?LCPI[0-9]+_[0-9]+}}(%rip), %xmm0
+; X64-AVX2-NEXT:    vpand (%rdi), %xmm0, %xmm0
+; X64-AVX2-NEXT:    retq
+
+  %v = load <8 x half>, ptr %p, align 16
+  %nnv = call <8 x half> @llvm.fabs.v8f16(<8 x half> %v)
+  ret <8 x half> %nnv
+}
+declare <8 x half> @llvm.fabs.v8f16(<8 x half> %p)
+
 define <8 x float> @fabs_v8f32(<8 x float> %p) {
 ; X86-AVX-LABEL: fabs_v8f32:
 ; X86-AVX:       # %bb.0:

@david-xl david-xl requested a review from topperc November 9, 2023 17:53
@david-xl
Copy link
Contributor Author

This fix improves Eigen pcos performance by ~65% on Haswell.

Copy link
Collaborator

@RKSimon RKSimon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@david-xl - please can you rebase this?

@david-xl david-xl requested review from JDevlieghere, nikic and a team as code owners November 13, 2023 19:28
@RKSimon
Copy link
Collaborator

RKSimon commented Nov 13, 2023

This looks like the merge has gone wrong?

@david-xl
Copy link
Contributor Author

This looks like the merge has gone wrong?

Right -- I messed it up (new to the workflow). Will fix.

@nikic nikic removed request for a team and nikic November 13, 2023 20:35
@david-xl
Copy link
Contributor Author

This looks like the merge has gone wrong?

Fixed the bad merge. PTAL.

Copy link

github-actions bot commented Nov 15, 2023

✅ With the latest revision this PR passed the C/C++ code formatter.

@david-xl
Copy link
Contributor Author

Addressed review comments. PTAL.

Copy link
Collaborator

@RKSimon RKSimon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM with one minor

%v = load <16 x half>, ptr %p, align 32
%nnv = call <16 x half> @llvm.fabs.v16f16(<16 x half> %v)
ret <16 x half> %nnv
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just let the script generate the codegen, even if its really poor for pre-AVX512FP16

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok. I will have a followup patch to update the test.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks. Looks like you have already updated and cleaned up the test.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup, we're still doing a really bad job with v16f16/v32f16 on AVX targets - please can you take a look?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will take a look.

	modified:   llvm/lib/Target/X86/X86ISelLowering.cpp
	modified:   llvm/test/CodeGen/X86/vec_fabs.ll
	modified:   llvm/test/CodeGen/X86/vec_fabs.ll
	modified:   llvm/test/CodeGen/X86/vec_fabs.ll

	modified:   llvm/test/CodeGen/X86/vec_fabs.ll

	modified:   llvm/test/CodeGen/X86/vec_fabs.ll
 1. Move the code to the common place
 2. Add test coverage for v16f16 and v32f16 FABS lowering

	modified:   llvm/lib/Target/X86/X86ISelLowering.cpp
	modified:   llvm/test/CodeGen/X86/vec_fabs.ll

	modified:   llvm/lib/Target/X86/X86ISelLowering.cpp
	modified:   llvm/test/CodeGen/X86/vec_fabs.ll
@david-xl david-xl merged commit ac3779e into llvm:main Nov 16, 2023
2 of 3 checks passed
sr-tream pushed a commit to sr-tream/llvm-project that referenced this pull request Nov 20, 2023
[X86]: Enable custom lowering for fabs.v8f16 on AVX

Currently, custom lowering of fabs.v8f16 requires AVX512FP16, which is
too restrictive. For v8f16 fabs lowering, no instructions in AVX512FP16
are needed. Without the fix, horribly inefficient code is generated
without AVX512FP16. Note instcombiner generates calls to intrinsics
@llvm.fabs.v8f16 when simplifyping AND <8 x half> operations.
zahiraam pushed a commit to zahiraam/llvm-project that referenced this pull request Nov 20, 2023
[X86]: Enable custom lowering for fabs.v8f16 on AVX

Currently, custom lowering of fabs.v8f16 requires AVX512FP16, which is
too restrictive. For v8f16 fabs lowering, no instructions in AVX512FP16
are needed. Without the fix, horribly inefficient code is generated
without AVX512FP16. Note instcombiner generates calls to intrinsics
@llvm.fabs.v8f16 when simplifyping AND <8 x half> operations.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants