Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[AArch64] add intrinsic to generate a bfi instruction #79672

Closed
wants to merge 2 commits into from

Conversation

RamaMalladiAWS
Copy link

BFI: Bit Field Insert copies any number of low order bits from a
register into the same number of adjacent bits at any position
in the destination register.

This PR generates the BFI instruction by implementing an intrinsic
function that can be invoked from the LLVM-IR.

BFI: Bit Field Insert copies any number of low order bits from a
register into the same number of adjacent bits at any position
in the destination register.

This PR generates the BFI instruction by implementing an intrinsic
function that can be invoked from the LLVM-IR.
Copy link

Thank you for submitting a Pull Request (PR) to the LLVM Project!

This PR will be automatically labeled and the relevant teams will be
notified.

If you wish to, you can add reviewers by using the "Reviewers" section on this page.

If this is not working for you, it is probably because you do not have write
permissions for the repository. In which case you can instead tag reviewers by
name in a comment by using @ followed by their GitHub username.

If you have received no comments on your PR for a week, you can request a review
by "ping"ing the PR by adding a comment “Ping”. The common courtesy "ping" rate
is once a week. Please remember that you are asking for valuable time from other developers.

If you have further questions, they may be answered by the LLVM GitHub User Guide.

You can also ask questions in a comment on this PR, on the LLVM Discord or on the forums.

@llvmbot
Copy link
Collaborator

llvmbot commented Jan 27, 2024

@llvm/pr-subscribers-backend-aarch64

@llvm/pr-subscribers-llvm-ir

Author: Rama Malladi (RamaMalladiAWS)

Changes

BFI: Bit Field Insert copies any number of low order bits from a
register into the same number of adjacent bits at any position
in the destination register.

This PR generates the BFI instruction by implementing an intrinsic
function that can be invoked from the LLVM-IR.


Full diff: https://github.com/llvm/llvm-project/pull/79672.diff

4 Files Affected:

  • (modified) llvm/include/llvm/IR/IntrinsicsAArch64.td (+3)
  • (modified) llvm/lib/Target/AArch64/AArch64ISelDAGToDAG.cpp (+13)
  • (modified) llvm/lib/Target/AArch64/AArch64InstrInfo.td (+8)
  • (added) llvm/test/CodeGen/AArch64/bfi-64-intrinsic.ll (+25)
diff --git a/llvm/include/llvm/IR/IntrinsicsAArch64.td b/llvm/include/llvm/IR/IntrinsicsAArch64.td
index 921e5b95ae03e8..9eb5154c95138f 100644
--- a/llvm/include/llvm/IR/IntrinsicsAArch64.td
+++ b/llvm/include/llvm/IR/IntrinsicsAArch64.td
@@ -855,6 +855,9 @@ def int_aarch64_crc32x  : DefaultAttrsIntrinsic<[llvm_i32_ty], [llvm_i32_ty, llv
     [IntrNoMem]>;
 def int_aarch64_crc32cx : DefaultAttrsIntrinsic<[llvm_i32_ty], [llvm_i32_ty, llvm_i64_ty],
     [IntrNoMem]>;
+def int_aarch64_bfi : DefaultAttrsIntrinsic<
+    [llvm_anyint_ty], [llvm_anyint_ty, llvm_anyint_ty, llvm_anyint_ty, llvm_anyint_ty],
+    [IntrNoMem, ImmArg<ArgIndex<2>>, ImmArg<ArgIndex<3>>]>;
 }
 
 //===----------------------------------------------------------------------===//
diff --git a/llvm/lib/Target/AArch64/AArch64ISelDAGToDAG.cpp b/llvm/lib/Target/AArch64/AArch64ISelDAGToDAG.cpp
index 163ed520a8a677..1fe3f95d54d131 100644
--- a/llvm/lib/Target/AArch64/AArch64ISelDAGToDAG.cpp
+++ b/llvm/lib/Target/AArch64/AArch64ISelDAGToDAG.cpp
@@ -5230,6 +5230,19 @@ void AArch64DAGToDAGISel::Select(SDNode *Node) {
     switch (IntNo) {
     default:
       break;
+    case Intrinsic::aarch64_bfi: {
+      SDLoc DL(Node);
+      auto lsb = cast<ConstantSDNode>(Node->getOperand(3))->getZExtValue();
+      auto width = cast<ConstantSDNode>(Node->getOperand(4))->getZExtValue();
+      auto ImmR = (VT.getSizeInBits() - lsb) % VT.getSizeInBits();
+      auto ImmS = width - 1;
+      SDValue Ops[] = {Node->getOperand(1), Node->getOperand(2),
+                       CurDAG->getConstant(ImmR, DL, VT),
+                       CurDAG->getConstant(ImmS, DL, VT)};
+      unsigned Opc = (VT == MVT::i32) ? AArch64::BFMWri : AArch64::BFMXri;
+      CurDAG->SelectNodeTo(Node, Opc, VT, Ops);
+      return;
+    }
     case Intrinsic::aarch64_tagp:
       SelectTagP(Node);
       return;
diff --git a/llvm/lib/Target/AArch64/AArch64InstrInfo.td b/llvm/lib/Target/AArch64/AArch64InstrInfo.td
index 03baa7497615e3..afa911abad7982 100644
--- a/llvm/lib/Target/AArch64/AArch64InstrInfo.td
+++ b/llvm/lib/Target/AArch64/AArch64InstrInfo.td
@@ -2558,6 +2558,14 @@ def : Pat<(rotr GPR32:$Rn, (i64 imm0_31:$imm)),
 def : Pat<(rotr GPR64:$Rn, (i64 imm0_63:$imm)),
           (EXTRXrri GPR64:$Rn, GPR64:$Rn, imm0_63:$imm)>;
 
+def SDT_AArch64BFI_32bit : SDTypeProfile<1, 2, [SDTCisVT<0, i32>, SDTCisVT<1, i32>,
+                                                    SDTCisVT<2, i32>]>;
+def SDT_AArch64BFI_64bit : SDTypeProfile<1, 2, [SDTCisVT<0, i64>, SDTCisVT<1, i64>,
+                                                    SDTCisVT<2, i64>]>;
+
+def aarch64_bfiw  : SDNode<"AArch64::BFMWri",  SDT_AArch64BFI_32bit>;
+def aarch64_bfix  : SDNode<"AArch64::BFMXri",  SDT_AArch64BFI_64bit>;
+
 //===----------------------------------------------------------------------===//
 // Other bitfield immediate instructions.
 //===----------------------------------------------------------------------===//
diff --git a/llvm/test/CodeGen/AArch64/bfi-64-intrinsic.ll b/llvm/test/CodeGen/AArch64/bfi-64-intrinsic.ll
new file mode 100644
index 00000000000000..11ecde6b6fab20
--- /dev/null
+++ b/llvm/test/CodeGen/AArch64/bfi-64-intrinsic.ll
@@ -0,0 +1,25 @@
+; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py
+; RUN: llc -mtriple=aarch64-none-linux-gnu %s -o - 2>&1 | FileCheck %s --check-prefixes=CHECK,CHECK-GI
+
+define i32 @f32(i32 %A, i32 %B) nounwind {
+; CHECK-LABEL: f32:
+; CHECK-GI:      // %bb.0:
+; CHECK-GI-NEXT: bfi w0, w1, #4, #2
+; CHECK-GI-NEXT: ret
+entry:
+  %tmp32 = call i32 @llvm.aarch64.bfi.i32(i32 %A, i32 %B, i32 4, i32 2)
+  ret i32 %tmp32
+}
+
+define i64 @f64(i64 %A, i64 %B) nounwind {
+; CHECK-LABEL: f64:
+; CHECK-GI:      // %bb.0:
+; CHECK-GI-NEXT: bfi x0, x1, #23, #8
+; CHECK-GI-NEXT: ret
+entry:
+  %tmp64 = call i64 @llvm.aarch64.bfi.i64(i64 %A, i64 %B, i64 23, i64 8)
+  ret i64 %tmp64
+}
+
+declare i32 @llvm.aarch64.bfi.i32(i32, i32, i32, i32)
+declare i64 @llvm.aarch64.bfi.i64(i64, i64, i64, i64)

Copy link
Member

@inclyc inclyc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also I kindly wonder if this intrinsic is necessary for optimization/codegen. Why not prefer inline assembly?

def SDT_AArch64BFI_64bit : SDTypeProfile<1, 2, [SDTCisVT<0, i64>, SDTCisVT<1, i64>,
SDTCisVT<2, i64>]>;

def aarch64_bfiw : SDNode<"AArch64::BFMWri", SDT_AArch64BFI_32bit>;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(question) could you explain why creating a new SDNode? Is this necessary?

@inclyc inclyc changed the title intrinsic to generate a bfi instruction [AArch64] add intrinsic to generate a bfi instruction Jan 28, 2024
@davemgreen
Copy link
Collaborator

Hello. Can you explain why this is needed, as opposed to using the equivalent shift/and/ors?

@RamaMalladiAWS
Copy link
Author

RamaMalladiAWS commented Jan 28, 2024

Hello. Can you explain why this is needed, as opposed to using the equivalent shift/and/ors?

Hi @davemgreen, one of our customers requested for such an intrinsic to be made available so that they could consume it in their IR directly. The reasoning was to use 1 instruction such bfi instead of a combination of shift, and, or.

@davemgreen
Copy link
Collaborator

OK. We would not usually add intrinsics like this without a strong motivating case, that could not be optimized in some other way. It is better to use target independent options when available, and inline assembly is available as a fallback if it is really needed. But I would recommend that they use normal and/or/shift operations and let us know about places the compiler isn't optimizing them as well as it could be.

@RamaMalladiAWS
Copy link
Author

RamaMalladiAWS commented Jan 29, 2024

OK. We would not usually add intrinsics like this without a strong motivating case, that could not be optimized in some other way. It is better to use target independent options when available, and inline assembly is available as a fallback if it is really needed. But I would recommend that they use normal and/or/shift operations and let us know about places the compiler isn't optimizing them as well as it could be.

I completely agree with the approach @davemgreen. In this case, the IR sequence wasn't optimized to a bfi. I can try to get a test-case and create an issue for generating the bfi instruction. Thanks again.

@RKSimon
Copy link
Collaborator

RKSimon commented Jan 29, 2024

@RamaMalladiAWS Do you have examples of the IR that fails to lower to BFI? These things often turn out to be either a missing middle-end canonicalization or maybe a case that could be added to existing pattern matching in the back-end.

@RamaMalladiAWS
Copy link
Author

@RamaMalladiAWS Do you have examples of the IR that fails to lower to BFI? These things often turn out to be either a missing middle-end canonicalization or maybe a case that could be added to existing pattern matching in the back-end.

Yes, @RKSimon, I will try to get some test-cases in the next couple of days and we can evaluate the issues if any. Thank you.

@dwpan
Copy link

dwpan commented Jan 30, 2024

Hello. Can you explain why this is needed, as opposed to using the equivalent shift/and/ors?

In Verilog/SystemVerilog language, the basic type is bit or bit vector, and length is arbitrary, insert/extract bits are common features in language. Introducing corresponding intrinsics could help gradually lower it and bring more optimization opportunities in llc. Otherwise, many shift/and/or are needed to be translated and then depends on code pattern matching to recognize and optimize them.

@davemgreen
Copy link
Collaborator

I see. The issue is that the opposite is often true as well - if we add a target specific intrinsic for this then, whilst we get a single instruction being emitted, we don't see all the other optimizations that the compiler can and should be performing.

Things like constant folding, combining into other instructions, known-bits analysis or any form of vectorization will all be blocked by the intrinsic. It can take quite some work to add all those features in (if they are possible), and without them can potentially lead to worse results. Plus more things to maintain.

BFI isn't a trivial instructions to match as it involves certain masks and shifts. There might certainly be advantages to having an intrinsic. I would like to try and see what the problems would be with generated code using normal operations first though, if we can. If there are optimizations we can make based on the existing code then that would help in all cases (c, mlir, rust, etc), not just frontends that are producing the intrinsics.

@RamaMalladiAWS
Copy link
Author

We decided to close this PR as we found LLC code-gen was good for many of our test-cases. Thank you for the reviews.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants