-
Notifications
You must be signed in to change notification settings - Fork 10.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[AArch64] add intrinsic to generate a bfi instruction #79672
Conversation
BFI: Bit Field Insert copies any number of low order bits from a register into the same number of adjacent bits at any position in the destination register. This PR generates the BFI instruction by implementing an intrinsic function that can be invoked from the LLVM-IR.
Thank you for submitting a Pull Request (PR) to the LLVM Project! This PR will be automatically labeled and the relevant teams will be If you wish to, you can add reviewers by using the "Reviewers" section on this page. If this is not working for you, it is probably because you do not have write If you have received no comments on your PR for a week, you can request a review If you have further questions, they may be answered by the LLVM GitHub User Guide. You can also ask questions in a comment on this PR, on the LLVM Discord or on the forums. |
@llvm/pr-subscribers-backend-aarch64 @llvm/pr-subscribers-llvm-ir Author: Rama Malladi (RamaMalladiAWS) ChangesBFI: Bit Field Insert copies any number of low order bits from a This PR generates the BFI instruction by implementing an intrinsic Full diff: https://github.com/llvm/llvm-project/pull/79672.diff 4 Files Affected:
diff --git a/llvm/include/llvm/IR/IntrinsicsAArch64.td b/llvm/include/llvm/IR/IntrinsicsAArch64.td
index 921e5b95ae03e8..9eb5154c95138f 100644
--- a/llvm/include/llvm/IR/IntrinsicsAArch64.td
+++ b/llvm/include/llvm/IR/IntrinsicsAArch64.td
@@ -855,6 +855,9 @@ def int_aarch64_crc32x : DefaultAttrsIntrinsic<[llvm_i32_ty], [llvm_i32_ty, llv
[IntrNoMem]>;
def int_aarch64_crc32cx : DefaultAttrsIntrinsic<[llvm_i32_ty], [llvm_i32_ty, llvm_i64_ty],
[IntrNoMem]>;
+def int_aarch64_bfi : DefaultAttrsIntrinsic<
+ [llvm_anyint_ty], [llvm_anyint_ty, llvm_anyint_ty, llvm_anyint_ty, llvm_anyint_ty],
+ [IntrNoMem, ImmArg<ArgIndex<2>>, ImmArg<ArgIndex<3>>]>;
}
//===----------------------------------------------------------------------===//
diff --git a/llvm/lib/Target/AArch64/AArch64ISelDAGToDAG.cpp b/llvm/lib/Target/AArch64/AArch64ISelDAGToDAG.cpp
index 163ed520a8a677..1fe3f95d54d131 100644
--- a/llvm/lib/Target/AArch64/AArch64ISelDAGToDAG.cpp
+++ b/llvm/lib/Target/AArch64/AArch64ISelDAGToDAG.cpp
@@ -5230,6 +5230,19 @@ void AArch64DAGToDAGISel::Select(SDNode *Node) {
switch (IntNo) {
default:
break;
+ case Intrinsic::aarch64_bfi: {
+ SDLoc DL(Node);
+ auto lsb = cast<ConstantSDNode>(Node->getOperand(3))->getZExtValue();
+ auto width = cast<ConstantSDNode>(Node->getOperand(4))->getZExtValue();
+ auto ImmR = (VT.getSizeInBits() - lsb) % VT.getSizeInBits();
+ auto ImmS = width - 1;
+ SDValue Ops[] = {Node->getOperand(1), Node->getOperand(2),
+ CurDAG->getConstant(ImmR, DL, VT),
+ CurDAG->getConstant(ImmS, DL, VT)};
+ unsigned Opc = (VT == MVT::i32) ? AArch64::BFMWri : AArch64::BFMXri;
+ CurDAG->SelectNodeTo(Node, Opc, VT, Ops);
+ return;
+ }
case Intrinsic::aarch64_tagp:
SelectTagP(Node);
return;
diff --git a/llvm/lib/Target/AArch64/AArch64InstrInfo.td b/llvm/lib/Target/AArch64/AArch64InstrInfo.td
index 03baa7497615e3..afa911abad7982 100644
--- a/llvm/lib/Target/AArch64/AArch64InstrInfo.td
+++ b/llvm/lib/Target/AArch64/AArch64InstrInfo.td
@@ -2558,6 +2558,14 @@ def : Pat<(rotr GPR32:$Rn, (i64 imm0_31:$imm)),
def : Pat<(rotr GPR64:$Rn, (i64 imm0_63:$imm)),
(EXTRXrri GPR64:$Rn, GPR64:$Rn, imm0_63:$imm)>;
+def SDT_AArch64BFI_32bit : SDTypeProfile<1, 2, [SDTCisVT<0, i32>, SDTCisVT<1, i32>,
+ SDTCisVT<2, i32>]>;
+def SDT_AArch64BFI_64bit : SDTypeProfile<1, 2, [SDTCisVT<0, i64>, SDTCisVT<1, i64>,
+ SDTCisVT<2, i64>]>;
+
+def aarch64_bfiw : SDNode<"AArch64::BFMWri", SDT_AArch64BFI_32bit>;
+def aarch64_bfix : SDNode<"AArch64::BFMXri", SDT_AArch64BFI_64bit>;
+
//===----------------------------------------------------------------------===//
// Other bitfield immediate instructions.
//===----------------------------------------------------------------------===//
diff --git a/llvm/test/CodeGen/AArch64/bfi-64-intrinsic.ll b/llvm/test/CodeGen/AArch64/bfi-64-intrinsic.ll
new file mode 100644
index 00000000000000..11ecde6b6fab20
--- /dev/null
+++ b/llvm/test/CodeGen/AArch64/bfi-64-intrinsic.ll
@@ -0,0 +1,25 @@
+; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py
+; RUN: llc -mtriple=aarch64-none-linux-gnu %s -o - 2>&1 | FileCheck %s --check-prefixes=CHECK,CHECK-GI
+
+define i32 @f32(i32 %A, i32 %B) nounwind {
+; CHECK-LABEL: f32:
+; CHECK-GI: // %bb.0:
+; CHECK-GI-NEXT: bfi w0, w1, #4, #2
+; CHECK-GI-NEXT: ret
+entry:
+ %tmp32 = call i32 @llvm.aarch64.bfi.i32(i32 %A, i32 %B, i32 4, i32 2)
+ ret i32 %tmp32
+}
+
+define i64 @f64(i64 %A, i64 %B) nounwind {
+; CHECK-LABEL: f64:
+; CHECK-GI: // %bb.0:
+; CHECK-GI-NEXT: bfi x0, x1, #23, #8
+; CHECK-GI-NEXT: ret
+entry:
+ %tmp64 = call i64 @llvm.aarch64.bfi.i64(i64 %A, i64 %B, i64 23, i64 8)
+ ret i64 %tmp64
+}
+
+declare i32 @llvm.aarch64.bfi.i32(i32, i32, i32, i32)
+declare i64 @llvm.aarch64.bfi.i64(i64, i64, i64, i64)
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also I kindly wonder if this intrinsic is necessary for optimization/codegen. Why not prefer inline assembly?
def SDT_AArch64BFI_64bit : SDTypeProfile<1, 2, [SDTCisVT<0, i64>, SDTCisVT<1, i64>, | ||
SDTCisVT<2, i64>]>; | ||
|
||
def aarch64_bfiw : SDNode<"AArch64::BFMWri", SDT_AArch64BFI_32bit>; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(question) could you explain why creating a new SDNode? Is this necessary?
Hello. Can you explain why this is needed, as opposed to using the equivalent shift/and/ors? |
Hi @davemgreen, one of our customers requested for such an intrinsic to be made available so that they could consume it in their IR directly. The reasoning was to use 1 instruction such |
OK. We would not usually add intrinsics like this without a strong motivating case, that could not be optimized in some other way. It is better to use target independent options when available, and inline assembly is available as a fallback if it is really needed. But I would recommend that they use normal and/or/shift operations and let us know about places the compiler isn't optimizing them as well as it could be. |
I completely agree with the approach @davemgreen. In this case, the IR sequence wasn't optimized to a |
@RamaMalladiAWS Do you have examples of the IR that fails to lower to BFI? These things often turn out to be either a missing middle-end canonicalization or maybe a case that could be added to existing pattern matching in the back-end. |
Yes, @RKSimon, I will try to get some test-cases in the next couple of days and we can evaluate the issues if any. Thank you. |
In Verilog/SystemVerilog language, the basic type is bit or bit vector, and length is arbitrary, insert/extract bits are common features in language. Introducing corresponding intrinsics could help gradually lower it and bring more optimization opportunities in llc. Otherwise, many shift/and/or are needed to be translated and then depends on code pattern matching to recognize and optimize them. |
I see. The issue is that the opposite is often true as well - if we add a target specific intrinsic for this then, whilst we get a single instruction being emitted, we don't see all the other optimizations that the compiler can and should be performing. Things like constant folding, combining into other instructions, known-bits analysis or any form of vectorization will all be blocked by the intrinsic. It can take quite some work to add all those features in (if they are possible), and without them can potentially lead to worse results. Plus more things to maintain. BFI isn't a trivial instructions to match as it involves certain masks and shifts. There might certainly be advantages to having an intrinsic. I would like to try and see what the problems would be with generated code using normal operations first though, if we can. If there are optimizations we can make based on the existing code then that would help in all cases (c, mlir, rust, etc), not just frontends that are producing the intrinsics. |
We decided to close this PR as we found LLC code-gen was good for many of our test-cases. Thank you for the reviews. |
BFI: Bit Field Insert copies any number of low order bits from a
register into the same number of adjacent bits at any position
in the destination register.
This PR generates the BFI instruction by implementing an intrinsic
function that can be invoked from the LLVM-IR.