Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[x86] Optimize urem with a constant divisor to use multiply-by-reciprocal #92669

Closed
wants to merge 1 commit into from

Conversation

xgupta
Copy link
Contributor

@xgupta xgupta commented May 18, 2024

The testcase provided by Henning Thielemann implemented a simple random number generator based on linear concurrences.
This needs a division and LLVM chooses to call __umoddi3 which is very slow since denominator is a constant, this can be expanded into a multiply-by-reciprocal sequence.

Fix #6769

@llvmbot llvmbot added backend:X86 llvm:SelectionDAG SelectionDAGISel as well labels May 18, 2024
@llvmbot
Copy link
Collaborator

llvmbot commented May 18, 2024

@llvm/pr-subscribers-backend-x86

@llvm/pr-subscribers-llvm-selectiondag

Author: Shivam Gupta (xgupta)

Changes

The testcase provided by Henning Thielemann implemented a simple random number generator based on linear concurrences.
This needs a division and LLVM chooses to call __umoddi3 which is very slow since denominator is a constant, this can be expanded into a multiply-by-reciprocal sequence.

Fix #6769


Full diff: https://github.com/llvm/llvm-project/pull/92669.diff

2 Files Affected:

  • (modified) llvm/lib/CodeGen/SelectionDAG/DAGCombiner.cpp (+26)
  • (added) llvm/test/CodeGen/X86/pr6769.ll (+20)
diff --git a/llvm/lib/CodeGen/SelectionDAG/DAGCombiner.cpp b/llvm/lib/CodeGen/SelectionDAG/DAGCombiner.cpp
index 2b181cd3ab1db..e6a5370e0fdef 100644
--- a/llvm/lib/CodeGen/SelectionDAG/DAGCombiner.cpp
+++ b/llvm/lib/CodeGen/SelectionDAG/DAGCombiner.cpp
@@ -5019,6 +5019,32 @@ SDValue DAGCombiner::visitREM(SDNode *N) {
     }
   }
 
+  // Optimization for urem with a constant divisor
+  if (!isSigned && isa<ConstantSDNode>(N1)) {
+    uint64_t M = cast<ConstantSDNode>(N1)->getZExtValue();
+    uint64_t R = (1ULL << 63) / M + 1;
+
+    SDValue Reciprocal = DAG.getConstant(R, DL, MVT::i64);
+    SDValue N0Ext = DAG.getZExtOrTrunc(N0, DL, MVT::i64);
+
+    // Multiply by reciprocal
+    SDValue Mul = DAG.getNode(ISD::MUL, DL, MVT::i64, N0Ext, Reciprocal);
+
+    // Right shift by 63 to get the quotient
+    SDValue ShiftAmount = DAG.getConstant(63, DL, MVT::i64);
+    SDValue Quotient = DAG.getNode(ISD::SRL, DL, MVT::i64, Mul, ShiftAmount);
+
+    // Multiply quotient by M to get the product
+    SDValue Modulus = DAG.getConstant(M, DL, MVT::i64);
+    SDValue Product = DAG.getNode(ISD::MUL, DL, MVT::i64, Quotient, Modulus);
+
+    // Subtract product from the original dividend to get the remainder
+    SDValue Remainder = DAG.getNode(ISD::SUB, DL, MVT::i64, N0Ext, Product);
+
+    // Truncate the result to the original type
+    return DAG.getNode(ISD::TRUNCATE, DL, VT, Remainder);
+  }
+
   AttributeList Attr = DAG.getMachineFunction().getFunction().getAttributes();
 
   // If X/C can be simplified by the division-by-constant logic, lower
diff --git a/llvm/test/CodeGen/X86/pr6769.ll b/llvm/test/CodeGen/X86/pr6769.ll
new file mode 100644
index 0000000000000..328bbf0f594c7
--- /dev/null
+++ b/llvm/test/CodeGen/X86/pr6769.ll
@@ -0,0 +1,20 @@
+; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py UTC_ARGS: --version 4
+; RUN: llc < %s -mtriple=x86_64-unknown-unknown | FileCheck %s
+
+define i32 @_rnd(i32 %a0) {
+; CHECK-LABEL: _rnd:
+; CHECK:       # %bb.0:
+; CHECK-NEXT:    movl %edi, %ecx
+; CHECK-NEXT:    imull $40692, %edi, %eax # imm = 0x9EF4
+; CHECK-NEXT:    movabsq $174770829514140, %rdx # imm = 0x9EF40135D59C
+; CHECK-NEXT:    imulq %rcx, %rdx
+; CHECK-NEXT:    shrq $63, %rdx
+; CHECK-NEXT:    imull $2147483399, %edx, %ecx # imm = 0x7FFFFF07
+; CHECK-NEXT:    subl %ecx, %eax
+; CHECK-NEXT:    retq
+  %x = zext i32 %a0 to i64
+  %y = mul i64 40692, %x
+  %z = urem i64 %y, 2147483399
+  %r = trunc i64 %z to i32
+  ret i32 %r
+}

@xgupta xgupta requested review from RKSimon, arsenm and dtcxzyw May 18, 2024 19:20
Copy link
Collaborator

@efriedma-quic efriedma-quic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please don't waste reviewers' time.

@topperc
Copy link
Collaborator

topperc commented May 18, 2024

It only calls __umoddi3 when compiling for 32-bit x86. For x86-64 it already uses a multiply.

@xgupta
Copy link
Contributor Author

xgupta commented May 18, 2024

It only calls __umoddi3 when compiling for 32-bit x86. For x86-64 it already uses a multiply.

Thanks for the information, in that case these changes are not required since there are not many 32 bit systems.

@xgupta xgupta closed this May 18, 2024
@xgupta xgupta deleted the urem branch June 12, 2024 16:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backend:X86 llvm:SelectionDAG SelectionDAGISel as well
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[x86] optimize trunc(div i64 ..) to i32 to use 'div' instruction or multiply sequence
4 participants