Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[X86] For minsize memset/memcpy, use byte or double-word accesses #87003

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

AreaZR
Copy link
Contributor

@AreaZR AreaZR commented Mar 28, 2024

repstosb and repstosd are the same size, but stosd is only done for 0 because the process of multiplying the constant so that it is copied across the bytes of the 32-bit number adds extra instructions that cause the size to increase. For 0, repstosb and repstosd are the same size, but stosd is only done for 0 because the process of multiplying the constant so that it is copied across the bytes of the 32-bit number adds extra instructions that cause the size to increase. For 0, we do not need to do that at all.

For memcpy, the same goes, and as a result the minsize check was moved ahead because a jmp to memcpy encoded takes more bytes than repmovsb.

@AreaZR AreaZR changed the title [X86] Use AlwaysInline to determine whether to emit code or bail when… [X86] Use AlwaysInline to determine whether to emit code or bail when inlining is determined to be unprofitable Mar 28, 2024
@llvmbot
Copy link
Collaborator

llvmbot commented Mar 28, 2024

@llvm/pr-subscribers-backend-x86

Author: AtariDreams (AtariDreams)

Changes

Assume true when we get to the getMemset code, as it ought to be profitable by then.

Just like how getMemcpy works.


Full diff: https://github.com/llvm/llvm-project/pull/87003.diff

2 Files Affected:

  • (modified) llvm/lib/Target/X86/X86SelectionDAGInfo.cpp (+5-3)
  • (modified) llvm/test/CodeGen/X86/memset-vs-memset-inline.ll (+4-131)
diff --git a/llvm/lib/Target/X86/X86SelectionDAGInfo.cpp b/llvm/lib/Target/X86/X86SelectionDAGInfo.cpp
index 7c630a2b0da080..50d273e69ada44 100644
--- a/llvm/lib/Target/X86/X86SelectionDAGInfo.cpp
+++ b/llvm/lib/Target/X86/X86SelectionDAGInfo.cpp
@@ -66,8 +66,10 @@ SDValue X86SelectionDAGInfo::EmitTargetCodeForMemset(
   // If not DWORD aligned or size is more than the threshold, call the library.
   // The libc version is likely to be faster for these cases. It can use the
   // address value and run time information about the CPU.
-  if (Alignment < Align(4) || !ConstantSize ||
-      ConstantSize->getZExtValue() > Subtarget.getMaxInlineSizeThreshold()) 
+  if (!ConstantSize ||
+      (!AlwaysInline &&
+       (Alignment < Align(4) ||
+        ConstantSize->getZExtValue() > Subtarget.getMaxInlineSizeThreshold())))
     return SDValue();
 
   uint64_t SizeVal = ConstantSize->getZExtValue();
@@ -142,7 +144,7 @@ SDValue X86SelectionDAGInfo::EmitTargetCodeForMemset(
                       DAG.getNode(ISD::ADD, dl, AddrVT, Dst,
                                   DAG.getConstant(Offset, dl, AddrVT)),
                       Val, DAG.getConstant(BytesLeft, dl, SizeVT), Alignment,
-                      isVolatile, AlwaysInline,
+                      isVolatile, /* AlwaysInline */ true,
                       /* isTailCall */ false, DstPtrInfo.getWithOffset(Offset));
   }
 
diff --git a/llvm/test/CodeGen/X86/memset-vs-memset-inline.ll b/llvm/test/CodeGen/X86/memset-vs-memset-inline.ll
index b8fdd936b43895..16022c6cbb3934 100644
--- a/llvm/test/CodeGen/X86/memset-vs-memset-inline.ll
+++ b/llvm/test/CodeGen/X86/memset-vs-memset-inline.ll
@@ -28,137 +28,10 @@ define void @regular_memset_calls_external_function(ptr %a, i8 %value) nounwind
 define void @inlined_set_doesnt_call_external_function(ptr %a, i8 %value) nounwind {
 ; CHECK-LABEL: inlined_set_doesnt_call_external_function:
 ; CHECK:       # %bb.0:
-; CHECK-NEXT:    movzbl %sil, %ecx
-; CHECK-NEXT:    movabsq $72340172838076673, %rax # imm = 0x101010101010101
-; CHECK-NEXT:    imulq %rcx, %rax
-; CHECK-NEXT:    movq %rax, 1016(%rdi)
-; CHECK-NEXT:    movq %rax, 1008(%rdi)
-; CHECK-NEXT:    movq %rax, 1000(%rdi)
-; CHECK-NEXT:    movq %rax, 992(%rdi)
-; CHECK-NEXT:    movq %rax, 984(%rdi)
-; CHECK-NEXT:    movq %rax, 976(%rdi)
-; CHECK-NEXT:    movq %rax, 968(%rdi)
-; CHECK-NEXT:    movq %rax, 960(%rdi)
-; CHECK-NEXT:    movq %rax, 952(%rdi)
-; CHECK-NEXT:    movq %rax, 944(%rdi)
-; CHECK-NEXT:    movq %rax, 936(%rdi)
-; CHECK-NEXT:    movq %rax, 928(%rdi)
-; CHECK-NEXT:    movq %rax, 920(%rdi)
-; CHECK-NEXT:    movq %rax, 912(%rdi)
-; CHECK-NEXT:    movq %rax, 904(%rdi)
-; CHECK-NEXT:    movq %rax, 896(%rdi)
-; CHECK-NEXT:    movq %rax, 888(%rdi)
-; CHECK-NEXT:    movq %rax, 880(%rdi)
-; CHECK-NEXT:    movq %rax, 872(%rdi)
-; CHECK-NEXT:    movq %rax, 864(%rdi)
-; CHECK-NEXT:    movq %rax, 856(%rdi)
-; CHECK-NEXT:    movq %rax, 848(%rdi)
-; CHECK-NEXT:    movq %rax, 840(%rdi)
-; CHECK-NEXT:    movq %rax, 832(%rdi)
-; CHECK-NEXT:    movq %rax, 824(%rdi)
-; CHECK-NEXT:    movq %rax, 816(%rdi)
-; CHECK-NEXT:    movq %rax, 808(%rdi)
-; CHECK-NEXT:    movq %rax, 800(%rdi)
-; CHECK-NEXT:    movq %rax, 792(%rdi)
-; CHECK-NEXT:    movq %rax, 784(%rdi)
-; CHECK-NEXT:    movq %rax, 776(%rdi)
-; CHECK-NEXT:    movq %rax, 768(%rdi)
-; CHECK-NEXT:    movq %rax, 760(%rdi)
-; CHECK-NEXT:    movq %rax, 752(%rdi)
-; CHECK-NEXT:    movq %rax, 744(%rdi)
-; CHECK-NEXT:    movq %rax, 736(%rdi)
-; CHECK-NEXT:    movq %rax, 728(%rdi)
-; CHECK-NEXT:    movq %rax, 720(%rdi)
-; CHECK-NEXT:    movq %rax, 712(%rdi)
-; CHECK-NEXT:    movq %rax, 704(%rdi)
-; CHECK-NEXT:    movq %rax, 696(%rdi)
-; CHECK-NEXT:    movq %rax, 688(%rdi)
-; CHECK-NEXT:    movq %rax, 680(%rdi)
-; CHECK-NEXT:    movq %rax, 672(%rdi)
-; CHECK-NEXT:    movq %rax, 664(%rdi)
-; CHECK-NEXT:    movq %rax, 656(%rdi)
-; CHECK-NEXT:    movq %rax, 648(%rdi)
-; CHECK-NEXT:    movq %rax, 640(%rdi)
-; CHECK-NEXT:    movq %rax, 632(%rdi)
-; CHECK-NEXT:    movq %rax, 624(%rdi)
-; CHECK-NEXT:    movq %rax, 616(%rdi)
-; CHECK-NEXT:    movq %rax, 608(%rdi)
-; CHECK-NEXT:    movq %rax, 600(%rdi)
-; CHECK-NEXT:    movq %rax, 592(%rdi)
-; CHECK-NEXT:    movq %rax, 584(%rdi)
-; CHECK-NEXT:    movq %rax, 576(%rdi)
-; CHECK-NEXT:    movq %rax, 568(%rdi)
-; CHECK-NEXT:    movq %rax, 560(%rdi)
-; CHECK-NEXT:    movq %rax, 552(%rdi)
-; CHECK-NEXT:    movq %rax, 544(%rdi)
-; CHECK-NEXT:    movq %rax, 536(%rdi)
-; CHECK-NEXT:    movq %rax, 528(%rdi)
-; CHECK-NEXT:    movq %rax, 520(%rdi)
-; CHECK-NEXT:    movq %rax, 512(%rdi)
-; CHECK-NEXT:    movq %rax, 504(%rdi)
-; CHECK-NEXT:    movq %rax, 496(%rdi)
-; CHECK-NEXT:    movq %rax, 488(%rdi)
-; CHECK-NEXT:    movq %rax, 480(%rdi)
-; CHECK-NEXT:    movq %rax, 472(%rdi)
-; CHECK-NEXT:    movq %rax, 464(%rdi)
-; CHECK-NEXT:    movq %rax, 456(%rdi)
-; CHECK-NEXT:    movq %rax, 448(%rdi)
-; CHECK-NEXT:    movq %rax, 440(%rdi)
-; CHECK-NEXT:    movq %rax, 432(%rdi)
-; CHECK-NEXT:    movq %rax, 424(%rdi)
-; CHECK-NEXT:    movq %rax, 416(%rdi)
-; CHECK-NEXT:    movq %rax, 408(%rdi)
-; CHECK-NEXT:    movq %rax, 400(%rdi)
-; CHECK-NEXT:    movq %rax, 392(%rdi)
-; CHECK-NEXT:    movq %rax, 384(%rdi)
-; CHECK-NEXT:    movq %rax, 376(%rdi)
-; CHECK-NEXT:    movq %rax, 368(%rdi)
-; CHECK-NEXT:    movq %rax, 360(%rdi)
-; CHECK-NEXT:    movq %rax, 352(%rdi)
-; CHECK-NEXT:    movq %rax, 344(%rdi)
-; CHECK-NEXT:    movq %rax, 336(%rdi)
-; CHECK-NEXT:    movq %rax, 328(%rdi)
-; CHECK-NEXT:    movq %rax, 320(%rdi)
-; CHECK-NEXT:    movq %rax, 312(%rdi)
-; CHECK-NEXT:    movq %rax, 304(%rdi)
-; CHECK-NEXT:    movq %rax, 296(%rdi)
-; CHECK-NEXT:    movq %rax, 288(%rdi)
-; CHECK-NEXT:    movq %rax, 280(%rdi)
-; CHECK-NEXT:    movq %rax, 272(%rdi)
-; CHECK-NEXT:    movq %rax, 264(%rdi)
-; CHECK-NEXT:    movq %rax, 256(%rdi)
-; CHECK-NEXT:    movq %rax, 248(%rdi)
-; CHECK-NEXT:    movq %rax, 240(%rdi)
-; CHECK-NEXT:    movq %rax, 232(%rdi)
-; CHECK-NEXT:    movq %rax, 224(%rdi)
-; CHECK-NEXT:    movq %rax, 216(%rdi)
-; CHECK-NEXT:    movq %rax, 208(%rdi)
-; CHECK-NEXT:    movq %rax, 200(%rdi)
-; CHECK-NEXT:    movq %rax, 192(%rdi)
-; CHECK-NEXT:    movq %rax, 184(%rdi)
-; CHECK-NEXT:    movq %rax, 176(%rdi)
-; CHECK-NEXT:    movq %rax, 168(%rdi)
-; CHECK-NEXT:    movq %rax, 160(%rdi)
-; CHECK-NEXT:    movq %rax, 152(%rdi)
-; CHECK-NEXT:    movq %rax, 144(%rdi)
-; CHECK-NEXT:    movq %rax, 136(%rdi)
-; CHECK-NEXT:    movq %rax, 128(%rdi)
-; CHECK-NEXT:    movq %rax, 120(%rdi)
-; CHECK-NEXT:    movq %rax, 112(%rdi)
-; CHECK-NEXT:    movq %rax, 104(%rdi)
-; CHECK-NEXT:    movq %rax, 96(%rdi)
-; CHECK-NEXT:    movq %rax, 88(%rdi)
-; CHECK-NEXT:    movq %rax, 80(%rdi)
-; CHECK-NEXT:    movq %rax, 72(%rdi)
-; CHECK-NEXT:    movq %rax, 64(%rdi)
-; CHECK-NEXT:    movq %rax, 56(%rdi)
-; CHECK-NEXT:    movq %rax, 48(%rdi)
-; CHECK-NEXT:    movq %rax, 40(%rdi)
-; CHECK-NEXT:    movq %rax, 32(%rdi)
-; CHECK-NEXT:    movq %rax, 24(%rdi)
-; CHECK-NEXT:    movq %rax, 16(%rdi)
-; CHECK-NEXT:    movq %rax, 8(%rdi)
-; CHECK-NEXT:    movq %rax, (%rdi)
+; CHECK-NEXT:    movl %esi, %eax
+; CHECK-NEXT:    movl $1024, %ecx # imm = 0x400
+; CHECK-NEXT:    # kill: def $al killed $al killed $eax
+; CHECK-NEXT:    rep;stosb %al, %es:(%rdi)
 ; CHECK-NEXT:    retq
   tail call void @llvm.memset.inline.p0.i64(ptr %a, i8 %value, i64 1024, i1 0)
   ret void

@AreaZR AreaZR force-pushed the memset branch 3 times, most recently from d2acc03 to 71657df Compare March 28, 2024 21:19
@AreaZR AreaZR force-pushed the memset branch 2 times, most recently from e88fbe6 to 652ee72 Compare April 5, 2024 17:34
@AreaZR AreaZR changed the title [X86] Use AlwaysInline to determine whether to emit code or bail when inlining is determined to be unprofitable [X86] For minsize, use size for alignment, rather than actual alignment Apr 5, 2024
@AreaZR AreaZR force-pushed the memset branch 6 times, most recently from d6285f9 to 267bb33 Compare April 5, 2024 18:07
@topperc
Copy link
Collaborator

topperc commented Apr 5, 2024

Title should mention "memset". Otherwise its unclear what case you're talking about.

@AreaZR AreaZR force-pushed the memset branch 3 times, most recently from a265b4d to 9bf30fe Compare April 5, 2024 19:07
@AreaZR
Copy link
Contributor Author

AreaZR commented Sep 15, 2024

@RKSimon I believe this is ready now!

@AreaZR AreaZR changed the title [X86] For inline memset with minsize, use stosb [X86] For inline memset with minsize, use stosb or stosd Sep 16, 2024
@AreaZR AreaZR force-pushed the memset branch 4 times, most recently from ef6a9d3 to c0bc6cd Compare September 16, 2024 17:12
@AreaZR
Copy link
Contributor Author

AreaZR commented Sep 16, 2024

@phoebewang I believe this is ready!

@AreaZR AreaZR force-pushed the memset branch 5 times, most recently from 3849b37 to 485a0e4 Compare September 16, 2024 18:17
Copy link
Collaborator

@RKSimon RKSimon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Your patch describes memset yet most of the changes appear to be for memcpy :/

@AreaZR
Copy link
Contributor Author

AreaZR commented Sep 17, 2024

Your patch describes memset yet most of the changes appear to be for memcpy :/

Incorrect. The memset code was refactored to use the same method as the memcpy code for determining block size.

@AreaZR AreaZR requested a review from RKSimon September 17, 2024 14:38
@topperc
Copy link
Collaborator

topperc commented Sep 17, 2024

Your patch describes memset yet most of the changes appear to be for memcpy :/

Incorrect. The memset code was refactored to use the same method as the memcpy code for determining block size.

You changed the codegen for 2 memcpy tests, but do not mention that in the patch description.

@AreaZR
Copy link
Contributor Author

AreaZR commented Sep 18, 2024

Your patch describes memset yet most of the changes appear to be for memcpy :/

Incorrect. The memset code was refactored to use the same method as the memcpy code for determining block size.

You changed the codegen for 2 memcpy tests, but do not mention that in the patch description.

Fixed!

@AreaZR AreaZR force-pushed the memset branch 2 times, most recently from 8a2e18a to f25e932 Compare September 18, 2024 17:01
@AreaZR AreaZR changed the title [X86] For inline memset with minsize, use stosb or stosd [X86] For inline memset and memcpy with minsize, use stosb or stosd Sep 19, 2024
@AreaZR AreaZR changed the title [X86] For inline memset and memcpy with minsize, use stosb or stosd [X86] For inline memset/memcpy with minsize, use stosb or stosd Sep 19, 2024
@AreaZR AreaZR changed the title [X86] For inline memset/memcpy with minsize, use stosb or stosd [X86] For inline memset/memcpy with minsize, use byte or double-word accesses Sep 19, 2024
@AreaZR AreaZR changed the title [X86] For inline memset/memcpy with minsize, use byte or double-word accesses [X86] For minsize memset/memcpy, use byte or double-word accesses Sep 19, 2024
repstosb and repstosd are the same size, but stosd is only done for 0 because the process of multiplying the constant so that it is copied across the bytes of the 32-bit number adds extra instructions that cause the size to increase. For 0, repstosb and repstosd are the same size, but stosd is only done for 0 because the process of multiplying the constant so that it is copied across the bytes of the 32-bit number adds extra instructions that cause the size to increase. For 0, we do not need to do that at all.

For memcpy, the same goes, and as a result the minsize check was moved ahead because a jmp to memcpy encoded takes more bytes than repmovsb.
@AreaZR
Copy link
Contributor Author

AreaZR commented Sep 20, 2024

@phoebewang is this adequate?

@phoebewang
Copy link
Contributor

@phoebewang is this adequate?

Can you split the 2nd commit into one NFC for refactor and one for the change?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants