[libc][x86] Use prefetch for write for memcpy #90450

gchatelet · 2024-04-29T10:41:30Z

Currently when LIBC_COPT_MEMCPY_X86_USE_SOFTWARE_PREFETCHING is set we prefetch memory for read on the source buffer. This patch adds prefetch for write on the destination buffer.

Currently when `LIBC_COPT_MEMCPY_X86_USE_SOFTWARE_PREFETCHING` is set we prefetch memory for read on the source buffer. This patch adds prefetch for write on the destination buffer.

llvmbot · 2024-04-29T10:42:01Z

@llvm/pr-subscribers-libc

Author: Guillaume Chatelet (gchatelet)

Changes

Currently when LIBC_COPT_MEMCPY_X86_USE_SOFTWARE_PREFETCHING is set we prefetch memory for read on the source buffer. This patch adds prefetch for write on the destination buffer.

Full diff: https://github.com/llvm/llvm-project/pull/90450.diff

1 Files Affected:

(modified) libc/src/string/memory_utils/x86_64/inline_memcpy.h (+20-13)

diff --git a/libc/src/string/memory_utils/x86_64/inline_memcpy.h b/libc/src/string/memory_utils/x86_64/inline_memcpy.h
index ae61b1235bd08c..7a4e70bdf2d150 100644
--- a/libc/src/string/memory_utils/x86_64/inline_memcpy.h
+++ b/libc/src/string/memory_utils/x86_64/inline_memcpy.h
@@ -69,14 +69,21 @@ inline_memcpy_x86_avx_ge64(Ptr __restrict dst, CPtr __restrict src,
   return builtin::Memcpy<64>::loop_and_tail(dst, src, count);
 }
 
+[[maybe_unused]] LIBC_INLINE void inline_memcpy_prefetch(Ptr __restrict dst,
+                                                         CPtr __restrict src,
+                                                         size_t distance) {
+  prefetch_to_local_cache(src + distance);
+  prefetch_for_write(dst + distance);
+}
+
 [[maybe_unused]] LIBC_INLINE void
 inline_memcpy_x86_sse2_ge64_sw_prefetching(Ptr __restrict dst,
                                            CPtr __restrict src, size_t count) {
   using namespace LIBC_NAMESPACE::x86;
-  prefetch_to_local_cache(src + K_ONE_CACHELINE);
+  inline_memcpy_prefetch(dst, src, K_ONE_CACHELINE);
   if (count <= 128)
     return builtin::Memcpy<64>::head_tail(dst, src, count);
-  prefetch_to_local_cache(src + K_TWO_CACHELINES);
+  inline_memcpy_prefetch(dst, src, K_TWO_CACHELINES);
   // Aligning 'dst' on a 32B boundary.
   builtin::Memcpy<32>::block(dst, src);
   align_to_next_boundary<32, Arg::Dst>(dst, src, count);
@@ -90,17 +97,17 @@ inline_memcpy_x86_sse2_ge64_sw_prefetching(Ptr __restrict dst,
   if (count < 352) {
     // Two cache lines at a time.
     while (offset + K_TWO_CACHELINES + 32 <= count) {
-      prefetch_to_local_cache(src + offset + K_ONE_CACHELINE);
-      prefetch_to_local_cache(src + offset + K_TWO_CACHELINES);
+      inline_memcpy_prefetch(dst, src, offset + K_ONE_CACHELINE);
+      inline_memcpy_prefetch(dst, src, offset + K_TWO_CACHELINES);
       builtin::Memcpy<K_TWO_CACHELINES>::block_offset(dst, src, offset);
       offset += K_TWO_CACHELINES;
     }
   } else {
     // Three cache lines at a time.
     while (offset + K_THREE_CACHELINES + 32 <= count) {
-      prefetch_to_local_cache(src + offset + K_ONE_CACHELINE);
-      prefetch_to_local_cache(src + offset + K_TWO_CACHELINES);
-      prefetch_to_local_cache(src + offset + K_THREE_CACHELINES);
+      inline_memcpy_prefetch(dst, src, offset + K_ONE_CACHELINE);
+      inline_memcpy_prefetch(dst, src, offset + K_TWO_CACHELINES);
+      inline_memcpy_prefetch(dst, src, offset + K_THREE_CACHELINES);
       // It is likely that this copy will be turned into a 'rep;movsb' on
       // non-AVX machines.
       builtin::Memcpy<K_THREE_CACHELINES>::block_offset(dst, src, offset);
@@ -114,11 +121,11 @@ inline_memcpy_x86_sse2_ge64_sw_prefetching(Ptr __restrict dst,
 inline_memcpy_x86_avx_ge64_sw_prefetching(Ptr __restrict dst,
                                           CPtr __restrict src, size_t count) {
   using namespace LIBC_NAMESPACE::x86;
-  prefetch_to_local_cache(src + K_ONE_CACHELINE);
+  inline_memcpy_prefetch(dst, src, K_ONE_CACHELINE);
   if (count <= 128)
     return builtin::Memcpy<64>::head_tail(dst, src, count);
-  prefetch_to_local_cache(src + K_TWO_CACHELINES);
-  prefetch_to_local_cache(src + K_THREE_CACHELINES);
+  inline_memcpy_prefetch(dst, src, K_TWO_CACHELINES);
+  inline_memcpy_prefetch(dst, src, K_THREE_CACHELINES);
   if (count < 256)
     return builtin::Memcpy<128>::head_tail(dst, src, count);
   // Aligning 'dst' on a 32B boundary.
@@ -133,9 +140,9 @@ inline_memcpy_x86_avx_ge64_sw_prefetching(Ptr __restrict dst,
   // - count >= 128.
   while (offset + K_THREE_CACHELINES + 64 <= count) {
     // Three cache lines at a time.
-    prefetch_to_local_cache(src + offset + K_ONE_CACHELINE);
-    prefetch_to_local_cache(src + offset + K_TWO_CACHELINES);
-    prefetch_to_local_cache(src + offset + K_THREE_CACHELINES);
+    inline_memcpy_prefetch(dst, src, offset + K_ONE_CACHELINE);
+    inline_memcpy_prefetch(dst, src, offset + K_TWO_CACHELINES);
+    inline_memcpy_prefetch(dst, src, offset + K_THREE_CACHELINES);
     builtin::Memcpy<K_THREE_CACHELINES>::block_offset(dst, src, offset);
     offset += K_THREE_CACHELINES;
   }

legrosbuffle · 2024-04-29T11:05:59Z

libc/src/string/memory_utils/x86_64/inline_memcpy.h

+                                                         CPtr __restrict src,
+                                                         size_t distance) {
+  prefetch_to_local_cache(src + distance);
+  prefetch_for_write(dst + distance);


Should there be an option to turn that off ?

The prefetching behavior is controlled by LIBC_COPT_MEMCPY_X86_USE_SOFTWARE_PREFETCHING, I don't want to split this further down into "prefetch for write" and "prefetch for read", we already have a lot of combinations to test.

legrosbuffle · 2024-04-29T11:07:25Z

I think the commit message should at least include the open-source benchmark numbers, and possible hints from our internal experience with this.

SchrodingerZhu · 2024-04-29T14:29:02Z

Could you paste the performance study numbers

gchatelet · 2024-08-28T12:47:22Z

Reviving this PR. The current microbenchmarks available (https://github.com/google/fleetbench and libc/benchmarks) are not able to show the impact of this change. Although we do have supporting evidence internally that this helps in certain configurations.
Specifically, when servers with many cores are under high utilization, DRAM congestion becomes a performance bottleneck. For these servers, disabling hardware prefetchers help reclaim some of the bandwidth. This - in turn - lowers the performance of memory functions as they highly rely on the streaming effect of hardware prefetchers. This patch helps recover some of this loss. More context is available in : https://dl.acm.org/doi/10.1145/3620666.3651373.

Here are the throughput results for fleetbench on a machine with hardware prefetchers turned off (all runs sorted by increasing throughput). It shows a neutral / mild improvement.

michaelrj-google

LGTM

…0dbf23e48 Local branch amd-gfx fd70dbf Merged main:fcb3a0485857c749d04ea234a8c3d629c62ab211 into amd-gfx:425372303add Remote branch main 73ef397 [libc][x86] Use prefetch for write for memcpy (llvm#90450)

[libc][x86] Use prefetch for write for memcpy

88b4b3e

Currently when `LIBC_COPT_MEMCPY_X86_USE_SOFTWARE_PREFETCHING` is set we prefetch memory for read on the source buffer. This patch adds prefetch for write on the destination buffer.

gchatelet requested a review from legrosbuffle April 29, 2024 10:41

llvmbot added the libc label Apr 29, 2024

legrosbuffle reviewed Apr 29, 2024

View reviewed changes

michaelrj-google approved these changes Aug 28, 2024

View reviewed changes

legrosbuffle approved these changes Aug 29, 2024

View reviewed changes

gchatelet merged commit 73ef397 into llvm:main Aug 29, 2024
7 checks passed

gchatelet deleted the memcpy_use_prefectch_for_write branch August 29, 2024 12:17

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[libc][x86] Use prefetch for write for memcpy #90450

[libc][x86] Use prefetch for write for memcpy #90450

gchatelet commented Apr 29, 2024

llvmbot commented Apr 29, 2024

legrosbuffle Apr 29, 2024

gchatelet Apr 29, 2024

legrosbuffle commented Apr 29, 2024

SchrodingerZhu commented Apr 29, 2024

gchatelet commented Aug 28, 2024 •

edited

Loading

michaelrj-google left a comment

[libc][x86] Use prefetch for write for memcpy #90450

[libc][x86] Use prefetch for write for memcpy #90450

Conversation

gchatelet commented Apr 29, 2024

llvmbot commented Apr 29, 2024

legrosbuffle Apr 29, 2024

Choose a reason for hiding this comment

gchatelet Apr 29, 2024

Choose a reason for hiding this comment

legrosbuffle commented Apr 29, 2024

SchrodingerZhu commented Apr 29, 2024

gchatelet commented Aug 28, 2024 • edited Loading

michaelrj-google left a comment

Choose a reason for hiding this comment

gchatelet commented Aug 28, 2024 •

edited

Loading