Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[libc] Adding a version of memset with software prefetching #70857

Merged
merged 14 commits into from
Nov 10, 2023

Conversation

doshimili
Copy link
Contributor

Software prefetching helps recover performance when hardware prefetching is disabled. The 'LIBC_COPT_MEMSET_X86_USE_SOFTWARE_PREFETCHING' compile time option allows users to use this patch.

* Add software prefetching to memset

* Add software prefetching to memset

* Fix formatting

* Fix build errors

* Fix build errors

* Fix formatting

* Fix formatting

* Fix formatting

* Fix formatting

* Fix formatting
* Add software prefetching to memset

* Add software prefetching to memset

* Fix formatting

* Fix build errors

* Fix build errors

* Fix formatting

* Fix formatting

* Fix formatting

* Fix formatting

* Fix formatting

* Add warmup to memset
@doshimili doshimili marked this pull request as ready for review November 3, 2023 14:08
@llvmbot llvmbot added the libc label Nov 3, 2023
@llvmbot
Copy link
Collaborator

llvmbot commented Nov 3, 2023

@llvm/pr-subscribers-libc

Author: None (doshimili)

Changes

Software prefetching helps recover performance when hardware prefetching is disabled. The 'LIBC_COPT_MEMSET_X86_USE_SOFTWARE_PREFETCHING' compile time option allows users to use this patch.


Full diff: https://github.com/llvm/llvm-project/pull/70857.diff

4 Files Affected:

  • (modified) libc/src/string/CMakeLists.txt (+1)
  • (modified) libc/src/string/memory_utils/op_generic.h (+25)
  • (modified) libc/src/string/memory_utils/x86_64/inline_memset.h (+54-25)
  • (modified) utils/bazel/llvm-project-overlay/libc/BUILD.bazel (+1)
diff --git a/libc/src/string/CMakeLists.txt b/libc/src/string/CMakeLists.txt
index 67675b682081c67..aa69bff7a8cfada 100644
--- a/libc/src/string/CMakeLists.txt
+++ b/libc/src/string/CMakeLists.txt
@@ -656,6 +656,7 @@ if(${LIBC_TARGET_ARCHITECTURE_IS_X86})
   add_memset(memset_x86_64_opt_sse4   COMPILE_OPTIONS -march=nehalem        REQUIRE SSE4_2)
   add_memset(memset_x86_64_opt_avx2   COMPILE_OPTIONS -march=haswell        REQUIRE AVX2)
   add_memset(memset_x86_64_opt_avx512 COMPILE_OPTIONS -march=skylake-avx512 REQUIRE AVX512F)
+  add_memset(memset_x86_64_opt_sw_prefetch COMPILE_OPTIONS -DLIBC_COPT_MEMSET_X86_USE_SOFTWARE_PREFETCHING)
   add_memset(memset_opt_host          COMPILE_OPTIONS ${LIBC_COMPILE_OPTIONS_NATIVE})
   add_memset(memset)
 elseif(${LIBC_TARGET_ARCHITECTURE_IS_AARCH64})
diff --git a/libc/src/string/memory_utils/op_generic.h b/libc/src/string/memory_utils/op_generic.h
index fd71ca30e24b936..2844501a7459044 100644
--- a/libc/src/string/memory_utils/op_generic.h
+++ b/libc/src/string/memory_utils/op_generic.h
@@ -48,6 +48,13 @@ using generic_v256 = uint8_t __attribute__((__vector_size__(32)));
 using generic_v512 = uint8_t __attribute__((__vector_size__(64)));
 } // namespace LIBC_NAMESPACE
 
+namespace LIBC_NAMESPACE::sw_prefetch {
+// Size of a cacheline for software prefetching
+static constexpr size_t kCachelineSize = 64;
+// prefetch for write
+static inline void PrefetchW(CPtr dst) { __builtin_prefetch(dst, 1, 3); }
+} // namespace LIBC_NAMESPACE::sw_prefetch
+
 namespace LIBC_NAMESPACE::generic {
 
 // We accept three types of values as elements for generic operations:
@@ -163,6 +170,24 @@ template <typename T> struct Memset {
     } while (offset < count - SIZE);
     tail(dst, value, count);
   }
+
+  template <size_t prefetch_distance, size_t prefetch_degree>
+  LIBC_INLINE static void loop_and_tail_prefetch(Ptr dst, uint8_t value,
+                                                 size_t count) {
+    size_t offset = 96;
+    while (offset + prefetch_degree + SIZE <= count) {
+      for (size_t i = 0; i < prefetch_degree / sw_prefetch::kCachelineSize; ++i)
+        sw_prefetch::PrefetchW(dst + offset + prefetch_distance +
+                               sw_prefetch::kCachelineSize * i);
+      for (size_t i = 0; i < prefetch_degree; i += SIZE, offset += SIZE)
+        block(dst + offset, value);
+    }
+    while (offset + SIZE < count) {
+      block(dst + offset, value);
+      offset += SIZE;
+    }
+    tail(dst, value, count);
+  }
 };
 
 template <typename T, typename... TS> struct MemsetSequence {
diff --git a/libc/src/string/memory_utils/x86_64/inline_memset.h b/libc/src/string/memory_utils/x86_64/inline_memset.h
index 6436594856b0eaf..98f559bca875a3a 100644
--- a/libc/src/string/memory_utils/x86_64/inline_memset.h
+++ b/libc/src/string/memory_utils/x86_64/inline_memset.h
@@ -16,9 +16,12 @@
 #include <stddef.h> // size_t
 
 namespace LIBC_NAMESPACE {
+namespace x86 {
+LIBC_INLINE_VAR constexpr bool kUseSoftwarePrefetchingMemset =
+    LLVM_LIBC_IS_DEFINED(LIBC_COPT_MEMSET_X86_USE_SOFTWARE_PREFETCHING);
+
+} // namespace x86
 
-[[maybe_unused]] LIBC_INLINE static void
-inline_memset_x86(Ptr dst, uint8_t value, size_t count) {
 #if defined(__AVX512F__)
   using uint128_t = generic_v128;
   using uint256_t = generic_v256;
@@ -37,29 +40,55 @@ inline_memset_x86(Ptr dst, uint8_t value, size_t count) {
   using uint512_t = cpp::array<uint64_t, 8>;
 #endif
 
-  if (count == 0)
-    return;
-  if (count == 1)
-    return generic::Memset<uint8_t>::block(dst, value);
-  if (count == 2)
-    return generic::Memset<uint16_t>::block(dst, value);
-  if (count == 3)
-    return generic::MemsetSequence<uint16_t, uint8_t>::block(dst, value);
-  if (count <= 8)
-    return generic::Memset<uint32_t>::head_tail(dst, value, count);
-  if (count <= 16)
-    return generic::Memset<uint64_t>::head_tail(dst, value, count);
-  if (count <= 32)
-    return generic::Memset<uint128_t>::head_tail(dst, value, count);
-  if (count <= 64)
-    return generic::Memset<uint256_t>::head_tail(dst, value, count);
-  if (count <= 128)
-    return generic::Memset<uint512_t>::head_tail(dst, value, count);
-  // Aligned loop
-  generic::Memset<uint256_t>::block(dst, value);
-  align_to_next_boundary<32>(dst, count);
-  return generic::Memset<uint256_t>::loop_and_tail(dst, value, count);
-}
+  [[maybe_unused]] LIBC_INLINE static void
+  inline_memset_x86_sw_prefetching(Ptr dst, uint8_t value, size_t count) {
+    // Prefetch one cacheline
+    sw_prefetch::PrefetchW(dst + sw_prefetch::kCachelineSize);
+    if (count <= 128)
+      return generic::Memset<uint512_t>::head_tail(dst, value, count);
+    // Prefetch the next cacheline
+    sw_prefetch::PrefetchW(dst + sw_prefetch::kCachelineSize * 2);
+    // Aligned loop
+    generic::Memset<uint256_t>::block(dst, value);
+    align_to_next_boundary<32>(dst, count);
+    if (count <= 192) {
+      return generic::Memset<uint256_t>::loop_and_tail(dst, value, count);
+    } else {
+      generic::Memset<uint512_t>::block(dst, value);
+      generic::Memset<uint256_t>::block(dst + sizeof(uint512_t), value);
+      return generic::Memset<uint256_t>::loop_and_tail_prefetch<320, 128>(
+          dst, value, count);
+    }
+  }
+
+  [[maybe_unused]] LIBC_INLINE static void
+  inline_memset_x86(Ptr dst, uint8_t value, size_t count) {
+    if (count == 0)
+      return;
+    if (count == 1)
+      return generic::Memset<uint8_t>::block(dst, value);
+    if (count == 2)
+      return generic::Memset<uint16_t>::block(dst, value);
+    if (count == 3)
+      return generic::MemsetSequence<uint16_t, uint8_t>::block(dst, value);
+    if (count <= 8)
+      return generic::Memset<uint32_t>::head_tail(dst, value, count);
+    if (count <= 16)
+      return generic::Memset<uint64_t>::head_tail(dst, value, count);
+    if (count <= 32)
+      return generic::Memset<uint128_t>::head_tail(dst, value, count);
+    if (count <= 64)
+      return generic::Memset<uint256_t>::head_tail(dst, value, count);
+    if constexpr (x86::kUseSoftwarePrefetchingMemset) {
+      return inline_memset_x86_sw_prefetching(dst, value, count);
+    }
+    if (count <= 128)
+      return generic::Memset<uint512_t>::head_tail(dst, value, count);
+    // Aligned loop
+    generic::Memset<uint256_t>::block(dst, value);
+    align_to_next_boundary<32>(dst, count);
+    return generic::Memset<uint256_t>::loop_and_tail(dst, value, count);
+  }
 } // namespace LIBC_NAMESPACE
 
 #endif // LLVM_LIBC_SRC_STRING_MEMORY_UTILS_X86_64_INLINE_MEMSET_H
diff --git a/utils/bazel/llvm-project-overlay/libc/BUILD.bazel b/utils/bazel/llvm-project-overlay/libc/BUILD.bazel
index 3ae68193dccd2b2..dea21fd77182605 100644
--- a/utils/bazel/llvm-project-overlay/libc/BUILD.bazel
+++ b/utils/bazel/llvm-project-overlay/libc/BUILD.bazel
@@ -33,6 +33,7 @@ PRINTF_COPTS = [
 MEMORY_COPTS = [
     # "LIBC_COPT_MEMCPY_X86_USE_REPMOVSB_FROM_SIZE=0",
     # "LIBC_COPT_MEMCPY_X86_USE_SOFTWARE_PREFETCHING",
+    # "LIBC_COPT_MEMSET_X86_USE_SOFTWARE_PREFETCHING",
 ]
 
 # A flag to pick which `mpfr` to use for math tests.

@lntue lntue requested a review from gchatelet November 7, 2023 01:28
Copy link
Contributor

@gchatelet gchatelet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thx for the PR!
First round of comments, let's iterate from here :)

libc/src/string/memory_utils/op_generic.h Outdated Show resolved Hide resolved
libc/src/string/memory_utils/op_generic.h Outdated Show resolved Hide resolved
libc/src/string/memory_utils/op_generic.h Outdated Show resolved Hide resolved
libc/src/string/memory_utils/op_generic.h Outdated Show resolved Hide resolved
libc/src/string/memory_utils/x86_64/inline_memset.h Outdated Show resolved Hide resolved
libc/src/string/memory_utils/x86_64/inline_memset.h Outdated Show resolved Hide resolved
… and other minor changes (#4)

* Add software prefetching to memset

* Add software prefetching to memset

* Fix formatting

* Fix build errors

* Fix build errors

* Fix formatting

* Fix formatting

* Fix formatting

* Fix formatting

* Fix formatting

* Add warmup to memset

* SW Prefetching in Memset

* Move implementation to src/string/memory_utils/x86_64/inline_memset.h and other minor changes

* Fix formatting
* Add software prefetching to memset

* Add software prefetching to memset

* Fix formatting

* Fix build errors

* Fix build errors

* Fix formatting

* Fix formatting

* Fix formatting

* Fix formatting

* Fix formatting

* Add warmup to memset

* SW Prefetching in Memset

* Move implementation to src/string/memory_utils/x86_64/inline_memset.h and other minor changes

* Fix formatting

* Remove wrong include
* Add software prefetching to memset

* Add software prefetching to memset

* Fix formatting

* Fix build errors

* Fix build errors

* Fix formatting

* Fix formatting

* Fix formatting

* Fix formatting

* Fix formatting

* Add warmup to memset

* SW Prefetching in Memset

* Move implementation to src/string/memory_utils/x86_64/inline_memset.h and other minor changes

* Fix formatting

* Remove wrong include
libc/src/string/memory_utils/op_generic.h Outdated Show resolved Hide resolved
libc/src/string/memory_utils/utils.h Outdated Show resolved Hide resolved
libc/src/string/memory_utils/x86_64/inline_memset.h Outdated Show resolved Hide resolved
libc/src/string/memory_utils/x86_64/inline_memset.h Outdated Show resolved Hide resolved
libc/src/string/memory_utils/x86_64/inline_memset.h Outdated Show resolved Hide resolved
Copy link

github-actions bot commented Nov 8, 2023

✅ With the latest revision this PR passed the C/C++ code formatter.

Copy link
Contributor

@gchatelet gchatelet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok this looks pretty good to me. We also have to add support to CMake and make sure the option is discoverable. This means adding an entry in https://github.com/llvm/llvm-project/blob/main/libc/config/config.json. It should mimic the LIBC_COPT_MEMSET_X86_USE_SOFTWARE_PREFETCHING without the COPT part.
You can draw inspiration from 380eb46 (don't update libc/docs/configure.rst yourself, it's now automatically done from the json file).
For now the option should be set to false, once I've checked that the codegen looks good we'll enable the option by overriding it in libc/config/linux/x86_64/config.json and libc/config/windows/x86_64/config.json.

libc/config/config.json Outdated Show resolved Hide resolved
Copy link
Contributor

@gchatelet gchatelet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thx for bearing with me!

@gchatelet gchatelet merged commit 3153aa4 into llvm:main Nov 10, 2023
3 checks passed
zahiraam pushed a commit to zahiraam/llvm-project that referenced this pull request Nov 20, 2023
Software prefetching helps recover performance when hardware prefetching
is disabled. The 'LIBC_COPT_MEMSET_X86_USE_SOFTWARE_PREFETCHING' compile
time option allows users to use this patch.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants