[libc] Add utility functions for warp-level scan and reduction #84866

jhuber6 · 2024-03-12T03:00:20Z

Summary:
The GPU uses a SIMT execution model. That means that each value actually
belongs to a group of 32 or 64 other lanes executing next to it. These
platforms offer some intrinsic fuctions to actually take elements from
neighboring lanes. With these we can do parallel scans or reductions.
These functions do not have an immediate user, but will be used in the
allocator interface that is in-progress and are generally good to have.
This patch is a precommit for these new utilitly functions.

llvmbot · 2024-03-12T03:00:46Z

@llvm/pr-subscribers-backend-amdgpu

@llvm/pr-subscribers-libc

Author: Joseph Huber (jhuber6)

Changes

Summary:
The GPU uses a SIMT execution model. That means that each value actually
belongs to a group of 32 or 64 other lanes executing next to it. These
platforms offer some intrinsic fuctions to actually take elements from
neighboring lanes. With these we can do parallel scans or reductions.
These functions do not have an immediate user, but will be used in the
allocator interface that is in-progress and are generally good to have.
This patch is a precommit for these new utilitly functions.

Full diff: https://github.com/llvm/llvm-project/pull/84866.diff

7 Files Affected:

(modified) libc/src/__support/GPU/amdgpu/utils.h (+6)
(modified) libc/src/__support/GPU/generic/utils.h (+2)
(modified) libc/src/__support/GPU/nvptx/utils.h (+7)
(modified) libc/src/__support/GPU/utils.h (+19)
(modified) libc/test/integration/src/__support/CMakeLists.txt (+3)
(added) libc/test/integration/src/__support/GPU/CMakeLists.txt (+11)
(added) libc/test/integration/src/__support/GPU/scan_reduce.cpp (+61)

diff --git a/libc/src/__support/GPU/amdgpu/utils.h b/libc/src/__support/GPU/amdgpu/utils.h
index 75f0b5744ebd72..9b520a6bcf38d4 100644
--- a/libc/src/__support/GPU/amdgpu/utils.h
+++ b/libc/src/__support/GPU/amdgpu/utils.h
@@ -145,6 +145,12 @@ LIBC_INLINE uint32_t get_lane_size() {
   __builtin_amdgcn_wave_barrier();
 }
 
+/// Shuffles the the lanes inside the wavefront according to the given index.
+[[clang::convergent]] LIBC_INLINE uint32_t shuffle(uint64_t, uint32_t idx,
+                                                   uint32_t x) {
+  return __builtin_amdgcn_ds_bpermute(idx << 2, x);
+}
+
 /// Returns the current value of the GPU's processor clock.
 /// NOTE: The RDNA3 and RDNA2 architectures use a 20-bit cycle counter.
 LIBC_INLINE uint64_t processor_clock() { return __builtin_readcyclecounter(); }
diff --git a/libc/src/__support/GPU/generic/utils.h b/libc/src/__support/GPU/generic/utils.h
index c6c3c01cf7d5f0..b6df59f7aa9efc 100644
--- a/libc/src/__support/GPU/generic/utils.h
+++ b/libc/src/__support/GPU/generic/utils.h
@@ -67,6 +67,8 @@ LIBC_INLINE void sync_threads() {}
 
 LIBC_INLINE void sync_lane(uint64_t) {}
 
+LIBC_INLINE uint32_t shuffle(uint64_t, uint32_t, uint32_t x) { return x; }
+
 LIBC_INLINE uint64_t processor_clock() { return 0; }
 
 LIBC_INLINE uint64_t fixed_frequency_clock() { return 0; }
diff --git a/libc/src/__support/GPU/nvptx/utils.h b/libc/src/__support/GPU/nvptx/utils.h
index a92c8847b6ecdf..5d20d75415ff27 100644
--- a/libc/src/__support/GPU/nvptx/utils.h
+++ b/libc/src/__support/GPU/nvptx/utils.h
@@ -126,6 +126,13 @@ LIBC_INLINE uint32_t get_lane_size() { return 32; }
   __nvvm_bar_warp_sync(static_cast<uint32_t>(mask));
 }
 
+/// Shuffles the the lanes inside the warp according to the given index.
+[[clang::convergent]] LIBC_INLINE uint32_t shuffle(uint64_t lane_mask,
+                                                   uint32_t idx, uint32_t x) {
+  uint32_t mask = static_cast<uint32_t>(lane_mask);
+  return __nvvm_shfl_sync_idx_i32(mask, x, idx, get_lane_size() - 1);
+}
+
 /// Returns the current value of the GPU's processor clock.
 LIBC_INLINE uint64_t processor_clock() { return __builtin_readcyclecounter(); }
 
diff --git a/libc/src/__support/GPU/utils.h b/libc/src/__support/GPU/utils.h
index 0f9167cdee0663..93022e8de811f7 100644
--- a/libc/src/__support/GPU/utils.h
+++ b/libc/src/__support/GPU/utils.h
@@ -31,6 +31,25 @@ LIBC_INLINE bool is_first_lane(uint64_t lane_mask) {
   return gpu::get_lane_id() == get_first_lane_id(lane_mask);
 }
 
+/// Gets the sum of all lanes inside the warp or wavefront.
+LIBC_INLINE uint32_t reduce(uint64_t lane_mask, uint32_t x) {
+  for (uint32_t step = gpu::get_lane_size() / 2; step > 0; step /= 2) {
+    uint32_t index = step + gpu::get_lane_id();
+    x += gpu::shuffle(lane_mask, index, x);
+  }
+  return gpu::broadcast_value(lane_mask, x);
+}
+
+/// Gets the accumulator scan of the threads in the warp or wavefront.
+LIBC_INLINE uint32_t scan(uint64_t lane_mask, uint32_t x) {
+  for (uint32_t step = 1; step < gpu::get_lane_size(); step *= 2) {
+    uint32_t index = gpu::get_lane_id() - step;
+    uint32_t bitmask = gpu::get_lane_id() >= step;
+    x += -bitmask & gpu::shuffle(lane_mask, index, x);
+  }
+  return x;
+}
+
 } // namespace gpu
 } // namespace LIBC_NAMESPACE
 
diff --git a/libc/test/integration/src/__support/CMakeLists.txt b/libc/test/integration/src/__support/CMakeLists.txt
index 7c853ff10259f5..b5b6557e8d6899 100644
--- a/libc/test/integration/src/__support/CMakeLists.txt
+++ b/libc/test/integration/src/__support/CMakeLists.txt
@@ -1 +1,4 @@
 add_subdirectory(threads)
+if(LIBC_TARGET_OS_IS_GPU)
+  add_subdirectory(GPU)
+endif()
diff --git a/libc/test/integration/src/__support/GPU/CMakeLists.txt b/libc/test/integration/src/__support/GPU/CMakeLists.txt
new file mode 100644
index 00000000000000..7811e0da45ddcf
--- /dev/null
+++ b/libc/test/integration/src/__support/GPU/CMakeLists.txt
@@ -0,0 +1,11 @@
+add_custom_target(libc-support-gpu-tests)
+add_dependencies(libc-integration-tests libc-support-gpu-tests)
+
+add_integration_test(
+  scan_reduce_test
+  SUITE libc-support-gpu-tests
+  SRCS
+    scan_reduce.cpp
+  LOADER_ARGS
+    --threads 64
+)
diff --git a/libc/test/integration/src/__support/GPU/scan_reduce.cpp b/libc/test/integration/src/__support/GPU/scan_reduce.cpp
new file mode 100644
index 00000000000000..4983a3fc83f284
--- /dev/null
+++ b/libc/test/integration/src/__support/GPU/scan_reduce.cpp
@@ -0,0 +1,61 @@
+//===-- Test for the parallel scan and reduction operations on the GPU ----===//
+//
+// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
+// See https://llvm.org/LICENSE.txt for license information.
+// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
+//
+//===----------------------------------------------------------------------===//
+
+#include "src/__support/CPP/bit.h"
+#include "src/__support/GPU/utils.h"
+#include "test/IntegrationTest/test.h"
+
+using namespace LIBC_NAMESPACE;
+
+static uint32_t sum(uint32_t n) { return n * (n + 1) / 2; }
+
+// Tests a reduction within a convergant warp or wavefront using some known
+// values. For example, if every element in the lane is one, then the sum should
+// be the size of the warp or wavefront, i.e. 1 + 1 + 1 ... + 1.
+static void test_reduce() {
+  uint64_t mask = gpu::get_lane_mask();
+  uint32_t x = gpu::reduce(mask, 1);
+  EXPECT_EQ(x, gpu::get_lane_size());
+
+  uint32_t y = gpu::reduce(mask, gpu::get_lane_id());
+  EXPECT_EQ(y, sum(gpu::get_lane_size() - 1));
+
+  uint32_t z = 0;
+  if (gpu::get_lane_id() % 2)
+    z = gpu::reduce(gpu::get_lane_mask(), 1);
+  EXPECT_EQ(z, gpu::get_lane_id() % 2 ? gpu::get_lane_size() / 2 : 0);
+  gpu::sync_lane(mask);
+}
+
+// Tests an accumulation scan within a convergent warp or wavefront using some
+// known values. For example, if every element in the lane is one, then the scan
+// should have each element be equivalent to its ID, i.e. 1, 1 + 1, ...
+static void test_scan() {
+  uint64_t mask = gpu::get_lane_mask();
+
+  uint32_t x = gpu::scan(mask, 1);
+  EXPECT_EQ(x, gpu::get_lane_id() + 1);
+
+  uint32_t y = gpu::scan(mask, gpu::get_lane_id());
+  EXPECT_EQ(y, sum(gpu::get_lane_id()));
+
+  uint32_t z = 0;
+  if (gpu::get_lane_id() % 2)
+    z = gpu::scan(gpu::get_lane_mask(), 1);
+  EXPECT_EQ(z, gpu::get_lane_id() % 2 ? gpu::get_lane_id() / 2 + 1 : 0);
+
+  gpu::sync_lane(mask);
+}
+
+TEST_MAIN(int argc, char **argv, char **envp) {
+  test_reduce();
+
+  test_scan();
+
+  return 0;
+}

Summary: The GPU uses a SIMT execution model. That means that each value actually belongs to a group of 32 or 64 other lanes executing next to it. These platforms offer some intrinsic fuctions to actually take elements from neighboring lanes. With these we can do parallel scans or reductions. These functions do not have an immediate user, but will be used in the allocator interface that is in-progress and are generally good to have. This patch is a precommit for these new utilitly functions.

arsenm · 2024-03-12T03:42:11Z

libc/src/__support/GPU/utils.h

@@ -31,6 +31,25 @@ LIBC_INLINE bool is_first_lane(uint64_t lane_mask) {
  return gpu::get_lane_id() == get_first_lane_id(lane_mask);
 }

+/// Gets the sum of all lanes inside the warp or wavefront.


We have llvm.amdgcn.wave.reduce.umin/umax. The intent was to expand those intrinsics to cover more operations , since the expansion has different strategies depending on the available instructions

I'll need this for the NVPTX build in any case, and I think it's sufficient for now to have this in source the way I need it until someone wants to implement the suite more correctly. However I don't think it will change too much considering that this is just convergent addition.

So, can we keep this implementation and replace it with a better builtin one later?

Also looks like the alternative implementations were never implemented. currently looks like we don't have the bpermute or DPP paths

I definitely see the appeal of having these as intrinsic functions. Specifically because right now I need to use __builtin_amdgcn_wavefrontsize which prevents these loops from being fully unrolled. However, I think it will be easy to drop those intrinsic functions in later.

shiltian

LGTM. I think we can revisit possible target dependent implementation/improvement on the fly.

jhuber6 requested review from arsenm, Artem-B, jdoerfert, lntue, michaelrj-google and shiltian March 12, 2024 03:00

llvmbot added backend:AMDGPU libc labels Mar 12, 2024

jhuber6 force-pushed the ScanReduce branch from 6a1ba77 to ca520e6 Compare March 12, 2024 03:26

arsenm reviewed Mar 12, 2024

View reviewed changes

shiltian approved these changes Mar 12, 2024

View reviewed changes

jhuber6 merged commit 261e564 into llvm:main Mar 12, 2024
4 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[libc] Add utility functions for warp-level scan and reduction #84866

[libc] Add utility functions for warp-level scan and reduction #84866

jhuber6 commented Mar 12, 2024

llvmbot commented Mar 12, 2024 •

edited

Loading

arsenm Mar 12, 2024

jhuber6 Mar 12, 2024 •

edited

Loading

arsenm Mar 12, 2024

jhuber6 Mar 12, 2024

shiltian left a comment

[libc] Add utility functions for warp-level scan and reduction #84866

[libc] Add utility functions for warp-level scan and reduction #84866

Conversation

jhuber6 commented Mar 12, 2024

llvmbot commented Mar 12, 2024 • edited Loading

arsenm Mar 12, 2024

Choose a reason for hiding this comment

jhuber6 Mar 12, 2024 • edited Loading

Choose a reason for hiding this comment

arsenm Mar 12, 2024

Choose a reason for hiding this comment

jhuber6 Mar 12, 2024

Choose a reason for hiding this comment

shiltian left a comment

Choose a reason for hiding this comment

llvmbot commented Mar 12, 2024 •

edited

Loading

jhuber6 Mar 12, 2024 •

edited

Loading