Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[libc] Rework the RPC interface to accept runtime wave sizes #80914

Merged
merged 1 commit into from
Feb 13, 2024

Conversation

jhuber6
Copy link
Contributor

@jhuber6 jhuber6 commented Feb 7, 2024

Summary:
The RPC interface needs to handle an entire warp or wavefront at once.
This is currently done by using a compile time constant indicating the
size of the buffer, which right now defaults to some value on the client
(GPU) side. However, there are currently attempts to move the libc
library to a single IR build. This is problematic as the size of the
wave fronts changes between ISAs on AMDGPU. The builitin
__builtin_amdgcn_wavefrontsize() will return the appropriate value,
but it is only known at runtime now.

In order to support this, this patch restructures the packet. Now
instead of having an array of arrays, we simply have a large array of
buffers and slice it according to the runtime value if we don't know it
ahead of time. This also somewhat has the advantage of making the buffer
contiguous within a page now that the header has been moved out of it.

@llvmbot
Copy link
Collaborator

llvmbot commented Feb 7, 2024

@llvm/pr-subscribers-libc

@llvm/pr-subscribers-backend-amdgpu

Author: Joseph Huber (jhuber6)

Changes

Summary:
The RPC interface needs to handle an entire warp or wavefront at once.
This is currently done by using a compile time constant indicating the
size of the buffer, which right now defaults to some value on the client
(GPU) side. However, there are currently attempts to move the libc
library to a single IR build. This is problematic as the size of the
wave fronts changes between ISAs on AMDGPU. The builitin
__builtin_amdgcn_wavefrontsize() will return the appropriate value,
but it is only known at runtime now.

In order to support this, this patch restructures the packet. Now
instead of having an array of arrays, we simply have a large array of
buffers and slice it according to the runtime value if we don't know it
ahead of time. This also somewhat has the advantage of making the buffer
contiguous within a page now that the header has been moved out of it.


Full diff: https://github.com/llvm/llvm-project/pull/80914.diff

5 Files Affected:

  • (modified) libc/src/__support/GPU/amdgpu/utils.h (+5-5)
  • (modified) libc/src/__support/GPU/generic/utils.h (+1-3)
  • (modified) libc/src/__support/GPU/nvptx/utils.h (+2-5)
  • (modified) libc/src/__support/RPC/rpc.h (+68-49)
  • (modified) libc/test/src/__support/RPC/rpc_smoke_test.cpp (+2-6)
diff --git a/libc/src/__support/GPU/amdgpu/utils.h b/libc/src/__support/GPU/amdgpu/utils.h
index 58bbe29cb3a7d7..9dc82ebcb6f129 100644
--- a/libc/src/__support/GPU/amdgpu/utils.h
+++ b/libc/src/__support/GPU/amdgpu/utils.h
@@ -17,9 +17,6 @@
 namespace LIBC_NAMESPACE {
 namespace gpu {
 
-/// The number of threads that execute in lock-step in a lane.
-constexpr const uint64_t LANE_SIZE = __AMDGCN_WAVEFRONT_SIZE;
-
 /// Type aliases to the address spaces used by the AMDGPU backend.
 template <typename T> using Private = [[clang::opencl_private]] T;
 template <typename T> using Constant = [[clang::opencl_constant]] T;
@@ -108,8 +105,11 @@ LIBC_INLINE uint64_t get_thread_id() {
          get_num_threads_x() * get_num_threads_y() * get_thread_id_z();
 }
 
-/// Returns the size of an AMD wavefront. Either 32 or 64 depending on hardware.
-LIBC_INLINE uint32_t get_lane_size() { return LANE_SIZE; }
+/// Returns the size of an AMD wavefront, either 32 or 64 depending on hardware.
+/// This is not a compile time constant and can vary within an executable.
+LIBC_INLINE uint32_t get_lane_size() {
+  return __builtin_amdgcn_wavefrontsize();
+}
 
 /// Returns the id of the thread inside of an AMD wavefront executing together.
 [[clang::convergent]] LIBC_INLINE uint32_t get_lane_id() {
diff --git a/libc/src/__support/GPU/generic/utils.h b/libc/src/__support/GPU/generic/utils.h
index 00b59837ccc671..58db88dce1ca8c 100644
--- a/libc/src/__support/GPU/generic/utils.h
+++ b/libc/src/__support/GPU/generic/utils.h
@@ -16,8 +16,6 @@
 namespace LIBC_NAMESPACE {
 namespace gpu {
 
-constexpr const uint64_t LANE_SIZE = 1;
-
 template <typename T> using Private = T;
 template <typename T> using Constant = T;
 template <typename T> using Shared = T;
@@ -55,7 +53,7 @@ LIBC_INLINE uint32_t get_thread_id_z() { return 0; }
 
 LIBC_INLINE uint64_t get_thread_id() { return 0; }
 
-LIBC_INLINE uint32_t get_lane_size() { return LANE_SIZE; }
+LIBC_INLINE uint32_t get_lane_size() { return 1; }
 
 LIBC_INLINE uint32_t get_lane_id() { return 0; }
 
diff --git a/libc/src/__support/GPU/nvptx/utils.h b/libc/src/__support/GPU/nvptx/utils.h
index e7e297adf7ecca..6c4bb5a7720a50 100644
--- a/libc/src/__support/GPU/nvptx/utils.h
+++ b/libc/src/__support/GPU/nvptx/utils.h
@@ -16,9 +16,6 @@
 namespace LIBC_NAMESPACE {
 namespace gpu {
 
-/// The number of threads that execute in lock-step in a warp.
-constexpr const uint64_t LANE_SIZE = 32;
-
 /// Type aliases to the address spaces used by the NVPTX backend.
 template <typename T> using Private = [[clang::opencl_private]] T;
 template <typename T> using Constant = [[clang::opencl_constant]] T;
@@ -95,8 +92,8 @@ LIBC_INLINE uint64_t get_thread_id() {
          get_num_threads_x() * get_num_threads_y() * get_thread_id_z();
 }
 
-/// Returns the size of a CUDA warp.
-LIBC_INLINE uint32_t get_lane_size() { return LANE_SIZE; }
+/// Returns the size of a CUDA warp, always 32 on NVIDIA hardware.
+LIBC_INLINE uint32_t get_lane_size() { return 32; }
 
 /// Returns the id of the thread inside of a CUDA warp executing together.
 [[clang::convergent]] LIBC_INLINE uint32_t get_lane_id() {
diff --git a/libc/src/__support/RPC/rpc.h b/libc/src/__support/RPC/rpc.h
index 7924d4cec2ac84..639bb71454a744 100644
--- a/libc/src/__support/RPC/rpc.h
+++ b/libc/src/__support/RPC/rpc.h
@@ -43,19 +43,6 @@ struct Header {
   uint16_t opcode;
 };
 
-/// The data payload for the associated packet. We provide enough space for each
-/// thread in the cooperating lane to have a buffer.
-template <uint32_t lane_size = gpu::LANE_SIZE> struct Payload {
-  Buffer slot[lane_size];
-};
-
-/// A packet used to share data between the client and server across an entire
-/// lane. We use a lane as the minimum granularity for execution.
-template <uint32_t lane_size = gpu::LANE_SIZE> struct alignas(64) Packet {
-  Header header;
-  Payload<lane_size> payload;
-};
-
 /// The maximum number of parallel ports that the RPC interface can support.
 constexpr uint64_t MAX_PORT_COUNT = 4096;
 
@@ -71,7 +58,7 @@ constexpr uint64_t MAX_PORT_COUNT = 4096;
 ///   - The client will always start with a 'send' operation.
 ///   - The server will always start with a 'recv' operation.
 ///   - Every 'send' or 'recv' call is mirrored by the other process.
-template <bool Invert, typename Packet> struct Process {
+template <bool Invert, uint32_t lane_size> struct Process {
   LIBC_INLINE Process() = default;
   LIBC_INLINE Process(const Process &) = delete;
   LIBC_INLINE Process &operator=(const Process &) = delete;
@@ -82,7 +69,8 @@ template <bool Invert, typename Packet> struct Process {
   uint32_t port_count = 0;
   cpp::Atomic<uint32_t> *inbox = nullptr;
   cpp::Atomic<uint32_t> *outbox = nullptr;
-  Packet *packet = nullptr;
+  Header *header = nullptr;
+  Buffer *packet = nullptr;
 
   static constexpr uint64_t NUM_BITS_IN_WORD = sizeof(uint32_t) * 8;
   cpp::Atomic<uint32_t> lock[MAX_PORT_COUNT / NUM_BITS_IN_WORD] = {0};
@@ -92,7 +80,9 @@ template <bool Invert, typename Packet> struct Process {
                                     advance(buffer, inbox_offset(port_count)))),
         outbox(reinterpret_cast<cpp::Atomic<uint32_t> *>(
             advance(buffer, outbox_offset(port_count)))),
-        packet(reinterpret_cast<Packet *>(
+        header(reinterpret_cast<Header *>(
+            advance(buffer, header_offset(port_count)))),
+        packet(reinterpret_cast<Buffer *>(
             advance(buffer, buffer_offset(port_count)))) {}
 
   /// Allocate a memory buffer sufficient to store the following equivalent
@@ -101,7 +91,8 @@ template <bool Invert, typename Packet> struct Process {
   /// struct Equivalent {
   ///   Atomic<uint32_t> primary[port_count];
   ///   Atomic<uint32_t> secondary[port_count];
-  ///   Packet buffer[port_count];
+  ///   Header header[port_count];
+  ///   Buffer packet[port_count][lane_size];
   /// };
   LIBC_INLINE static constexpr uint64_t allocation_size(uint32_t port_count) {
     return buffer_offset(port_count) + buffer_bytes(port_count);
@@ -144,6 +135,18 @@ template <bool Invert, typename Packet> struct Process {
     atomic_thread_fence(cpp::MemoryOrder::ACQUIRE);
   }
 
+  /// The packet is a linearly allocated array of buffers used to communicate
+  /// with the other process. This function returns the appropriate slot in this
+  /// array such that the process can operate on an entire warp or wavefront at
+  /// once.
+  LIBC_INLINE Buffer *get_packet(uint32_t index) {
+    // A value of UINT32_MAX indicates the lane_size is only known at runtime.
+    if constexpr (lane_size == UINT32_MAX)
+      return &packet[index * gpu::get_lane_size()];
+    else
+      return &packet[index * lane_size];
+  }
+
   /// Determines if this process needs to wait for ownership of the buffer. We
   /// invert the condition on one of the processes to indicate that if one
   /// process owns the buffer then the other does not.
@@ -220,7 +223,8 @@ template <bool Invert, typename Packet> struct Process {
 
   /// Number of bytes to allocate for the buffer containing the packets.
   LIBC_INLINE static constexpr uint64_t buffer_bytes(uint32_t port_count) {
-    return port_count * sizeof(Packet);
+    static_assert(lane_size != UINT32_MAX, "Cannot be used with runtime size");
+    return port_count * lane_size * sizeof(Buffer);
   }
 
   /// Offset of the inbox in memory. This is the same as the outbox if inverted.
@@ -233,9 +237,15 @@ template <bool Invert, typename Packet> struct Process {
     return Invert ? 0 : mailbox_bytes(port_count);
   }
 
+  /// Offset of the buffer containing the packets after the inbox and outbox.
+  LIBC_INLINE static constexpr uint64_t header_offset(uint32_t port_count) {
+    return align_up(2 * mailbox_bytes(port_count), alignof(Header));
+  }
+
   /// Offset of the buffer containing the packets after the inbox and outbox.
   LIBC_INLINE static constexpr uint64_t buffer_offset(uint32_t port_count) {
-    return align_up(2 * mailbox_bytes(port_count), alignof(Packet));
+    return align_up(header_offset(port_count) + port_count * sizeof(Header),
+                    alignof(Buffer));
   }
 
   /// Conditionally set the n-th bit in the atomic bitfield.
@@ -264,33 +274,33 @@ template <bool Invert, typename Packet> struct Process {
 /// Invokes a function accross every active buffer across the total lane size.
 template <uint32_t lane_size>
 static LIBC_INLINE void invoke_rpc(cpp::function<void(Buffer *)> fn,
-                                   Packet<lane_size> &packet) {
+                                   Buffer *slot, uint64_t mask) {
   if constexpr (is_process_gpu()) {
-    fn(&packet.payload.slot[gpu::get_lane_id()]);
+    fn(&slot[gpu::get_lane_id()]);
   } else {
     for (uint32_t i = 0; i < lane_size; i += gpu::get_lane_size())
-      if (packet.header.mask & 1ul << i)
-        fn(&packet.payload.slot[i]);
+      if (mask & 1ul << i)
+        fn(&slot[i]);
   }
 }
 
 /// Alternate version that also provides the index of the current lane.
 template <uint32_t lane_size>
 static LIBC_INLINE void invoke_rpc(cpp::function<void(Buffer *, uint32_t)> fn,
-                                   Packet<lane_size> &packet) {
+                                   Buffer *slot, uint64_t mask) {
   if constexpr (is_process_gpu()) {
-    fn(&packet.payload.slot[gpu::get_lane_id()], gpu::get_lane_id());
+    fn(&slot[gpu::get_lane_id()], gpu::get_lane_id());
   } else {
     for (uint32_t i = 0; i < lane_size; i += gpu::get_lane_size())
-      if (packet.header.mask & 1ul << i)
-        fn(&packet.payload.slot[i], i);
+      if (mask & 1ul << i)
+        fn(&slot[i], i);
   }
 }
 
 /// The port provides the interface to communicate between the multiple
 /// processes. A port is conceptually an index into the memory provided by the
 /// underlying process that is guarded by a lock bit.
-template <bool T, typename S> struct Port {
+template <bool T, uint32_t S> struct Port {
   LIBC_INLINE Port(Process<T, S> &process, uint64_t lane_mask, uint32_t index,
                    uint32_t out)
       : process(process), lane_mask(lane_mask), index(index), out(out),
@@ -319,7 +329,7 @@ template <bool T, typename S> struct Port {
   LIBC_INLINE void recv_n(void **dst, uint64_t *size, A &&alloc);
 
   LIBC_INLINE uint16_t get_opcode() const {
-    return process.packet[index].header.opcode;
+    return process.header[index].opcode;
   }
 
   LIBC_INLINE uint16_t get_index() const { return index; }
@@ -351,15 +361,24 @@ struct Client {
   LIBC_INLINE Client(uint32_t port_count, void *buffer)
       : process(port_count, buffer) {}
 
-  using Port = rpc::Port<false, Packet<gpu::LANE_SIZE>>;
+#if defined(LIBC_TARGET_ARCH_IS_AMDGPU)
+  // The AMDGPU wavefront size is only known at runtime. We indicate this to the
+  // process using a value of UINT32_MAX.
+  static constexpr uint32_t lane_size = UINT32_MAX;
+#elif defined(LIBC_TARGET_ARCH_IS_NVPTX)
+  static constexpr uint32_t lane_size = 32;
+#else
+  static constexpr uint32_t lane_size = 1;
+#endif
+
+  using Port = rpc::Port<false, lane_size>;
   template <uint16_t opcode> LIBC_INLINE Port open();
 
 private:
-  Process<false, Packet<gpu::LANE_SIZE>> process;
+  Process<false, lane_size> process;
 };
 static_assert(cpp::is_trivially_copyable<Client>::value &&
-                  sizeof(Process<false, Packet<1>>) ==
-                      sizeof(Process<false, Packet<32>>),
+                  sizeof(Process<false, 1>) == sizeof(Process<false, 32>),
               "The client is not trivially copyable from the server");
 
 /// The RPC server used to respond to the client.
@@ -372,20 +391,20 @@ template <uint32_t lane_size> struct Server {
   LIBC_INLINE Server(uint32_t port_count, void *buffer)
       : process(port_count, buffer) {}
 
-  using Port = rpc::Port<true, Packet<lane_size>>;
+  using Port = rpc::Port<true, lane_size>;
   LIBC_INLINE cpp::optional<Port> try_open(uint32_t start = 0);
   LIBC_INLINE Port open();
 
   LIBC_INLINE static uint64_t allocation_size(uint32_t port_count) {
-    return Process<true, Packet<lane_size>>::allocation_size(port_count);
+    return Process<true, lane_size>::allocation_size(port_count);
   }
 
 private:
-  Process<true, Packet<lane_size>> process;
+  Process<true, lane_size> process;
 };
 
 /// Applies \p fill to the shared buffer and initiates a send operation.
-template <bool T, typename S>
+template <bool T, uint32_t S>
 template <typename F>
 LIBC_INLINE void Port<T, S>::send(F fill) {
   uint32_t in = owns_buffer ? out ^ T : process.load_inbox(lane_mask, index);
@@ -394,14 +413,14 @@ LIBC_INLINE void Port<T, S>::send(F fill) {
   process.wait_for_ownership(lane_mask, index, out, in);
 
   // Apply the \p fill function to initialize the buffer and release the memory.
-  invoke_rpc(fill, process.packet[index]);
+  invoke_rpc<S>(fill, process.get_packet(index), process.header[index].mask);
   out = process.invert_outbox(index, out);
   owns_buffer = false;
   receive = false;
 }
 
 /// Applies \p use to the shared buffer and acknowledges the send.
-template <bool T, typename S>
+template <bool T, uint32_t S>
 template <typename U>
 LIBC_INLINE void Port<T, S>::recv(U use) {
   // We only exchange ownership of the buffer during a receive if we are waiting
@@ -417,13 +436,13 @@ LIBC_INLINE void Port<T, S>::recv(U use) {
   process.wait_for_ownership(lane_mask, index, out, in);
 
   // Apply the \p use function to read the memory out of the buffer.
-  invoke_rpc(use, process.packet[index]);
+  invoke_rpc<S>(use, process.get_packet(index), process.header[index].mask);
   receive = true;
   owns_buffer = true;
 }
 
 /// Combines a send and receive into a single function.
-template <bool T, typename S>
+template <bool T, uint32_t S>
 template <typename F, typename U>
 LIBC_INLINE void Port<T, S>::send_and_recv(F fill, U use) {
   send(fill);
@@ -433,7 +452,7 @@ LIBC_INLINE void Port<T, S>::send_and_recv(F fill, U use) {
 /// Combines a receive and send operation into a single function. The \p work
 /// function modifies the buffer in-place and the send is only used to initiate
 /// the copy back.
-template <bool T, typename S>
+template <bool T, uint32_t S>
 template <typename W>
 LIBC_INLINE void Port<T, S>::recv_and_send(W work) {
   recv(work);
@@ -442,7 +461,7 @@ LIBC_INLINE void Port<T, S>::recv_and_send(W work) {
 
 /// Helper routine to simplify the interface when sending from the GPU using
 /// thread private pointers to the underlying value.
-template <bool T, typename S>
+template <bool T, uint32_t S>
 LIBC_INLINE void Port<T, S>::send_n(const void *src, uint64_t size) {
   const void **src_ptr = &src;
   uint64_t *size_ptr = &size;
@@ -451,7 +470,7 @@ LIBC_INLINE void Port<T, S>::send_n(const void *src, uint64_t size) {
 
 /// Sends an arbitrarily sized data buffer \p src across the shared channel in
 /// multiples of the packet length.
-template <bool T, typename S>
+template <bool T, uint32_t S>
 LIBC_INLINE void Port<T, S>::send_n(const void *const *src, uint64_t *size) {
   uint64_t num_sends = 0;
   send([&](Buffer *buffer, uint32_t id) {
@@ -465,7 +484,7 @@ LIBC_INLINE void Port<T, S>::send_n(const void *const *src, uint64_t *size) {
     rpc_memcpy(&buffer->data[1], lane_value(src, id), len);
   });
   uint64_t idx = sizeof(Buffer::data) - sizeof(uint64_t);
-  uint64_t mask = process.packet[index].header.mask;
+  uint64_t mask = process.header[index].mask;
   while (gpu::ballot(mask, idx < num_sends)) {
     send([=](Buffer *buffer, uint32_t id) {
       uint64_t len = lane_value(size, id) - idx > sizeof(Buffer::data)
@@ -481,7 +500,7 @@ LIBC_INLINE void Port<T, S>::send_n(const void *const *src, uint64_t *size) {
 /// Receives an arbitrarily sized data buffer across the shared channel in
 /// multiples of the packet length. The \p alloc function is called with the
 /// size of the data so that we can initialize the size of the \p dst buffer.
-template <bool T, typename S>
+template <bool T, uint32_t S>
 template <typename A>
 LIBC_INLINE void Port<T, S>::recv_n(void **dst, uint64_t *size, A &&alloc) {
   uint64_t num_recvs = 0;
@@ -498,7 +517,7 @@ LIBC_INLINE void Port<T, S>::recv_n(void **dst, uint64_t *size, A &&alloc) {
     rpc_memcpy(lane_value(dst, id), &buffer->data[1], len);
   });
   uint64_t idx = sizeof(Buffer::data) - sizeof(uint64_t);
-  uint64_t mask = process.packet[index].header.mask;
+  uint64_t mask = process.header[index].mask;
   while (gpu::ballot(mask, idx < num_recvs)) {
     recv([=](Buffer *buffer, uint32_t id) {
       uint64_t len = lane_value(size, id) - idx > sizeof(Buffer::data)
@@ -540,8 +559,8 @@ template <uint16_t opcode> LIBC_INLINE Client::Port Client::open() {
     }
 
     if (gpu::is_first_lane(lane_mask)) {
-      process.packet[index].header.opcode = opcode;
-      process.packet[index].header.mask = lane_mask;
+      process.header[index].opcode = opcode;
+      process.header[index].mask = lane_mask;
     }
     gpu::sync_lane(lane_mask);
     return Port(process, lane_mask, index, out);
diff --git a/libc/test/src/__support/RPC/rpc_smoke_test.cpp b/libc/test/src/__support/RPC/rpc_smoke_test.cpp
index 54821e21f9ccf7..ea40bea0e4c695 100644
--- a/libc/test/src/__support/RPC/rpc_smoke_test.cpp
+++ b/libc/test/src/__support/RPC/rpc_smoke_test.cpp
@@ -13,12 +13,8 @@
 namespace {
 enum { lane_size = 8, port_count = 4 };
 
-struct Packet {
-  uint64_t unused;
-};
-
-using ProcAType = LIBC_NAMESPACE::rpc::Process<false, Packet>;
-using ProcBType = LIBC_NAMESPACE::rpc::Process<true, Packet>;
+using ProcAType = LIBC_NAMESPACE::rpc::Process<false, 1>;
+using ProcBType = LIBC_NAMESPACE::rpc::Process<true, 1>;
 
 static_assert(ProcAType::inbox_offset(port_count) ==
               ProcBType::outbox_offset(port_count));

Comment on lines 365 to 366
// The AMDGPU wavefront size is only known at runtime. We indicate this to the
// process using a value of UINT32_MAX.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this isn't really accurate. it's known at compile time of the kernel

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Compile time has a slightly different meaning in terms of C++, it more or less means it's not constexpr. However, it's definitely not correct to say it's a runtime value.

/// The default lane size indicates that the actual lane size is not a compile
/// time constant and needs to be instead be queried from the platform.
constexpr uint32_t DEFAULT_LANE_SIZE = 0;

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't make any sense to me. There is no default lane size

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Naming is hard, it's supposed to imply that it's not a value that can go in a template.

// Given the current outbox and inbox values, wait until the inbox changes
// to indicate that this thread owns the buffer element.
/// Given the current outbox and inbox values, wait until the inbox changes
/// to indicate that this thread owns the buffer element.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

spurious extra /

if constexpr (lane_size == DEFAULT_LANE_SIZE)
return &packet[index * gpu::get_lane_size()];
else
return &packet[index * lane_size];
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is dubious. Just have the non-GPUs report a lane size of 1 and get rid of the DEFAULT_LANE_SIZE construct

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The server (running on the CPU) needs to pass it in manually according to whatever the runtime states it should be (i.e. 32). So we need the extra argument there for the server create the appropriately sized buffer since it would just be 1 in that case so they would have conflicting views of the buffer.

Copy link

github-actions bot commented Feb 7, 2024

✅ With the latest revision this PR passed the C/C++ code formatter.

@@ -264,33 +277,35 @@ template <bool Invert, typename Packet> struct Process {
/// Invokes a function accross every active buffer across the total lane size.
template <uint32_t lane_size>
static LIBC_INLINE void invoke_rpc(cpp::function<void(Buffer *)> fn,
Packet<lane_size> &packet) {
Buffer *slot, uint64_t mask) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is mask here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The packet used to contain the mask, now it's passed in. It's the exact same value. When we open a port we record the thread mask that was used so that the server knows not to touch the unused slots.

Right now it's sort of a precondition that the mask here matches the implicit execution mask, unfortunately you can't static assert that.

fn(&packet.payload.slot[i]);
auto sz = lane_size == VARIABLE_LANE_SIZE ? gpu::get_lane_size() : lane_size;
for (uint32_t i = 0; i < sz; i += gpu::get_lane_size())
if (mask & 1ul << i)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks unlikely to be correct. Are you aware that it parses as (mask & 1) << i?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

<< has precedence 7 while & has 11so it should bemask & (1 << i)`. However, the fact that I had to look up the operator precedence is a good indicator that this should be in parenthesis anyway.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I didn't remember the precedence and then misread the table at cppreference. Parens seem good here.

@JonChesterfield
Copy link
Collaborator

I think a fair amount of the complexity is from trying to treat the wavesize as a compile time constant in some places and as a variable in others. I don't think that's worth the hassle - if we go with it as a "runtime" value everywhere, it'll still constant propagate on the GPU, and it might do so on the host as well if function specialisation decides it's worth doing.

Or we could go with it's a compile time constant everywhere and make the function specialisation explicit. The host side would pick up a switch to convert the runtime value to a compile time one. Messy. In general the rpc implementation in trunk tends to trust the compiler to constant propagate (unlike the prototype which burned everything at compile time) so leaving calls to gpu::wavesize() scattered around is probably the better way to go.

@jhuber6
Copy link
Contributor Author

jhuber6 commented Feb 8, 2024

I think a fair amount of the complexity is from trying to treat the wavesize as a compile time constant in some places and as a variable in others. I don't think that's worth the hassle - if we go with it as a "runtime" value everywhere, it'll still constant propagate on the GPU, and it might do so on the host as well if function specialisation decides it's worth doing.

The main issue is that the Server needs to be instantiated with the given size. For example, if I have an Nvidia GPU and AMD machine I will end up with a Server<32> and Server<64> respectively, the lane size is required to slice into the buffer array and get the buffer bytes. The GPU can just call get_lane_size() and it returns the value, but the Server needs it to be passed in externally, this is currently done by the template. We cannot forward a runtime function to the Server interface that would do the right thing given the multitudes of targets. Right now the logic is more like, the Client gets to default to whatever it wants and the Server is responsible for making them match.

Or we could go with it's a compile time constant everywhere and make the function specialization explicit. The host side would pick up a switch to convert the runtime value to a compile time one. Messy. In general the rpc implementation in trunk tends to trust the compiler to constant propagate (unlike the prototype which burned everything at compile time) so leaving calls to gpu::wavesize() scattered around is probably the better way to go.

This was my first inclination, but it didn't really sit well when I tried it out. I could give it another shot, but I think it just requires way too much branching in the interface.

Summary:
The RPC interface needs to handle an entire warp or wavefront at once.
This is currently done by using a compile time constant indicating the
size of the buffer, which right now defaults to some value on the client
(GPU) side. However, there are currently attempts to move the `libc`
library to a single IR build. This is problematic as the size of the
wave fronts changes between ISAs on AMDGPU. The builitin
`__builtin_amdgcn_wavefrontsize()` will return the appropriate value,
but it is only known at runtime now.

In order to support this, this patch restructures the packet. Now
instead of having an array of arrays, we simply have a large array of
buffers and slice it according to the runtime value if we don't know it
ahead of time. This also somewhat has the advantage of making the buffer
contiguous within a page now that the header has been moved out of it.
@arsenm
Copy link
Contributor

arsenm commented Feb 8, 2024

The intrinsic won't actually fold out in any useful place today (only happens extremely late in the backend). You're better of linking in a wavesize constant variable

@jhuber6
Copy link
Contributor Author

jhuber6 commented Feb 8, 2024

The intrinsic won't actually fold out in any useful place today (only happens extremely late in the backend). You're better of linking in a wavesize constant variable

I did some really informal checks on register usage and it didn't seem to change anything. Really simple branches like this one seem to at least get optimized out, see https://godbolt.org/z/qbjqfq3WG. I think in this case it's a constant written to a register, which the late passes then propagate.

@jhuber6
Copy link
Contributor Author

jhuber6 commented Feb 8, 2024

The intrinsic won't actually fold out in any useful place today (only happens extremely late in the backend). You're better of linking in a wavesize constant variable

I did some basic tests at various optimization levels and didn't find a case where this changed resource usage via -Rpass-analysis=kernel-resource-usage. Looking at https://godbolt.org/z/T468dT1PY it at least seems capable of optimizing through very shallow branches, and this is pretty much just a single constant read and multiplication.

@arsenm
Copy link
Contributor

arsenm commented Feb 8, 2024

The intrinsic won't actually fold out in any useful place today (only happens extremely late in the backend). You're better of linking in a wavesize constant variable

I did some basic tests at various optimization levels and didn't find a case where this changed resource usage via -Rpass-analysis=kernel-resource-usage. Looking at https://godbolt.org/z/T468dT1PY it at least seems capable of optimizing through very shallow branches, and this is pretty much just a single constant read and multiplication.

It is not folded out at the IR level, where all the useful optimizations happen. This is leaving dead code until we have very little attempt to do anything about it

@jhuber6
Copy link
Contributor Author

jhuber6 commented Feb 8, 2024

The intrinsic won't actually fold out in any useful place today (only happens extremely late in the backend). You're better of linking in a wavesize constant variable

I did some basic tests at various optimization levels and didn't find a case where this changed resource usage via -Rpass-analysis=kernel-resource-usage. Looking at https://godbolt.org/z/T468dT1PY it at least seems capable of optimizing through very shallow branches, and this is pretty much just a single constant read and multiplication.

It is not folded out at the IR level, where all the useful optimizations happen. This is leaving dead code until we have very little attempt to do anything about it

The usage here simply unconditionally writes to a variable. The absolute worst case scenario here is a single extra register to hold the value instead of using it as a constant. However this doesn't seem to be the case even at O1 compared to the old version that used a template parameter so I'm not overly concerned compared to the complexity of the ROCm device libraries for this.

Copy link
Collaborator

@JonChesterfield JonChesterfield left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice, thank you. The consistently runtime value which the optimiser hopefully removes is much clearer than the previous revision.

@JonChesterfield
Copy link
Collaborator

The intrinsic won't actually fold out in any useful place today (only happens extremely late in the backend). You're better of linking in a wavesize constant variable

That seems worth fixing - somewhere in the compiler we know the wavesize (sometimes from clang, sometimes from the backend), we can kill off the intrinsic whenever that information comes to light.

@jhuber6
Copy link
Contributor Author

jhuber6 commented Feb 12, 2024

The intrinsic won't actually fold out in any useful place today (only happens extremely late in the backend). You're better of linking in a wavesize constant variable

That seems worth fixing - somewhere in the compiler we know the wavesize (sometimes from clang, sometimes from the backend), we can kill off the intrinsic whenever that information comes to light.

I'm assuming this is only lowered in MIR because of the whole "variable per CFG" thing? Otherwise it should be trivial to make it just key off of the attributes / subtarget.

@arsenm
Copy link
Contributor

arsenm commented Feb 12, 2024

That seems worth fixing - somewhere in the compiler we know the wavesize (sometimes from clang, sometimes from the backend), we can kill off the intrinsic whenever that information comes to light.

The problem is we always know the wavesize. It's 64. We have no wavesize unknown representation. We get away with this by folding it in codegen

@jhuber6 jhuber6 merged commit f879ac0 into llvm:main Feb 13, 2024
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants