[libc] Remove CPU dependent AMDGPU instructions #80707

jhuber6 · 2024-02-05T16:41:15Z

Summary:
Some recent changes allowed us to remove target level divergence one
these instructions. This patch removes the wavefront dependent
divergence for the ballot and thread ID functions, as well as the clock.
The changes to the "Vendor" library simply disables target specific
optimizations in the implementation. This should be removed in its
entirety when the LLVM libm is sufficiently implemented.

The remaining areas of divergence is only the RPC packet size and the
fixed frequency counter.

Summary: Some recent changes allowed us to remove target level divergence one these instructions. This patch removes the wavefront dependent divergence for the ballot and thread ID functions, as well as the clock. The changes to the "Vendor" library simply disables target specific optimizations in the implementation. This should be removed in its entirety when the LLVM `libm` is sufficiently implemented. The remaining areas of divergence is only the RPC packet size and the fixed frequency counter.

llvmbot · 2024-02-05T16:41:47Z

@llvm/pr-subscribers-backend-amdgpu

@llvm/pr-subscribers-libc

Author: Joseph Huber (jhuber6)

Changes

Summary:
Some recent changes allowed us to remove target level divergence one
these instructions. This patch removes the wavefront dependent
divergence for the ballot and thread ID functions, as well as the clock.
The changes to the "Vendor" library simply disables target specific
optimizations in the implementation. This should be removed in its
entirety when the LLVM libm is sufficiently implemented.

The remaining areas of divergence is only the RPC packet size and the
fixed frequency counter.

Full diff: https://github.com/llvm/llvm-project/pull/80707.diff

2 Files Affected:

(modified) libc/src/__support/GPU/amdgpu/utils.h (+4-18)
(modified) libc/src/math/gpu/vendor/amdgpu/platform.h (+2-82)

diff --git a/libc/src/__support/GPU/amdgpu/utils.h b/libc/src/__support/GPU/amdgpu/utils.h
index 96e3efccb3b5a5..58bbe29cb3a7d7 100644
--- a/libc/src/__support/GPU/amdgpu/utils.h
+++ b/libc/src/__support/GPU/amdgpu/utils.h
@@ -113,10 +113,7 @@ LIBC_INLINE uint32_t get_lane_size() { return LANE_SIZE; }
 
 /// Returns the id of the thread inside of an AMD wavefront executing together.
 [[clang::convergent]] LIBC_INLINE uint32_t get_lane_id() {
-  if constexpr (LANE_SIZE == 64)
-    return __builtin_amdgcn_mbcnt_hi(~0u, __builtin_amdgcn_mbcnt_lo(~0u, 0u));
-  else
-    return __builtin_amdgcn_mbcnt_lo(~0u, 0u);
+  return __builtin_amdgcn_mbcnt_hi(~0u, __builtin_amdgcn_mbcnt_lo(~0u, 0u));
 }
 
 /// Returns the bit-mask of active threads in the current wavefront.
@@ -134,11 +131,7 @@ LIBC_INLINE uint32_t get_lane_size() { return LANE_SIZE; }
 [[clang::convergent]] LIBC_INLINE uint64_t ballot(uint64_t lane_mask, bool x) {
   // the lane_mask & gives the nvptx semantics when lane_mask is a subset of
   // the active threads
-  if constexpr (LANE_SIZE == 64) {
-    return lane_mask & __builtin_amdgcn_ballot_w64(x);
-  } else {
-    return lane_mask & __builtin_amdgcn_ballot_w32(x);
-  }
+  return lane_mask & __builtin_amdgcn_ballot_w64(x);
 }
 
 /// Waits for all the threads in the block to converge and issues a fence.
@@ -153,15 +146,8 @@ LIBC_INLINE uint32_t get_lane_size() { return LANE_SIZE; }
 }
 
 /// Returns the current value of the GPU's processor clock.
-/// NOTE: The RDNA3 and RDNA2 architectures use a 20-bit cycle cycle counter.
-LIBC_INLINE uint64_t processor_clock() {
-  if constexpr (LIBC_HAS_BUILTIN(__builtin_amdgcn_s_memtime))
-    return __builtin_amdgcn_s_memtime();
-  else if constexpr (LIBC_HAS_BUILTIN(__builtin_readcyclecounter))
-    return __builtin_readcyclecounter();
-  else
-    return 0;
-}
+/// NOTE: The RDNA3 and RDNA2 architectures use a 20-bit cycle counter.
+LIBC_INLINE uint64_t processor_clock() { return __builtin_readcyclecounter(); }
 
 /// Returns a fixed-frequency timestamp. The actual frequency is dependent on
 /// the card and can only be queried via the driver.
diff --git a/libc/src/math/gpu/vendor/amdgpu/platform.h b/libc/src/math/gpu/vendor/amdgpu/platform.h
index 160a8508cd8b95..b22240419cd0c1 100644
--- a/libc/src/math/gpu/vendor/amdgpu/platform.h
+++ b/libc/src/math/gpu/vendor/amdgpu/platform.h
@@ -32,88 +32,8 @@ extern const LIBC_INLINE_VAR uint8_t __oclc_correctly_rounded_sqrt32 = 1;
 // Disable finite math optimizations.
 extern const LIBC_INLINE_VAR uint8_t __oclc_finite_only_opt = 0;
 
-#if defined(__gfx700__)
-extern const LIBC_INLINE_VAR uint32_t __oclc_ISA_version = 7000;
-#elif defined(__gfx701__)
-extern const LIBC_INLINE_VAR uint32_t __oclc_ISA_version = 7001;
-#elif defined(__gfx702__)
-extern const LIBC_INLINE_VAR uint32_t __oclc_ISA_version = 7002;
-#elif defined(__gfx703__)
-extern const LIBC_INLINE_VAR uint32_t __oclc_ISA_version = 7003;
-#elif defined(__gfx704__)
-extern const LIBC_INLINE_VAR uint32_t __oclc_ISA_version = 7004;
-#elif defined(__gfx705__)
-extern const LIBC_INLINE_VAR uint32_t __oclc_ISA_version = 7005;
-#elif defined(__gfx801__)
-extern const LIBC_INLINE_VAR uint32_t __oclc_ISA_version = 8001;
-#elif defined(__gfx802__)
-extern const LIBC_INLINE_VAR uint32_t __oclc_ISA_version = 8002;
-#elif defined(__gfx803__)
-extern const LIBC_INLINE_VAR uint32_t __oclc_ISA_version = 8003;
-#elif defined(__gfx805__)
-extern const LIBC_INLINE_VAR uint32_t __oclc_ISA_version = 8005;
-#elif defined(__gfx810__)
-extern const LIBC_INLINE_VAR uint32_t __oclc_ISA_version = 8100;
-#elif defined(__gfx900__)
-extern const LIBC_INLINE_VAR uint32_t __oclc_ISA_version = 9000;
-#elif defined(__gfx902__)
-extern const LIBC_INLINE_VAR uint32_t __oclc_ISA_version = 9002;
-#elif defined(__gfx904__)
-extern const LIBC_INLINE_VAR uint32_t __oclc_ISA_version = 9004;
-#elif defined(__gfx906__)
-extern const LIBC_INLINE_VAR uint32_t __oclc_ISA_version = 9006;
-#elif defined(__gfx908__)
-extern const LIBC_INLINE_VAR uint32_t __oclc_ISA_version = 9008;
-#elif defined(__gfx909__)
-extern const LIBC_INLINE_VAR uint32_t __oclc_ISA_version = 9009;
-#elif defined(__gfx90a__)
-extern const LIBC_INLINE_VAR uint32_t __oclc_ISA_version = 9010;
-#elif defined(__gfx90c__)
-extern const LIBC_INLINE_VAR uint32_t __oclc_ISA_version = 9012;
-#elif defined(__gfx940__)
-extern const LIBC_INLINE_VAR uint32_t __oclc_ISA_version = 9400;
-#elif defined(__gfx941__)
-extern const LIBC_INLINE_VAR uint32_t __oclc_ISA_version = 9401;
-#elif defined(__gfx942__)
-extern const LIBC_INLINE_VAR uint32_t __oclc_ISA_version = 9402;
-#elif defined(__gfx1010__)
-extern const LIBC_INLINE_VAR uint32_t __oclc_ISA_version = 10100;
-#elif defined(__gfx1011__)
-extern const LIBC_INLINE_VAR uint32_t __oclc_ISA_version = 10101;
-#elif defined(__gfx1012__)
-extern const LIBC_INLINE_VAR uint32_t __oclc_ISA_version = 10102;
-#elif defined(__gfx1013__)
-extern const LIBC_INLINE_VAR uint32_t __oclc_ISA_version = 10103;
-#elif defined(__gfx1030__)
-extern const LIBC_INLINE_VAR uint32_t __oclc_ISA_version = 10300;
-#elif defined(__gfx1031__)
-extern const LIBC_INLINE_VAR uint32_t __oclc_ISA_version = 10301;
-#elif defined(__gfx1032__)
-extern const LIBC_INLINE_VAR uint32_t __oclc_ISA_version = 10302;
-#elif defined(__gfx1033__)
-extern const LIBC_INLINE_VAR uint32_t __oclc_ISA_version = 10303;
-#elif defined(__gfx1034__)
-extern const LIBC_INLINE_VAR uint32_t __oclc_ISA_version = 10304;
-#elif defined(__gfx1035__)
-extern const LIBC_INLINE_VAR uint32_t __oclc_ISA_version = 10305;
-#elif defined(__gfx1036__)
-extern const LIBC_INLINE_VAR uint32_t __oclc_ISA_version = 10306;
-#elif defined(__gfx1100__)
-extern const LIBC_INLINE_VAR uint32_t __oclc_ISA_version = 11000;
-#elif defined(__gfx1101__)
-extern const LIBC_INLINE_VAR uint32_t __oclc_ISA_version = 11001;
-#elif defined(__gfx1102__)
-extern const LIBC_INLINE_VAR uint32_t __oclc_ISA_version = 11002;
-#elif defined(__gfx1103__)
-extern const LIBC_INLINE_VAR uint32_t __oclc_ISA_version = 11003;
-#elif defined(__gfx1150__)
-extern const LIBC_INLINE_VAR uint32_t __oclc_ISA_version = 11500;
-#elif defined(__gfx1151__)
-extern const LIBC_INLINE_VAR uint32_t __oclc_ISA_version = 11501;
-#else
-#error "Unknown AMDGPU architecture"
-#endif
-}
+// Set the ISA to zero to ingore all ISA specific optimizations.
+extern const LIBC_INLINE_VAR uint32_t __oclc_ISA_version = 0;
 
 // These aliases cause clang to emit the control constants with ODR linkage.
 // This allows us to link against the symbols without preventing them from being

arsenm · 2024-02-05T17:10:39Z

libc/src/math/gpu/vendor/amdgpu/platform.h

+// Set the ISA to zero to ingore all ISA specific optimizations.
+extern const LIBC_INLINE_VAR uint32_t __oclc_ISA_version = 0;


it's not just optimizations and this will break things. you just happen to not be using any of the things that can break in ockl

So this only links ocml.bc. I checked the source and it's only used in the following

./src/opts.h:#define HAVE_FAST_FMA32() (__oclc_ISA_version == 7001 || __oclc_ISA_version == 8001 || __oclc_ISA_version >= 9000) ./src/opts.h:#define HAVE_BUGGY_FREXP_INSTRUCTIONS() (__oclc_ISA_version < 7000)

I'm assuming that setting it to zero here will simply impact performance since it'll use a non-FMA slow path and prtobably a software workaround for frexp. I'm eager to delete this dependendance altogether so I'm not concerned about performance. I could probably just set it to 9000 or something if needed.

For the fast FMA case, you're much better off assuming true than false.

The HAVE_BUGGY_FREXP_INSTRUCTIONS case is gone since the generic frexp intrinsic was added; I assume you're looking at an old copy

arsenm · 2024-02-06T05:29:07Z

libc/src/math/gpu/vendor/amdgpu/platform.h

-#elif defined(__gfx810__)
-extern const LIBC_INLINE_VAR uint32_t __oclc_ISA_version = 8100;
-#elif defined(__gfx900__)
+// Set the ISA to gfx900 to assume FMA instructions are present.


I think this comment could use more elaboration. You're relying on assumptions about what the library is doing with this, and the specific subset you're using

arsenm

I would have added more detail about why it's OK here

jplehr · 2024-02-06T13:35:18Z

Hi, this appears to have broken the AMD GPU libc bot https://lab.llvm.org/staging/#/builders/11/builds/1656

jhuber6 · 2024-02-06T13:38:50Z

Hi, this appears to have broken the AMD GPU libc bot https://lab.llvm.org/staging/#/builders/11/builds/1656

Oops, fixed.

jhuber6 requested review from arsenm, jdoerfert, JonChesterfield and lntue February 5, 2024 16:41

llvmbot added backend:AMDGPU libc labels Feb 5, 2024

arsenm reviewed Feb 5, 2024

View reviewed changes

Address comments

729afca

arsenm reviewed Feb 6, 2024

View reviewed changes

More comment

0f111fe

arsenm approved these changes Feb 6, 2024

View reviewed changes

jhuber6 merged commit 026f3c1 into llvm:main Feb 6, 2024
3 of 4 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[libc] Remove CPU dependent AMDGPU instructions #80707

[libc] Remove CPU dependent AMDGPU instructions #80707

jhuber6 commented Feb 5, 2024

llvmbot commented Feb 5, 2024 •

edited

Loading

arsenm Feb 5, 2024

jhuber6 Feb 5, 2024

arsenm Feb 5, 2024

jhuber6 Feb 5, 2024

arsenm Feb 6, 2024

arsenm left a comment

jplehr commented Feb 6, 2024

jhuber6 commented Feb 6, 2024

		// Set the ISA to zero to ingore all ISA specific optimizations.
		extern const LIBC_INLINE_VAR uint32_t __oclc_ISA_version = 0;

[libc] Remove CPU dependent AMDGPU instructions #80707

[libc] Remove CPU dependent AMDGPU instructions #80707

Conversation

jhuber6 commented Feb 5, 2024

llvmbot commented Feb 5, 2024 • edited Loading

arsenm Feb 5, 2024

Choose a reason for hiding this comment

jhuber6 Feb 5, 2024

Choose a reason for hiding this comment

arsenm Feb 5, 2024

Choose a reason for hiding this comment

jhuber6 Feb 5, 2024

Choose a reason for hiding this comment

arsenm Feb 6, 2024

Choose a reason for hiding this comment

arsenm left a comment

Choose a reason for hiding this comment

jplehr commented Feb 6, 2024

jhuber6 commented Feb 6, 2024

llvmbot commented Feb 5, 2024 •

edited

Loading