Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[libc] Remove CPU dependent AMDGPU instructions #80707

Merged
merged 3 commits into from
Feb 6, 2024

Conversation

jhuber6
Copy link
Contributor

@jhuber6 jhuber6 commented Feb 5, 2024

Summary:
Some recent changes allowed us to remove target level divergence one
these instructions. This patch removes the wavefront dependent
divergence for the ballot and thread ID functions, as well as the clock.
The changes to the "Vendor" library simply disables target specific
optimizations in the implementation. This should be removed in its
entirety when the LLVM libm is sufficiently implemented.

The remaining areas of divergence is only the RPC packet size and the
fixed frequency counter.

Summary:
Some recent changes allowed us to remove target level divergence one
these instructions. This patch removes the wavefront dependent
divergence for the ballot and thread ID functions, as well as the clock.
The changes to the "Vendor" library simply disables target specific
optimizations in the implementation. This should be removed in its
entirety when the LLVM `libm` is sufficiently implemented.

The remaining areas of divergence is only the RPC packet size and the
fixed frequency counter.
@llvmbot
Copy link

llvmbot commented Feb 5, 2024

@llvm/pr-subscribers-backend-amdgpu

@llvm/pr-subscribers-libc

Author: Joseph Huber (jhuber6)

Changes

Summary:
Some recent changes allowed us to remove target level divergence one
these instructions. This patch removes the wavefront dependent
divergence for the ballot and thread ID functions, as well as the clock.
The changes to the "Vendor" library simply disables target specific
optimizations in the implementation. This should be removed in its
entirety when the LLVM libm is sufficiently implemented.

The remaining areas of divergence is only the RPC packet size and the
fixed frequency counter.


Full diff: https://github.com/llvm/llvm-project/pull/80707.diff

2 Files Affected:

  • (modified) libc/src/__support/GPU/amdgpu/utils.h (+4-18)
  • (modified) libc/src/math/gpu/vendor/amdgpu/platform.h (+2-82)
diff --git a/libc/src/__support/GPU/amdgpu/utils.h b/libc/src/__support/GPU/amdgpu/utils.h
index 96e3efccb3b5a5..58bbe29cb3a7d7 100644
--- a/libc/src/__support/GPU/amdgpu/utils.h
+++ b/libc/src/__support/GPU/amdgpu/utils.h
@@ -113,10 +113,7 @@ LIBC_INLINE uint32_t get_lane_size() { return LANE_SIZE; }
 
 /// Returns the id of the thread inside of an AMD wavefront executing together.
 [[clang::convergent]] LIBC_INLINE uint32_t get_lane_id() {
-  if constexpr (LANE_SIZE == 64)
-    return __builtin_amdgcn_mbcnt_hi(~0u, __builtin_amdgcn_mbcnt_lo(~0u, 0u));
-  else
-    return __builtin_amdgcn_mbcnt_lo(~0u, 0u);
+  return __builtin_amdgcn_mbcnt_hi(~0u, __builtin_amdgcn_mbcnt_lo(~0u, 0u));
 }
 
 /// Returns the bit-mask of active threads in the current wavefront.
@@ -134,11 +131,7 @@ LIBC_INLINE uint32_t get_lane_size() { return LANE_SIZE; }
 [[clang::convergent]] LIBC_INLINE uint64_t ballot(uint64_t lane_mask, bool x) {
   // the lane_mask & gives the nvptx semantics when lane_mask is a subset of
   // the active threads
-  if constexpr (LANE_SIZE == 64) {
-    return lane_mask & __builtin_amdgcn_ballot_w64(x);
-  } else {
-    return lane_mask & __builtin_amdgcn_ballot_w32(x);
-  }
+  return lane_mask & __builtin_amdgcn_ballot_w64(x);
 }
 
 /// Waits for all the threads in the block to converge and issues a fence.
@@ -153,15 +146,8 @@ LIBC_INLINE uint32_t get_lane_size() { return LANE_SIZE; }
 }
 
 /// Returns the current value of the GPU's processor clock.
-/// NOTE: The RDNA3 and RDNA2 architectures use a 20-bit cycle cycle counter.
-LIBC_INLINE uint64_t processor_clock() {
-  if constexpr (LIBC_HAS_BUILTIN(__builtin_amdgcn_s_memtime))
-    return __builtin_amdgcn_s_memtime();
-  else if constexpr (LIBC_HAS_BUILTIN(__builtin_readcyclecounter))
-    return __builtin_readcyclecounter();
-  else
-    return 0;
-}
+/// NOTE: The RDNA3 and RDNA2 architectures use a 20-bit cycle counter.
+LIBC_INLINE uint64_t processor_clock() { return __builtin_readcyclecounter(); }
 
 /// Returns a fixed-frequency timestamp. The actual frequency is dependent on
 /// the card and can only be queried via the driver.
diff --git a/libc/src/math/gpu/vendor/amdgpu/platform.h b/libc/src/math/gpu/vendor/amdgpu/platform.h
index 160a8508cd8b95..b22240419cd0c1 100644
--- a/libc/src/math/gpu/vendor/amdgpu/platform.h
+++ b/libc/src/math/gpu/vendor/amdgpu/platform.h
@@ -32,88 +32,8 @@ extern const LIBC_INLINE_VAR uint8_t __oclc_correctly_rounded_sqrt32 = 1;
 // Disable finite math optimizations.
 extern const LIBC_INLINE_VAR uint8_t __oclc_finite_only_opt = 0;
 
-#if defined(__gfx700__)
-extern const LIBC_INLINE_VAR uint32_t __oclc_ISA_version = 7000;
-#elif defined(__gfx701__)
-extern const LIBC_INLINE_VAR uint32_t __oclc_ISA_version = 7001;
-#elif defined(__gfx702__)
-extern const LIBC_INLINE_VAR uint32_t __oclc_ISA_version = 7002;
-#elif defined(__gfx703__)
-extern const LIBC_INLINE_VAR uint32_t __oclc_ISA_version = 7003;
-#elif defined(__gfx704__)
-extern const LIBC_INLINE_VAR uint32_t __oclc_ISA_version = 7004;
-#elif defined(__gfx705__)
-extern const LIBC_INLINE_VAR uint32_t __oclc_ISA_version = 7005;
-#elif defined(__gfx801__)
-extern const LIBC_INLINE_VAR uint32_t __oclc_ISA_version = 8001;
-#elif defined(__gfx802__)
-extern const LIBC_INLINE_VAR uint32_t __oclc_ISA_version = 8002;
-#elif defined(__gfx803__)
-extern const LIBC_INLINE_VAR uint32_t __oclc_ISA_version = 8003;
-#elif defined(__gfx805__)
-extern const LIBC_INLINE_VAR uint32_t __oclc_ISA_version = 8005;
-#elif defined(__gfx810__)
-extern const LIBC_INLINE_VAR uint32_t __oclc_ISA_version = 8100;
-#elif defined(__gfx900__)
-extern const LIBC_INLINE_VAR uint32_t __oclc_ISA_version = 9000;
-#elif defined(__gfx902__)
-extern const LIBC_INLINE_VAR uint32_t __oclc_ISA_version = 9002;
-#elif defined(__gfx904__)
-extern const LIBC_INLINE_VAR uint32_t __oclc_ISA_version = 9004;
-#elif defined(__gfx906__)
-extern const LIBC_INLINE_VAR uint32_t __oclc_ISA_version = 9006;
-#elif defined(__gfx908__)
-extern const LIBC_INLINE_VAR uint32_t __oclc_ISA_version = 9008;
-#elif defined(__gfx909__)
-extern const LIBC_INLINE_VAR uint32_t __oclc_ISA_version = 9009;
-#elif defined(__gfx90a__)
-extern const LIBC_INLINE_VAR uint32_t __oclc_ISA_version = 9010;
-#elif defined(__gfx90c__)
-extern const LIBC_INLINE_VAR uint32_t __oclc_ISA_version = 9012;
-#elif defined(__gfx940__)
-extern const LIBC_INLINE_VAR uint32_t __oclc_ISA_version = 9400;
-#elif defined(__gfx941__)
-extern const LIBC_INLINE_VAR uint32_t __oclc_ISA_version = 9401;
-#elif defined(__gfx942__)
-extern const LIBC_INLINE_VAR uint32_t __oclc_ISA_version = 9402;
-#elif defined(__gfx1010__)
-extern const LIBC_INLINE_VAR uint32_t __oclc_ISA_version = 10100;
-#elif defined(__gfx1011__)
-extern const LIBC_INLINE_VAR uint32_t __oclc_ISA_version = 10101;
-#elif defined(__gfx1012__)
-extern const LIBC_INLINE_VAR uint32_t __oclc_ISA_version = 10102;
-#elif defined(__gfx1013__)
-extern const LIBC_INLINE_VAR uint32_t __oclc_ISA_version = 10103;
-#elif defined(__gfx1030__)
-extern const LIBC_INLINE_VAR uint32_t __oclc_ISA_version = 10300;
-#elif defined(__gfx1031__)
-extern const LIBC_INLINE_VAR uint32_t __oclc_ISA_version = 10301;
-#elif defined(__gfx1032__)
-extern const LIBC_INLINE_VAR uint32_t __oclc_ISA_version = 10302;
-#elif defined(__gfx1033__)
-extern const LIBC_INLINE_VAR uint32_t __oclc_ISA_version = 10303;
-#elif defined(__gfx1034__)
-extern const LIBC_INLINE_VAR uint32_t __oclc_ISA_version = 10304;
-#elif defined(__gfx1035__)
-extern const LIBC_INLINE_VAR uint32_t __oclc_ISA_version = 10305;
-#elif defined(__gfx1036__)
-extern const LIBC_INLINE_VAR uint32_t __oclc_ISA_version = 10306;
-#elif defined(__gfx1100__)
-extern const LIBC_INLINE_VAR uint32_t __oclc_ISA_version = 11000;
-#elif defined(__gfx1101__)
-extern const LIBC_INLINE_VAR uint32_t __oclc_ISA_version = 11001;
-#elif defined(__gfx1102__)
-extern const LIBC_INLINE_VAR uint32_t __oclc_ISA_version = 11002;
-#elif defined(__gfx1103__)
-extern const LIBC_INLINE_VAR uint32_t __oclc_ISA_version = 11003;
-#elif defined(__gfx1150__)
-extern const LIBC_INLINE_VAR uint32_t __oclc_ISA_version = 11500;
-#elif defined(__gfx1151__)
-extern const LIBC_INLINE_VAR uint32_t __oclc_ISA_version = 11501;
-#else
-#error "Unknown AMDGPU architecture"
-#endif
-}
+// Set the ISA to zero to ingore all ISA specific optimizations.
+extern const LIBC_INLINE_VAR uint32_t __oclc_ISA_version = 0;
 
 // These aliases cause clang to emit the control constants with ODR linkage.
 // This allows us to link against the symbols without preventing them from being

Comment on lines 35 to 36
// Set the ISA to zero to ingore all ISA specific optimizations.
extern const LIBC_INLINE_VAR uint32_t __oclc_ISA_version = 0;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's not just optimizations and this will break things. you just happen to not be using any of the things that can break in ockl

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So this only links ocml.bc. I checked the source and it's only used in the following

./src/opts.h:#define HAVE_FAST_FMA32() (__oclc_ISA_version == 7001 || __oclc_ISA_version == 8001 || __oclc_ISA_version >= 9000)
./src/opts.h:#define HAVE_BUGGY_FREXP_INSTRUCTIONS() (__oclc_ISA_version < 7000)

I'm assuming that setting it to zero here will simply impact performance since it'll use a non-FMA slow path and prtobably a software workaround for frexp. I'm eager to delete this dependendance altogether so I'm not concerned about performance. I could probably just set it to 9000 or something if needed.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the fast FMA case, you're much better off assuming true than false.

The HAVE_BUGGY_FREXP_INSTRUCTIONS case is gone since the generic frexp intrinsic was added; I assume you're looking at an old copy

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

#elif defined(__gfx810__)
extern const LIBC_INLINE_VAR uint32_t __oclc_ISA_version = 8100;
#elif defined(__gfx900__)
// Set the ISA to gfx900 to assume FMA instructions are present.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this comment could use more elaboration. You're relying on assumptions about what the library is doing with this, and the specific subset you're using

Copy link
Contributor

@arsenm arsenm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would have added more detail about why it's OK here

@jhuber6 jhuber6 merged commit 026f3c1 into llvm:main Feb 6, 2024
3 of 4 checks passed
@jplehr
Copy link
Contributor

jplehr commented Feb 6, 2024

Hi, this appears to have broken the AMD GPU libc bot https://lab.llvm.org/staging/#/builders/11/builds/1656

@jhuber6
Copy link
Contributor Author

jhuber6 commented Feb 6, 2024

Hi, this appears to have broken the AMD GPU libc bot https://lab.llvm.org/staging/#/builders/11/builds/1656

Oops, fixed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants