[NVPTX][Draft] Make `__nvvm_nanosleep` a no-op if unsupported #81033

jhuber6 · 2024-02-07T20:12:30Z

Summary;
The LLVM C library currently uses nanosleep in the RPC interface and
for the C library nanosleep function. We build the LLVM C library for
every single NVPTX architecture individually currently, which is not
ideal. The goal is to make the LLVM-IR target independent, unfortunately
the one snag is the nanosleep function which will crash if used on a
GPU older than sm_70. There are three possible solutions to this.

Use __nvvm_reflect(__CUDA_ARCH__) like the libdevice functions.
This will work as long as optimizations are on, not ideal.
Get rid of the use of nanosleep in libc. This isn't ideal as
sleeping during the busy-wait loops is helpful for thread scheduling
and it prevents us from providing nanosleep as a C library
function.
This patch, which simply makes it legal on all architectures but do
nothing is it's older than sm_70.

This is a draft to question if this is an acceptable hack, as an
intrinsic silently doing nothing is not always a good idea. Potentially
a new intrinsic could be added instead, but there is also a desire to
have intrinsics map 1-to-1 with hardware.

Summary; The LLVM C library currently uses `nanosleep` in the RPC interface and for the C library `nanosleep` function. We build the LLVM C library for every single NVPTX architecture individually currently, which is not ideal. The goal is to make the LLVM-IR target independent, unfortunately the one snag is the `nanosleep` function which will crash if used on a GPU older than sm_70. There are three possible solutions to this. 1. Use `__nvvm_reflect(__CUDA_ARCH__)` like the libdevice functions. This will work as long as optimizations are on, not ideal. 2. Get rid of the use of nanosleep in `libc`. This isn't ideal as sleeping during the busy-wait loops is helpful for thread scheduling and it prevents us from providing `nanosleep` as a C library function. 3. This patch, which simply makes it legal on all architectures but do nothing is it's older than sm_70. This is a draft to question if this is an acceptable hack, as an intrinsic silently doing nothing is not always a good idea. Potentially a new intrinsic could be added instead, but there is also a desire to have intrinsics map 1-to-1 with hardware.

Artem-B · 2024-02-07T20:38:39Z

This patch, which simply makes it legal on all architectures but do nothing is it's older than sm_70.

I do not think this is the right thing to do. "do nothing" is not what one would expect from a nanosleep.

Let's unpack your problem a bit.

__nvvm_reflect() is probably closest to what you would need. However, IIUIC, if you use it to provide nanosleep-based variant and an alternative for the older GPUs, the nanosleep variant code will still hang off the dead branch of if(__nvvm_reflect()) and if it's not eliminated by DCE (which it would not if optimizations are off), the resulting PTX will be invalid for the older GPUs.

In other words, pushing nanosleep implementation into an intrinsic makes things compile everywhere at the expense of doing a wrong thing on the older GPUs. I do not think it's a good trade-off.

Perhaps a better approach would be to incorporate dead branch elimination onto NVVMReflect pass itself. We do know that it is the explicit intent of __nvvm_reflect(). If NVVMReflect explicitly guarantees that the dead branch will be gone, it should allow you to use approach #1 w/o concerns for whether optimizations are enabled and you should be able to provide whatever alternative implementation you need (even if it's a null one), without affecting correctness of LLVM itself.

jhuber6 · 2024-02-07T20:50:29Z

This patch, which simply makes it legal on all architectures but do nothing is it's older than sm_70.

I do not think this is the right thing to do. "do nothing" is not what one would expect from a nanosleep.

Thanks, I made this a draft because I figured it wasn't the correct thing to do but wanted to pose the question.

Let's unpack your problem a bit.

__nvvm_reflect() is probably closest to what you would need. However, IIUIC, if you use it to provide nanosleep-based variant and an alternative for the older GPUs, the nanosleep variant code will still hang off the dead branch of if(__nvvm_reflect()) and if it's not eliminated by DCE (which it would not if optimizations are off), the resulting PTX will be invalid for the older GPUs.

In other words, pushing nanosleep implementation into an intrinsic makes things compile everywhere at the expense of doing a wrong thing on the older GPUs. I do not think it's a good trade-off.

Perhaps a better approach would be to incorporate dead branch elimination onto NVVMReflect pass itself. We do know that it is the explicit intent of __nvvm_reflect(). If NVVMReflect explicitly guarantees that the dead branch will be gone, it should allow you to use approach #1 w/o concerns for whether optimizations are enabled and you should be able to provide whatever alternative implementation you need (even if it's a null one), without affecting correctness of LLVM itself.

I think that would be a good solution if possible. Would this simply mean scheduling a global DCE pass right after the NVVM reflect pass? Since that seems to be run at O0 that seems like the easiest solution, though it somewhat breaks O0 semantics.

Or, maybe we just have a really shallow implementation in the NVVM reflect pass that collapses the branch?

jhuber6 · 2024-02-07T21:08:51Z

Okay, __nvvm_reflect doesn't work fully here because the nanosleep builtin I added requires sm_70 at the clang level. Either means I'd need to go back to inline assembly or remove that requirement at least from clang so it's a backend failure.

Artem-B · 2024-02-07T22:12:49Z

Okay, __nvvm_reflect doesn't work fully here because the nanosleep builtin I added requires sm_70 at the clang level. Either means I'd need to go back to inline assembly or remove that requirement at least from clang so it's a backend failure.

The question is -- who's going to provide a fallback implementation for the nanosleepbuiltin for the older GPUs. I do not think it's LLVM's job, so constraining the builtin is appropriate. However, nothing stops you from providing your own implementation in libc using inline asm. Something along these lines:

__device__ void my_nanosleep(int N) {
  if (__nvvm_reflect(SM_70)) {
    asm volatile("nanosleep")
  } else {
     while(N--) {
        volatile asm("something unoptimizable")
     }
  }
}

jhuber6 requested a review from Artem-B February 7, 2024 20:12

jhuber6 closed this Feb 9, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[NVPTX][Draft] Make `__nvvm_nanosleep` a no-op if unsupported #81033

[NVPTX][Draft] Make `__nvvm_nanosleep` a no-op if unsupported #81033

jhuber6 commented Feb 7, 2024

Artem-B commented Feb 7, 2024

jhuber6 commented Feb 7, 2024

jhuber6 commented Feb 7, 2024

Artem-B commented Feb 7, 2024

[NVPTX][Draft] Make __nvvm_nanosleep a no-op if unsupported #81033

[NVPTX][Draft] Make __nvvm_nanosleep a no-op if unsupported #81033

Conversation

jhuber6 commented Feb 7, 2024

Artem-B commented Feb 7, 2024

jhuber6 commented Feb 7, 2024

jhuber6 commented Feb 7, 2024

Artem-B commented Feb 7, 2024

[NVPTX][Draft] Make `__nvvm_nanosleep` a no-op if unsupported #81033

[NVPTX][Draft] Make `__nvvm_nanosleep` a no-op if unsupported #81033