Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[NVPTX][Draft] Make __nvvm_nanosleep a no-op if unsupported #81033

Closed
wants to merge 1 commit into from

Conversation

jhuber6
Copy link
Contributor

@jhuber6 jhuber6 commented Feb 7, 2024

Summary;
The LLVM C library currently uses nanosleep in the RPC interface and
for the C library nanosleep function. We build the LLVM C library for
every single NVPTX architecture individually currently, which is not
ideal. The goal is to make the LLVM-IR target independent, unfortunately
the one snag is the nanosleep function which will crash if used on a
GPU older than sm_70. There are three possible solutions to this.

  1. Use __nvvm_reflect(__CUDA_ARCH__) like the libdevice functions.
    This will work as long as optimizations are on, not ideal.
  2. Get rid of the use of nanosleep in libc. This isn't ideal as
    sleeping during the busy-wait loops is helpful for thread scheduling
    and it prevents us from providing nanosleep as a C library
    function.
  3. This patch, which simply makes it legal on all architectures but do
    nothing is it's older than sm_70.

This is a draft to question if this is an acceptable hack, as an
intrinsic silently doing nothing is not always a good idea. Potentially
a new intrinsic could be added instead, but there is also a desire to
have intrinsics map 1-to-1 with hardware.

Summary;
The LLVM C library currently uses `nanosleep` in the RPC interface and
for the C library `nanosleep` function. We build the LLVM C library for
every single NVPTX architecture individually currently, which is not
ideal. The goal is to make the LLVM-IR target independent, unfortunately
the one snag is the `nanosleep` function which will crash if used on a
GPU older than sm_70. There are three possible solutions to this.

1. Use `__nvvm_reflect(__CUDA_ARCH__)` like the libdevice functions.
   This will work as long as optimizations are on, not ideal.
2. Get rid of the use of nanosleep in `libc`. This isn't ideal as
   sleeping during the busy-wait loops is helpful for thread scheduling
   and it prevents us from providing `nanosleep` as a C library
   function.
3. This patch, which simply makes it legal on all architectures but do
   nothing is it's older than sm_70.

This is a draft to question if this is an acceptable hack, as an
intrinsic silently doing nothing is not always a good idea. Potentially
a new intrinsic could be added instead, but there is also a desire to
have intrinsics map 1-to-1 with hardware.
@jhuber6 jhuber6 requested a review from Artem-B February 7, 2024 20:12
@Artem-B
Copy link
Member

Artem-B commented Feb 7, 2024

This patch, which simply makes it legal on all architectures but do nothing is it's older than sm_70.

I do not think this is the right thing to do. "do nothing" is not what one would expect from a nanosleep.

Let's unpack your problem a bit.

__nvvm_reflect() is probably closest to what you would need. However, IIUIC, if you use it to provide nanosleep-based variant and an alternative for the older GPUs, the nanosleep variant code will still hang off the dead branch of if(__nvvm_reflect()) and if it's not eliminated by DCE (which it would not if optimizations are off), the resulting PTX will be invalid for the older GPUs.

In other words, pushing nanosleep implementation into an intrinsic makes things compile everywhere at the expense of doing a wrong thing on the older GPUs. I do not think it's a good trade-off.

Perhaps a better approach would be to incorporate dead branch elimination onto NVVMReflect pass itself. We do know that it is the explicit intent of __nvvm_reflect(). If NVVMReflect explicitly guarantees that the dead branch will be gone, it should allow you to use approach #1 w/o concerns for whether optimizations are enabled and you should be able to provide whatever alternative implementation you need (even if it's a null one), without affecting correctness of LLVM itself.

@jhuber6
Copy link
Contributor Author

jhuber6 commented Feb 7, 2024

This patch, which simply makes it legal on all architectures but do nothing is it's older than sm_70.

I do not think this is the right thing to do. "do nothing" is not what one would expect from a nanosleep.

Thanks, I made this a draft because I figured it wasn't the correct thing to do but wanted to pose the question.

Let's unpack your problem a bit.

__nvvm_reflect() is probably closest to what you would need. However, IIUIC, if you use it to provide nanosleep-based variant and an alternative for the older GPUs, the nanosleep variant code will still hang off the dead branch of if(__nvvm_reflect()) and if it's not eliminated by DCE (which it would not if optimizations are off), the resulting PTX will be invalid for the older GPUs.

In other words, pushing nanosleep implementation into an intrinsic makes things compile everywhere at the expense of doing a wrong thing on the older GPUs. I do not think it's a good trade-off.

Perhaps a better approach would be to incorporate dead branch elimination onto NVVMReflect pass itself. We do know that it is the explicit intent of __nvvm_reflect(). If NVVMReflect explicitly guarantees that the dead branch will be gone, it should allow you to use approach #1 w/o concerns for whether optimizations are enabled and you should be able to provide whatever alternative implementation you need (even if it's a null one), without affecting correctness of LLVM itself.

I think that would be a good solution if possible. Would this simply mean scheduling a global DCE pass right after the NVVM reflect pass? Since that seems to be run at O0 that seems like the easiest solution, though it somewhat breaks O0 semantics.

Or, maybe we just have a really shallow implementation in the NVVM reflect pass that collapses the branch?

@jhuber6
Copy link
Contributor Author

jhuber6 commented Feb 7, 2024

Okay, __nvvm_reflect doesn't work fully here because the nanosleep builtin I added requires sm_70 at the clang level. Either means I'd need to go back to inline assembly or remove that requirement at least from clang so it's a backend failure.

@Artem-B
Copy link
Member

Artem-B commented Feb 7, 2024

Okay, __nvvm_reflect doesn't work fully here because the nanosleep builtin I added requires sm_70 at the clang level. Either means I'd need to go back to inline assembly or remove that requirement at least from clang so it's a backend failure.

The question is -- who's going to provide a fallback implementation for the nanosleepbuiltin for the older GPUs. I do not think it's LLVM's job, so constraining the builtin is appropriate. However, nothing stops you from providing your own implementation in libc using inline asm. Something along these lines:

__device__ void my_nanosleep(int N) {
  if (__nvvm_reflect(SM_70)) {
    asm volatile("nanosleep")
  } else {
     while(N--) {
        volatile asm("something unoptimizable")
     }
  }
}

@jhuber6 jhuber6 closed this Feb 9, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants