[NVPTX] Allow the ctor/dtor lowering pass to emit kernels #71549

jhuber6 · 2023-11-07T15:39:57Z

Summary:
This pass emits the new "nvptx$device$init" and "nvptx$device$fini"
kernels that are callable by the device. This intends to mimic the
method of lowering for AMDGPU where we emit amdgcn.device.init and
amdgcn.device.fini respectively. These kernels simply iterate a symbol
called __init_array_start/stop and __fini_array_start/stop.
Normally, the linker provides these symbols automatically. In the AMDGPU
case we only need call the kernel and we call the ctors / dtors.
However, for NVPTX we require the user initializes these variables to
the associated globals that we already emit as a part of this pass.

The motivation behind this change is to move away from OpenMP's handling
of ctors / dtors. I would much prefer that the backend / runtime handles
this. That allows us to handle ctors / dtors in a language agnostic way,

This approach requires that the runtime initializes the associated
globals. They are marked weak so we can emit this per-TU. The kernel
itself is weak_odr as it is copied exactly.

One downside is that any module containing these kernels elicitis the
"stack size cannot be statically determined warning" every time from
nvlink which is annoying but inconsequential for functionality. It
would be nice if there were a way to silence this warning however.

jhuber6 · 2023-11-09T17:38:44Z

This is tested and functional in #71739 now.

llvm/lib/Target/NVPTX/NVPTXCtorDtorLowering.cpp

Summary: This pass emits the new "nvptx$device$init" and "nvptx$device$fini" kernels that are callable by the device. This intends to mimic the method of lowering for AMDGPU where we emit `amdgcn.device.init` and `amdgcn.device.fini` respectively. These kernels simply iterate a symbol called `__init_array_start/stop` and `__fini_array_start/stop`. Normally, the linker provides these symbols automatically. In the AMDGPU case we only need call the kernel and we call the ctors / dtors. However, for NVPTX we require the user initializes these variables to the associated globals that we already emit as a part of this pass. The motivation behind this change is to move away from OpenMP's handling of ctors / dtors. I would much prefer that the backend / runtime handles this. That allows us to handle ctors / dtors in a language agnostic way, This approach requires that the runtime initializes the associated globals. They are marked `weak` so we can emit this per-TU. The kernel itself is `weak_odr` as it is copied exactly. One downside is that any module containing these kernels elicitis the "stack size cannot be statically determined warning" every time from `nvlink` which is annoying but inconsequential for functionality. It would be nice if there were a way to silence this warning however.

Summary: This patch reworks how we handle global constructors in OpenMP. Previously, we emitted individual kernels that were all registered and called individually. In order to provide more generic support, this patch moves all handling of this to the target backend and the runtime plugin. This has the benefit of supporting the GNU extensions for constructors an destructors, removing a class of failures related to shared library destruction order, and allows targets other than OpenMP to use the same support without needing to change the frontend. This is primarily done by calling kernels that the backend emits to iterate a list of ctor / dtor functions. For x64, this is automatic and we get it for free with the standard `dlopen` handling. For AMDGPU, we emit `amdgcn.device.init` and `amdgcn.device.fini` functions which handle everything atuomatically and simply need to be called. For NVPTX, a patch llvm#71549 provides the kernels to call, but the runtime needs to set up the array manually by pulling out all the known constructor / destructor functions. One concession that this patch requires is the change that for GPU targets in OpenMP offloading we will use `llvm.global_dtors` instead of using `atexit`. This is because `atexit` is a separate runtime function that does not mesh well with the handling we're trying to do here. This should be equivalent in all cases except for cases where we would need to destruct manually such as: ``` struct S { ~S() { foo(); } }; void foo() { static S s; } ``` However this is broken in many other ways on the GPU, so it is not regressing any support, simply increasing the scope of what we can handle. This changes the handling of ctors / dtors. This patch now outputs a information message regarding the deprecation if the old format is used. This will be completely removed in a later release. Depends on: llvm#71549 Add LangOption for atexit usage Summary: This method isn't 1-to-1 but it's more functional than not having it.

Summary: This patch reworks how we handle global constructors in OpenMP. Previously, we emitted individual kernels that were all registered and called individually. In order to provide more generic support, this patch moves all handling of this to the target backend and the runtime plugin. This has the benefit of supporting the GNU extensions for constructors an destructors, removing a class of failures related to shared library destruction order, and allows targets other than OpenMP to use the same support without needing to change the frontend. This is primarily done by calling kernels that the backend emits to iterate a list of ctor / dtor functions. For x64, this is automatic and we get it for free with the standard `dlopen` handling. For AMDGPU, we emit `amdgcn.device.init` and `amdgcn.device.fini` functions which handle everything atuomatically and simply need to be called. For NVPTX, a patch #71549 provides the kernels to call, but the runtime needs to set up the array manually by pulling out all the known constructor / destructor functions. One concession that this patch requires is the change that for GPU targets in OpenMP offloading we will use `llvm.global_dtors` instead of using `atexit`. This is because `atexit` is a separate runtime function that does not mesh well with the handling we're trying to do here. This should be equivalent in all cases except for cases where we would need to destruct manually such as: ``` struct S { ~S() { foo(); } }; void foo() { static S s; } ``` However this is broken in many other ways on the GPU, so it is not regressing any support, simply increasing the scope of what we can handle. This changes the handling of ctors / dtors. This patch now outputs a information message regarding the deprecation if the old format is used. This will be completely removed in a later release. Depends on: #71549

Summary: This patch reworks how we handle global constructors in OpenMP. Previously, we emitted individual kernels that were all registered and called individually. In order to provide more generic support, this patch moves all handling of this to the target backend and the runtime plugin. This has the benefit of supporting the GNU extensions for constructors an destructors, removing a class of failures related to shared library destruction order, and allows targets other than OpenMP to use the same support without needing to change the frontend. This is primarily done by calling kernels that the backend emits to iterate a list of ctor / dtor functions. For x64, this is automatic and we get it for free with the standard `dlopen` handling. For AMDGPU, we emit `amdgcn.device.init` and `amdgcn.device.fini` functions which handle everything atuomatically and simply need to be called. For NVPTX, a patch llvm#71549 provides the kernels to call, but the runtime needs to set up the array manually by pulling out all the known constructor / destructor functions. One concession that this patch requires is the change that for GPU targets in OpenMP offloading we will use `llvm.global_dtors` instead of using `atexit`. This is because `atexit` is a separate runtime function that does not mesh well with the handling we're trying to do here. This should be equivalent in all cases except for cases where we would need to destruct manually such as: ``` struct S { ~S() { foo(); } }; void foo() { static S s; } ``` However this is broken in many other ways on the GPU, so it is not regressing any support, simply increasing the scope of what we can handle. This changes the handling of ctors / dtors. This patch now outputs a information message regarding the deprecation if the old format is used. This will be completely removed in a later release. Depends on: llvm#71549 Change-Id: I99d449b4ca8c590a99fbd84774c673a4d49300a4

Summary: This pass emits the new "nvptx$device$init" and "nvptx$device$fini" kernels that are callable by the device. This intends to mimic the method of lowering for AMDGPU where we emit `amdgcn.device.init` and `amdgcn.device.fini` respectively. These kernels simply iterate a symbol called `__init_array_start/stop` and `__fini_array_start/stop`. Normally, the linker provides these symbols automatically. In the AMDGPU case we only need call the kernel and we call the ctors / dtors. However, for NVPTX we require the user initializes these variables to the associated globals that we already emit as a part of this pass. The motivation behind this change is to move away from OpenMP's handling of ctors / dtors. I would much prefer that the backend / runtime handles this. That allows us to handle ctors / dtors in a language agnostic way, This approach requires that the runtime initializes the associated globals. They are marked `weak` so we can emit this per-TU. The kernel itself is `weak_odr` as it is copied exactly. One downside is that any module containing these kernels elicitis the "stack size cannot be statically determined warning" every time from `nvlink` which is annoying but inconsequential for functionality. It would be nice if there were a way to silence this warning however.

Summary: This patch reworks how we handle global constructors in OpenMP. Previously, we emitted individual kernels that were all registered and called individually. In order to provide more generic support, this patch moves all handling of this to the target backend and the runtime plugin. This has the benefit of supporting the GNU extensions for constructors an destructors, removing a class of failures related to shared library destruction order, and allows targets other than OpenMP to use the same support without needing to change the frontend. This is primarily done by calling kernels that the backend emits to iterate a list of ctor / dtor functions. For x64, this is automatic and we get it for free with the standard `dlopen` handling. For AMDGPU, we emit `amdgcn.device.init` and `amdgcn.device.fini` functions which handle everything atuomatically and simply need to be called. For NVPTX, a patch llvm#71549 provides the kernels to call, but the runtime needs to set up the array manually by pulling out all the known constructor / destructor functions. One concession that this patch requires is the change that for GPU targets in OpenMP offloading we will use `llvm.global_dtors` instead of using `atexit`. This is because `atexit` is a separate runtime function that does not mesh well with the handling we're trying to do here. This should be equivalent in all cases except for cases where we would need to destruct manually such as: ``` struct S { ~S() { foo(); } }; void foo() { static S s; } ``` However this is broken in many other ways on the GPU, so it is not regressing any support, simply increasing the scope of what we can handle. This changes the handling of ctors / dtors. This patch now outputs a information message regarding the deprecation if the old format is used. This will be completely removed in a later release. Depends on: llvm#71549

jhuber6 requested review from arsenm, Artem-B, jdoerfert, jlebar and JonChesterfield November 7, 2023 15:39

jhuber6 mentioned this pull request Nov 8, 2023

[OpenMP] Rework handling of global ctor/dtors in OpenMP #71739

Merged

jhuber6 force-pushed the NVPTXCtorDtorKernel branch from a0d2f19 to 1a4b214 Compare November 9, 2023 16:49

Artem-B reviewed Nov 9, 2023

View reviewed changes

llvm/lib/Target/NVPTX/NVPTXCtorDtorLowering.cpp Outdated Show resolved Hide resolved

llvm/lib/Target/NVPTX/NVPTXCtorDtorLowering.cpp Outdated Show resolved Hide resolved

llvm/lib/Target/NVPTX/NVPTXCtorDtorLowering.cpp Show resolved Hide resolved

jhuber6 force-pushed the NVPTXCtorDtorKernel branch from 1a4b214 to 5f3af1e Compare November 9, 2023 20:10

Artem-B reviewed Nov 9, 2023

View reviewed changes

llvm/lib/Target/NVPTX/NVPTXCtorDtorLowering.cpp Outdated Show resolved Hide resolved

jhuber6 force-pushed the NVPTXCtorDtorKernel branch from 5f3af1e to 3176ffb Compare November 9, 2023 20:42

Artem-B approved these changes Nov 9, 2023

View reviewed changes

jhuber6 force-pushed the NVPTXCtorDtorKernel branch from 3176ffb to f3a0d1d Compare November 10, 2023 15:33

jhuber6 merged commit af8ebfd into llvm:main Nov 10, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[NVPTX] Allow the ctor/dtor lowering pass to emit kernels #71549

[NVPTX] Allow the ctor/dtor lowering pass to emit kernels #71549

jhuber6 commented Nov 7, 2023

jhuber6 commented Nov 9, 2023

[NVPTX] Allow the ctor/dtor lowering pass to emit kernels #71549

[NVPTX] Allow the ctor/dtor lowering pass to emit kernels #71549

Conversation

jhuber6 commented Nov 7, 2023

jhuber6 commented Nov 9, 2023