-
Notifications
You must be signed in to change notification settings - Fork 12k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[NVPTX] Allow the ctor/dtor lowering pass to emit kernels #71549
Merged
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
jhuber6
requested review from
arsenm,
Artem-B,
jdoerfert,
jlebar and
JonChesterfield
November 7, 2023 15:39
jhuber6
force-pushed
the
NVPTXCtorDtorKernel
branch
from
November 9, 2023 16:49
a0d2f19
to
1a4b214
Compare
This is tested and functional in #71739 now. |
Artem-B
reviewed
Nov 9, 2023
jhuber6
force-pushed
the
NVPTXCtorDtorKernel
branch
from
November 9, 2023 20:10
1a4b214
to
5f3af1e
Compare
Artem-B
reviewed
Nov 9, 2023
jhuber6
force-pushed
the
NVPTXCtorDtorKernel
branch
from
November 9, 2023 20:42
5f3af1e
to
3176ffb
Compare
Artem-B
approved these changes
Nov 9, 2023
Summary: This pass emits the new "nvptx$device$init" and "nvptx$device$fini" kernels that are callable by the device. This intends to mimic the method of lowering for AMDGPU where we emit `amdgcn.device.init` and `amdgcn.device.fini` respectively. These kernels simply iterate a symbol called `__init_array_start/stop` and `__fini_array_start/stop`. Normally, the linker provides these symbols automatically. In the AMDGPU case we only need call the kernel and we call the ctors / dtors. However, for NVPTX we require the user initializes these variables to the associated globals that we already emit as a part of this pass. The motivation behind this change is to move away from OpenMP's handling of ctors / dtors. I would much prefer that the backend / runtime handles this. That allows us to handle ctors / dtors in a language agnostic way, This approach requires that the runtime initializes the associated globals. They are marked `weak` so we can emit this per-TU. The kernel itself is `weak_odr` as it is copied exactly. One downside is that any module containing these kernels elicitis the "stack size cannot be statically determined warning" every time from `nvlink` which is annoying but inconsequential for functionality. It would be nice if there were a way to silence this warning however.
jhuber6
force-pushed
the
NVPTXCtorDtorKernel
branch
from
November 10, 2023 15:33
3176ffb
to
f3a0d1d
Compare
jhuber6
added a commit
to jhuber6/llvm-project
that referenced
this pull request
Nov 10, 2023
Summary: This patch reworks how we handle global constructors in OpenMP. Previously, we emitted individual kernels that were all registered and called individually. In order to provide more generic support, this patch moves all handling of this to the target backend and the runtime plugin. This has the benefit of supporting the GNU extensions for constructors an destructors, removing a class of failures related to shared library destruction order, and allows targets other than OpenMP to use the same support without needing to change the frontend. This is primarily done by calling kernels that the backend emits to iterate a list of ctor / dtor functions. For x64, this is automatic and we get it for free with the standard `dlopen` handling. For AMDGPU, we emit `amdgcn.device.init` and `amdgcn.device.fini` functions which handle everything atuomatically and simply need to be called. For NVPTX, a patch llvm#71549 provides the kernels to call, but the runtime needs to set up the array manually by pulling out all the known constructor / destructor functions. One concession that this patch requires is the change that for GPU targets in OpenMP offloading we will use `llvm.global_dtors` instead of using `atexit`. This is because `atexit` is a separate runtime function that does not mesh well with the handling we're trying to do here. This should be equivalent in all cases except for cases where we would need to destruct manually such as: ``` struct S { ~S() { foo(); } }; void foo() { static S s; } ``` However this is broken in many other ways on the GPU, so it is not regressing any support, simply increasing the scope of what we can handle. This changes the handling of ctors / dtors. This patch now outputs a information message regarding the deprecation if the old format is used. This will be completely removed in a later release. Depends on: llvm#71549 Add LangOption for atexit usage Summary: This method isn't 1-to-1 but it's more functional than not having it.
jhuber6
added a commit
that referenced
this pull request
Nov 10, 2023
Summary: This patch reworks how we handle global constructors in OpenMP. Previously, we emitted individual kernels that were all registered and called individually. In order to provide more generic support, this patch moves all handling of this to the target backend and the runtime plugin. This has the benefit of supporting the GNU extensions for constructors an destructors, removing a class of failures related to shared library destruction order, and allows targets other than OpenMP to use the same support without needing to change the frontend. This is primarily done by calling kernels that the backend emits to iterate a list of ctor / dtor functions. For x64, this is automatic and we get it for free with the standard `dlopen` handling. For AMDGPU, we emit `amdgcn.device.init` and `amdgcn.device.fini` functions which handle everything atuomatically and simply need to be called. For NVPTX, a patch #71549 provides the kernels to call, but the runtime needs to set up the array manually by pulling out all the known constructor / destructor functions. One concession that this patch requires is the change that for GPU targets in OpenMP offloading we will use `llvm.global_dtors` instead of using `atexit`. This is because `atexit` is a separate runtime function that does not mesh well with the handling we're trying to do here. This should be equivalent in all cases except for cases where we would need to destruct manually such as: ``` struct S { ~S() { foo(); } }; void foo() { static S s; } ``` However this is broken in many other ways on the GPU, so it is not regressing any support, simply increasing the scope of what we can handle. This changes the handling of ctors / dtors. This patch now outputs a information message regarding the deprecation if the old format is used. This will be completely removed in a later release. Depends on: #71549
searlmc1
pushed a commit
to ROCm/llvm-project
that referenced
this pull request
Nov 12, 2023
Summary: This patch reworks how we handle global constructors in OpenMP. Previously, we emitted individual kernels that were all registered and called individually. In order to provide more generic support, this patch moves all handling of this to the target backend and the runtime plugin. This has the benefit of supporting the GNU extensions for constructors an destructors, removing a class of failures related to shared library destruction order, and allows targets other than OpenMP to use the same support without needing to change the frontend. This is primarily done by calling kernels that the backend emits to iterate a list of ctor / dtor functions. For x64, this is automatic and we get it for free with the standard `dlopen` handling. For AMDGPU, we emit `amdgcn.device.init` and `amdgcn.device.fini` functions which handle everything atuomatically and simply need to be called. For NVPTX, a patch llvm#71549 provides the kernels to call, but the runtime needs to set up the array manually by pulling out all the known constructor / destructor functions. One concession that this patch requires is the change that for GPU targets in OpenMP offloading we will use `llvm.global_dtors` instead of using `atexit`. This is because `atexit` is a separate runtime function that does not mesh well with the handling we're trying to do here. This should be equivalent in all cases except for cases where we would need to destruct manually such as: ``` struct S { ~S() { foo(); } }; void foo() { static S s; } ``` However this is broken in many other ways on the GPU, so it is not regressing any support, simply increasing the scope of what we can handle. This changes the handling of ctors / dtors. This patch now outputs a information message regarding the deprecation if the old format is used. This will be completely removed in a later release. Depends on: llvm#71549 Change-Id: I99d449b4ca8c590a99fbd84774c673a4d49300a4
zahiraam
pushed a commit
to zahiraam/llvm-project
that referenced
this pull request
Nov 20, 2023
Summary: This pass emits the new "nvptx$device$init" and "nvptx$device$fini" kernels that are callable by the device. This intends to mimic the method of lowering for AMDGPU where we emit `amdgcn.device.init` and `amdgcn.device.fini` respectively. These kernels simply iterate a symbol called `__init_array_start/stop` and `__fini_array_start/stop`. Normally, the linker provides these symbols automatically. In the AMDGPU case we only need call the kernel and we call the ctors / dtors. However, for NVPTX we require the user initializes these variables to the associated globals that we already emit as a part of this pass. The motivation behind this change is to move away from OpenMP's handling of ctors / dtors. I would much prefer that the backend / runtime handles this. That allows us to handle ctors / dtors in a language agnostic way, This approach requires that the runtime initializes the associated globals. They are marked `weak` so we can emit this per-TU. The kernel itself is `weak_odr` as it is copied exactly. One downside is that any module containing these kernels elicitis the "stack size cannot be statically determined warning" every time from `nvlink` which is annoying but inconsequential for functionality. It would be nice if there were a way to silence this warning however.
zahiraam
pushed a commit
to zahiraam/llvm-project
that referenced
this pull request
Nov 20, 2023
Summary: This patch reworks how we handle global constructors in OpenMP. Previously, we emitted individual kernels that were all registered and called individually. In order to provide more generic support, this patch moves all handling of this to the target backend and the runtime plugin. This has the benefit of supporting the GNU extensions for constructors an destructors, removing a class of failures related to shared library destruction order, and allows targets other than OpenMP to use the same support without needing to change the frontend. This is primarily done by calling kernels that the backend emits to iterate a list of ctor / dtor functions. For x64, this is automatic and we get it for free with the standard `dlopen` handling. For AMDGPU, we emit `amdgcn.device.init` and `amdgcn.device.fini` functions which handle everything atuomatically and simply need to be called. For NVPTX, a patch llvm#71549 provides the kernels to call, but the runtime needs to set up the array manually by pulling out all the known constructor / destructor functions. One concession that this patch requires is the change that for GPU targets in OpenMP offloading we will use `llvm.global_dtors` instead of using `atexit`. This is because `atexit` is a separate runtime function that does not mesh well with the handling we're trying to do here. This should be equivalent in all cases except for cases where we would need to destruct manually such as: ``` struct S { ~S() { foo(); } }; void foo() { static S s; } ``` However this is broken in many other ways on the GPU, so it is not regressing any support, simply increasing the scope of what we can handle. This changes the handling of ctors / dtors. This patch now outputs a information message regarding the deprecation if the old format is used. This will be completely removed in a later release. Depends on: llvm#71549
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Summary:
This pass emits the new "nvptx$device$init" and "nvptx$device$fini"
kernels that are callable by the device. This intends to mimic the
method of lowering for AMDGPU where we emit
amdgcn.device.init
andamdgcn.device.fini
respectively. These kernels simply iterate a symbolcalled
__init_array_start/stop
and__fini_array_start/stop
.Normally, the linker provides these symbols automatically. In the AMDGPU
case we only need call the kernel and we call the ctors / dtors.
However, for NVPTX we require the user initializes these variables to
the associated globals that we already emit as a part of this pass.
The motivation behind this change is to move away from OpenMP's handling
of ctors / dtors. I would much prefer that the backend / runtime handles
this. That allows us to handle ctors / dtors in a language agnostic way,
This approach requires that the runtime initializes the associated
globals. They are marked
weak
so we can emit this per-TU. The kernelitself is
weak_odr
as it is copied exactly.One downside is that any module containing these kernels elicitis the
"stack size cannot be statically determined warning" every time from
nvlink
which is annoying but inconsequential for functionality. Itwould be nice if there were a way to silence this warning however.