-
Notifications
You must be signed in to change notification settings - Fork 10.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Flang][OpenMP][Offload] HLFIR AssignOp does not lower to a friendly form for AMDGPU which is used for target offloading for OpenMP #74603
Comments
@llvm/issue-subscribers-flang-ir Author: None (agozillon)
This issue was found during the implementation of the following PR (and is dependent on it): https://github.com//pull/71766
The following example which attempts to map and assign a value to an allocatable variable on device compiles and works for the deprecated FIR flow, but will fail using the new HLFIR flow:
Yielding the following ICE error:
The command used to compile this and hit the error (should just require having a Clang that's compiled to support AMDGPU and the dependent PR if it's not committed already): flang-new --offload-arch=gfx90a -fopenmp test.f90 -o test.out From what I can gather, from digging a little into the issue, this comes from AMDGPU not supporting DYNAMIC_STACKALLOCA instructions. I think AMDGPU only performs static allocation, but someone with more understanding of that segment of the compiler will know far better than myself. However, the generation of this code that's unfriendly for AMD GPU, appears to stem from the HLFIR AssignOp, which lowers to a Fortran runtime call, which likely brings in the instruction that requires a dynamic stack allocation instruction (I've unfortunately not found the exact problematic line, but there's a number of areas that might pose the problem). The solution, that I can currently think of, is to opt out of the HLFIR AssignOp generation for AMD GPU devices or for OpenMP offload (or both) and utilise the old FIR flow, which does not depend on the runtime call. I am not sure how palatable that is for everyone though, as I imagine the intent was to discard this old FIR flow in the near future. I am more than open to other suggestions however! This is just the option I had in mind just now. It also brings up the possible issue that errors like this are encountered for other cases where HLFIR operations lower to Fortran rutnime calls, but that may be hyperbole as this is the only case I've encountered so far. |
@llvm/issue-subscribers-flang-runtime Author: None (agozillon)
This issue was found during the implementation of the following PR (and is dependent on it): https://github.com//pull/71766
The following example which attempts to map and assign a value to an allocatable variable on device compiles and works for the deprecated FIR flow, but will fail using the new HLFIR flow:
Yielding the following ICE error:
The command used to compile this and hit the error (should just require having a Clang that's compiled to support AMDGPU and the dependent PR if it's not committed already): flang-new --offload-arch=gfx90a -fopenmp test.f90 -o test.out From what I can gather, from digging a little into the issue, this comes from AMDGPU not supporting DYNAMIC_STACKALLOCA instructions. I think AMDGPU only performs static allocation, but someone with more understanding of that segment of the compiler will know far better than myself. However, the generation of this code that's unfriendly for AMD GPU, appears to stem from the HLFIR AssignOp, which lowers to a Fortran runtime call, which likely brings in the instruction that requires a dynamic stack allocation instruction (I've unfortunately not found the exact problematic line, but there's a number of areas that might pose the problem). The solution, that I can currently think of, is to opt out of the HLFIR AssignOp generation for AMD GPU devices or for OpenMP offload (or both) and utilise the old FIR flow, which does not depend on the runtime call. I am not sure how palatable that is for everyone though, as I imagine the intent was to discard this old FIR flow in the near future. I am more than open to other suggestions however! This is just the option I had in mind just now. It also brings up the possible issue that errors like this are encountered for other cases where HLFIR operations lower to Fortran rutnime calls, but that may be hyperbole as this is the only case I've encountered so far. |
I don't think using the old lowering is a good solution in the long term. Did you compile the runtime for offload? |
Perhaps I have not in this case, is there specific build commands required to do this? I'm a little unfamiliar with how the Fortran runtime works in this case admittedly, so I'd appreciate any pointers you can give. |
I have not done it myself but there is this post: https://discourse.llvm.org/t/rfc-building-flang-runtime-for-offload-devices/70787 |
Thank you @clementval I'll have a look into that thread, I do recollect a little about it, can't remember where we landed on it though but it does seem Slava made some good headway with an initial patch for CUDA/NVPTX. |
I've had a little time to dig into this a little more, it seems that at least the above error has nothing to do with the definition itself as it's not linked in to the module at the time the error occurs, we only have the declaration. It seems more likely that the allocations (box and scalar) generated for the reallocation case are causing some issues. I think (not completely sure, need to find it/talk to someone) there's an optimisation pass in the backend somewhere that automagically tidies them up where possible, but doesn't handle these in this case (it disappears the allocas we create for kernel arguments on lowering to replace with other instructions at least, but it requires a load access in that particular case). But in any case, if I raise the allocas to the kernel entry point (as when using the generalised LLVM dialect lowering to LLVM IR it injects them in the middle of the kernel currently), I think making them statically allocatable, it bypasses the previous error, however, it hits another wall in the instruction selector where it tries to directly use a FrameIndex (the allocation), which seems to also be a no-go for at least AMDGPU. So I am still not entirely sure of a fix, but I did notice another possible workaround would be to deactivate the realloc portion of the hlfir::AssignOp operation for AMDGPU/OpenMP target offload by perhaps adding another logic check to https://github.com/llvm/llvm-project/blob/main/flang/lib/Lower/Bridge.cpp#L3491 here to toggle isWholeAllocatableAssignment false (or something along those lines), similar to my original suggestion but less drastic as it's not keeping around the old FIR flow I think. I suppose one other, perhaps more drastic option would be to create a pass that raised the allocas out of the kernel and turned them into mapped arguments, but that is possibly quite overkill for something like AssignOp (e.g. assigning a single value to an allocatable). |
It seems very error prone to have specific lowering for different target. Especially in assignment where Fortran as lots of rule for allocatable and so on. It feels like the adaptation should be done in a target specific pass like the target-rewrite pass to make the code workaround the error you are seeing. |
Thank you @clementval I wasn't aware of this pass, I'll have a look into it! |
I believe I have found a series of small fixes that will get this test working, I'll open a PR or two for it when I am back from vacation, unless someone else manages to get there first:
So the test will need two dependencies to work, the fortran runtime built for offload and a libc built for offload as well. Alongside some of the non-library related modifications to the IR. |
…se function pass to finalize, utilised in convertTarget This patch seeks to add a mechanism to raise constant (not ConstantExpr or runtime/dynamic) sized allocations into the entry block for select functions that have been inserted into a list for processing. This processing occurs during the finalize call, after OutlinedInfo regions have completed. This currently has only been utilised for createOutlinedFunction, which is triggered for TargetOp generation in the OpenMP MLIR dialect lowering to LLVM-IR. This currently is required for Target kernels generated by createOutlinedFunction to avoid subsequent optimisation passes doing some unintentional malformed optimisaitions for AMD kernels (unsure if it occurs for other vendors). If the allocas are generated inside of the kernel and are not in the entry block and are subsequently passed to a function this can lead to required instructions being erased or manipulated in a way that causes the kernel to run into a HSA access error. This fix is related to a series of problems found in: llvm#74603 This problem primarily presents itself for Flang's HLFIR AssignOp currently, when utilised with a scalar temporary constant on the RHS and a descriptor type on the LHS. It will generate a call to a runtime function, wrap the RHS temporary in a newly allocated descriptor (an llvm struct), and pass both the LHS and RHS descriptor into the runtime function call. This will currently be embedded into the middle of the target region in the user entry block, which means the allocas are also embedded in the middle, which seems to pose issues when later passes are executed. This issue may present itself in other HLFIR operations or unrelated operations that generate allocas as a by product, but for the moment, this one test case is the only scenario i've found this problem. Perhaps this is not the appropriate fix, I am very open to other suggestions, I've tried a few others (at varying levels of the flang/mlir compiler flow), but this one is the smallest and least intrusive changeset. The other two, that come to mind (but I've not fully looked into, the former I tried a little with blocks but it had a few issues I'd need to think through): * Having a proper alloca only block (or region) generated for TargetOps that we could merge into the entry block that's generated by convertTarget's createOutlinedFunction. * Or diverging a little from Clang's current target generation and using the CodeExtractor to generate the user code as an outlined function region invoked from the kernel we make, with our kernel arguments passed into it. Similar to the current parallel generation. I am not sure how well this would intermingle with the existing parallel generation though that's layered in. Both of these methods seem like quite a divergeance from the current status quo, which I am not entirely sure is meritted for the small test this change aims to fix.
…se function pass to finalize, utilised in convertTarget This patch seeks to add a mechanism to raise constant (not ConstantExpr or runtime/dynamic) sized allocations into the entry block for select functions that have been inserted into a list for processing. This processing occurs during the finalize call, after OutlinedInfo regions have completed. This currently has only been utilised for createOutlinedFunction, which is triggered for TargetOp generation in the OpenMP MLIR dialect lowering to LLVM-IR. This currently is required for Target kernels generated by createOutlinedFunction to avoid subsequent optimisation passes doing some unintentional malformed optimisaitions for AMD kernels (unsure if it occurs for other vendors). If the allocas are generated inside of the kernel and are not in the entry block and are subsequently passed to a function this can lead to required instructions being erased or manipulated in a way that causes the kernel to run into a HSA access error. This fix is related to a series of problems found in: llvm#74603 This problem primarily presents itself for Flang's HLFIR AssignOp currently, when utilised with a scalar temporary constant on the RHS and a descriptor type on the LHS. It will generate a call to a runtime function, wrap the RHS temporary in a newly allocated descriptor (an llvm struct), and pass both the LHS and RHS descriptor into the runtime function call. This will currently be embedded into the middle of the target region in the user entry block, which means the allocas are also embedded in the middle, which seems to pose issues when later passes are executed. This issue may present itself in other HLFIR operations or unrelated operations that generate allocas as a by product, but for the moment, this one test case is the only scenario i've found this problem. Perhaps this is not the appropriate fix, I am very open to other suggestions, I've tried a few others (at varying levels of the flang/mlir compiler flow), but this one is the smallest and least intrusive changeset. The other two, that come to mind (but I've not fully looked into, the former I tried a little with blocks but it had a few issues I'd need to think through): * Having a proper alloca only block (or region) generated for TargetOps that we could merge into the entry block that's generated by convertTarget's createOutlinedFunction. * Or diverging a little from Clang's current target generation and using the CodeExtractor to generate the user code as an outlined function region invoked from the kernel we make, with our kernel arguments passed into it. Similar to the current parallel generation. I am not sure how well this would intermingle with the existing parallel generation though that's layered in. Both of these methods seem like quite a divergeance from the current status quo, which I am not entirely sure is meritted for the small test this change aims to fix.
…se function pass to finalize, utilised in convertTarget (#78818) This patch seeks to add a mechanism to raise constant (not ConstantExpr or runtime/dynamic) sized allocations into the entry block for select functions that have been inserted into a list for processing. This processing occurs during the finalize call, after OutlinedInfo regions have completed. This currently has only been utilised for createOutlinedFunction, which is triggered for TargetOp generation in the OpenMP MLIR dialect lowering to LLVM-IR. This currently is required for Target kernels generated by createOutlinedFunction to avoid subsequent optimization passes doing some unintentional malformed optimizations for AMD kernels (unsure if it occurs for other vendors). If the allocas are generated inside of the kernel and are not in the entry block and are subsequently passed to a function this can lead to required instructions being erased or manipulated in a way that causes the kernel to run into a HSA access error. This fix is related to a series of problems found in: #74603 This problem primarily presents itself for Flang's HLFIR AssignOp currently, when utilised with a scalar temporary constant on the RHS and a descriptor type on the LHS. It will generate a call to a runtime function, wrap the RHS temporary in a newly allocated descriptor (an llvm struct), and pass both the LHS and RHS descriptor into the runtime function call. This will currently be embedded into the middle of the target region in the user entry block, which means the allocas are also embedded in the middle, which seems to pose issues when later passes are executed. This issue may present itself in other HLFIR operations or unrelated operations that generate allocas as a by product, but for the moment, this one test case is the only scenario I've found this problem. Perhaps this is not the appropriate fix, I am very open to other suggestions, I've tried a few others (at varying levels of the flang/mlir compiler flow), but this one is the smallest and least intrusive change set. The other two, that come to mind (but I've not fully looked into, the former I tried a little with blocks but it had a few issues I'd need to think through): - Having a proper alloca only block (or region) generated for TargetOps that we could merge into the entry block that's generated by convertTarget's createOutlinedFunction. - Or diverging a little from Clang's current target generation and using the CodeExtractor to generate the user code as an outlined function region invoked from the kernel we make, with our kernel arguments passed into it. Similar to the current parallel generation. I am not sure how well this would intermingle with the existing parallel generation though that's layered in. Both of these methods seem like quite a divergence from the current status quo, which I am not entirely sure is merited for the small test this change aims to fix.
This particular case should now be resolved as the PRs that help address it have now landed. |
@llvm/issue-subscribers-openmp Author: None (agozillon)
This issue was found during the implementation of the following PR (and is dependent on it): https://github.com//pull/71766
The following example which attempts to map and assign a value to an allocatable variable on device compiles and works for the deprecated FIR flow, but will fail using the new HLFIR flow:
Yielding the following ICE error:
The command used to compile this and hit the error (should just require having a Clang that's compiled to support AMDGPU and the dependent PR if it's not committed already): flang-new --offload-arch=gfx90a -fopenmp test.f90 -o test.out From what I can gather, from digging a little into the issue, this comes from AMDGPU not supporting DYNAMIC_STACKALLOCA instructions. I think AMDGPU only performs static allocation, but someone with more understanding of that segment of the compiler will know far better than myself. However, the generation of this code that's unfriendly for AMD GPU, appears to stem from the HLFIR AssignOp, which lowers to a Fortran runtime call, which likely brings in the instruction that requires a dynamic stack allocation instruction (I've unfortunately not found the exact problematic line, but there's a number of areas that might pose the problem). The solution, that I can currently think of, is to opt out of the HLFIR AssignOp generation for AMD GPU devices or for OpenMP offload (or both) and utilise the old FIR flow, which does not depend on the runtime call. I am not sure how palatable that is for everyone though, as I imagine the intent was to discard this old FIR flow in the near future. I am more than open to other suggestions however! This is just the option I had in mind just now. It also brings up the possible issue that errors like this are encountered for other cases where HLFIR operations lower to Fortran rutnime calls, but that may be hyperbole as this is the only case I've encountered so far. |
This issue was found during the implementation of the following PR (and is dependent on it): #71766
The following example which attempts to map and assign a value to an allocatable variable on device compiles and works for the deprecated FIR flow, but will fail using the new HLFIR flow:
Yielding the following ICE error:
The command used to compile this and hit the error (should just require having a Clang that's compiled to support AMDGPU and the dependent PR if it's not committed already): flang-new --offload-arch=gfx90a -fopenmp test.f90 -o test.out
From what I can gather, from digging a little into the issue, this comes from AMDGPU not supporting DYNAMIC_STACKALLOCA instructions. I think AMDGPU only performs static allocation, but someone with more understanding of that segment of the compiler will know far better than myself.
However, the generation of this code that's unfriendly for AMD GPU, appears to stem from the HLFIR AssignOp, which lowers to a Fortran runtime call, which likely brings in the instruction that requires a dynamic stack allocation instruction (I've unfortunately not found the exact problematic line, but there's a number of areas that might pose the problem).
The solution, that I can currently think of, is to opt out of the HLFIR AssignOp generation for AMD GPU devices or for OpenMP offload (or both) and utilise the old FIR flow, which does not depend on the runtime call. I am not sure how palatable that is for everyone though, as I imagine the intent was to discard this old FIR flow in the near future. I am more than open to other suggestions however! This is just the option I had in mind just now.
It also brings up the possible issue that errors like this are encountered for other cases where HLFIR operations lower to Fortran rutnime calls, but that may be hyperbole as this is the only case I've encountered so far.
The text was updated successfully, but these errors were encountered: