New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
added initial source code support for AMDGPU backend #2734
Conversation
src/CodeGen_AMDGPU_Dev.cpp
Outdated
#include <fstream> | ||
|
||
/* | ||
// This is declared in NVAMDGPU.h, which is not exported. Ugly, but seems better than |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I assume this comment is not accurate :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it is not accurate. I left it there so that you can provide more information about the purpose of createNVVMReflectPass
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's some weird extra pass required by the nvptx backend in llvm. I would assume it's not relevant here.
src/CodeGen_AMDGPU_Dev.cpp
Outdated
|
||
void CodeGen_AMDGPU_Dev::visit(const Store *op) { | ||
|
||
// Do aligned 4-wide 32-bit stores as a single i128 store. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This Store visitor was to work around a particular piece of weirdness in the nvptx backend. Not necessary here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok. So, how should I implement the Store visitor?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the default store visitor will work fine. Try just deleting this method.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you want me to get rid of the whole visit(const Store *op)
implementation? I see the code as regular i32 stores are transformed to i128 stores.
How to test the codegen for amdgpu backend? |
To test this, it needs some kind of runtime API support. How do you load and run these kernels? Looking at the code, it looks like it has some familiar things from OpenCL... I am wondering, what is the advantage of using the AMD GPU backend vs. generating OpenCL kernels? |
The plan is to generate AMDGPU asm rather than HIP or OpenCL kernels (Just like generating ptx for CUDA). The asm can be compile to object file and load, run hip module api (similar to cuda driver module api). |
So to test this, it sounds like this needs a HIP runtime module (similar to src/runtime/opencl.cpp or src/runtime/cuda.cpp and associated support code). Is it by chance possible to load AMDGPU assembly kernels in OpenCL? If so, you might be able to just re-use the OpenCL runtime? |
I can add new HIP runtime. Unfortunately, no, you can't load asm kernels using OpenCL RT. I am not looking at running anything now. As a first step, I want to make sure the LLVM IR generated is valid (different from nvptx64). |
To see what you're generating, just call |
@@ -125,6 +129,8 @@ Expr make_device_interface_call(DeviceAPI device_api) { | |||
case DeviceAPI::Hexagon: | |||
interface_name = "halide_hexagon_device_interface"; | |||
break; | |||
case DeviceAPI::AMDGPU: | |||
interface_name = "halide_amdgpu_device_interface"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
missing a break?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed it.
1. When C codegen happens, Halide can now add amdgpu runtime code to it 2. Added file HalideRuntimeAMDGPU.h 3. For CodeGen_GPU_Host, it now create AMDGPU codegen object 4. Added support for testing AMDGPU target 5. Made appropriate changes to Makefile and CMakeLists.txt
1. Added mini_amdgpu.h containing required data structures 2. Added hip_functions.h containing required apis 3. Added amdgpu.cpp which is incomplete. Right now added it to test build 4. Changed Makefile and CMakeLists.txt to build amdgpu.cpp
1. Changed triple 2. Added few llvm ir files for amdgpu codegen to consume 3. Changed Makefile accordingly
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I forgot to change Makefile. Hence the build is failing. Fixed it in the next commit.
@abadams can you review the last two commits? And can you re-spin the build? I am not seeing any build errors on my local machine. |
1. Added few missing pieces to runtime linker 2. Disabled amdgcn bitcode linking 3. Added a new runtime function to amdgpu.cpp 4. Registered amdgpu runtime apis
1. This fixes object code generation bug
Turns out git didn't catch these files before because bc file name is in gitignore
Hi, can this be merged? |
Does that mean it works? How can we enable testing of this backend on the buildbots? |
Um, the code generation is working fine. I am working on runtime, there are few changes I had to make which takes a lot of testing (time actually..). Meanwhile, if this pr gets merged, I don't need to worry about merge conflicts later. |
I'd rather not merge semi-functional things into master. This part of the compiler is very stable - I doubt you'll have any problems with merge conflicts. |
1. Fixed AMDGPU linker relocs 2. Added new argument to relocate to take got_offset
src/AMDGPUOffload.cpp
Outdated
if (type == R_AMDGPU_ABS64 && sym->is_defined()) { | ||
return Relocation(R_AMDGPU_RELATIVE64, fixup_offset, sym_offset + addend, nullptr); | ||
} else if (type == R_AMDGPU_ABS32_LO || type == R_AMDGPU_ABS32_HI || type == R_AMDGPU_ABS32 || type == R_AMDGPU_ABS64) { | ||
return Relocation(type, fixup_offset, addend, sym); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@t-tye what do if both the conditionals are false?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You return the default constructed Relocation as before. This tells the caller that the relocation record is a static relocation that has been fully processed and will not be included in the linked resulting code object.
@@ -602,7 +602,7 @@ std::vector<char> write_shared_object_internal(Object &obj, Linker *linker, cons | |||
// We need to define the GOT symbol. | |||
uint64_t max_got_size = obj.symbols_size() * 2 * sizeof(addr_t); | |||
Section got(".got", Section::SHT_PROGBITS); | |||
got.set_alignment(4); | |||
got.set_alignment(sizeof(addr_t)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dsharletg good?
src/AMDGPUOffload.h
Outdated
namespace Halide { | ||
namespace Internal { | ||
|
||
/** Pull loops marked with the Hexagon device API to a separate |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fix comments s/Hexagon/AMD
src/AMDGPUOffload.cpp
Outdated
auto bss = obj->find_section(".bss"); | ||
if (bss != obj->sections_end()) { | ||
bss->set_alignment(128); | ||
// TODO: We should set the type to SHT_NOBITS |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These comments are basically useless to you, I think.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are they? @dsharletg
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, at the very least the reference to '8998' is nonsense in this context.
src/AMDGPUOffload.cpp
Outdated
// Make .bss a real section. | ||
auto bss = obj->find_section(".bss"); | ||
if (bss != obj->sections_end()) { | ||
bss->set_alignment(128); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You may not need 128 byte alignment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dsharletg true?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, 128 byte alignment is for hexagon.
1. Changed comments for AMDGPUOffload.cpp 2. Removed already helper functions in AMDGPUOffload.cpp 3. Removed irrelevant comments
After building Halide, when I run
I am getting the following error:
Update 1:I took the assembly dump, compile and link to object code and used [1]. https://gist.github.com/adityaatluri/4dfd6898a2204ed43bbfd8579db4bd63#file-module_load-cpp Update 2:When I put print statement in |
You're probably going to need to dump out the linked blob to a file, and load it with various tools (readelf, objdump, etc.) to see if it is valid and try to track down the bugs. |
The code path for linker is not being used at runtime. Where should I look at for that control flow? |
|
||
Stmt inject_amdgpu_rpc(Stmt s, const Target &host_target, | ||
Module &containing_module) { | ||
Target target(Target::Linux, Target::X86, 64); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be the AMD GPU target OS and architecture. If you don't need any OS support (no thread pool, IO, etc.) then maybe you can just use Target::NoOS.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After some debugging, the function resolve_submodules()
is not calling compile_to_buffer()
as there are no submodules. Should I do,
Module Module::resolve_submodules() const {
if(target.has_feature(Target::AMDGPUGFX900)) {
Module lowered_module(name(), target());
lowered_module_append(this-> compile_to_amdgpu_shared_object(*this));
return lowered_module;
}
if(submodules().empty()) {
return *this;
}
.....
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dsharletg can I add this code?
Added a comment, plus you'll probably need to add another special case here: Line 258 in 626c2dd
(This is ugly, hopefully your backend will help us find common behavior that we can build proper support for.) |
I think the reason the Hexagon linker doesn't use the GOT for relocations is because we only link shared objects. The GOT address can't be known at link time, so generating a relocation that refers directly to the GOT must be an error (maybe this is where "not PIC code" type linker errors come from?) This appears to also be true of the AMDGPU relocations: https://llvm.org/docs/AMDGPUUsage.html#amdgpu-relocation-records
So it seems that there may be a difference here in that the Hexagon linker produces shared objects and the AMDGPU linker will produce non-PIC executables. |
Here is my understanding of what should be happening in the linker. The GOT address is known at link time. It is the VA of the .got section of the generated code object. A got relocation is the means that a reference in the text section can access an address defined in another dynamically loaded library (or main executable). Since the text section needs to be readonly (so it can be shared between processes) the way this is achieved is to make the access indirect through the got table. The got relocation is a request to the linker create an entry in the got table when linking from a relocatable to produce a shared library or executable (namely something that can be loaded by the operating system). Generally the text section is PIC code so the instruction in the text section wants to know the displacement to get to the got table entry in the same code object. That is what a PC relative got relocation computes, and hence why it needs to know the VA of the got table. The linker creates a section for the got table. Before relocation it allocates all sections so they have known VA addresses. (It seems the got table is conservatively sized based on number of symbols which seems a bit odd.) This change is passing that into the relocation function. If the relocation is a got relocation, the relocation function checks to see if there is already a got table entry for the external symbol. If not it creates one. In either case what it has is the offset from the start of the got table to the entry. It must add this offset to the got table base to get the address of entry. In addition an ABSOLUTE relocation record must be generated for the got entry to ensure it will be patched at load time with the actual address of the external symbol. Since the code object is PIC it does not matter what address it is assumed the code object will be loaded. This load address is the ELF VA. This code chooses to use 0. The RELATIVE relocation record is what allows the code object to be loaded at a different address. It cause the loader to add the difference between the ELF VA and the actual load address. If the code object is loaded at its actual ELF VA then no patching is required. A RELATIVE relocation only makes sense for locations that contain an address within the same code object. Hence checking that the symbol is defined before generating it. If the symbol is not defined then an ABSOLUTE relocation must be used (as is done for the got table entries). There seems to be a bug in the Hexagon linker as it does not do this. A test case to show this would be the definition of a global variable A initialized to the address of another global variable B. If B is a defined variable then a RELATIVE relocation should be used to update A. If B is an external variable then an ABSOLUTE relocation would be used. The time you will get got table entries is when you have external references to symbols. It maybe that you never have them as you do not support global variables or dynamic libraries. |
I think that this cannot happen in PIC code, the only absolute relocations for PIC code are in the GOT, and happen when the symbol is defined (which may be at runtime in the case of shared objects). If the compiler generates a reference to a symbol that cannot be relocated at link time (i.e. either refers to an undefined symbol, or needs an absolute address), then it's too late for the linker to fix that, as it might require the code to have an extra level of indirection (via the GOT).
We definitely do generate GOT table entries, we have global variables with initializers, and we do support dynamic libraries :) In fact, the only thing we currently support is shared libraries. |
Right. The relocations cannot be in the PIC readonly .text sections. But they can be present in the the writable .data sections. For example, global variables that are defined and have an initializer that is the address of another symbol.
Have you used global variables initialized by the address of other symbols (that may be defined or undefined)? For example:
|
Yes, for example: https://github.com/halide/Halide/blob/master/src/runtime/qurt_allocator.cpp#L88 (qurt_allocator.cpp defines the memory allocator linked to Hexagon code, which is linked by our linker.) |
Looking at the Hexagon linker I am unclear how it is creating the relocations needed to make that work if |
At least for functions, they get re-routed through the PLT: Line 714 in f687e89
A relocation that refers to an undefined external symbol gets redirected to a PLT entry, and the PLT entry gets a relocation for the real symbol. A similar thing might be done by the compiler for global data symbols? |
In our case, variables get allocated in the data sections and relocation records are used to initialize their value correctly at load time. We do not currently have a plt, but when we do I would expect it to only contain access to functions that can be accessed externally. The got is used for references to external symbols. |
src/AMDGPUOffload.cpp
Outdated
@@ -213,19 +212,20 @@ class AMDGPULinker : public Linker { | |||
} | |||
|
|||
Symbol add_plt_entry(const Symbol &sym, Section &plt, Section &got, const Symbol &got_sym) override { | |||
internal_error << "Unsupported plt relocation for" << sym << "\n"; | |||
internal_error << "Unsupported plt relocation for amdgpu object" << "\n"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do get the symbol name printed out change to:
internal_error << "Unsupported plt relocation for " << sym->get_name() << "\n";
@pranavb-ca @dsharletg, can you review? |
@abadams can you restart ci? |
How should I resolve this: https://travis-ci.org/halide/Halide/jobs/370795728#L7531 |
I think there is a recently closed PR that might be of interest to your patch's testing efforts. |
This PR is quite old. Is there an update on its status? |
Not working on it anymore. Closing PR |
Initial source port.