Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

added initial source code support for AMDGPU backend #2734

Closed
wants to merge 28 commits into from
Closed

added initial source code support for AMDGPU backend #2734

wants to merge 28 commits into from

Conversation

aditya4d1
Copy link
Contributor

Initial source port.

#include <fstream>

/*
// This is declared in NVAMDGPU.h, which is not exported. Ugly, but seems better than
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume this comment is not accurate :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it is not accurate. I left it there so that you can provide more information about the purpose of createNVVMReflectPass

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's some weird extra pass required by the nvptx backend in llvm. I would assume it's not relevant here.


void CodeGen_AMDGPU_Dev::visit(const Store *op) {

// Do aligned 4-wide 32-bit stores as a single i128 store.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This Store visitor was to work around a particular piece of weirdness in the nvptx backend. Not necessary here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok. So, how should I implement the Store visitor?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the default store visitor will work fine. Try just deleting this method.

Copy link
Contributor Author

@aditya4d1 aditya4d1 Feb 9, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you want me to get rid of the whole visit(const Store *op) implementation? I see the code as regular i32 stores are transformed to i128 stores.

@aditya4d1
Copy link
Contributor Author

How to test the codegen for amdgpu backend?

@dsharletg
Copy link
Contributor

To test this, it needs some kind of runtime API support. How do you load and run these kernels?

Looking at the code, it looks like it has some familiar things from OpenCL... I am wondering, what is the advantage of using the AMD GPU backend vs. generating OpenCL kernels?

@aditya4d1
Copy link
Contributor Author

The plan is to generate AMDGPU asm rather than HIP or OpenCL kernels (Just like generating ptx for CUDA). The asm can be compile to object file and load, run hip module api (similar to cuda driver module api).

@dsharletg
Copy link
Contributor

So to test this, it sounds like this needs a HIP runtime module (similar to src/runtime/opencl.cpp or src/runtime/cuda.cpp and associated support code).

Is it by chance possible to load AMDGPU assembly kernels in OpenCL? If so, you might be able to just re-use the OpenCL runtime?

@aditya4d1
Copy link
Contributor Author

I can add new HIP runtime. Unfortunately, no, you can't load asm kernels using OpenCL RT. I am not looking at running anything now. As a first step, I want to make sure the LLVM IR generated is valid (different from nvptx64).

@shoaibkamil
Copy link
Contributor

To see what you're generating, just call CodeGen_AMDGPU_Dev::dump() from somewhere, or run a testcase that would run on the GPU with the environment variable HL_DEBUG_CODEGEN=2.

@@ -125,6 +129,8 @@ Expr make_device_interface_call(DeviceAPI device_api) {
case DeviceAPI::Hexagon:
interface_name = "halide_hexagon_device_interface";
break;
case DeviceAPI::AMDGPU:
interface_name = "halide_amdgpu_device_interface";
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

missing a break?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed it.

Aditya Atluri added 6 commits February 19, 2018 10:46
1. When C codegen happens, Halide can now add amdgpu runtime code to it
2. Added file HalideRuntimeAMDGPU.h
3. For CodeGen_GPU_Host, it now create AMDGPU codegen object
4. Added support for testing AMDGPU target
5. Made appropriate changes to Makefile and CMakeLists.txt
1. Added mini_amdgpu.h containing required data structures
2. Added hip_functions.h containing required apis
3. Added amdgpu.cpp which is incomplete. Right now added it to test build
4. Changed Makefile and CMakeLists.txt to build amdgpu.cpp
1. Changed triple
2. Added few llvm ir files for amdgpu codegen to consume
3. Changed Makefile accordingly
Copy link
Contributor Author

@aditya4d1 aditya4d1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I forgot to change Makefile. Hence the build is failing. Fixed it in the next commit.

@aditya4d1
Copy link
Contributor Author

@abadams can you review the last two commits? And can you re-spin the build? I am not seeing any build errors on my local machine.

Aditya Atluri added 5 commits February 23, 2018 13:20
1. Added few missing pieces to runtime linker
2. Disabled amdgcn bitcode linking
3. Added a new runtime function to amdgpu.cpp
4. Registered amdgpu runtime apis
1. This fixes object code generation bug
Turns out git didn't catch these files before because bc file name is in gitignore
@aditya4d1
Copy link
Contributor Author

Hi, can this be merged?

@abadams
Copy link
Member

abadams commented Feb 27, 2018

Does that mean it works? How can we enable testing of this backend on the buildbots?

@aditya4d1
Copy link
Contributor Author

Um, the code generation is working fine. I am working on runtime, there are few changes I had to make which takes a lot of testing (time actually..). Meanwhile, if this pr gets merged, I don't need to worry about merge conflicts later.

@abadams
Copy link
Member

abadams commented Feb 27, 2018

I'd rather not merge semi-functional things into master. This part of the compiler is very stable - I doubt you'll have any problems with merge conflicts.

1. Fixed AMDGPU linker relocs
2. Added new argument to relocate to take got_offset
if (type == R_AMDGPU_ABS64 && sym->is_defined()) {
return Relocation(R_AMDGPU_RELATIVE64, fixup_offset, sym_offset + addend, nullptr);
} else if (type == R_AMDGPU_ABS32_LO || type == R_AMDGPU_ABS32_HI || type == R_AMDGPU_ABS32 || type == R_AMDGPU_ABS64) {
return Relocation(type, fixup_offset, addend, sym);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@t-tye what do if both the conditionals are false?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You return the default constructed Relocation as before. This tells the caller that the relocation record is a static relocation that has been fully processed and will not be included in the linked resulting code object.

@@ -602,7 +602,7 @@ std::vector<char> write_shared_object_internal(Object &obj, Linker *linker, cons
// We need to define the GOT symbol.
uint64_t max_got_size = obj.symbols_size() * 2 * sizeof(addr_t);
Section got(".got", Section::SHT_PROGBITS);
got.set_alignment(4);
got.set_alignment(sizeof(addr_t));
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dsharletg good?

namespace Halide {
namespace Internal {

/** Pull loops marked with the Hexagon device API to a separate
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fix comments s/Hexagon/AMD

auto bss = obj->find_section(".bss");
if (bss != obj->sections_end()) {
bss->set_alignment(128);
// TODO: We should set the type to SHT_NOBITS
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These comments are basically useless to you, I think.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are they? @dsharletg

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, at the very least the reference to '8998' is nonsense in this context.

// Make .bss a real section.
auto bss = obj->find_section(".bss");
if (bss != obj->sections_end()) {
bss->set_alignment(128);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You may not need 128 byte alignment.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dsharletg true?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, 128 byte alignment is for hexagon.

1. Changed comments for AMDGPUOffload.cpp
2. Removed already helper functions in AMDGPUOffload.cpp
3. Removed irrelevant comments
@aditya4d1
Copy link
Contributor Author

aditya4d1 commented Apr 12, 2018

After building Halide, when I run

$ HL_JIT_TARGET=host-amdgpu_gfx900-debug make -f ../Halide/Makefile correctness_conv
olution; ./bin/correctness_convolution

I am getting the following error:

Entering Pipeline blur1
 Input Buffer in: buffer(0, 0x0, 0x1983b80, 1, uint16, {0, 128, 1}, {0, 48, 128})
 Input Buffer tent: buffer(0, 0x0, 0x1975800, 1, uint16, {0, 3, 1}, {0, 3, 3})
 Input (void *) __user_context: 0x7ffe9e696748
 Output Buffer blur1: buffer(0, 0x0, 0x1953f80, 0, uint16, {0, 128, 1}, {0, 48, 128})
AMDGPU: halide_amdgpu_initialize_kernels (user_context: 0x0, state_ptr: 0x7f9d54f47000, ptx_src: 0x7f9d54f41420, size: 12032
    load_libhip (user_context: 0x0)
    Loaded HIP runtime library: libhip_hcc.so
AMDGPU: Multiple AMDGPU devices detected. Selecting the one with the most cores.
      Device 0 has 2560 cores
      Device 1 has 2560 cores
      Device 2 has 2560 cores
      Device 3 has 2560 cores
    Got device 3
      Vega 10 [Radeon Instinct MI25]
      total memory: 16368 MB
      max threads per block: 1024
      warp size: 64
      max block size: 1024 1024 1024
      max grid size: 2147483647 2147483647 2147483647
      max shared memory per block: 65536
      max constant memory per block: 16384
      compute capability 3.0
      workitems: 64 x 192 = 192
    hipCtxCreate 3 -> 0x7f9d12a5ec50(4)
    hipModuleLoadData 0x7f9d54f41420, 12032 -> terminate called after throwing an instance of 'std::bad_alloc'
  what():  std::bad_alloc
bash: line 1: 118380 Aborted                 (core dumped) /home/aditya/halide_build/bin/correctness_convolution
../Halide/Makefile:1468: recipe for target 'correctness_convolution' failed

Update 1:

I took the assembly dump, compile and link to object code and used hipModuleLoadData in a sample [1], It worked fine. May be the linker is not generating valid object code yet.

[1]. https://gist.github.com/adityaatluri/4dfd6898a2204ed43bbfd8579db4bd63#file-module_load-cpp

Update 2:

When I put print statement in relocate function, it is not printed out at runtime.

@dsharletg
Copy link
Contributor

You're probably going to need to dump out the linked blob to a file, and load it with various tools (readelf, objdump, etc.) to see if it is valid and try to track down the bugs.

@aditya4d1
Copy link
Contributor Author

The code path for linker is not being used at runtime. Where should I look at for that control flow?


Stmt inject_amdgpu_rpc(Stmt s, const Target &host_target,
Module &containing_module) {
Target target(Target::Linux, Target::X86, 64);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be the AMD GPU target OS and architecture. If you don't need any OS support (no thread pool, IO, etc.) then maybe you can just use Target::NoOS.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After some debugging, the function resolve_submodules() is not calling compile_to_buffer() as there are no submodules. Should I do,

Module Module::resolve_submodules() const {
if(target.has_feature(Target::AMDGPUGFX900)) {
  Module lowered_module(name(), target());
  lowered_module_append(this-> compile_to_amdgpu_shared_object(*this));
  return lowered_module;
}
if(submodules().empty()) {
  return *this;
}
.....
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dsharletg can I add this code?

@dsharletg
Copy link
Contributor

Added a comment, plus you'll probably need to add another special case here:

// TODO: This Hexagon specific code should be removed as soon as possible.

(This is ugly, hopefully your backend will help us find common behavior that we can build proper support for.)

@dsharletg
Copy link
Contributor

I think the reason the Hexagon linker doesn't use the GOT for relocations is because we only link shared objects. The GOT address can't be known at link time, so generating a relocation that refers directly to the GOT must be an error (maybe this is where "not PIC code" type linker errors come from?)

This appears to also be true of the AMDGPU relocations: https://llvm.org/docs/AMDGPUUsage.html#amdgpu-relocation-records

GOT
Represents the address of the global offset table.

So it seems that there may be a difference here in that the Hexagon linker produces shared objects and the AMDGPU linker will produce non-PIC executables.

@t-tye
Copy link

t-tye commented Apr 13, 2018

Here is my understanding of what should be happening in the linker.

The GOT address is known at link time. It is the VA of the .got section of the generated code object. A got relocation is the means that a reference in the text section can access an address defined in another dynamically loaded library (or main executable).

Since the text section needs to be readonly (so it can be shared between processes) the way this is achieved is to make the access indirect through the got table. The got relocation is a request to the linker create an entry in the got table when linking from a relocatable to produce a shared library or executable (namely something that can be loaded by the operating system).

Generally the text section is PIC code so the instruction in the text section wants to know the displacement to get to the got table entry in the same code object. That is what a PC relative got relocation computes, and hence why it needs to know the VA of the got table.

The linker creates a section for the got table. Before relocation it allocates all sections so they have known VA addresses. (It seems the got table is conservatively sized based on number of symbols which seems a bit odd.) This change is passing that into the relocation function.

If the relocation is a got relocation, the relocation function checks to see if there is already a got table entry for the external symbol. If not it creates one. In either case what it has is the offset from the start of the got table to the entry. It must add this offset to the got table base to get the address of entry.

In addition an ABSOLUTE relocation record must be generated for the got entry to ensure it will be patched at load time with the actual address of the external symbol.

Since the code object is PIC it does not matter what address it is assumed the code object will be loaded. This load address is the ELF VA. This code chooses to use 0.

The RELATIVE relocation record is what allows the code object to be loaded at a different address. It cause the loader to add the difference between the ELF VA and the actual load address. If the code object is loaded at its actual ELF VA then no patching is required.

A RELATIVE relocation only makes sense for locations that contain an address within the same code object. Hence checking that the symbol is defined before generating it. If the symbol is not defined then an ABSOLUTE relocation must be used (as is done for the got table entries). There seems to be a bug in the Hexagon linker as it does not do this.

A test case to show this would be the definition of a global variable A initialized to the address of another global variable B. If B is a defined variable then a RELATIVE relocation should be used to update A. If B is an external variable then an ABSOLUTE relocation would be used.

The time you will get got table entries is when you have external references to symbols. It maybe that you never have them as you do not support global variables or dynamic libraries.

@dsharletg
Copy link
Contributor

If the symbol is not defined then an ABSOLUTE relocation must be used (as is done for the got table entries).

I think that this cannot happen in PIC code, the only absolute relocations for PIC code are in the GOT, and happen when the symbol is defined (which may be at runtime in the case of shared objects). If the compiler generates a reference to a symbol that cannot be relocated at link time (i.e. either refers to an undefined symbol, or needs an absolute address), then it's too late for the linker to fix that, as it might require the code to have an extra level of indirection (via the GOT).

The time you will get got table entries is when you have external references to symbols. It maybe that you never have them as you do not support global variables or dynamic libraries.

We definitely do generate GOT table entries, we have global variables with initializers, and we do support dynamic libraries :) In fact, the only thing we currently support is shared libraries.

@t-tye
Copy link

t-tye commented Apr 13, 2018

If the symbol is not defined then an ABSOLUTE relocation must be used (as is done for the got table entries).

I think that this cannot happen in PIC code, the only absolute relocations for PIC code are in the GOT, and happen when the symbol is defined (which may be at runtime in the case of shared objects). If the compiler generates a reference to a symbol that cannot be relocated at link time (i.e. either refers to an undefined symbol, or needs an absolute address), then it's too late for the linker to fix that, as it might require the code to have an extra level of indirection (via the GOT).

Right. The relocations cannot be in the PIC readonly .text sections. But they can be present in the the writable .data sections. For example, global variables that are defined and have an initializer that is the address of another symbol.

The time you will get got table entries is when you have external references to symbols. It maybe that you never have them as you do not support global variables or dynamic libraries.

We definitely do generate GOT table entries, we have global variables with initializers, and we do support dynamic libraries :) In fact, the only thing we currently support is shared libraries.

Have you used global variables initialized by the address of other symbols (that may be defined or undefined)? For example:

int b;
int* a = &b;

@dsharletg
Copy link
Contributor

Yes, for example: https://github.com/halide/Halide/blob/master/src/runtime/qurt_allocator.cpp#L88

(qurt_allocator.cpp defines the memory allocator linked to Hexagon code, which is linked by our linker.)

@t-tye
Copy link

t-tye commented Apr 13, 2018

Looking at the Hexagon linker I am unclear how it is creating the relocations needed to make that work if halide_default_malloc is not defined in the same shared library as custom_malloc. For that to work I would expect that the linked shared object would have a data section that holds custom_malloc that has an absolute relocation record specifying an undefined symbol halide_default_malloc. But the only relocation records that seem to be generated for the linked code object are those put on the relocations list of the got section. But I only see that happening for the relative relocations, and the absolute relocations for the got table entries.

@dsharletg
Copy link
Contributor

At least for functions, they get re-routed through the PLT:

debug(2) << "Defining PLT entry for " << sym->get_name() << "\n";

A relocation that refers to an undefined external symbol gets redirected to a PLT entry, and the PLT entry gets a relocation for the real symbol.

A similar thing might be done by the compiler for global data symbols?

@t-tye
Copy link

t-tye commented Apr 14, 2018

In our case, variables get allocated in the data sections and relocation records are used to initialize their value correctly at load time. We do not currently have a plt, but when we do I would expect it to only contain access to functions that can be accessed externally. The got is used for references to external symbols.

@@ -213,19 +212,20 @@ class AMDGPULinker : public Linker {
}

Symbol add_plt_entry(const Symbol &sym, Section &plt, Section &got, const Symbol &got_sym) override {
internal_error << "Unsupported plt relocation for" << sym << "\n";
internal_error << "Unsupported plt relocation for amdgpu object" << "\n";
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do get the symbol name printed out change to:

internal_error << "Unsupported plt relocation for " << sym->get_name() << "\n";

@aditya4d1
Copy link
Contributor Author

@pranavb-ca @dsharletg, can you review?

@aditya4d1
Copy link
Contributor Author

@abadams can you restart ci?

@aditya4d1
Copy link
Contributor Author

How should I resolve this: https://travis-ci.org/halide/Halide/jobs/370795728#L7531
?

@ronlieb
Copy link

ronlieb commented Jun 18, 2018

I think there is a recently closed PR that might be of interest to your patch's testing efforts.
#3048

@steven-johnson
Copy link
Contributor

This PR is quite old. Is there an update on its status?

@aditya4d1
Copy link
Contributor Author

Not working on it anymore. Closing PR

@aditya4d1 aditya4d1 closed this Mar 26, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

9 participants