added initial source code support for AMDGPU backend #2734

aditya4d1 · 2018-02-09T04:28:36Z

Initial source port.

abadams · 2018-02-09T05:03:51Z

src/CodeGen_AMDGPU_Dev.cpp

+#include <fstream>
+
+/*
+// This is declared in NVAMDGPU.h, which is not exported. Ugly, but seems better than


I assume this comment is not accurate :)

Yes, it is not accurate. I left it there so that you can provide more information about the purpose of createNVVMReflectPass

It's some weird extra pass required by the nvptx backend in llvm. I would assume it's not relevant here.

abadams · 2018-02-09T05:06:26Z

src/CodeGen_AMDGPU_Dev.cpp

+
+void CodeGen_AMDGPU_Dev::visit(const Store *op) {
+
+    // Do aligned 4-wide 32-bit stores as a single i128 store.


This Store visitor was to work around a particular piece of weirdness in the nvptx backend. Not necessary here.

Ok. So, how should I implement the Store visitor?

I think the default store visitor will work fine. Try just deleting this method.

Do you want me to get rid of the whole visit(const Store *op) implementation? I see the code as regular i32 stores are transformed to i128 stores.

aditya4d1 · 2018-02-09T14:47:07Z

How to test the codegen for amdgpu backend?

dsharletg · 2018-02-09T17:09:04Z

To test this, it needs some kind of runtime API support. How do you load and run these kernels?

Looking at the code, it looks like it has some familiar things from OpenCL... I am wondering, what is the advantage of using the AMD GPU backend vs. generating OpenCL kernels?

aditya4d1 · 2018-02-09T17:11:43Z

The plan is to generate AMDGPU asm rather than HIP or OpenCL kernels (Just like generating ptx for CUDA). The asm can be compile to object file and load, run hip module api (similar to cuda driver module api).

dsharletg · 2018-02-09T17:25:25Z

So to test this, it sounds like this needs a HIP runtime module (similar to src/runtime/opencl.cpp or src/runtime/cuda.cpp and associated support code).

Is it by chance possible to load AMDGPU assembly kernels in OpenCL? If so, you might be able to just re-use the OpenCL runtime?

aditya4d1 · 2018-02-09T17:29:04Z

I can add new HIP runtime. Unfortunately, no, you can't load asm kernels using OpenCL RT. I am not looking at running anything now. As a first step, I want to make sure the LLVM IR generated is valid (different from nvptx64).

shoaibkamil · 2018-02-09T19:22:50Z

To see what you're generating, just call CodeGen_AMDGPU_Dev::dump() from somewhere, or run a testcase that would run on the GPU with the environment variable HL_DEBUG_CODEGEN=2.

abadams · 2018-02-19T18:12:06Z

src/DeviceInterface.cpp

@@ -125,6 +129,8 @@ Expr make_device_interface_call(DeviceAPI device_api) {
    case DeviceAPI::Hexagon:
        interface_name = "halide_hexagon_device_interface";
        break;
+    case DeviceAPI::AMDGPU:
+        interface_name = "halide_amdgpu_device_interface";


missing a break?

1. When C codegen happens, Halide can now add amdgpu runtime code to it 2. Added file HalideRuntimeAMDGPU.h 3. For CodeGen_GPU_Host, it now create AMDGPU codegen object 4. Added support for testing AMDGPU target 5. Made appropriate changes to Makefile and CMakeLists.txt

1. Added mini_amdgpu.h containing required data structures 2. Added hip_functions.h containing required apis 3. Added amdgpu.cpp which is incomplete. Right now added it to test build 4. Changed Makefile and CMakeLists.txt to build amdgpu.cpp

1. Changed triple 2. Added few llvm ir files for amdgpu codegen to consume 3. Changed Makefile accordingly

aditya4d1

I forgot to change Makefile. Hence the build is failing. Fixed it in the next commit.

aditya4d1 · 2018-02-22T20:18:53Z

@abadams can you review the last two commits? And can you re-spin the build? I am not seeing any build errors on my local machine.

1. Added few missing pieces to runtime linker 2. Disabled amdgcn bitcode linking 3. Added a new runtime function to amdgpu.cpp 4. Registered amdgpu runtime apis

1. This fixes object code generation bug

Turns out git didn't catch these files before because bc file name is in gitignore

aditya4d1 · 2018-02-27T14:26:23Z

Hi, can this be merged?

abadams · 2018-02-27T21:07:41Z

Does that mean it works? How can we enable testing of this backend on the buildbots?

aditya4d1 · 2018-02-27T21:11:03Z

Um, the code generation is working fine. I am working on runtime, there are few changes I had to make which takes a lot of testing (time actually..). Meanwhile, if this pr gets merged, I don't need to worry about merge conflicts later.

abadams · 2018-02-27T21:12:42Z

I'd rather not merge semi-functional things into master. This part of the compiler is very stable - I doubt you'll have any problems with merge conflicts.

1. Fixed AMDGPU linker relocs 2. Added new argument to relocate to take got_offset

aditya4d1 · 2018-04-12T21:08:15Z

src/AMDGPUOffload.cpp

+        if (type == R_AMDGPU_ABS64 && sym->is_defined()) {
+            return Relocation(R_AMDGPU_RELATIVE64, fixup_offset, sym_offset + addend, nullptr);
+        } else if (type == R_AMDGPU_ABS32_LO || type == R_AMDGPU_ABS32_HI || type == R_AMDGPU_ABS32 || type == R_AMDGPU_ABS64) {
+            return Relocation(type, fixup_offset, addend, sym);


@t-tye what do if both the conditionals are false?

You return the default constructed Relocation as before. This tells the caller that the relocation record is a static relocation that has been fully processed and will not be included in the linked resulting code object.

aditya4d1 · 2018-04-12T21:08:33Z

src/Elf.cpp

@@ -602,7 +602,7 @@ std::vector<char> write_shared_object_internal(Object &obj, Linker *linker, cons
    // We need to define the GOT symbol.
    uint64_t max_got_size = obj.symbols_size() * 2 * sizeof(addr_t);
    Section got(".got", Section::SHT_PROGBITS);
-    got.set_alignment(4);
+    got.set_alignment(sizeof(addr_t));


@dsharletg good?

pranavb-ca · 2018-04-12T21:34:14Z

src/AMDGPUOffload.h

+namespace Halide {
+namespace Internal {
+
+/** Pull loops marked with the Hexagon device API to a separate


Fix comments s/Hexagon/AMD

pranavb-ca · 2018-04-12T21:41:20Z

src/AMDGPUOffload.cpp

+    auto bss = obj->find_section(".bss");
+    if (bss != obj->sections_end()) {
+        bss->set_alignment(128);
+        // TODO: We should set the type to SHT_NOBITS


These comments are basically useless to you, I think.

Are they? @dsharletg

Yes, at the very least the reference to '8998' is nonsense in this context.

pranavb-ca · 2018-04-12T21:41:44Z

src/AMDGPUOffload.cpp

+    // Make .bss a real section.
+    auto bss = obj->find_section(".bss");
+    if (bss != obj->sections_end()) {
+        bss->set_alignment(128);


You may not need 128 byte alignment.

@dsharletg true?

Yes, 128 byte alignment is for hexagon.

1. Changed comments for AMDGPUOffload.cpp 2. Removed already helper functions in AMDGPUOffload.cpp 3. Removed irrelevant comments

aditya4d1 · 2018-04-12T23:35:30Z

After building Halide, when I run

$ HL_JIT_TARGET=host-amdgpu_gfx900-debug make -f ../Halide/Makefile correctness_conv
olution; ./bin/correctness_convolution

I am getting the following error:

Entering Pipeline blur1
 Input Buffer in: buffer(0, 0x0, 0x1983b80, 1, uint16, {0, 128, 1}, {0, 48, 128})
 Input Buffer tent: buffer(0, 0x0, 0x1975800, 1, uint16, {0, 3, 1}, {0, 3, 3})
 Input (void *) __user_context: 0x7ffe9e696748
 Output Buffer blur1: buffer(0, 0x0, 0x1953f80, 0, uint16, {0, 128, 1}, {0, 48, 128})
AMDGPU: halide_amdgpu_initialize_kernels (user_context: 0x0, state_ptr: 0x7f9d54f47000, ptx_src: 0x7f9d54f41420, size: 12032
    load_libhip (user_context: 0x0)
    Loaded HIP runtime library: libhip_hcc.so
AMDGPU: Multiple AMDGPU devices detected. Selecting the one with the most cores.
      Device 0 has 2560 cores
      Device 1 has 2560 cores
      Device 2 has 2560 cores
      Device 3 has 2560 cores
    Got device 3
      Vega 10 [Radeon Instinct MI25]
      total memory: 16368 MB
      max threads per block: 1024
      warp size: 64
      max block size: 1024 1024 1024
      max grid size: 2147483647 2147483647 2147483647
      max shared memory per block: 65536
      max constant memory per block: 16384
      compute capability 3.0
      workitems: 64 x 192 = 192
    hipCtxCreate 3 -> 0x7f9d12a5ec50(4)
    hipModuleLoadData 0x7f9d54f41420, 12032 -> terminate called after throwing an instance of 'std::bad_alloc'
  what():  std::bad_alloc
bash: line 1: 118380 Aborted                 (core dumped) /home/aditya/halide_build/bin/correctness_convolution
../Halide/Makefile:1468: recipe for target 'correctness_convolution' failed

Update 1:

I took the assembly dump, compile and link to object code and used hipModuleLoadData in a sample [1], It worked fine. May be the linker is not generating valid object code yet.

[1]. https://gist.github.com/adityaatluri/4dfd6898a2204ed43bbfd8579db4bd63#file-module_load-cpp

Update 2:

When I put print statement in relocate function, it is not printed out at runtime.

dsharletg · 2018-04-12T23:42:19Z

You're probably going to need to dump out the linked blob to a file, and load it with various tools (readelf, objdump, etc.) to see if it is valid and try to track down the bugs.

aditya4d1 · 2018-04-12T23:45:02Z

The code path for linker is not being used at runtime. Where should I look at for that control flow?

dsharletg · 2018-04-12T23:51:40Z

src/AMDGPUOffload.cpp

+
+Stmt inject_amdgpu_rpc(Stmt s, const Target &host_target,
+                        Module &containing_module) {
+    Target target(Target::Linux, Target::X86, 64);


This should be the AMD GPU target OS and architecture. If you don't need any OS support (no thread pool, IO, etc.) then maybe you can just use Target::NoOS.

After some debugging, the function resolve_submodules() is not calling compile_to_buffer() as there are no submodules. Should I do,

Module Module::resolve_submodules() const { if(target.has_feature(Target::AMDGPUGFX900)) { Module lowered_module(name(), target()); lowered_module_append(this-> compile_to_amdgpu_shared_object(*this)); return lowered_module; } if(submodules().empty()) { return *this; } ..... }

@dsharletg can I add this code?

dsharletg · 2018-04-12T23:52:52Z

Added a comment, plus you'll probably need to add another special case here:

Halide/src/Module.cpp

Line 258 in 626c2dd

// TODO: This Hexagon specific code should be removed as soon as possible.

(This is ugly, hopefully your backend will help us find common behavior that we can build proper support for.)

dsharletg · 2018-04-13T01:15:51Z

I think the reason the Hexagon linker doesn't use the GOT for relocations is because we only link shared objects. The GOT address can't be known at link time, so generating a relocation that refers directly to the GOT must be an error (maybe this is where "not PIC code" type linker errors come from?)

This appears to also be true of the AMDGPU relocations: https://llvm.org/docs/AMDGPUUsage.html#amdgpu-relocation-records

GOT
Represents the address of the global offset table.

So it seems that there may be a difference here in that the Hexagon linker produces shared objects and the AMDGPU linker will produce non-PIC executables.

t-tye · 2018-04-13T01:54:43Z

Here is my understanding of what should be happening in the linker.

The GOT address is known at link time. It is the VA of the .got section of the generated code object. A got relocation is the means that a reference in the text section can access an address defined in another dynamically loaded library (or main executable).

Since the text section needs to be readonly (so it can be shared between processes) the way this is achieved is to make the access indirect through the got table. The got relocation is a request to the linker create an entry in the got table when linking from a relocatable to produce a shared library or executable (namely something that can be loaded by the operating system).

Generally the text section is PIC code so the instruction in the text section wants to know the displacement to get to the got table entry in the same code object. That is what a PC relative got relocation computes, and hence why it needs to know the VA of the got table.

The linker creates a section for the got table. Before relocation it allocates all sections so they have known VA addresses. (It seems the got table is conservatively sized based on number of symbols which seems a bit odd.) This change is passing that into the relocation function.

If the relocation is a got relocation, the relocation function checks to see if there is already a got table entry for the external symbol. If not it creates one. In either case what it has is the offset from the start of the got table to the entry. It must add this offset to the got table base to get the address of entry.

In addition an ABSOLUTE relocation record must be generated for the got entry to ensure it will be patched at load time with the actual address of the external symbol.

Since the code object is PIC it does not matter what address it is assumed the code object will be loaded. This load address is the ELF VA. This code chooses to use 0.

The RELATIVE relocation record is what allows the code object to be loaded at a different address. It cause the loader to add the difference between the ELF VA and the actual load address. If the code object is loaded at its actual ELF VA then no patching is required.

A RELATIVE relocation only makes sense for locations that contain an address within the same code object. Hence checking that the symbol is defined before generating it. If the symbol is not defined then an ABSOLUTE relocation must be used (as is done for the got table entries). There seems to be a bug in the Hexagon linker as it does not do this.

A test case to show this would be the definition of a global variable A initialized to the address of another global variable B. If B is a defined variable then a RELATIVE relocation should be used to update A. If B is an external variable then an ABSOLUTE relocation would be used.

The time you will get got table entries is when you have external references to symbols. It maybe that you never have them as you do not support global variables or dynamic libraries.

dsharletg · 2018-04-13T21:34:15Z

If the symbol is not defined then an ABSOLUTE relocation must be used (as is done for the got table entries).

I think that this cannot happen in PIC code, the only absolute relocations for PIC code are in the GOT, and happen when the symbol is defined (which may be at runtime in the case of shared objects). If the compiler generates a reference to a symbol that cannot be relocated at link time (i.e. either refers to an undefined symbol, or needs an absolute address), then it's too late for the linker to fix that, as it might require the code to have an extra level of indirection (via the GOT).

The time you will get got table entries is when you have external references to symbols. It maybe that you never have them as you do not support global variables or dynamic libraries.

We definitely do generate GOT table entries, we have global variables with initializers, and we do support dynamic libraries :) In fact, the only thing we currently support is shared libraries.

t-tye · 2018-04-13T22:23:49Z

If the symbol is not defined then an ABSOLUTE relocation must be used (as is done for the got table entries).

I think that this cannot happen in PIC code, the only absolute relocations for PIC code are in the GOT, and happen when the symbol is defined (which may be at runtime in the case of shared objects). If the compiler generates a reference to a symbol that cannot be relocated at link time (i.e. either refers to an undefined symbol, or needs an absolute address), then it's too late for the linker to fix that, as it might require the code to have an extra level of indirection (via the GOT).

Right. The relocations cannot be in the PIC readonly .text sections. But they can be present in the the writable .data sections. For example, global variables that are defined and have an initializer that is the address of another symbol.

The time you will get got table entries is when you have external references to symbols. It maybe that you never have them as you do not support global variables or dynamic libraries.

We definitely do generate GOT table entries, we have global variables with initializers, and we do support dynamic libraries :) In fact, the only thing we currently support is shared libraries.

Have you used global variables initialized by the address of other symbols (that may be defined or undefined)? For example:

int b;
int* a = &b;

dsharletg · 2018-04-13T22:26:35Z

Yes, for example: https://github.com/halide/Halide/blob/master/src/runtime/qurt_allocator.cpp#L88

(qurt_allocator.cpp defines the memory allocator linked to Hexagon code, which is linked by our linker.)

t-tye · 2018-04-13T22:39:00Z

Looking at the Hexagon linker I am unclear how it is creating the relocations needed to make that work if halide_default_malloc is not defined in the same shared library as custom_malloc. For that to work I would expect that the linked shared object would have a data section that holds custom_malloc that has an absolute relocation record specifying an undefined symbol halide_default_malloc. But the only relocation records that seem to be generated for the linked code object are those put on the relocations list of the got section. But I only see that happening for the relative relocations, and the absolute relocations for the got table entries.

dsharletg · 2018-04-13T23:52:53Z

At least for functions, they get re-routed through the PLT:

Halide/src/Elf.cpp

Line 714 in f687e89

debug(2) << "Defining PLT entry for " << sym->get_name() << "\n";

A relocation that refers to an undefined external symbol gets redirected to a PLT entry, and the PLT entry gets a relocation for the real symbol.

A similar thing might be done by the compiler for global data symbols?

t-tye · 2018-04-14T00:27:27Z

In our case, variables get allocated in the data sections and relocation records are used to initialize their value correctly at load time. We do not currently have a plt, but when we do I would expect it to only contain access to functions that can be accessed externally. The got is used for references to external symbols.

t-tye · 2018-04-14T01:50:24Z

src/AMDGPUOffload.cpp

@@ -213,19 +212,20 @@ class AMDGPULinker : public Linker {
    }

    Symbol add_plt_entry(const Symbol &sym, Section &plt, Section &got, const Symbol &got_sym) override {
-        internal_error << "Unsupported plt relocation for" << sym << "\n";
+        internal_error << "Unsupported plt relocation for amdgpu object" << "\n";


Do get the symbol name printed out change to:

internal_error << "Unsupported plt relocation for " << sym->get_name() << "\n";

aditya4d1 · 2018-04-24T20:02:37Z

@pranavb-ca @dsharletg, can you review?

aditya4d1 · 2018-04-24T20:48:58Z

@abadams can you restart ci?

aditya4d1 · 2018-04-24T22:34:34Z

How should I resolve this: https://travis-ci.org/halide/Halide/jobs/370795728#L7531
?

ronlieb · 2018-06-18T14:19:18Z

I think there is a recently closed PR that might be of interest to your patch's testing efforts.
#3048

steven-johnson · 2019-03-18T21:03:21Z

This PR is quite old. Is there an update on its status?

aditya4d1 · 2019-03-26T17:43:05Z

Not working on it anymore. Closing PR

added initial source code support for AMDGPU backend

49e35ad

abadams reviewed Feb 9, 2018

View reviewed changes

Aditya Atluri added 3 commits February 9, 2018 14:05

removed nvptx passes from amdgpu code

ab588a4

removed commented out code

0e9ce86

added device enum for halide to generate code for amdgpu

c829be1

abadams reviewed Feb 19, 2018

View reviewed changes

Aditya Atluri added 6 commits February 19, 2018 10:46

added break statement

463fc85

added headers to implement amdgpu runtime

c483a59

1. Added mini_amdgpu.h containing required data structures 2. Added hip_functions.h containing required apis 3. Added amdgpu.cpp which is incomplete. Right now added it to test build 4. Changed Makefile and CMakeLists.txt to build amdgpu.cpp

added support to load hip shared library

de4c293

changed amdgpu codegen to call proper functions

6956bb5

1. Changed triple 2. Added few llvm ir files for amdgpu codegen to consume 3. Changed Makefile accordingly

added device libs support to Makefile

2de8b06

aditya4d1 commented Feb 22, 2018

View reviewed changes

Aditya Atluri added 5 commits February 23, 2018 13:20

disabled using amdgcn bitcode

8acbc05

1. Added few missing pieces to runtime linker 2. Disabled amdgcn bitcode linking 3. Added a new runtime function to amdgpu.cpp 4. Registered amdgpu runtime apis

moved AMDGPU target initializations into constructor

ae87405

1. This fixes object code generation bug

Removed spaces in Makefile for building amdgpu ll from bitcode

c3c8bf4

added rule to build initmod_amdgpu

8731eb4

added dummy device libraries

9fdfb2c

Turns out git didn't catch these files before because bc file name is in gitignore

made changes to elf.cpp to make it work with amdgpu object code

597431d

1. Fixed AMDGPU linker relocs 2. Added new argument to relocate to take got_offset

aditya4d1 commented Apr 12, 2018

View reviewed changes

pranavb-ca requested changes Apr 12, 2018

View reviewed changes

fixed missing code path

25c458e

1. Changed comments for AMDGPUOffload.cpp 2. Removed already helper functions in AMDGPUOffload.cpp 3. Removed irrelevant comments

dsharletg reviewed Apr 12, 2018

View reviewed changes

t-tye reviewed Apr 14, 2018

View reviewed changes

Aditya Atluri added 2 commits April 24, 2018 12:44

added logic to compile amdgpu shared object

81a3688

added debug text for plt entry

133574f

Merge branch 'master' into rocm-src-v1

5b04b98

fixed build issue after master and rocm-src-v1 branch merge

f81be26

aditya4d1 closed this Mar 26, 2019


		void CodeGen_AMDGPU_Dev::visit(const Store *op) {

		// Do aligned 4-wide 32-bit stores as a single i128 store.

added initial source code support for AMDGPU backend #2734

added initial source code support for AMDGPU backend #2734

Conversation

aditya4d1 commented Feb 9, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

aditya4d1 Feb 9, 2018 • edited

Choose a reason for hiding this comment

aditya4d1 commented Feb 9, 2018

dsharletg commented Feb 9, 2018

aditya4d1 commented Feb 9, 2018

dsharletg commented Feb 9, 2018

aditya4d1 commented Feb 9, 2018

shoaibkamil commented Feb 9, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

aditya4d1 left a comment

Choose a reason for hiding this comment

aditya4d1 commented Feb 22, 2018

aditya4d1 commented Feb 27, 2018

abadams commented Feb 27, 2018

aditya4d1 commented Feb 27, 2018

abadams commented Feb 27, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

aditya4d1 commented Apr 12, 2018 • edited

Update 1:

Update 2:

dsharletg commented Apr 12, 2018

aditya4d1 commented Apr 12, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dsharletg commented Apr 12, 2018

dsharletg commented Apr 13, 2018

t-tye commented Apr 13, 2018

dsharletg commented Apr 13, 2018

t-tye commented Apr 13, 2018

dsharletg commented Apr 13, 2018

t-tye commented Apr 13, 2018

dsharletg commented Apr 13, 2018

t-tye commented Apr 14, 2018

Choose a reason for hiding this comment

aditya4d1 commented Apr 24, 2018

aditya4d1 commented Apr 24, 2018

aditya4d1 commented Apr 24, 2018

ronlieb commented Jun 18, 2018

steven-johnson commented Mar 18, 2019

aditya4d1 commented Mar 26, 2019

aditya4d1 Feb 9, 2018 •

edited

aditya4d1 commented Apr 12, 2018 •

edited