Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[compiler-rt] Initial support for builtins on GPU targets #95304

Merged
merged 1 commit into from
Jul 10, 2024

Conversation

jhuber6
Copy link
Contributor

@jhuber6 jhuber6 commented Jun 12, 2024

Summary:
This patch adds initial support to build the builtins library for GPU
targets. Primarily this requires adding a few new architectures for
amdgcn and nvptx64. I built this using the following invocations.

$ cmake ../compiler-rt -DCMAKE_C_COMPILER=clang
  -DCMAKE_CXX_COMPILER=clang++ -DCMAKE_BUILD_TYPE=Release -GNinja
  -DCMAKE_C_COMPILER_TARGET=<nvptx64-nvidia-cuda|amdgcn-amd-amdhsa>
  -DCMAKE_CXX_COMPILER_TARGET=<nvptx64-nvidia-cuda|amdgcn-amd-amdhsa>
  -DCMAKE_C_COMPILER_WORKS=1 -DCMAKE_CXX_COMPILER_WORKS=1
  -DLLVM_CMAKE_DIR=../cmake/Modules -DCOMPILER_RT_DEFAULT_TARGET_ONLY=ON
  -C ../compiler-rt/cmake/caches/GPU.cmake

Some pointers would be appreciated for how to test this using a standard
(non-default target only) build.

GPU builds are somewhat finnicky. We only expect this to be built with a
sufficiently new clang, as it's the only compiler that supports the
target and output we distribute. Distribution is done as LLVM-IR blobs for now.
GPUs have little backward compatibility, so linking object files is
left to a future patch.

More work is necessary to build correctly for all targets and ship into
the correct clang resource directory. Additionally we need to use the
libc project's support for running unit tests.

# FIXME: This doesn't work with no CUDA installation.
set(CMAKE_REQUIRED_FLAGS "${CMAKE_REQUIRED_FLAGS} -march=sm_75")
endif()
endif()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this be extracted somehow? I suspect that we need this in base-config-ix.cmake as well.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the difference? I thought this was the one place that checked the default argument.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think for a regular build (not cross compiling directly) it will be fine since it will just use clang which should hopefully be supported by the user's system, this is mostly required for when the Runtimes build invoked it with CMAKE_C_TARGET=nvptx64-nvidia-cuda.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Specifying CMAKE_C_TARGET doesn't imply cross-compilation though. I could use a x64 compiler, hosted on x64, and still specify the x64 target and that would not be cross-compilation.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

True, but I'm just going off of the expected case where the compiler has an implicit --target= version. I'm assuming the base logic just checks if it can target any of them? The only logic I saw for that was in base-config-ix.cmake and it went off of the default target, so I'm unsure where else to modify.

@jhuber6 jhuber6 force-pushed the CRT branch 2 times, most recently from 86c2f38 to 5b44600 Compare June 12, 2024 21:40
@jhuber6 jhuber6 force-pushed the CRT branch 2 times, most recently from b149b1c to b8e4c2b Compare June 13, 2024 13:15
@yxsamliu
Copy link
Collaborator

If -DCMAKE_C_COMPILER_TARGET=amdgcn-amd-amdhsa -DCMAKE_CXX_COMPILER_TARGET=amdgcn-amd-amdhsa is specified, will cmake only build compiler-rt builtin lib for amdgcn (not building compiler-rt builtin lib for the default host target)?

It is desirable to build compiler-rt builtin for both the default host target and amdgcn target in one build. And the build for amdgcn target can be optional and controlled by a cmake variable. You may use a similar approach in https://github.com/llvm/llvm-project/pull/71978/files to achieve that.

@jhuber6
Copy link
Contributor Author

jhuber6 commented Jun 13, 2024

If -DCMAKE_C_COMPILER_TARGET=amdgcn-amd-amdhsa -DCMAKE_CXX_COMPILER_TARGET=amdgcn-amd-amdhsa is specified, will cmake only build compiler-rt builtin lib for amdgcn (not building compiler-rt builtin lib for the default host target)?

It is desirable to build compiler-rt builtin for both the default host target and amdgcn target in one build. And the build for amdgcn target can be optional and controlled by a cmake variable. You may use a similar approach in https://github.com/llvm/llvm-project/pull/71978/files to achieve that.

It should work with this configuration, (Though I just remembered I need to add some more flags to the other case). Ideally we'd just build these if the compiler can target it and don't need an extra variable. What I did here was mostly testing (direct) targeting, however it should work in various cases. Ultimately what I want is to install these to <install>/lib/clang/<version>/lib/amdgcn-amd-amdhsa/libclang_rt.builtins.(a|.bc) and then we link that in. Supporting complex numbers and 128 bit division is part of why I wanted to do this.

@jhuber6
Copy link
Contributor Author

jhuber6 commented Jun 13, 2024

What's the expected way to target multiple architectures in a single build by the way? The logic I see seems to just take the first value from the LLVM default.

@yxsamliu
Copy link
Collaborator

What's the expected way to target multiple architectures in a single build by the way? The logic I see seems to just take the first value from the LLVM default.

By default, the cmake file only build the LLVM default. However for HIP we want to also build the target for device. Since normal users may not want that, we need a LLVM variable to enable it.

@jhuber6
Copy link
Contributor Author

jhuber6 commented Jun 13, 2024

What's the expected way to target multiple architectures in a single build by the way? The logic I see seems to just take the first value from the LLVM default.

By default, the cmake file only build the LLVM default. However for HIP we want to also build the target for device. Since normal users may not want that, we need a LLVM variable to enable it.

My current plan is to emit it as part of a GPU toolchain. Right now I do this with the libc project so I'm planning on the canonical way of building this to be like this.

    -DRUNTIMES_amdgcn-amd-amdhsa_LLVM_ENABLE_RUNTIMES=compiler-rt;libc         \    
    -DLLVM_RUNTIME_TARGETS=default;amdgcn-amd-amdhsa 

However, this will require a few tweaks since the runtimes build handles the builtins separately. Going to work on a follow-up for that.

@compnerd is there a canonical way to build for multiple architectures at the same time? I.e. given a standard CMake invocation I'd like to be able to build output for x64, amdgcn, and nvptx64 if they are all supported by the compiler.

@compnerd
Copy link
Member

@compnerd is there a canonical way to build for multiple architectures at the same time? I.e. given a standard CMake invocation I'd like to be able to build output for x64, amdgcn, and nvptx64 if they are all supported by the compiler.

Invoke CMake multiple times :) CMake is designed to build for a single host at a time, so you would need to run a build per host.

@jhuber6
Copy link
Contributor Author

jhuber6 commented Jun 13, 2024

@compnerd is there a canonical way to build for multiple architectures at the same time? I.e. given a standard CMake invocation I'd like to be able to build output for x64, amdgcn, and nvptx64 if they are all supported by the compiler.

Invoke CMake multiple times :) CMake is designed to build for a single host at a time, so you would need to run a build per host.

Okay, that tracks with the runtimes option I'm planning on using. The case that @yxsamliu is describing can probably be done with -DCOMPILER_RT_DEFAULT_TARGET=amdgcn-amd-amdhsa instead of a dedicated flag, correct?

@yxsamliu
Copy link
Collaborator

@compnerd is there a canonical way to build for multiple architectures at the same time? I.e. given a standard CMake invocation I'd like to be able to build output for x64, amdgcn, and nvptx64 if they are all supported by the compiler.

Invoke CMake multiple times :) CMake is designed to build for a single host at a time, so you would need to run a build per host.

Is it possible to build compiler-rt runtime by using LLVM built in a different build directory? This is to avoid building LLVM twice if we want to build compiler-rt for x86_64 and and amdgcn.

@jhuber6
Copy link
Contributor Author

jhuber6 commented Jun 13, 2024

@compnerd Okay, I figured out my original problem. So, in #95234 the reason I stopped seeing the issue was because I moved to using DEFAULT_TARGET_ONLY mode. The test_target_architecture version uses this simple.cc which then fails due to the printf. I think it should be fine to build this if the user has a compatible clang, so maybe I should revive that patch?

@petrhosek
Copy link
Member

There are two ways to build compiler-rt:

  1. The old way in theory supports building for multiple targets in a single build, but in practice only reliably works for sufficiently similar targets like x86 and x86-64 or different flavors of arm32. The target is controlled by -DCOMPILER_RT_DEFAULT_TARGET.
  2. The new way which builds only for a single target at a time, relying on -DCMAKE_<LANG>_COMPILER_TARGET. You also need to set -DCOMPILER_RT_DEFAULT_TARGET_ONLY=ON to disable the legacy logic. This is used by the runtimes build.

I'd like to eventually fully deprecate (1) and make (2) the only way since (1) is a maintenance nightmare: it doesn't use standard CMake logic, instead it invokes Clang directly as add_custom_target, has its own way of handling flags, etc.

@compnerd
Copy link
Member

No, that patch is not valid - compiler-rt is a grab bag of libraries, we cannot design the build system for just the builtins. If you want to build just the builtins, you need to configure the build accordingly to avoid the other libraries.

@compnerd
Copy link
Member

@yxsamliu - if you are using the runtimes build, the just built compiler can be used to build the runtimes for the hosts that you would like.

@jhuber6
Copy link
Contributor Author

jhuber6 commented Jun 13, 2024

There are two ways to build compiler-rt:

1. The old way in theory supports building for multiple targets in a single build, but in practice only reliably works for sufficiently similar targets like x86 and x86-64 or different flavors of arm32. The target is controlled by `-DCOMPILER_RT_DEFAULT_TARGET`.

2. The new way which builds only for a single target at a time, relying on `-DCMAKE_<LANG>_COMPILER_TARGET`. You also need to set `-DCOMPILER_RT_DEFAULT_TARGET_ONLY=ON` to disable the legacy logic. This is used by the runtimes build.

I'd like to eventually fully deprecate (1) and make (2) the only way since (1) is a maintenance nightmare: it doesn't use standard CMake logic, instead it invokes Clang directly as add_custom_target, has its own way of handling flags, etc.

Okay, (2) is basically what I've been testing this against. I've run into some issues getting it to work in the runtimes case however. This seems to stem from the fact that we have a separate LLVM_BUILTIN_TARGETS. I'd like to populate that automatically if not set for every LLVM_RUNTIME_TARGETS that contains compiler-rt. Is that reasonable?

@jhuber6
Copy link
Contributor Author

jhuber6 commented Jun 14, 2024

@petrhosek I'm trying to get it working through the LLVM_ENABLE_RUNTIMES path, but for some reason the CMAKE_CXX_COMPILER_ID doesn't seem to be set. It's set for other runtimes builds using the same interface (libc), but not for the separate builtins build. Is there some special handling we do here?

@jhuber6
Copy link
Contributor Author

jhuber6 commented Jun 14, 2024

I removed the version check for now since I have no clue why it's failing and it's not strictly necessary. I updated some logic to make the default configuration work when targeted directly. The patch #95554 fixes the issues I had with enabling it.

@yxsamliu with both of these patches, the following build should work.

$ cmake ../llvm -DCMAKE_BUILD_TYPE=Release -DLLVM_ENABLE_PROJECTS=clang;lld    \
    -DRUNTIMES_nvptx64-nvidia-cuda_LLVM_ENABLE_RUNTIMES=compiler-rt;libc       \
    -DRUNTIMES_amdgcn-amd-amdhsa_LLVM_ENABLE_RUNTIMES=compiler-rt;libc         \
    -DLLVM_RUNTIME_TARGETS=default;amdgcn-amd-amdhsa;nvptx64-nvidia-cuda       \

This will create $<install>/lib/clang/19/lib/amdgcn-amd-amdhsa/libclang_rt.builtins.a. I propose in the future that the clang driver simply checks for the existence of this directory and links it in if present. This is the resource directory so it's tied to the installation.

@arsenm
Copy link
Contributor

arsenm commented Jun 14, 2024

Distribution is done as LLVM-IR blobs. GPUs have little backward compatibility, so linking object files is difficult.

I'd rather not bake in this assumption. We should be able to support .o and .bc builds

@jhuber6
Copy link
Contributor Author

jhuber6 commented Jun 14, 2024

Distribution is done as LLVM-IR blobs. GPUs have little backward compatibility, so linking object files is difficult.

I'd rather not bake in this assumption. We should be able to support .o and .bc builds

It's true for now at least, so we get the .bc. I'd love to have an actual object build though, since it would let us resolve things if we ever want the backend to be able to use these.

@arsenm
Copy link
Contributor

arsenm commented Jun 14, 2024

It's true for now at least, so we get the .bc. I'd love to have an actual object build though, since it would let us resolve things if we ever want the backend to be able to use these.

Nothing is stopping you from building and using the object today. We just aren't going to guarantee a stable ABI for it now

@jhuber6
Copy link
Contributor Author

jhuber6 commented Jun 14, 2024

It's true for now at least, so we get the .bc. I'd love to have an actual object build though, since it would let us resolve things if we ever want the backend to be able to use these.

Nothing is stopping you from building and using the object today. We just aren't going to guarantee a stable ABI for it now

No stable ABI and incorrect resource usage metadata. The other issue is we'd need to build multiple objects like amdgcn-amd-amdhsa/libclang_rt.gfx1030.a or something. It's definitely something I'm thinking about, but I don't think it's blocking for the initial support.

@jhuber6
Copy link
Contributor Author

jhuber6 commented Jun 14, 2024

It's true for now at least, so we get the .bc. I'd love to have an actual object build though, since it would let us resolve things if we ever want the backend to be able to use these.

Nothing is stopping you from building and using the object today. We just aren't going to guarantee a stable ABI for it now

Potentially we could put the subarchitecture in the triple (I think Arm and DXIL do this) and then build these builtins multiple times. Unfortunately this would take a stupidly long time to configure, since compiler-rt already does a million of these cxx_compiler_has checks, which take several seconds, so if we start building for 40+ architectures it's going to be very, very slow.

@yxsamliu
Copy link
Collaborator

It's true for now at least, so we get the .bc. I'd love to have an actual object build though, since it would let us resolve things if we ever want the backend to be able to use these.

Nothing is stopping you from building and using the object today. We just aren't going to guarantee a stable ABI for it now

Is AMDGPU backend able to do ISA-level linking now?

@jhuber6
Copy link
Contributor Author

jhuber6 commented Jun 14, 2024

Is AMDGPU backend able to do ISA-level linking now?

AFAIK, you can do ISA-level linking in the sense that relocations work. However it's "broken" for the following reasons.

  1. We don't guarantee a consistent function call ABI
  2. Kernel resource usage metadata relied on traversing the backend SCC in LLVM-IR. We could make a foreign function call make it just use maximum resources as the trivial fix. A real solution would require the backend to emit function-level metadata and have lld do this part. There's already support for stuff like this for PGO that could likely be reused.
  3. LDS usage will not be calculated correctly from foreign functions, this would require special relocations as well.

So, currently it "works" if you just so happen to have enough registers, don't use LDS, and expect the ABI not to change in the future.

Summary:
This patch adds initial support to build the `builtins` library for GPU
targets. Primarily this requires adding a few new architectures for
`amdgcn` and `nvptx64`. I built this using the following invocations.

```console
$ cmake ../compiler-rt -DCMAKE_C_COMPILER=clang
  -DCMAKE_CXX_COMPILER=clang++ -DCMAKE_BUILD_TYPE=Release -GNinja
  -DCMAKE_C_COMPILER_TARGET=<nvptx64-nvidia-cuda|amdgcn-amd-amdhsa>
  -DCMAKE_CXX_COMPILER_TARGET=<nvptx64-nvidia-cuda|amdgcn-amd-amdhsa>
  -DCMAKE_C_COMPILER_WORKS=1 -DCMAKE_CXX_COMPILER_WORKS=1
  -DLLVM_CMAKE_DIR=../cmake/Modules -DCOMPILER_RT_DEFAULT_TARGET_ONLY=ON
  -C ../compiler-rt/cmake/caches/GPU.cmake
```

Some pointers would be appreciated for how to test this using a standard
(non-default target only) build.

GPU builds are somewhat finnicky. We only expect this to be built with a
sufficiently new clang, as it's the only compiler that supports the
target and output we distribute. Distribution is done as LLVM-IR blobs.
GPUs have little backward compatibility, so linking object files is
difficult. However, this prevents us from calling these functions
post-LTO as they will have been optimized out. Another issue is the
CMake flag querying functions, currently these fail on nvptx if you
don't have CUDA installed because they want to use the `ptxas` and
`nvlink` binaries.

More work is necessary to build correctly for all targets and ship into
the correct clang resource directory. Additionally we need to use the
`libc` project's support for running unit tests.
@JonChesterfield
Copy link
Collaborator

There are three things here.

One is the very complicated logic in compiler-rt cmake for deciding what to build.

Two is the long-standing machine code linking puzzle for amdgpu.

Three is that amdgpu really could do with a compiler-rt builtins to factor parts of the compiler backend / intrinsics into.

We can postpone 2 by noting that builtins.a is an archive and archives can have object or bitcode in. 1 is just really careful programming.

I'd like there to be a target-builtins for the two amdgpu and nvptx targets, provided it serves a similar role to on the other targets, and doesn't import a load of gpu-offload-centric design decisions like unexpectedly wrapping everything in x64 elf files or deciding archive files don't mean what they do everywhere else.

At the lowest level, a gpu triple is just a bare metal embedded target. Compiler-rt builtins is the lowest level, it can and should use that model.

@jhuber6
Copy link
Contributor Author

jhuber6 commented Jun 17, 2024

There are three things here.

One is the very complicated logic in compiler-rt cmake for deciding what to build.

Two is the long-standing machine code linking puzzle for amdgpu.

Three is that amdgpu really could do with a compiler-rt builtins to factor parts of the compiler backend / intrinsics into.

Making the backend rely on it is difficult because of (2), but agreed.

We can postpone 2 by noting that builtins.a is an archive and archives can have object or bitcode in. 1 is just really careful programming.

I'd like there to be a target-builtins for the two amdgpu and nvptx targets, provided it serves a similar role to on the other targets, and doesn't import a load of gpu-offload-centric design decisions like unexpectedly wrapping everything in x64 elf files or deciding archive files don't mean what they do everywhere else.

The short-term plan here is to make this compiler_rt.builtins.a and have the lld job that the ROCMToolChain in clang creates implicitly link against it if it exists. That should use all the regular LT

At the lowest level, a gpu triple is just a bare metal embedded target. Compiler-rt builtins is the lowest level, it can and should use that model.

Yep, the compilation method here is pretty much just "I want to build compiler-rt for this target" Which in this case just adds a few extra flags to make it work for now.

@arsenm
Copy link
Contributor

arsenm commented Jun 17, 2024

Making the backend rely on it is difficult because of (2), but agreed.

Not for compiler-rt, we can just assume it's version locked. The resource problem is also simpler for compiler-rt. We can more easily just define a fixed register budget, and then just assume there's no stack usage

@jhuber6
Copy link
Contributor Author

jhuber6 commented Jun 17, 2024

Making the backend rely on it is difficult because of (2), but agreed.

Not for compiler-rt, we can just assume it's version locked. The resource problem is also simpler for compiler-rt. We can more easily just define a fixed register budget, and then just assume there's no stack usage

Yeah, and LDS likely isn't going to be an issue unless we start putting more things in here. Would that require some special handling for like "If callee is a compiler-rt function assume some thread count?" I know for most cases the default for an unknown callee is generous enough to not cause issues.

Also, the issues with machine code linking is that you'd need to somehow wrangle many different builds. This would only work as part of the build system if you put it into the Triple (like Arm Architecture does), but even that would then necessitate like 20 separate builds which would take ages. (Configuring just one of these adds like an extra 15 seconds to my build).

The issue with the IR / LTO route is that the library gets resolved / internalized before the backend runs, so we don't know if we actually needed some function from the static library. I'm wondering if it would be possible to do some kind of "late" LTO that runs after everything else. That way we don't need 50 different builds and we get ISA linking that's correct.

@arsenm
Copy link
Contributor

arsenm commented Jun 17, 2024

That way we don't need 50 different builds and we get ISA linking that's correct.

The compatibility window isn't that big, we should be able to start using the new generic targets for this too. But ultimately I don't see this as a big deal.

Part of my compiler-rt interest is it's a restricted support case where we can start with object linking

@jhuber6
Copy link
Contributor Author

jhuber6 commented Jun 17, 2024

That way we don't need 50 different builds and we get ISA linking that's correct.

The compatibility window isn't that big, we should be able to start using the new generic targets for this too. But ultimately I don't see this as a big deal.

I'm just wondering how this would work in practice. We'd need special logic for both the build system and the linker job in clang. And like I said, building it for even a few architectures is going to add at least 10 minutes to a trivial rebuild.

The end goal is definitely to have this supported in the backend, since it should simplify a lot of stuff. But I think the initial version of this patch doesn't need to tackle the very wide problem of ISA linking on GPUs just yet.

@jhuber6
Copy link
Contributor Author

jhuber6 commented Jun 21, 2024

Are there any issues with this patch overall? I think this is a good first step and we can investigate ABI linking on GPUs as we add features.

@yxsamliu
Copy link
Collaborator

LGTM on HIP side. We could use it like we use device libraries since it is GPU-arch-neutral. I think it is good it have the bitcode lib built first and then consider adding object libs in a progressive way.

@jhuber6
Copy link
Contributor Author

jhuber6 commented Jul 1, 2024

ping

@jhuber6
Copy link
Contributor Author

jhuber6 commented Jul 10, 2024

@compnerd Is there anything that needs to be changed? I'd like to merge this before the LLVM19 fork if possible.

Copy link
Member

@compnerd compnerd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, this seems safe enough as is to merge I think. I wonder if we can push the flags into the cache file, but that isn't a requirement for this I think - we can clean that up subsequently.

@jhuber6 jhuber6 merged commit dad7442 into llvm:main Jul 10, 2024
6 checks passed
aaryanshukla pushed a commit to aaryanshukla/llvm-project that referenced this pull request Jul 14, 2024
Summary:
This patch adds initial support to build the `builtins` library for GPU
targets. Primarily this requires adding a few new architectures for
`amdgcn` and `nvptx64`. I built this using the following invocations.

```console
$ cmake ../compiler-rt -DCMAKE_C_COMPILER=clang
  -DCMAKE_CXX_COMPILER=clang++ -DCMAKE_BUILD_TYPE=Release -GNinja
  -DCMAKE_C_COMPILER_TARGET=<nvptx64-nvidia-cuda|amdgcn-amd-amdhsa>
  -DCMAKE_CXX_COMPILER_TARGET=<nvptx64-nvidia-cuda|amdgcn-amd-amdhsa>
  -DCMAKE_C_COMPILER_WORKS=1 -DCMAKE_CXX_COMPILER_WORKS=1
  -DLLVM_CMAKE_DIR=../cmake/Modules -DCOMPILER_RT_DEFAULT_TARGET_ONLY=ON
  -C ../compiler-rt/cmake/caches/GPU.cmake
```

Some pointers would be appreciated for how to test this using a standard
(non-default target only) build.

GPU builds are somewhat finnicky. We only expect this to be built with a
sufficiently new clang, as it's the only compiler that supports the
target and output we distribute. Distribution is done as LLVM-IR blobs
for now.
GPUs have little backward compatibility, so linking object files is
left to a future patch.

More work is necessary to build correctly for all targets and ship into
the correct clang resource directory. Additionally we need to use the
`libc` project's support for running unit tests.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants