[OpenMP] offload compilation failure with CUDA-12 #60296

ye-luo · 2023-01-25T17:05:44Z

main.cpp

#include <cublas_v2.h>
int main()
{}

error from the nvptx64 pass

$ clang++ -fopenmp --offload-arch=sm_80 -I /usr/local/cuda-12.0/include main.cpp 
clang-16: warning: CUDA version is newer than the latest partially supported version 11.8 [-Wunknown-cuda-version]
In file included from main.cpp:1:
In file included from /usr/local/cuda-12.0/include/cublas_v2.h:69:
In file included from /usr/local/cuda-12.0/include/cublas_api.h:77:
In file included from /usr/local/cuda-12.0/include/cuda_fp16.h:4006:
/usr/local/cuda-12.0/include/cuda_fp16.hpp:690:1: error: unknown type name '__CUDA_FP16_DECL__'
__CUDA_FP16_DECL__ __half2 __internal_device_float2_to_half2_rn(const float a, const float b) {
^
/usr/local/cuda-12.0/include/cuda_fp16.hpp:690:27: error: expected ';' after top level declarator
__CUDA_FP16_DECL__ __half2 __internal_device_float2_to_half2_rn(const float a, const float b) {
                          ^
2 errors generated.

both cudatoolkit 12.0 and 12.1 are affected

The text was updated successfully, but these errors were encountered:

llvmbot · 2023-01-25T17:13:26Z

@llvm/issue-subscribers-openmp

shiltian · 2023-01-25T19:35:16Z

The definition of those special types are guarded by #if defined(__CUDAACC__) which are defined by Nvidia's compiler automatically but not by clang. You might try that and see if it works.

ye-luo · 2023-01-25T21:17:08Z

Adding -D__CUDAACC__ doesn't help.

jdoerfert · 2023-01-26T00:43:42Z

@Artem-B Have you seen this?

Artem-B · 2023-01-27T22:18:59Z

I do not see it with CUDA or C++ compilation, so it's likely something OpenMP-specific.

shiltian · 2023-01-28T03:29:26Z

TBH I don't think cublas_v2.h is designed to be used by a non-NVIDIA compiler in a non-CUDA mode.

ye-luo · 2023-01-28T03:47:12Z

cublas_v2.h is from cublas which is host library designed to work with any host compiler.
I think it is a compatibility issue when clang has openmp offload enabled.

shiltian · 2023-01-28T03:53:20Z

I can't reproduce the issue.

➜  tmp clang++ -fopenmp --offload-arch=sm_86 cublas.cpp -o cublas -I /soft/compilers/cuda/cuda-11.8.0/include
clang-16: warning: CUDA version 11.8 is only partially supported [-Wunknown-cuda-version]
clang-16: warning: CUDA version 11.8 is only partially supported [-Wunknown-cuda-version]
clang-16: warning: CUDA version 11.8 is only partially supported [-Wunknown-cuda-version]
➜  tmp which clang
/home/ac.shilei.tian/Documents/build/llvm/release/bin/clang
➜  tmp clang --version
clang version 17.0.0
Target: x86_64-unknown-linux-gnu
Thread model: posix
InstalledDir: /home/ac.shilei.tian/Documents/build/llvm/release/bin
➜  tmp cat cublas.cpp
#include <cublas_v2.h>
int main(int argc, char *argv[]) { return 0; }

ye-luo · 2023-01-28T04:15:45Z

cuda-12

prckent · 2023-01-30T14:34:41Z

@shiltian https://docs.nvidia.com/cuda/cublas/index.html#new-and-legacy-cublas-api . cublas_v2.h defines the standard user facing API for cuBLAS.

shiltian · 2023-01-31T01:01:28Z

@ye-luo @prckent Yeah, unfortunately JLSE doesn't have CUDA 12 installed, so I can't take a look.

ye-luo · 2023-01-31T16:40:26Z

@ye-luo @prckent Yeah, unfortunately JLSE doesn't have CUDA 12 installed, so I can't take a look.

just installed.

shiltian · 2023-01-31T20:31:05Z

@ye-luo Thanks for the help. I did some investigation. I found something interesting.

First, cublas_v2.h is not only for host. That header also contains device functions, and __internal_device_float2_to_half2_rn is one of them. They are guarded by __CUDA_ARCH__:

#if defined(__CUDA_ARCH__) || defined(_NVHPC_CUDA)

__CUDA_FP16_DECL__ __half2 __internal_device_float2_to_half2_rn(const float a, const float b) {
    __half2 val;
NV_IF_ELSE_TARGET(NV_PROVIDES_SM_80,
    asm("{ cvt.rn.f16x2.f32 %0, %2, %1; }\n"
        : "=r"(__HALF2_TO_UI(val)) : "f"(a), "f"(b));
,
    asm("{.reg .f16 low,high;\n"
        "  cvt.rn.f16.f32 low, %1;\n"
        "  cvt.rn.f16.f32 high, %2;\n"
        "  mov.b32 %0, {low,high};}\n" : "=r"(__HALF2_TO_UI(val)) : "f"(a), "f"(b));
)
    return val;
}

#endif

__CUDA_FP16_DECL__ is defined by the following macro:

/* Set up function decorations */
#if defined(__CUDACC__) || defined(_NVHPC_CUDA)
#define __CUDA_FP16_DECL__ static __device__ __inline__
#define __CUDA_HOSTDEVICE_FP16_DECL__ static __host__ __device__ __inline__
#define __VECTOR_FUNCTIONS_DECL__ static __inline__ __host__ __device__
#define __CUDA_HOSTDEVICE__ __host__ __device__
#else /* !defined(__CUDACC__) */
#if defined(__GNUC__)
#define __CUDA_HOSTDEVICE_FP16_DECL__ static __attribute__ ((unused))
#else
#define __CUDA_HOSTDEVICE_FP16_DECL__ static
#endif /* defined(__GNUC__) */
#define __CUDA_HOSTDEVICE__
#endif /* defined(__CUDACC_) */

That also demonstrates that __internal_device_float2_to_half2_rn is a device function.

Second, why no issue with pure C++ or CUDA mode? Because pure C++ mode will not define the macro __CUDA_ARCH__, and in CUDA mode __CUDACC__ is defined in clang/lib/Headers/__clang_cuda_runtime_wrapper.h, thus __CUDA_FP16_DECL__ is properly defined.

On the other hand, things will not work if __CUDA_ARCH__ is defined but __CUDACC__ is not, which is the case when OpenMP target offloading is enabled. That will leave __CUDA_FP16_DECL__ undefined, causing the issue observed here.

So the problem looks like we should not define __CUDA_ARCH__ in the device compilation. I'm not sure if that will break anything. I'll give it a shot.

ye-luo · 2023-01-31T20:56:39Z

Can we have __CUDACC__ defined in the offload device pass?

shiltian · 2023-01-31T21:47:12Z

Well, that can't solve the issue here because it's gonna "unlock" more device functions that we don't want (and we don't have the definitions as well). You can try -Xarch_device -D__CUDACC__ and will see what's going on.

shiltian · 2023-01-31T22:57:55Z

https://reviews.llvm.org/D125256 introduced the definition of __CUDA_ARCH__, but we don't think that is not correct here. IMO cublas_v2.h didn't guard the code properly. For the function definition, they guard it using __CUDA_ARCH__ to indicate it is device compilation or not, while for the macro used in the definition, they guarded it using __CUDACC__. Not consistent at all. I'm not sure if we need to revert https://reviews.llvm.org/D125256.

Artem-B · 2023-01-31T23:18:11Z

I'm not familiar with the details of how openmp deals with GPU-specific headers. I think previously it only needed a small subset necessary to pull in math functions. I'm not aware of anyone actually trying it with CUDA headers before.

Not consistent at all.

That inconsistency is normal for CUDA headers. They are only intended to be used by NVCC and clang ends up having to do a lot of preprocessor contortions in order to make them usable. Apparently we need to add more hacks to make them usable for OpenMP.

__CUDA_ARCH__ is most likely going to remain set, at least for the compilation of the code we're going to run on the GPU. Without it we will not see a lot of GPU-side function definitions.

However, for the host-facing APIs, you will likely need to have cublas headers included w/o __CUDA_ARCH__ which should produce the view of the header consistent with what a C++ or CUDA host compilation would see.

shiltian · 2023-02-01T02:47:53Z

@Artem-B Thanks for the information. I think eventually we are gonna need a wrapper header that needs to be included by the driver when using CUDA headers.

ye-luo · 2023-03-18T02:49:09Z

For the function definition, they guard it using __CUDA_ARCH__ to indicate it is device compilation or not, while for the macro used in the definition, they guarded it using __CUDACC__. Not consistent at all.

I agree with @shiltian
after changing to

#if defined(__CUDACC__) || defined(_NVHPC_CUDA)

__CUDA_FP16_DECL__ __half2 __internal_device_float2_to_half2_rn(const float a, const float b) {

I'm able to complete the compilation of full qmcpack and pass all the tests.

also checked 12.1 remains bad. Let me contact NV.

Artem-B · 2023-03-20T17:15:19Z

@Artem-B Thanks for the information. I think eventually we are gonna need a wrapper header that needs to be included by the driver when using CUDA headers.

NVIDIA will not change already-released CUDA versions. Doing a selective macro manipulation in include wrapper, like clang does for CUDA compilation) is probably your only practical choice here, IMO.

ye-luo · 2023-03-20T18:40:54Z

Got message from NV contact

After discussing with the CUDA team, it turns out this bug has been fixed in the latest CUDA. So the next CUDA release would have this fixed. The fix is just like your workaround in https://github.com/llvm/llvm-project/issues/60296#issuecomment-1474628734. So you can use that to work around this for now.

I will report back once I verified the next release.

ye-luo · 2023-04-21T02:59:51Z

Fixed in CUDA 12.1 Update 1.

github-actions bot added the new issue label Jan 25, 2023

EugeneZelenko added openmp cuda and removed new issue labels Jan 25, 2023

ye-luo closed this as completed Mar 20, 2023

EugeneZelenko added the worksforme Resolved as "works for me" label Apr 21, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[OpenMP] offload compilation failure with CUDA-12 #60296

[OpenMP] offload compilation failure with CUDA-12 #60296

ye-luo commented Jan 25, 2023 •

edited

Loading

llvmbot commented Jan 25, 2023

shiltian commented Jan 25, 2023

ye-luo commented Jan 25, 2023

jdoerfert commented Jan 26, 2023

Artem-B commented Jan 27, 2023

shiltian commented Jan 28, 2023

ye-luo commented Jan 28, 2023

shiltian commented Jan 28, 2023

ye-luo commented Jan 28, 2023

prckent commented Jan 30, 2023

shiltian commented Jan 31, 2023

ye-luo commented Jan 31, 2023

shiltian commented Jan 31, 2023

ye-luo commented Jan 31, 2023

shiltian commented Jan 31, 2023 •

edited

Loading

shiltian commented Jan 31, 2023

Artem-B commented Jan 31, 2023

shiltian commented Feb 1, 2023

ye-luo commented Mar 18, 2023

Artem-B commented Mar 20, 2023

ye-luo commented Mar 20, 2023

ye-luo commented Apr 21, 2023

[OpenMP] offload compilation failure with CUDA-12 #60296

[OpenMP] offload compilation failure with CUDA-12 #60296

Comments

ye-luo commented Jan 25, 2023 • edited Loading

llvmbot commented Jan 25, 2023

shiltian commented Jan 25, 2023

ye-luo commented Jan 25, 2023

jdoerfert commented Jan 26, 2023

Artem-B commented Jan 27, 2023

shiltian commented Jan 28, 2023

ye-luo commented Jan 28, 2023

shiltian commented Jan 28, 2023

ye-luo commented Jan 28, 2023

prckent commented Jan 30, 2023

shiltian commented Jan 31, 2023

ye-luo commented Jan 31, 2023

shiltian commented Jan 31, 2023

ye-luo commented Jan 31, 2023

shiltian commented Jan 31, 2023 • edited Loading

shiltian commented Jan 31, 2023

Artem-B commented Jan 31, 2023

shiltian commented Feb 1, 2023

ye-luo commented Mar 18, 2023

Artem-B commented Mar 20, 2023

ye-luo commented Mar 20, 2023

ye-luo commented Apr 21, 2023

ye-luo commented Jan 25, 2023 •

edited

Loading

shiltian commented Jan 31, 2023 •

edited

Loading