Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[OpenMP] offload compilation failure with CUDA-12 #60296

Closed
ye-luo opened this issue Jan 25, 2023 · 22 comments
Closed

[OpenMP] offload compilation failure with CUDA-12 #60296

ye-luo opened this issue Jan 25, 2023 · 22 comments
Labels
cuda openmp worksforme Resolved as "works for me"

Comments

@ye-luo
Copy link
Contributor

ye-luo commented Jan 25, 2023

main.cpp

#include <cublas_v2.h>
int main()
{}

error from the nvptx64 pass

$ clang++ -fopenmp --offload-arch=sm_80 -I /usr/local/cuda-12.0/include main.cpp 
clang-16: warning: CUDA version is newer than the latest partially supported version 11.8 [-Wunknown-cuda-version]
In file included from main.cpp:1:
In file included from /usr/local/cuda-12.0/include/cublas_v2.h:69:
In file included from /usr/local/cuda-12.0/include/cublas_api.h:77:
In file included from /usr/local/cuda-12.0/include/cuda_fp16.h:4006:
/usr/local/cuda-12.0/include/cuda_fp16.hpp:690:1: error: unknown type name '__CUDA_FP16_DECL__'
__CUDA_FP16_DECL__ __half2 __internal_device_float2_to_half2_rn(const float a, const float b) {
^
/usr/local/cuda-12.0/include/cuda_fp16.hpp:690:27: error: expected ';' after top level declarator
__CUDA_FP16_DECL__ __half2 __internal_device_float2_to_half2_rn(const float a, const float b) {
                          ^
2 errors generated.

both cudatoolkit 12.0 and 12.1 are affected

@llvmbot
Copy link
Collaborator

llvmbot commented Jan 25, 2023

@llvm/issue-subscribers-openmp

@shiltian
Copy link
Contributor

The definition of those special types are guarded by #if defined(__CUDAACC__) which are defined by Nvidia's compiler automatically but not by clang. You might try that and see if it works.

@ye-luo
Copy link
Contributor Author

ye-luo commented Jan 25, 2023

Adding -D__CUDAACC__ doesn't help.

@jdoerfert
Copy link
Member

@Artem-B Have you seen this?

@Artem-B
Copy link
Member

Artem-B commented Jan 27, 2023

I do not see it with CUDA or C++ compilation, so it's likely something OpenMP-specific.

@shiltian
Copy link
Contributor

TBH I don't think cublas_v2.h is designed to be used by a non-NVIDIA compiler in a non-CUDA mode.

@ye-luo
Copy link
Contributor Author

ye-luo commented Jan 28, 2023

cublas_v2.h is from cublas which is host library designed to work with any host compiler.
I think it is a compatibility issue when clang has openmp offload enabled.

@shiltian
Copy link
Contributor

I can't reproduce the issue.

➜  tmp clang++ -fopenmp --offload-arch=sm_86 cublas.cpp -o cublas -I /soft/compilers/cuda/cuda-11.8.0/include
clang-16: warning: CUDA version 11.8 is only partially supported [-Wunknown-cuda-version]
clang-16: warning: CUDA version 11.8 is only partially supported [-Wunknown-cuda-version]
clang-16: warning: CUDA version 11.8 is only partially supported [-Wunknown-cuda-version]
➜  tmp which clang
/home/ac.shilei.tian/Documents/build/llvm/release/bin/clang
➜  tmp clang --version
clang version 17.0.0
Target: x86_64-unknown-linux-gnu
Thread model: posix
InstalledDir: /home/ac.shilei.tian/Documents/build/llvm/release/bin
➜  tmp cat cublas.cpp
#include <cublas_v2.h>
int main(int argc, char *argv[]) { return 0; }

@ye-luo
Copy link
Contributor Author

ye-luo commented Jan 28, 2023

cuda-12

@prckent
Copy link

prckent commented Jan 30, 2023

@shiltian https://docs.nvidia.com/cuda/cublas/index.html#new-and-legacy-cublas-api . cublas_v2.h defines the standard user facing API for cuBLAS.

@shiltian
Copy link
Contributor

@ye-luo @prckent Yeah, unfortunately JLSE doesn't have CUDA 12 installed, so I can't take a look.

@ye-luo
Copy link
Contributor Author

ye-luo commented Jan 31, 2023

@ye-luo @prckent Yeah, unfortunately JLSE doesn't have CUDA 12 installed, so I can't take a look.

just installed.

@shiltian
Copy link
Contributor

@ye-luo Thanks for the help. I did some investigation. I found something interesting.

First, cublas_v2.h is not only for host. That header also contains device functions, and __internal_device_float2_to_half2_rn is one of them. They are guarded by __CUDA_ARCH__:

#if defined(__CUDA_ARCH__) || defined(_NVHPC_CUDA)

__CUDA_FP16_DECL__ __half2 __internal_device_float2_to_half2_rn(const float a, const float b) {
    __half2 val;
NV_IF_ELSE_TARGET(NV_PROVIDES_SM_80,
    asm("{ cvt.rn.f16x2.f32 %0, %2, %1; }\n"
        : "=r"(__HALF2_TO_UI(val)) : "f"(a), "f"(b));
,
    asm("{.reg .f16 low,high;\n"
        "  cvt.rn.f16.f32 low, %1;\n"
        "  cvt.rn.f16.f32 high, %2;\n"
        "  mov.b32 %0, {low,high};}\n" : "=r"(__HALF2_TO_UI(val)) : "f"(a), "f"(b));
)
    return val;
}

#endif

__CUDA_FP16_DECL__ is defined by the following macro:

/* Set up function decorations */
#if defined(__CUDACC__) || defined(_NVHPC_CUDA)
#define __CUDA_FP16_DECL__ static __device__ __inline__
#define __CUDA_HOSTDEVICE_FP16_DECL__ static __host__ __device__ __inline__
#define __VECTOR_FUNCTIONS_DECL__ static __inline__ __host__ __device__
#define __CUDA_HOSTDEVICE__ __host__ __device__
#else /* !defined(__CUDACC__) */
#if defined(__GNUC__)
#define __CUDA_HOSTDEVICE_FP16_DECL__ static __attribute__ ((unused))
#else
#define __CUDA_HOSTDEVICE_FP16_DECL__ static
#endif /* defined(__GNUC__) */
#define __CUDA_HOSTDEVICE__
#endif /* defined(__CUDACC_) */

That also demonstrates that __internal_device_float2_to_half2_rn is a device function.

Second, why no issue with pure C++ or CUDA mode? Because pure C++ mode will not define the macro __CUDA_ARCH__, and in CUDA mode __CUDACC__ is defined in clang/lib/Headers/__clang_cuda_runtime_wrapper.h, thus __CUDA_FP16_DECL__ is properly defined.

On the other hand, things will not work if __CUDA_ARCH__ is defined but __CUDACC__ is not, which is the case when OpenMP target offloading is enabled. That will leave __CUDA_FP16_DECL__ undefined, causing the issue observed here.

So the problem looks like we should not define __CUDA_ARCH__ in the device compilation. I'm not sure if that will break anything. I'll give it a shot.

@ye-luo
Copy link
Contributor Author

ye-luo commented Jan 31, 2023

Can we have __CUDACC__ defined in the offload device pass?

@shiltian
Copy link
Contributor

shiltian commented Jan 31, 2023

Well, that can't solve the issue here because it's gonna "unlock" more device functions that we don't want (and we don't have the definitions as well). You can try -Xarch_device -D__CUDACC__ and will see what's going on.

@shiltian
Copy link
Contributor

https://reviews.llvm.org/D125256 introduced the definition of __CUDA_ARCH__, but we don't think that is not correct here. IMO cublas_v2.h didn't guard the code properly. For the function definition, they guard it using __CUDA_ARCH__ to indicate it is device compilation or not, while for the macro used in the definition, they guarded it using __CUDACC__. Not consistent at all. I'm not sure if we need to revert https://reviews.llvm.org/D125256.

@Artem-B
Copy link
Member

Artem-B commented Jan 31, 2023

I'm not familiar with the details of how openmp deals with GPU-specific headers. I think previously it only needed a small subset necessary to pull in math functions. I'm not aware of anyone actually trying it with CUDA headers before.

Not consistent at all.

That inconsistency is normal for CUDA headers. They are only intended to be used by NVCC and clang ends up having to do a lot of preprocessor contortions in order to make them usable. Apparently we need to add more hacks to make them usable for OpenMP.

__CUDA_ARCH__ is most likely going to remain set, at least for the compilation of the code we're going to run on the GPU. Without it we will not see a lot of GPU-side function definitions.

However, for the host-facing APIs, you will likely need to have cublas headers included w/o __CUDA_ARCH__ which should produce the view of the header consistent with what a C++ or CUDA host compilation would see.

@shiltian
Copy link
Contributor

shiltian commented Feb 1, 2023

@Artem-B Thanks for the information. I think eventually we are gonna need a wrapper header that needs to be included by the driver when using CUDA headers.

@ye-luo
Copy link
Contributor Author

ye-luo commented Mar 18, 2023

For the function definition, they guard it using __CUDA_ARCH__ to indicate it is device compilation or not, while for the macro used in the definition, they guarded it using __CUDACC__. Not consistent at all.

I agree with @shiltian
after changing to

#if defined(__CUDACC__) || defined(_NVHPC_CUDA)

__CUDA_FP16_DECL__ __half2 __internal_device_float2_to_half2_rn(const float a, const float b) {

I'm able to complete the compilation of full qmcpack and pass all the tests.

also checked 12.1 remains bad. Let me contact NV.

@Artem-B
Copy link
Member

Artem-B commented Mar 20, 2023

@Artem-B Thanks for the information. I think eventually we are gonna need a wrapper header that needs to be included by the driver when using CUDA headers.

NVIDIA will not change already-released CUDA versions. Doing a selective macro manipulation in include wrapper, like clang does for CUDA compilation) is probably your only practical choice here, IMO.

@ye-luo
Copy link
Contributor Author

ye-luo commented Mar 20, 2023

Got message from NV contact

After discussing with the CUDA team, it turns out this bug has been fixed in the latest CUDA. So the next CUDA release would have this fixed. The fix is just like your workaround in https://github.com/llvm/llvm-project/issues/60296#issuecomment-1474628734. So you can use that to work around this for now.

I will report back once I verified the next release.

@ye-luo ye-luo closed this as completed Mar 20, 2023
@ye-luo
Copy link
Contributor Author

ye-luo commented Apr 21, 2023

Fixed in CUDA 12.1 Update 1.

@EugeneZelenko EugeneZelenko added the worksforme Resolved as "works for me" label Apr 21, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cuda openmp worksforme Resolved as "works for me"
Projects
None yet
Development

No branches or pull requests

7 participants