-
Notifications
You must be signed in to change notification settings - Fork 11k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
With gcc-12's libstdc++, clang-15 rc2 still cannot compile CUDA/HIP code because of __noinline__ #57544
With gcc-12's libstdc++, clang-15 rc2 still cannot compile CUDA/HIP code because of __noinline__ #57544
Comments
I looked at the commit blurb between rc3 and rc2, and I guess there is no fix focusing on this, so I decided to report. If it's fixed in rc3, please ignore me |
@yxsamliu CC'ing the author of https://reviews.llvm.org/D124866 |
It appears that something have changed in libstdc++ headers between gcc-11.3 and 12.1: |
Right, that's why many error emerges when gcc12 released: https://bugs.gentoo.org/857126 I thought https://reviews.llvm.org/D124866 is just for it but it does not. |
with gcc12.1
CUDA header defines |
Yes, that's what I'm currently doing, but it's a bit ugly and requires a big patch for large projects, which means maintaining difficulties.
HIP also suffers: https://github.com/ROCm-Developer-Tools/hipamd/blob/bf49b11d07064ec56ea130c3a73beb216d81f582/include/hip/amd_detail/host_defines.h#L51 And that's why bugs like https://bugs.gentoo.org/857126 is present. |
There is a condition for that macro |
@yxsamliu I carefully read your patch, and figured out I misunderstood it previously. With your patch, CUDA/HIP can compile without define Thanks for your extra explanation! |
This rewrites _Sp_counted_base::_M_release to skip the two atomic instructions that decrement each of the use count and the weak count when both are 1. Benefits: Save the cost of the last atomic decrements of each of the use count and the weak count in _Sp_counted_base. Atomic instructions are significantly slower than regular loads and stores across major architectures. How current code works: _M_release() atomically decrements the use count, checks if it was 1, if so calls _M_dispose(), atomically decrements the weak count, checks if it was 1, and if so calls _M_destroy(). How the proposed algorithm works: _M_release() loads both use count and weak count together atomically (assuming suitable alignment, discussed later), checks if the value corresponds to a 0x1 value in the individual count members, and if so calls _M_dispose() and _M_destroy(). Otherwise, it follows the original algorithm. Why it works: When the current thread executing _M_release() finds each of the counts is equal to 1, then no other threads could possibly hold use or weak references to this control block. That is, no other threads could possibly access the counts or the protected object. There are two crucial high-level issues that I'd like to point out first: - Atomicity of access to the counts together - Proper alignment of the counts together The patch is intended to apply the proposed algorithm only to the case of 64-bit mode, 4-byte counts, and 8-byte aligned _Sp_counted_base. ** Atomicity ** - The proposed algorithm depends on the mutual atomicity among 8-byte atomic operations and 4-byte atomic operations on each of the 4-byte halves of the 8-byte aligned 8-byte block. - The standard does not guarantee atomicity of 8-byte operations on a pair of 8-byte aligned 4-byte objects. - To my knowledge this works in practice on systems that guarantee native implementation of 4-byte and 8-byte atomic operations. - __atomic_always_lock_free is used to check for native atomic operations. ** Alignment ** - _Sp_counted_base is an internal base class, with a virtual destructor, so it has a vptr at the beginning of the class, and will be aligned to alignof(void*) i.e. 8 bytes. - The first members of the class are the 4-byte use count and 4-byte weak count, which will occupy 8 contiguous bytes immediately after the vptr, i.e. they form an 8-byte aligned 8 byte range. Other points: - The proposed algorithm can interact correctly with the current algorithm. That is, multiple threads using different versions of the code with and without the patch operating on the same objects should always interact correctly. The intent for the patch is to be ABI compatible with the current implementation. - The proposed patch involves a performance trade-off between saving the costs of atomic instructions when the counts are both 1 vs adding the cost of loading the 8-byte combined counts and comparison with {0x1, 0x1}. - I noticed a big difference between the code generated by GCC vs LLVM. GCC seems to generate noticeably more code and what seems to be redundant null checks and branches. - The patch has been in use (built using LLVM) in a large environment for many months. The performance gains outweigh the losses (roughly 10 to 1) across a large variety of workloads. Signed-off-by: Jonathan Wakely <jwakely@redhat.com> Co-authored-by: Jonathan Wakely <jwakely@redhat.com> libstdc++-v3/ChangeLog: * include/bits/c++config (_GLIBCXX_TSAN): Define macro indicating that TSan is in use. * include/bits/shared_ptr_base.h (_Sp_counted_base::_M_release): Replace definition in primary template with explicit specializations for _S_mutex and _S_atomic policies. (_Sp_counted_base<_S_mutex>::_M_release): New specialization. (_Sp_counted_base<_S_atomic>::_M_release): New specialization, using a single atomic load to access both reference counts at once. (_Sp_counted_base::_M_release_last_use): New member function.
Looks like we do still have this problem:
Proposed workaround: https://reviews.llvm.org/D149364 |
pls figgs |
Hello, is there any news on this issue? |
The workaround has landed in a50e54f and should be present in clang-17. clang-15 or -16 will not have the fix. |
I am still see some issues of this kind
|
That's a new instance. Looks like we'll be playing this game of whack-a-mole for a while. :-( |
Can you grep your |
full list of
|
If you try to include |
with |
I have no way to install new libstdc++ on my machine. Would you be able to test the patch for me? |
Sure, will test it now |
For the record, here is the full list of occurrences of |
you just forgot to add this files in CMakeList.txt
and now I can finally compile with clang |
/cherry-pick 588023d |
/branch llvm/llvm-project-release-prs/issue57544 |
Fixes llvm/llvm-project#57544 (cherry picked from commit 588023ddafb4b0cd11914ab068c6d07187374d69)
/pull-request llvm/llvm-project-release-prs#698 |
Fixes llvm/llvm-project#57544 (cherry picked from commit 588023ddafb4b0cd11914ab068c6d07187374d69)
Deals with clang as the compiler in CUDA mode with gcc/glibc 12 include files. Clang was fixed to recognize __noinline__ as an attribute, but the CUDA include files still have a define for __noinline__. The attribute started to be used in gcc 12 include files, which expands to __attribute__((__attribute__((noinline)))), which the compiler rejects. See NVIDIA/thrust#1703 and llvm/llvm-project#57544
Although https://reviews.llvm.org/D124866 aims to fix bugs such as NVIDIA/thrust#1703 or https://bugs.gentoo.org/857126
When I tried clang-15 rc2, the issue still exists.
Reproduce method:
pure C++ test.C
Run
clang++ test.C
HIP code example test.hip
Run
clang++ -nogpulib -nogpuinc -x hip test.hip
Both gives error:
g++-v12/bits/shared_ptr_base.h:196:22: error: use of undeclared identifier 'noinline'; did you mean 'inline'?
The text was updated successfully, but these errors were encountered: