Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

With gcc-12's libstdc++, clang-15 rc2 still cannot compile CUDA/HIP code because of __noinline__ #57544

Closed
littlewu2508 opened this issue Sep 3, 2022 · 25 comments · Fixed by llvm/llvm-project-release-prs#698
Assignees
Labels

Comments

@littlewu2508
Copy link

littlewu2508 commented Sep 3, 2022

Although https://reviews.llvm.org/D124866 aims to fix bugs such as NVIDIA/thrust#1703 or https://bugs.gentoo.org/857126

When I tried clang-15 rc2, the issue still exists.

Reproduce method:

pure C++ test.C

#define __noinline__ __attribute__((noinline))
#include<future>

int main()
{
        return 0;
}

Run clang++ test.C

HIP code example test.hip

#include <hip/hip_runtime.h>
#include<future>

int main()
{
        return 0;
}

Run clang++ -nogpulib -nogpuinc -x hip test.hip

Both gives error: g++-v12/bits/shared_ptr_base.h:196:22: error: use of undeclared identifier 'noinline'; did you mean 'inline'?

@littlewu2508
Copy link
Author

I looked at the commit blurb between rc3 and rc2, and I guess there is no fix focusing on this, so I decided to report. If it's fixed in rc3, please ignore me

@littlewu2508
Copy link
Author

@yxsamliu CC'ing the author of https://reviews.llvm.org/D124866

@Artem-B
Copy link
Member

Artem-B commented Sep 7, 2022

It appears that something have changed in libstdc++ headers between gcc-11.3 and 12.1:
https://cuda.godbolt.org/z/EvzG5TPzr

@littlewu2508
Copy link
Author

It appears that something have changed in libstdc++ headers between gcc-11.3 and 12.1: https://cuda.godbolt.org/z/EvzG5TPzr

Right, that's why many error emerges when gcc12 released: https://bugs.gentoo.org/857126

I thought https://reviews.llvm.org/D124866 is just for it but it does not.

@yxsamliu
Copy link
Collaborator

yxsamliu commented Sep 7, 2022

with gcc12.1

/opt/compiler-explorer/gcc-12.1.0/lib/gcc/x86_64-linux-gnu/12.1.0/../../../../include/c++/12.1.0/bits/shared_ptr_base.h:196:22: error: use of undeclared identifier 'noinline'; did you mean 'inline'?
      __attribute__((__noinline__))
                     ^
/opt/compiler-explorer/cuda/11.7.0//include/crt/host_defines.h:83:24: note: expanded from macro '__noinline__'
        __attribute__((noinline))

CUDA header defines __noinline__ as __attribute__((noinline)). gcc12 uses __attribute__((__noinline__)). This ends up as __attribute__((__attribute__((noinline)))). clang emits error for that. One workaround is to intercept host_defines.h and undef __noinline__ in clang default include header for CUDA. HIP is not subject to this issue.

@littlewu2508
Copy link
Author

littlewu2508 commented Sep 7, 2022

CUDA header defines __noinline__ as __attribute__((noinline)). gcc12 uses __attribute__((__noinline__)). This ends up as __attribute__((__attribute__((noinline)))). clang emits error for that. One workaround is to intercept host_defines.h and undef __noinline__ in clang default include header for CUDA.

Yes, that's what I'm currently doing, but it's a bit ugly and requires a big patch for large projects, which means maintaining difficulties.

HIP is not subject to this issue.

HIP also suffers: https://github.com/ROCm-Developer-Tools/hipamd/blob/bf49b11d07064ec56ea130c3a73beb216d81f582/include/hip/amd_detail/host_defines.h#L51

And that's why bugs like https://bugs.gentoo.org/857126 is present.

@yxsamliu
Copy link
Collaborator

yxsamliu commented Sep 7, 2022

CUDA header defines __noinline__ as __attribute__((noinline)). gcc12 uses __attribute__((__noinline__)). This ends up as __attribute__((__attribute__((noinline)))). clang emits error for that. One workaround is to intercept host_defines.h and undef __noinline__ in clang default include header for CUDA.

Yes, that's what I'm currently doing, but it's a bit ugly and requires a big patch for large projects, which means maintaining difficulties.

HIP is not subject to this issue.

HIP also suffers: https://github.com/ROCm-Developer-Tools/hipamd/blob/bf49b11d07064ec56ea130c3a73beb216d81f582/include/hip/amd_detail/host_defines.h#L51

And that's why bugs like https://bugs.gentoo.org/857126 is present.

There is a condition for that macro !__has_feature(cuda_noinline_keyword). My patch https://reviews.llvm.org/D124866 enable that feature, therefore HIP no longer defines that macro. In that sense, I said 'HIP is not subject to this issue'.

@littlewu2508
Copy link
Author

@yxsamliu I carefully read your patch, and figured out I misunderstood it previously. With your patch, CUDA/HIP can compile without define #define __noinline__ __attribute__((noinline)) in their api headers (I also made a MWE which confirms that). So with your patch, and removing #define __noinline__ __attribute__((noinline)) as well, problem with gcc-12 solved.

Thanks for your extra explanation!

@EugeneZelenko EugeneZelenko added the worksforme Resolved as "works for me" label Sep 7, 2022
jwakely referenced this issue in gcc-mirror/gcc Apr 27, 2023
This rewrites _Sp_counted_base::_M_release to skip the two atomic
instructions that decrement each of the use count and the weak count
when both are 1.

Benefits: Save the cost of the last atomic decrements of each of the use
count and the weak count in _Sp_counted_base. Atomic instructions are
significantly slower than regular loads and stores across major
architectures.

How current code works: _M_release() atomically decrements the use
count, checks if it was 1, if so calls _M_dispose(), atomically
decrements the weak count, checks if it was 1, and if so calls
_M_destroy().

How the proposed algorithm works: _M_release() loads both use count and
weak count together atomically (assuming suitable alignment, discussed
later), checks if the value corresponds to a 0x1 value in the individual
count members, and if so calls _M_dispose() and _M_destroy().
Otherwise, it follows the original algorithm.

Why it works: When the current thread executing _M_release() finds each
of the counts is equal to 1, then no other threads could possibly hold
use or weak references to this control block. That is, no other threads
could possibly access the counts or the protected object.

There are two crucial high-level issues that I'd like to point out first:
- Atomicity of access to the counts together
- Proper alignment of the counts together

The patch is intended to apply the proposed algorithm only to the case of
64-bit mode, 4-byte counts, and 8-byte aligned _Sp_counted_base.

** Atomicity **
- The proposed algorithm depends on the mutual atomicity among 8-byte
atomic operations and 4-byte atomic operations on each of the 4-byte halves
of the 8-byte aligned 8-byte block.
- The standard does not guarantee atomicity of 8-byte operations on a pair
of 8-byte aligned 4-byte objects.
- To my knowledge this works in practice on systems that guarantee native
implementation of 4-byte and 8-byte atomic operations.
- __atomic_always_lock_free is used to check for native atomic operations.

** Alignment **
- _Sp_counted_base is an internal base class, with a virtual destructor,
so it has a vptr at the beginning of the class, and will be aligned to
alignof(void*) i.e. 8 bytes.
- The first members of the class are the 4-byte use count and 4-byte
weak count, which will occupy 8 contiguous bytes immediately after the
vptr, i.e. they form an 8-byte aligned 8 byte range.

Other points:
- The proposed algorithm can interact correctly with the current algorithm.
That is, multiple threads using different versions of the code with and
without the patch operating on the same objects should always interact
correctly. The intent for the patch is to be ABI compatible with the
current implementation.
- The proposed patch involves a performance trade-off between saving the
costs of atomic instructions when the counts are both 1 vs adding the cost
of loading the 8-byte combined counts and comparison with {0x1, 0x1}.
- I noticed a big difference between the code generated by GCC vs LLVM. GCC
seems to generate noticeably more code and what seems to be redundant null
checks and branches.
- The patch has been in use (built using LLVM) in a large environment for
many months. The performance gains outweigh the losses (roughly 10 to 1)
across a large variety of workloads.

Signed-off-by: Jonathan Wakely <jwakely@redhat.com>

Co-authored-by: Jonathan Wakely <jwakely@redhat.com>

libstdc++-v3/ChangeLog:

	* include/bits/c++config (_GLIBCXX_TSAN): Define macro
	indicating that TSan is in use.
	* include/bits/shared_ptr_base.h (_Sp_counted_base::_M_release):
	Replace definition in primary template with explicit
	specializations for _S_mutex and _S_atomic policies.
	(_Sp_counted_base<_S_mutex>::_M_release): New specialization.
	(_Sp_counted_base<_S_atomic>::_M_release): New specialization,
	using a single atomic load to access both reference counts at
	once.
	(_Sp_counted_base::_M_release_last_use): New member function.
@Artem-B
Copy link
Member

Artem-B commented Apr 27, 2023

Looks like we do still have this problem:

echo '#include <memory>'|  bin/clang --cuda-path=$HOME/local/cuda-12.1.1 -x cuda --offload-arch=sm_60 - -c -stdlib=libstdc++ --cuda-device-only -c
clang: warning: CUDA version is newer than the latest partially supported version 11.8 [-Wunknown-cuda-version]
In file included from <stdin>:1:
In file included from /usr/lib/gcc/x86_64-linux-gnu/12/../../../../include/c++/12/memory:77:
In file included from /usr/lib/gcc/x86_64-linux-gnu/12/../../../../include/c++/12/bits/shared_ptr.h:53:
/usr/lib/gcc/x86_64-linux-gnu/12/../../../../include/c++/12/bits/shared_ptr_base.h:196:22: error: use of undeclared identifier 'noinline'; did you mean 'inline'?
      __attribute__((__noinline__))
                     ^
/usr/local/google/home/tra/local/cuda-12.1.1/include/crt/host_defines.h:83:24: note: expanded from macro '__noinline__'
        __attribute__((noinline))
                       ^
In file included from <stdin>:1:
In file included from /usr/lib/gcc/x86_64-linux-gnu/12/../../../../include/c++/12/memory:77:
In file included from /usr/lib/gcc/x86_64-linux-gnu/12/../../../../include/c++/12/bits/shared_ptr.h:53:
/usr/lib/gcc/x86_64-linux-gnu/12/../../../../include/c++/12/bits/shared_ptr_base.h:196:22: error: type name does not allow function specifier to be specified
/usr/local/google/home/tra/local/cuda-12.1.1/include/crt/host_defines.h:83:24: note: expanded from macro '__noinline__'
        __attribute__((noinline))
                       ^
In file included from <stdin>:1:
In file included from /usr/lib/gcc/x86_64-linux-gnu/12/../../../../include/c++/12/memory:77:
In file included from /usr/lib/gcc/x86_64-linux-gnu/12/../../../../include/c++/12/bits/shared_ptr.h:53:
/usr/lib/gcc/x86_64-linux-gnu/12/../../../../include/c++/12/bits/shared_ptr_base.h:196:22: error: expected expression
/usr/local/google/home/tra/local/cuda-12.1.1/include/crt/host_defines.h:83:33: note: expanded from macro '__noinline__'
        __attribute__((noinline))
                                ^
3 errors generated when compiling for sm_60.

Proposed workaround: https://reviews.llvm.org/D149364

@Artem-B Artem-B reopened this Apr 27, 2023
@Artem-B Artem-B self-assigned this Apr 27, 2023
@EugeneZelenko EugeneZelenko removed the worksforme Resolved as "works for me" label Apr 27, 2023
@ewof
Copy link

ewof commented Jun 10, 2023

pls figgs

@Yaraslaut
Copy link
Contributor

Hello, is there any news on this issue?

@Artem-B
Copy link
Member

Artem-B commented Sep 12, 2023

The workaround has landed in a50e54f and should be present in clang-17.

clang-15 or -16 will not have the fix.

@Artem-B Artem-B closed this as completed Sep 12, 2023
@Yaraslaut
Copy link
Contributor

Yaraslaut commented Sep 12, 2023

I am still see some issues of this kind

In file included from /usr/lib64/gcc/x86_64-pc-linux-gnu/13.2.1/../../../../include/c++/13.2.1/string:55:
/usr/lib64/gcc/x86_64-pc-linux-gnu/13.2.1/../../../../include/c++/13.2.1/bits/basic_string.tcc:473:20: error: expected expression
/opt/cuda/include/crt/host_defines.h:83:33: note: expanded from macro '__noinline__'
   83 |         __attribute__((noinline))
      |                                 ^
clang version 17.0.0 (https://github.com/llvm/llvm-project.git 0176e8729ea4a2cff1ec6689c7620a9f37ce9904)

@Artem-B
Copy link
Member

Artem-B commented Sep 12, 2023

That's a new instance. Looks like we'll be playing this game of whack-a-mole for a while. :-(

@Artem-B
Copy link
Member

Artem-B commented Sep 12, 2023

Can you grep your /usr/lib64/gcc/x86_64-pc-linux-gnu/13.2.1/../../../../include/c++/13.2.1 for other instances of __noinline__ and send me the list of headers it's present in? I want to have an idea how widespread the issue is, rather than fixing one just to run into another one.

@Yaraslaut
Copy link
Contributor

full list of __noinline__ attributes

/u/i/c/13.2.1 ❱ pwd
/usr/include/c++/13.2.1
/u/i/c/13.2.1 ❱ rg "__noinline__"
stacktrace
251:      [[__gnu__::__noinline__]]
267:      [[__gnu__::__noinline__]]
287:      [[__gnu__::__noinline__]]

bits/shared_ptr_base.h
196:      __attribute__((__noinline__))

bits/basic_string.tcc
473:    __attribute__((__noinline__, __noclone__, __cold__)) void

bits/basic_string.h
2532:      __attribute__((__noinline__, __noclone__, __cold__)) void

experimental/bits/simd_detail.h
263:#define _GLIBCXX_SIMD_NEVER_INLINE [[__gnu__::__noinline__]]

@Artem-B
Copy link
Member

Artem-B commented Sep 12, 2023

If you try to include <stacktrace> from a CUDA source, does it compile or fail?

@Yaraslaut
Copy link
Contributor

with [[__gnu__::__noinline__]] everything is ok

@Artem-B
Copy link
Member

Artem-B commented Sep 12, 2023

I have no way to install new libstdc++ on my machine. Would you be able to test the patch for me?
Artem-B@227d9a7

@Yaraslaut
Copy link
Contributor

Sure, will test it now

@Artem-B
Copy link
Member

Artem-B commented Sep 12, 2023

For the record, here is the full list of occurrences of __noinline__ in libstdc++:
https://github.com/search?q=repo%3Agcc-mirror%2Fgcc+path%3Alibstd+-path%3Atestsuite+__noinline&type=code

image

@Yaraslaut
Copy link
Contributor

I have no way to install new libstdc++ on my machine. Would you be able to test the patch for me? Artem-B@227d9a7

you just forgot to add this files in CMakeList.txt

diff --git a/clang/lib/Headers/CMakeLists.txt b/clang/lib/Headers/CMakeLists.txt
index 39030d433a61..d1ea5fa1a3e7 100644
--- a/clang/lib/Headers/CMakeLists.txt
+++ b/clang/lib/Headers/CMakeLists.txt
@@ -289,6 +289,8 @@ set(cuda_wrapper_files

 set(cuda_wrapper_bits_files
   cuda_wrappers/bits/shared_ptr_base.h
+  cuda_wrappers/bits/basic_string.tcc
+  cuda_wrappers/bits/basic_string.h
 )

 set(ppc_wrapper_files

and now I can finally compile with clang

@Artem-B
Copy link
Member

Artem-B commented Sep 12, 2023

/cherry-pick 588023d

@llvmbot
Copy link
Collaborator

llvmbot commented Sep 12, 2023

/branch llvm/llvm-project-release-prs/issue57544

llvmbot pushed a commit to llvm/llvm-project-release-prs that referenced this issue Sep 12, 2023
Fixes llvm/llvm-project#57544

(cherry picked from commit 588023ddafb4b0cd11914ab068c6d07187374d69)
@llvmbot
Copy link
Collaborator

llvmbot commented Sep 12, 2023

/pull-request llvm/llvm-project-release-prs#698

ZijunZhaoCCK pushed a commit to ZijunZhaoCCK/llvm-project that referenced this issue Sep 19, 2023
tru pushed a commit to llvm/llvm-project-release-prs that referenced this issue Sep 29, 2023
Fixes llvm/llvm-project#57544

(cherry picked from commit 588023ddafb4b0cd11914ab068c6d07187374d69)
markdewing added a commit to markdewing/qmcpack that referenced this issue Oct 9, 2023
Deals with clang as the compiler in CUDA mode with gcc/glibc 12 include files.
Clang was fixed to recognize __noinline__ as an attribute,
but the CUDA include files still have a define for __noinline__.
The attribute started to be used in gcc 12 include files, which expands
to __attribute__((__attribute__((noinline)))), which the compiler rejects.
See NVIDIA/thrust#1703 and
    llvm/llvm-project#57544
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
Development

Successfully merging a pull request may close this issue.

7 participants