[CUDA][HIP] Defer device diagnostics in implicit H+D explicit instantiations#197214
Conversation
…instantiation
For an explicit class template instantiation such as `template <typename T>
struct B { virtual ~B() = default; }; template class B<float>;`, the
device-side vtable for B<float> needs a slot for the destructor. Without
help, clang puts a reference to ~B<float> in the vtable but does not emit
a body for it on the device side, leaving an unresolved external symbol
that ptxas or the device linker rejects.
PR llvm#129117 addressed this by marking the destructor as always emitted on
the device side, and by pushing it into the deferred-diag visitor so that
its body is checked even when no device code uses the class. With the
body emitted, the vtable reference resolves cleanly.
A common pattern in C++ code that uses libstdc++ is a class template like
`template <typename T> struct Derived { virtual ~Derived() = default;
std::vector<T> v; }; template class Derived<double>;`. Here
~Derived<double> chains through std::vector<double>::~vector, which is
constexpr and therefore implicit `__host__` `__device__`, but on the
runtime branch eventually calls ::operator delete, which is host only.
With the previous approach the deferred-diag visitor walks ~Derived's
body even when no device code uses Derived<double>, follows the chain,
and reports "reference to `__host__` function 'operator delete' in
`__host__` `__device__` function". The user has written no device code
that touches the class, but the diagnostic still fires.
The new approach drops the force-emit and the eager push. Instead, when
CGVTables is filling a destructor slot in a device-side vtable, it writes
a NULL pointer when the destructor's `__host__` and `__device__`
attributes are both implicit (the user did not opt in to device
emission), the parent class is an explicit template instantiation, and
no code in this translation unit has referenced the destructor yet (the
vtable build is the only requester). The vtable stays well-formed, the
slot is simply NULL, and no destructor body emission is forced. When
device code does construct an instance, the destructor is referenced
before the vtable is built, so the symbol exists and the slot gets the
real pointer through the normal path. Host-side vtables are unchanged.
|
@llvm/pr-subscribers-clang-codegen @llvm/pr-subscribers-clang Author: Yaxun (Sam) Liu (yxsamliu) ChangesFor an explicit class template instantiation such as PR #129117 addressed this by marking the destructor as always emitted on A common pattern in C++ code that uses libstdc++ is a class template like The new approach drops the force-emit and the eager push. Instead, when Full diff: https://github.com/llvm/llvm-project/pull/197214.diff 6 Files Affected:
diff --git a/clang/lib/CodeGen/CGVTables.cpp b/clang/lib/CodeGen/CGVTables.cpp
index 99640f5ce2ad1..8dc7e92746dd0 100644
--- a/clang/lib/CodeGen/CGVTables.cpp
+++ b/clang/lib/CodeGen/CGVTables.cpp
@@ -811,6 +811,35 @@ void CodeGenVTables::addVTableComponent(ConstantArrayBuilder &builder,
return builder.add(
llvm::ConstantExpr::getNullValue(CGM.GlobalsInt8PtrTy));
}
+ // On device, an implicit __host__ __device__ virtual destructor of an
+ // explicit class template instantiation should fill its vtable slot
+ // only if device code has actually referenced the destructor. Otherwise
+ // emitting a real pointer would force body emission whose host-only
+ // callees (e.g. through libstdc++ destructor chains) can become
+ // unresolved external references at link time. The host-side vtable
+ // still gets the real pointer.
+ if (CGM.getLangOpts().CUDAIsDevice) {
+ if (const auto *Dtor = dyn_cast<CXXDestructorDecl>(MD)) {
+ const auto *HAttr = Dtor->getAttr<CUDAHostAttr>();
+ const auto *DAttr = Dtor->getAttr<CUDADeviceAttr>();
+ bool IsImplicitHD =
+ HAttr && DAttr && HAttr->isImplicit() && DAttr->isImplicit();
+ const auto *Spec =
+ dyn_cast<ClassTemplateSpecializationDecl>(Dtor->getParent());
+ bool IsExplicitInst =
+ Spec && (Spec->getTemplateSpecializationKind() ==
+ TSK_ExplicitInstantiationDeclaration ||
+ Spec->getTemplateSpecializationKind() ==
+ TSK_ExplicitInstantiationDefinition);
+ if (IsImplicitHD && IsExplicitInst &&
+ !CGM.GetGlobalValue(CGM.getMangledName(GD))) {
+ if (IsThunk)
+ nextVTableThunkIndex++;
+ return builder.add(
+ llvm::ConstantExpr::getNullValue(CGM.GlobalsInt8PtrTy));
+ }
+ }
+ }
// Method is acceptable, continue processing as usual.
}
diff --git a/clang/lib/Sema/SemaCUDA.cpp b/clang/lib/Sema/SemaCUDA.cpp
index 9e05de941f335..e9921b40e726f 100644
--- a/clang/lib/Sema/SemaCUDA.cpp
+++ b/clang/lib/Sema/SemaCUDA.cpp
@@ -457,21 +457,6 @@ bool SemaCUDA::inferTargetForImplicitSpecialMember(CXXRecordDecl *ClassDecl,
CXXMethodDecl *MemberDecl,
bool ConstRHS,
bool Diagnose) {
- // If MemberDecl is virtual destructor of an explicit template class
- // instantiation, it must be emitted, therefore it needs to be inferred
- // conservatively by ignoring implicit host/device attrs of member and parent
- // dtors called by it. Also, it needs to be checed by deferred diag visitor.
- bool IsExpVDtor = false;
- if (isa<CXXDestructorDecl>(MemberDecl) && MemberDecl->isVirtual()) {
- if (auto *Spec = dyn_cast<ClassTemplateSpecializationDecl>(ClassDecl)) {
- TemplateSpecializationKind TSK = Spec->getTemplateSpecializationKind();
- IsExpVDtor = TSK == TSK_ExplicitInstantiationDeclaration ||
- TSK == TSK_ExplicitInstantiationDefinition;
- }
- }
- if (IsExpVDtor)
- SemaRef.DeclsToCheckForDeferredDiags.insert(MemberDecl);
-
// If the defaulted special member is defined lexically outside of its
// owning class, or the special member already has explicit device or host
// attributes, do not infer.
@@ -521,8 +506,7 @@ bool SemaCUDA::inferTargetForImplicitSpecialMember(CXXRecordDecl *ClassDecl,
if (!SMOR.getMethod())
continue;
- CUDAFunctionTarget BaseMethodTarget =
- IdentifyTarget(SMOR.getMethod(), IsExpVDtor);
+ CUDAFunctionTarget BaseMethodTarget = IdentifyTarget(SMOR.getMethod());
if (!InferredTarget) {
InferredTarget = BaseMethodTarget;
@@ -564,8 +548,7 @@ bool SemaCUDA::inferTargetForImplicitSpecialMember(CXXRecordDecl *ClassDecl,
if (!SMOR.getMethod())
continue;
- CUDAFunctionTarget FieldMethodTarget =
- IdentifyTarget(SMOR.getMethod(), IsExpVDtor);
+ CUDAFunctionTarget FieldMethodTarget = IdentifyTarget(SMOR.getMethod());
if (!InferredTarget) {
InferredTarget = FieldMethodTarget;
diff --git a/clang/lib/Sema/SemaDecl.cpp b/clang/lib/Sema/SemaDecl.cpp
index a9a4cb89d115f..4ffdf19780c22 100644
--- a/clang/lib/Sema/SemaDecl.cpp
+++ b/clang/lib/Sema/SemaDecl.cpp
@@ -21255,21 +21255,6 @@ Sema::FunctionEmissionStatus Sema::getEmissionStatus(const FunctionDecl *FD,
if (IsEmittedForExternalSymbol())
return FunctionEmissionStatus::Emitted;
-
- // If FD is a virtual destructor of an explicit instantiation
- // of a template class, return Emitted.
- if (auto *Destructor = dyn_cast<CXXDestructorDecl>(FD)) {
- if (Destructor->isVirtual()) {
- if (auto *Spec = dyn_cast<ClassTemplateSpecializationDecl>(
- Destructor->getParent())) {
- TemplateSpecializationKind TSK =
- Spec->getTemplateSpecializationKind();
- if (TSK == TSK_ExplicitInstantiationDeclaration ||
- TSK == TSK_ExplicitInstantiationDefinition)
- return FunctionEmissionStatus::Emitted;
- }
- }
- }
}
// Otherwise, the function is known-emitted if it's in our set of
diff --git a/clang/test/CodeGenCUDA/device-vtable.cu b/clang/test/CodeGenCUDA/device-vtable.cu
index fa4ca0c8a4708..af4db54c3d61e 100644
--- a/clang/test/CodeGenCUDA/device-vtable.cu
+++ b/clang/test/CodeGenCUDA/device-vtable.cu
@@ -13,6 +13,43 @@
#include "Inputs/cuda.h"
+// Explicit class template instantiation with an implicit __host__ __device__
+// virtual destructor: when no device code references the class, the device
+// vtable should fill destructor slots with NULL so that emitting the
+// destructor body (which may reach host-only callees through e.g.
+// libstdc++ runtime dispatch) is not forced. This block is checked first
+// because comdat globals are emitted before non-comdat globals in IR.
+template <typename T>
+struct ETI {
+ virtual ~ETI() = default;
+};
+template class ETI<float>;
+// CHECK-DEVICE: @_ZTV3ETIIfE = {{.*}} zeroinitializer
+// CHECK-HOST: @_ZTV3ETIIfE = {{.*}} @_ZN3ETIIfED
+
+// Device code does reference ETI_Used<float>: the per-slot NULL extension
+// must NOT fire — the device vtable's complete-destructor slot must hold
+// the real pointer (the deleting-destructor slot stays unused because no
+// device code performs `delete`).
+template <typename T>
+struct ETI_Used {
+ virtual ~ETI_Used() = default;
+};
+template class ETI_Used<float>;
+__device__ void use_eti_used() { ETI_Used<float> x; }
+// CHECK-DEVICE: @_ZTV8ETI_UsedIfE = {{.*}} @_ZN8ETI_UsedIfED1Ev
+// CHECK-HOST: @_ZTV8ETI_UsedIfE = {{.*}} @_ZN8ETI_UsedIfED
+
+// Explicit __device__ virtual destructor on an explicit instantiation:
+// the per-slot NULL extension must NOT fire (it gates on implicit H+D),
+// so the device vtable holds the real destructor pointers.
+template <typename T>
+struct ETI_Dev {
+ virtual __device__ ~ETI_Dev() = default;
+};
+template class ETI_Dev<float>;
+// CHECK-DEVICE: @_ZTV7ETI_DevIfE = {{.*}} @_ZN7ETI_DevIfED
+
struct H {
virtual void method();
};
diff --git a/clang/test/SemaCUDA/dtor.cu b/clang/test/SemaCUDA/dtor.cu
index cc37837e70791..7e3176be73ccf 100644
--- a/clang/test/SemaCUDA/dtor.cu
+++ b/clang/test/SemaCUDA/dtor.cu
@@ -5,8 +5,9 @@
#include "Inputs/cuda.h"
-// Virtual dtor ~B() of explicit instantiation B<float> must
-// be emitted, which causes host_fun() called.
+// Explicit __device__ virtual dtor ~B() reached from device code via
+// destruction of a local variable should be walked by the deferred diag
+// visitor and reach the host_fun() call in the dtor chain.
namespace ExplicitInstantiationExplicitDevDtor {
void host_fun() // dev-note {{'host_fun' declared here}}
{}
@@ -25,19 +26,21 @@ struct A {
template <typename T>
struct B {
public:
- virtual __device__ ~B() = default;
+ virtual __device__ ~B() = default; // dev-note {{called by 'foo'}}
A _a;
};
template class B<float>;
+__device__ void foo() {
+ B<float> x;
+}
}
-// The implicit host/device attrs of virtual dtor ~B() should be
-// conservatively inferred, where constexpr member dtor's should
-// not be considered device since they may call host functions.
-// Therefore B<float>::~B() should not have implicit device attr.
-// However C<float>::~C() should have implicit device attr since
-// it is trivial.
+// Implicit H+D virtual dtor ~B() of an explicit instantiation that is
+// not used from device code should not be eagerly walked by the deferred
+// diag visitor. The host-only chain reachable from ~B() through ~A() is
+// only relevant if device code actually constructs/destroys B<float>.
+// C<float> is used from device foo() but its dtor chain is trivial.
namespace ExplicitInstantiationDtorNoAttr {
void host_fun()
{}
diff --git a/clang/test/SemaCUDA/implicit-hd-dtor-explicit-instantiation.cu b/clang/test/SemaCUDA/implicit-hd-dtor-explicit-instantiation.cu
new file mode 100644
index 0000000000000..9065c57c95f39
--- /dev/null
+++ b/clang/test/SemaCUDA/implicit-hd-dtor-explicit-instantiation.cu
@@ -0,0 +1,41 @@
+// RUN: %clang_cc1 -triple nvptx64-nvidia-cuda -fcuda-is-device -std=c++20 \
+// RUN: -fsyntax-only -verify %s
+// RUN: %clang_cc1 -triple amdgcn -fcuda-is-device -std=c++20 \
+// RUN: -fsyntax-only -verify %s
+// expected-no-diagnostics
+
+// An explicit class template instantiation with an implicit
+// __host__ __device__ virtual destructor must not produce device
+// diagnostics for host-only callees reachable through its destructor
+// chain when no device code references the class. The destructor is a
+// candidate for device emission only because of the explicit
+// instantiation, not because of any device use, so deferred device
+// diagnostics should not be raised against its body.
+
+#include "Inputs/cuda.h"
+
+// A host-only function reachable from an implicit H+D destructor body.
+void host_only_dealloc() {}
+
+// constexpr functions get implicit __host__ __device__, but their bodies
+// can still call host-only functions on the runtime path.
+template <unsigned long>
+constexpr void deallocate() {
+ host_only_dealloc();
+}
+
+struct alloc_holder {
+ constexpr ~alloc_holder() { deallocate<8>(); }
+};
+
+template <typename T>
+struct Base {
+ virtual ~Base() = default;
+};
+
+template <typename T>
+struct Derived : Base<T> {
+ alloc_holder m_data;
+};
+
+template class Derived<double>;
|
efriedma-quic
left a comment
There was a problem hiding this comment.
The basic problem is that the translation unit where the destruction happens isn't necessarily the same to the translation unit where the destructor is defined/instantiated. So this is basically just guessing.
I'm not sure there's a great solution here... but it would be nice if we could at least cause a linker error if the heuristic goes wrong; crashing with a null pointer dereference on device is a difficult failure mode.
|
I'm out of my depth with mechanics of vtable construction, so it's just a drive-by comment. The issue looks somewhat similar to an issue with virtual functions. When they are one-side only, I believe we end up with a NULL vtable entry. I wonder if we could populate such ghost entries with a stub that aborts with a visible or easier to diagnose error. E.g. crashing on a distinct trap instruction to distinguish it from random null pointer crashes, or literally calling abort with a message. That said, this class of problems existed since the beginning, but, AFAICT, it very rarely shows up in practice. It may not be worth complicating the error path handling here. Crash on null vtable entry is reasonably distinct from a generic null pointer crash to point investigation in the right direction. In general, I'm OK with the patch approach. In this case runtime failure on NULL in the code path that is not intended to or unlikely to be executed is preferable to the failure to compile the code that would most likely work otherwise. Standard library support on the GPU has always been a best effort affair, so some sharp corners are expected. |
- Treat the class as device-used if any ctor (Ctor_Complete or Ctor_Base) has been emitted, so polymorphic delete via base pointer reaches a real dtor. - Replace NULL vtable slots with per-dtor internal trap stubs (`llvm.trap` + `unreachable`); the stub name embeds the dtor's mangled name so a crash backtrace points at the offending class. Use `__clang_hip_` prefix for HIP, `__clang_cuda_` for CUDA. - AddrSpaceCast the stub pointer to the globals AS for AMDGCN. - Tests: ETI_Poly polymorphic-delete block in device-vtable.cu and a HIP-specific test pinning the prefix.
|
✅ With the latest revision this PR passed the C/C++ code formatter. |
Thanks for the review. I tried to build a case where another TU dispatches through a slot we filled and ends up unresolved, but extern template still re-instantiates the dtor as linkonce_odr on the device pass, and an out-of-line dtor decl in a header gets host-only inference so Sema rejects device use before the heuristic runs. I could not find a real cross-TU break path, though I agree the heuristic is local and could in principle be wrong. To address the diagnosis side, I just pushed an update that fills the slot with a tiny per-dtor internal trap stub instead of NULL — the stub name embeds the dtor's mangled name, so if the heuristic ever misses a case, the crash backtrace points right at the offending class rather than a bare null deref. Happy to iterate if you have a scenario in mind I missed. |
Thanks. I took the trap-stub idea and applied it to this case in the latest update — the slot now points at a tiny internal function whose body is llvm.trap + unreachable, and its name embeds the dtor's mangled name so a backtrace identifies the offending class. I kept the change scoped to this heuristic and did not extend it to the long-standing NULL fill for one-side-only HD methods, since that's a broader behavior change worth its own discussion. Happy to revisit that separately if you think it's worthwhile. |
Artem-B
left a comment
There was a problem hiding this comment.
LGTM for the trap as the safeguard. I'll defer to @efriedma-quic for the overall approval.
|
The previous iteration of this patch emitted a trap body for the implicit H+D destructor whenever the current TU did not reference it through normal code. The gap: the destructor may actually be a valid device function. Its body has no host-only calls. No real device errors. But the trap body still gets emitted. The vtable for the explicit instantiation in this TU then points at the trap. If another TU instantiates the same class on device with the real destructor, the two vtables compete under COMDAT linkage. The trap version can win. Device code that calls the destructor through the vtable then traps at runtime, even though a perfectly valid definition exists. The new design classifies by what happens when clang tries to emit the destructor body.
Valid destructors stay valid. Real errors get reported with real call chains. Unused destructors no longer break the vtable just because something deep in their chain is not device-callable. |
Defer device-target diagnostics in implicit __host__ __device__ functions reached via an explicit template instantiation. At end of TU: * Organic device caller -> surface diags with call-stack note. * No device caller, no errors -> emit real body. * No device caller, has errors -> drop diags, mark for trap body. Covers target-mismatch errors and overload ambiguity, independent of -fgpu-defer-diag. CodeGen routes marked functions through DeferredDeclsToEmit and emits `call @llvm.trap()` followed by `ret void` or `ret <T> poison`.
91cdf03 to
84d8244
Compare
|
gentle ping |
When clang explicitly instantiates a class template, it must emit device-side
bodies for the implicit
__host__ __device__members so the vtable andinstantiation symbols resolve. Some of those members chain into host-only
calls (for example libstdc++ destructors that eventually call ::operator
delete). If no device code actually uses the class, the user still sees
errors about calling a
__host__function from device code, even though theywrote no device code that touches it. Overload ambiguity in the same context
behaves the same way.
This patch defers device-side errors in implicit
__host__ __device__functions reached only via an explicit template instantiation. At the end of
the translation unit, clang checks whether a real device caller exists. If
one does, the deferred errors are surfaced with the usual call-stack notes.
If not, the diagnostics are dropped and the function gets a trap body, so the
vtable and instantiation symbols stay well-formed and the failure only fires
if device code ever calls the function. This works without -fgpu-defer-diag.