Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CUDA][HIP] Fix host/device context in concept #67721

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

yxsamliu
Copy link
Collaborator

Currently, constraints are checked in Sema::FinishTemplateArgumentDeduction, where the current function in ASTContext is set to the instantiated template function. When resolving functions for the constraints, clang assumes the caller is the current function, This causes incompatibility with nvcc and also for constexpr template functions with C++.

clang caches the constraint checking result per concept/type matching. It assumes the result does not depend on the instantiation context.

This patch let constraint checking have its own host/device context and by default it is host to be compatible with C++. This makes the constraint checking independent of callers and make the caching valid.

In the future, we may introduce device constraints by other means, e.g. adding device attribute per function call in constraints.

Fixes: #67507

@llvmbot llvmbot added clang Clang issues not falling into any other category clang:frontend Language frontend issues, e.g. anything involving "Sema" labels Sep 28, 2023
@llvmbot
Copy link
Collaborator

llvmbot commented Sep 28, 2023

@llvm/pr-subscribers-clang

Changes

Currently, constraints are checked in Sema::FinishTemplateArgumentDeduction, where the current function in ASTContext is set to the instantiated template function. When resolving functions for the constraints, clang assumes the caller is the current function, This causes incompatibility with nvcc and also for constexpr template functions with C++.

clang caches the constraint checking result per concept/type matching. It assumes the result does not depend on the instantiation context.

This patch let constraint checking have its own host/device context and by default it is host to be compatible with C++. This makes the constraint checking independent of callers and make the caching valid.

In the future, we may introduce device constraints by other means, e.g. adding device attribute per function call in constraints.

Fixes: #67507


Full diff: https://github.com/llvm/llvm-project/pull/67721.diff

5 Files Affected:

  • (modified) clang/docs/HIPSupport.rst (+31)
  • (modified) clang/include/clang/Sema/Sema.h (+7-2)
  • (modified) clang/lib/Sema/SemaCUDA.cpp (+21-13)
  • (modified) clang/lib/Sema/SemaConcept.cpp (+2)
  • (added) clang/test/SemaCUDA/concept.cu (+23)
diff --git a/clang/docs/HIPSupport.rst b/clang/docs/HIPSupport.rst
index 8b4649733a9c777..ea7eed0fe7ce1eb 100644
--- a/clang/docs/HIPSupport.rst
+++ b/clang/docs/HIPSupport.rst
@@ -176,3 +176,34 @@ Predefined Macros
    * - ``HIP_API_PER_THREAD_DEFAULT_STREAM``
      - Alias to ``__HIP_API_PER_THREAD_DEFAULT_STREAM__``. Deprecated.
 
+C++20 Concepts with HIP and CUDA
+--------------------------------
+
+In Clang, when working with HIP or CUDA, it's important to note that all constraints in C++20 concepts are assumed to be for the host side only. This behavior is consistent across both programming models, and developers should be aware of this assumption when writing code that utilizes C++20 concepts.
+
+Example:
+.. code-block:: c++
+
+   template <class T>
+   concept MyConcept = requires(T& obj) {
+     my_function(obj);  // Assumed to be a host-side requirement
+   };
+
+   template <MyConcept T>
+   __global__ void kernel() {
+      // Kernel code
+   }
+
+   struct MyType {};
+
+   inline void my_function(MyType& obj) {}
+
+   int main() {
+      kernel<MyType><<<1,1>>>();
+      return 0;
+   }
+
+In the above example, the ``MyConcept`` concept is assumed to check the host-side requirements, even though it's being used in a device kernel. Developers should structure their code accordingly to ensure correct behavior and to satisfy the host-side constraints assumed by Clang.
+
+This assumption helps maintain a consistent behavior when dealing with template constraints, and simplifies the compilation model by reducing the complexity associated with differentiating between host and device-side requirements.
+
diff --git a/clang/include/clang/Sema/Sema.h b/clang/include/clang/Sema/Sema.h
index 712db0a3dd895d5..9b1545b634177d4 100644
--- a/clang/include/clang/Sema/Sema.h
+++ b/clang/include/clang/Sema/Sema.h
@@ -13312,6 +13312,7 @@ class Sema final {
     CTCK_Unknown,       /// Unknown context
     CTCK_InitGlobalVar, /// Function called during global variable
                         /// initialization
+    CTCK_Constraint,    /// Function called for constraint checking
   };
 
   /// Define the current global CUDA host/device context where a function may be
@@ -13319,13 +13320,17 @@ class Sema final {
   struct CUDATargetContext {
     CUDAFunctionTarget Target = CFT_HostDevice;
     CUDATargetContextKind Kind = CTCK_Unknown;
-    Decl *D = nullptr;
+    const Decl *D = nullptr;
+    const Expr *E = nullptr;
+    /// Whether should override the current function.
+    bool shouldOverride(const Decl *D) const;
   } CurCUDATargetCtx;
 
   struct CUDATargetContextRAII {
     Sema &S;
     CUDATargetContext SavedCtx;
-    CUDATargetContextRAII(Sema &S_, CUDATargetContextKind K, Decl *D);
+    CUDATargetContextRAII(Sema &S_, CUDATargetContextKind K, const Decl *D,
+                          const Expr *E = nullptr);
     ~CUDATargetContextRAII() { S.CurCUDATargetCtx = SavedCtx; }
   };
 
diff --git a/clang/lib/Sema/SemaCUDA.cpp b/clang/lib/Sema/SemaCUDA.cpp
index 88f5484575db17a..e72c42e672167d9 100644
--- a/clang/lib/Sema/SemaCUDA.cpp
+++ b/clang/lib/Sema/SemaCUDA.cpp
@@ -114,27 +114,35 @@ static bool hasAttr(const Decl *D, bool IgnoreImplicitAttr) {
 
 Sema::CUDATargetContextRAII::CUDATargetContextRAII(Sema &S_,
                                                    CUDATargetContextKind K,
-                                                   Decl *D)
+                                                   const Decl *D,
+                                                   const Expr *E)
     : S(S_) {
   SavedCtx = S.CurCUDATargetCtx;
-  assert(K == CTCK_InitGlobalVar);
-  auto *VD = dyn_cast_or_null<VarDecl>(D);
-  if (VD && VD->hasGlobalStorage() && !VD->isStaticLocal()) {
-    auto Target = CFT_Host;
-    if ((hasAttr<CUDADeviceAttr>(VD, /*IgnoreImplicit=*/true) &&
-         !hasAttr<CUDAHostAttr>(VD, /*IgnoreImplicit=*/true)) ||
-        hasAttr<CUDASharedAttr>(VD, /*IgnoreImplicit=*/true) ||
-        hasAttr<CUDAConstantAttr>(VD, /*IgnoreImplicit=*/true))
-      Target = CFT_Device;
-    S.CurCUDATargetCtx = {Target, K, VD};
+  auto Target = CFT_Host;
+  if (K == CTCK_InitGlobalVar) {
+    auto *VD = dyn_cast_or_null<VarDecl>(D);
+    if (VD && VD->hasGlobalStorage() && !VD->isStaticLocal()) {
+      if ((hasAttr<CUDADeviceAttr>(VD, /*IgnoreImplicit=*/true) &&
+           !hasAttr<CUDAHostAttr>(VD, /*IgnoreImplicit=*/true)) ||
+          hasAttr<CUDASharedAttr>(VD, /*IgnoreImplicit=*/true) ||
+          hasAttr<CUDAConstantAttr>(VD, /*IgnoreImplicit=*/true))
+        Target = CFT_Device;
+      S.CurCUDATargetCtx = {Target, K, D, E};
+    }
+    return;
   }
+  assert(K == CTCK_Constraint);
+  S.CurCUDATargetCtx = {Target, K, D, E};
+}
+
+bool Sema::CUDATargetContext::shouldOverride(const Decl *D)const {
+  return Kind == CTCK_Constraint || D == nullptr;
 }
 
 /// IdentifyCUDATarget - Determine the CUDA compilation target for this function
 Sema::CUDAFunctionTarget Sema::IdentifyCUDATarget(const FunctionDecl *D,
                                                   bool IgnoreImplicitHDAttr) {
-  // Code that lives outside a function gets the target from CurCUDATargetCtx.
-  if (D == nullptr)
+  if (CurCUDATargetCtx.shouldOverride(D))
     return CurCUDATargetCtx.Target;
 
   if (D->hasAttr<CUDAInvalidTargetAttr>())
diff --git a/clang/lib/Sema/SemaConcept.cpp b/clang/lib/Sema/SemaConcept.cpp
index 036548b68247bfa..6475f4e3dcde49d 100644
--- a/clang/lib/Sema/SemaConcept.cpp
+++ b/clang/lib/Sema/SemaConcept.cpp
@@ -336,6 +336,8 @@ static ExprResult calculateConstraintSatisfaction(
     Sema &S, const NamedDecl *Template, SourceLocation TemplateNameLoc,
     const MultiLevelTemplateArgumentList &MLTAL, const Expr *ConstraintExpr,
     ConstraintSatisfaction &Satisfaction) {
+  Sema::CUDATargetContextRAII X(S, Sema::CTCK_Constraint,
+      /*Decl=*/nullptr, ConstraintExpr);
   return calculateConstraintSatisfaction(
       S, ConstraintExpr, Satisfaction, [&](const Expr *AtomicExpr) {
         EnterExpressionEvaluationContext ConstantEvaluated(
diff --git a/clang/test/SemaCUDA/concept.cu b/clang/test/SemaCUDA/concept.cu
new file mode 100644
index 000000000000000..1ed906b01a94efa
--- /dev/null
+++ b/clang/test/SemaCUDA/concept.cu
@@ -0,0 +1,23 @@
+// RUN: %clang_cc1 -triple amdgcn-amd-amdhsa -fcuda-is-device -x hip %s \
+// RUN:   -std=c++20 -fsyntax-only -verify
+// RUN: %clang_cc1 -triple x86_64 -x hip %s \
+// RUN:   -std=c++20 -fsyntax-only -verify
+
+// expected-no-diagnostics
+
+#include "Inputs/cuda.h"
+
+template <class T>
+concept C = requires(T x) {
+  func(x);
+};
+
+struct A {};
+void func(A x) {}
+
+template <C T> __attribute__((global)) void kernel(T x) { }
+
+int main() {
+  A a;
+  kernel<<<1,1>>>(a);
+}

@github-actions
Copy link

github-actions bot commented Sep 28, 2023

✅ With the latest revision this PR passed the C/C++ code formatter.

Currently, constraints are checked in Sema::FinishTemplateArgumentDeduction,
where the current function in ASTContext is set to the instantiated template
function. When resolving functions for the constraints, clang assumes the
caller is the current function, This causes incompatibility with nvcc and
also for constexpr template functions with C++.

clang caches the constraint checking result per concept/type matching. It
assumes the result does not depend on the instantiation context.

This patch let constraint checking have its own host/device context and by
default it is host to be compatible with C++. This makes the constraint
checking independent of callers and make the caching valid.

In the future, we may introduce device constraints by other means,
e.g. adding __device__ attribute per function call in constraints.

Fixes: llvm#67507
@yxsamliu
Copy link
Collaborator Author

yxsamliu commented Oct 4, 2023

ping


template <class T>
concept MyConcept = requires(T& obj) {
my_function(obj); // Assumed to be a host-side requirement
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How would one write a correct GPU-side requirement?

E.g. I want some __device__ function to be callable.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

currently, we do not support that.

I would suggest adding an extension to the clang that allows __host__ and __device__ attributes on call expressions in concept definition to indicate the required callability for the host or device.

For example,

template <class T>
   concept MyConcept = requires(T& obj) {
     __device__ my_function(obj);  // requires my_function(obj) callable on device side
}

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I understand correctly, normally a template is usable from either host or device (depending on whether it ends up calling any host-only or device-only function). This choice for concepts seems like it's going to be problematic for that model. Something as simple as:

template<Copyable T> T f(T x) { return x; }

... should really be callable on the host or device side if T is copyable on the host or device side, and using the host side in all cases will mean that things like the C++ <complex> or <functional> header may stop doing the right thing in some cases if/when they get extended to use concepts. And it seems like with this patch there's not anything that the authors of those headers can really do about it.

Perhaps it would be better for the host/device choice in a concept satisfaction check to depend on the context in which the concept is required to be satisfied (which I would imagine is what happened by chance before this patch), and for us to include the CUDA context as part of the constraint satisfaction cache key? That kind of direction seems like it'd give closer results to what we'd get from the split compilation model. I don't know if that actually works in general, though. For example, given:

__host__ X<T> host_global;
__device__ X<T> device_global;

... where X is a constrained template, what seems like it should happen here is that we take the __host__ / __device__ into account when concept-checking X's template arguments, but I'd worry that we don't have the host/device information to hand when checking the concept satisfaction query for X.

More broadly, I think there'll be cases where a CUDA developer will want, from host code, to check whether a constraint would be satisfied on the device, and some mechanism for doing that seems useful. I think that can be done with the model I suggest above, by putting a kernel call inside a requires expression, but it seems awkward, so perhaps some syntax for explicitly evaluating a concept-id in a particular host/device would be useful.

But it definitely seems worthwhile to figure out what rule NVCC is using here.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I understand correctly, normally a template is usable from either host or device (depending on whether it ends up calling any host-only or device-only function).

No. template function in CUDA/HIP is subject to the same availability check regarding host/device attributes. e.g https://godbolt.org/z/ccTxhEhfo

constexpr template functions can be called by both host and device functions not because they are template functions, but because host device attributes are added to them implicitly. They are still subject to host/device availability checks.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For a typical use case of concept in CUDA programs, please see https://godbolt.org/z/o7Wa68n9c

This is taken from issue #67507.

In this example, users want to express two constraints on geometric_shape:

  1. it can be passed to a function draw

  2. it can be passed to a function area and the result is convertible to double

For the first constraint, users only need it on the host side. For the second constraint, users need it on both the host side and the device side. This gives us some insight into users' needs for constraints: they are usually different for host and device sides, since users may want to do different things on host and device sides. Therefore, assuming a constraint in a concept should be satisfied on both the device and host sides will result in some unnecessary extra constraints on either side.

Is it OK to evaluate the constraints by the context where the template is instantiated? For example, when we instantiate the kernel template <geometric_shape T> __global__ void compute_areas, can we evaluate the constraints in the device context to get what we need? It is not good. Because then the constraint about function draw needs to be satisfied on the device side. That is not what we need. The point is, that the constraints defined in a concept need to have individual required context. We want to be able to express that this constraint should be satisfied in the device context, and that constraint should be satisfied in the host context. That is why I propose to allow __device__ and __host__ attributes to be added to the call expressions in concepts to indicate the required context for an individual constraint.

Now that we have discussed the users' needs regarding device/host contexts of constraints. Let's look at how nvcc currently evaluates satisfaction of constraints.

Based on https://godbolt.org/z/o7Wa68n9c , the instantiation of work<triangle> is successful. We can infer that triangle satisfies the two constraints. They can only be evaluated in the host context since functions draw and area are all host functions. Even though the instantiation of work<triangle> is done in a device context, the evaluation of the constraints is still done in the host context.

The current patch matches nvcc's behaviour.

@yxsamliu
Copy link
Collaborator Author

ping

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
clang:frontend Language frontend issues, e.g. anything involving "Sema" clang Clang issues not falling into any other category
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Clang cuda functions not handling concepts correctly.
4 participants