Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

clang: CUDA error: compiling relocatable device code with dynamically allocated shared memory passed through a lambda expression will cause an error. #65806

Closed
jacobtrombetta opened this issue Sep 8, 2023 · 2 comments · Fixed by #65990
Assignees
Labels

Comments

@jacobtrombetta
Copy link

Overview

Compiling relocatable device code (-fcuda-rdc) with dynamically allocated shared memory (extern __shared__) passed through a lambda expression will cause a fatal: Variable used as initial value not in .global or .const state space error in the clang-18 compiler.

The work around is to pass an object that points to the dynamically allocated shared memory object.

Error

clang++ -x cuda -c main.cc -o main.o --cuda-gpu-arch=sm_70 -std=c++20 -fcuda-rdc
clang++: warning: CUDA version 12.1 is only partially supported [-Wunknown-cuda-version]
ptxas /tmp/main-sm_70-a10293.s, line 51; fatal   : Variable used as initial value not in .global or .const state space
ptxas fatal   : Ptx assembly aborted due to errors
clang++: error: ptxas command failed with exit code 255 (use -v to see invocation)
Ubuntu clang version 18.0.0 (++20230908042326+cf51876dd909-1~exp1~20230908042444.1172)
Target: x86_64-pc-linux-gnu
Thread model: posix
InstalledDir: /usr/bin
clang++: note: diagnostic msg: 
********************

PLEASE ATTACH THE FOLLOWING FILES TO THE BUG REPORT:
Preprocessed source(s) and associated run script(s) are located at:
clang++: note: diagnostic msg: /tmp/main-42ef64.cu
clang++: note: diagnostic msg: /tmp/main-sm_70-d402df.cu
clang++: note: diagnostic msg: /tmp/main-42ef64.sh
clang++: note: diagnostic msg: 

********************

Isolated example that reproduces the error

template <class F> 
__host__ __device__ void launch_kernel(unsigned int thread_id, F f) noexcept {
  switch (thread_id) {
  case 0: {
    return f(std::integral_constant<unsigned int, 0>{});
  }
  case 1: {
    return f(std::integral_constant<unsigned int, 1>{});
  }
  }
  __builtin_unreachable();
}

template <typename T, unsigned int thread_id>
__device__ void dynamically_allocated_input_from_lambda(T* shared_data){
  shared_data[thread_id] = thread_id;
}

template <typename T>
__global__ void kernel() {
  /////////////////////////////////////////////////////////
  // Source of relocatable device code compilation error //
  /////////////////////////////////////////////////////////
  extern __shared__ T shared_data[];
  /////////////////////////////////////////////////////////
  /////////////////////////////////////////////////////////
  /////////////////////////////////////////////////////////

  ////////////////
  // Workaround //
  ////////////////
  //extern __shared__ T shared_data_d[];
  //T* shared_data = shared_data_d;
  ////////////////
  ////////////////
  ////////////////
 
  launch_kernel(
      threadIdx.x,
      [=]<unsigned thread_id>(std::integral_constant<unsigned, thread_id>) noexcept {
          dynamically_allocated_input_from_lambda<T, thread_id>(shared_data);
      });
}

int main()
{
  using T = unsigned int;
  kernel<T><<<2, 2, sizeof(T) * 2>>>();

  cudaDeviceSynchronize();
}

Supporting files

The code above, requested files for the bug report, and supporting files for reproducing the issue in a Docker container are attached.
clang_rdc_issue.zip

@github-actions github-actions bot added clang Clang issues not falling into any other category new issue labels Sep 8, 2023
@EugeneZelenko EugeneZelenko added cuda and removed clang Clang issues not falling into any other category new issue labels Sep 8, 2023
@Artem-B
Copy link
Member

Artem-B commented Sep 8, 2023

Interesting. So, we end up creating a static variable which is supposed to keep references to external data, and that does not work with shared pointers, because each SM gets its own instance and we don't know the value at compile time.

.extern .shared .align 4 .b8 shared_data[];
.global .align 8 .u64 __clang_gpu_used_external[2] = {generic(shared_data), void kernel<unsigned int>()};

We use that to preserve things that may be referred to from the host, and shared memory objects can't be accessed that way. We should not put them on that list.

@yxsamliu I assume this is also true for AMD GPUs. Is it?

@yxsamliu
Copy link
Collaborator

Right. It was a clang bug introduced by my change. I will fix it.

@yxsamliu yxsamliu self-assigned this Sep 11, 2023
yxsamliu added a commit to yxsamliu/llvm-project that referenced this issue Sep 11, 2023
Fixes: llvm#65806

Currently clang put extern shared var ODR-used by host
device functions in global var __clang_gpu_used_external.
This behavior was due to https://reviews.llvm.org/D123441.
However, clang should not do that for extern shared vars
since their addresses are per warp, therefore cannot be
accessed by host code.
yxsamliu added a commit that referenced this issue Sep 11, 2023
Fixes: #65806

Currently clang put extern shared var ODR-used by host device functions
in global var __clang_gpu_used_external. This behavior was due to
https://reviews.llvm.org/D123441. However, clang should not do that for
extern shared vars since their addresses are per warp, therefore cannot
be accessed by host code.
AntonRydahl pushed a commit to AntonRydahl/llvm-project that referenced this issue Sep 11, 2023
Fixes: llvm#65806

Currently clang put extern shared var ODR-used by host device functions
in global var __clang_gpu_used_external. This behavior was due to
https://reviews.llvm.org/D123441. However, clang should not do that for
extern shared vars since their addresses are per warp, therefore cannot
be accessed by host code.
ZijunZhaoCCK pushed a commit to ZijunZhaoCCK/llvm-project that referenced this issue Sep 19, 2023
Fixes: llvm#65806

Currently clang put extern shared var ODR-used by host device functions
in global var __clang_gpu_used_external. This behavior was due to
https://reviews.llvm.org/D123441. However, clang should not do that for
extern shared vars since their addresses are per warp, therefore cannot
be accessed by host code.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants