clang: CUDA error: compiling relocatable device code with dynamically allocated shared memory passed through a lambda expression will cause an error. #65806

jacobtrombetta · 2023-09-08T21:06:21Z

Overview

Compiling relocatable device code (-fcuda-rdc) with dynamically allocated shared memory (extern __shared__) passed through a lambda expression will cause a fatal: Variable used as initial value not in .global or .const state space error in the clang-18 compiler.

The work around is to pass an object that points to the dynamically allocated shared memory object.

Error

clang++ -x cuda -c main.cc -o main.o --cuda-gpu-arch=sm_70 -std=c++20 -fcuda-rdc
clang++: warning: CUDA version 12.1 is only partially supported [-Wunknown-cuda-version]
ptxas /tmp/main-sm_70-a10293.s, line 51; fatal   : Variable used as initial value not in .global or .const state space
ptxas fatal   : Ptx assembly aborted due to errors
clang++: error: ptxas command failed with exit code 255 (use -v to see invocation)
Ubuntu clang version 18.0.0 (++20230908042326+cf51876dd909-1~exp1~20230908042444.1172)
Target: x86_64-pc-linux-gnu
Thread model: posix
InstalledDir: /usr/bin
clang++: note: diagnostic msg: 
********************

PLEASE ATTACH THE FOLLOWING FILES TO THE BUG REPORT:
Preprocessed source(s) and associated run script(s) are located at:
clang++: note: diagnostic msg: /tmp/main-42ef64.cu
clang++: note: diagnostic msg: /tmp/main-sm_70-d402df.cu
clang++: note: diagnostic msg: /tmp/main-42ef64.sh
clang++: note: diagnostic msg: 

********************

Isolated example that reproduces the error

template <class F> 
__host__ __device__ void launch_kernel(unsigned int thread_id, F f) noexcept {
  switch (thread_id) {
  case 0: {
    return f(std::integral_constant<unsigned int, 0>{});
  }
  case 1: {
    return f(std::integral_constant<unsigned int, 1>{});
  }
  }
  __builtin_unreachable();
}

template <typename T, unsigned int thread_id>
__device__ void dynamically_allocated_input_from_lambda(T* shared_data){
  shared_data[thread_id] = thread_id;
}

template <typename T>
__global__ void kernel() {
  /////////////////////////////////////////////////////////
  // Source of relocatable device code compilation error //
  /////////////////////////////////////////////////////////
  extern __shared__ T shared_data[];
  /////////////////////////////////////////////////////////
  /////////////////////////////////////////////////////////
  /////////////////////////////////////////////////////////

  ////////////////
  // Workaround //
  ////////////////
  //extern __shared__ T shared_data_d[];
  //T* shared_data = shared_data_d;
  ////////////////
  ////////////////
  ////////////////
 
  launch_kernel(
      threadIdx.x,
      [=]<unsigned thread_id>(std::integral_constant<unsigned, thread_id>) noexcept {
          dynamically_allocated_input_from_lambda<T, thread_id>(shared_data);
      });
}

int main()
{
  using T = unsigned int;
  kernel<T><<<2, 2, sizeof(T) * 2>>>();

  cudaDeviceSynchronize();
}

Supporting files

The code above, requested files for the bug report, and supporting files for reproducing the issue in a Docker container are attached.
clang_rdc_issue.zip

The text was updated successfully, but these errors were encountered:

Artem-B · 2023-09-08T23:18:43Z

Interesting. So, we end up creating a static variable which is supposed to keep references to external data, and that does not work with shared pointers, because each SM gets its own instance and we don't know the value at compile time.

.extern .shared .align 4 .b8 shared_data[];
.global .align 8 .u64 __clang_gpu_used_external[2] = {generic(shared_data), void kernel<unsigned int>()};

We use that to preserve things that may be referred to from the host, and shared memory objects can't be accessed that way. We should not put them on that list.

@yxsamliu I assume this is also true for AMD GPUs. Is it?

yxsamliu · 2023-09-11T14:59:37Z

Right. It was a clang bug introduced by my change. I will fix it.

Fixes: llvm#65806 Currently clang put extern shared var ODR-used by host device functions in global var __clang_gpu_used_external. This behavior was due to https://reviews.llvm.org/D123441. However, clang should not do that for extern shared vars since their addresses are per warp, therefore cannot be accessed by host code.

Fixes: #65806 Currently clang put extern shared var ODR-used by host device functions in global var __clang_gpu_used_external. This behavior was due to https://reviews.llvm.org/D123441. However, clang should not do that for extern shared vars since their addresses are per warp, therefore cannot be accessed by host code.

Fixes: llvm#65806 Currently clang put extern shared var ODR-used by host device functions in global var __clang_gpu_used_external. This behavior was due to https://reviews.llvm.org/D123441. However, clang should not do that for extern shared vars since their addresses are per warp, therefore cannot be accessed by host code.

github-actions bot added clang Clang issues not falling into any other category new issue labels Sep 8, 2023

EugeneZelenko added cuda and removed clang Clang issues not falling into any other category new issue labels Sep 8, 2023

yxsamliu self-assigned this Sep 11, 2023

jacobtrombetta mentioned this issue Sep 11, 2023

ci: add clang rdc compilation error workaround to multiexp kernel component ( PROOF-630 ) spaceandtimelabs/blitzar#8

Merged

yxsamliu mentioned this issue Sep 11, 2023

[CUDA][HIP] Do not mark extern shared var #65990

Merged

yxsamliu closed this as completed in #65990 Sep 11, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

clang: CUDA error: compiling relocatable device code with dynamically allocated shared memory passed through a lambda expression will cause an error. #65806

clang: CUDA error: compiling relocatable device code with dynamically allocated shared memory passed through a lambda expression will cause an error. #65806

jacobtrombetta commented Sep 8, 2023

Artem-B commented Sep 8, 2023 •

edited

yxsamliu commented Sep 11, 2023

clang: CUDA error: compiling relocatable device code with dynamically allocated shared memory passed through a lambda expression will cause an error. #65806

clang: CUDA error: compiling relocatable device code with dynamically allocated shared memory passed through a lambda expression will cause an error. #65806

Comments

jacobtrombetta commented Sep 8, 2023

Overview

Error

Isolated example that reproduces the error

Supporting files

Artem-B commented Sep 8, 2023 • edited

yxsamliu commented Sep 11, 2023

Artem-B commented Sep 8, 2023 •

edited