-
Notifications
You must be signed in to change notification settings - Fork 407
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
OpenACC CMakechange Clacc #6250
Conversation
…ompiler can comple the OpenACC backend.
when targeting AMD GPUs.
OpenACC/Clacc removal list. Remove NVHPC-specific changes from cmake/kokkos_enable_devices.cmake Remove incomplete changes to Makefile.kokkos.
cmake/kokkos_arch.cmake
Outdated
# When not compiling for offload to any GPU, we're compiling for kernel | ||
# execution on the host. In that case, memory is shared between the OpenACC | ||
# space and the host space. | ||
COMPILER_SPECIFIC_DEFS( | ||
Clang KOKKOS_OPENACC_WITHOUT_GPU | ||
NVHPC KOKKOS_OPENACC_WITHOUT_GPU | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we actually care about that case apart from debugging (and then the shared memory space doesn't really help)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A user may want to execute a kernel on a host (in parallel).
cmake/kokkos_enable_devices.cmake
Outdated
Clang -fopenacc -fopenacc-fake-async-wait | ||
-Wno-openacc-and-cxx -Wno-openmp-mapping -Wno-unknown-cuda-version | ||
-Wno-pass-failed | ||
# -Wno-defaulted-function-deleted |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's up with this flag?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Deleted.
// Alternative implementation to work around OpenACC features not yet | ||
// implemented by Clacc |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the difference between the two implementations?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The current implementation of Clacc does not support gang-private variables; thus, the alternative implementation allocates the gang-private arrays on GPU global memory using array expansion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please make sure your comment (in code) captures this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated the comment.
@@ -25,6 +25,9 @@ struct OpenACC_Traits { | |||
#if defined(KOKKOS_IMPL_ARCH_NVIDIA_GPU) | |||
static constexpr acc_device_t dev_type = acc_device_nvidia; | |||
static constexpr bool may_fallback_to_host = false; | |||
#elif defined(KOKKOS_ARCH_VEGA) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also NAVI
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added.
…ding on AMD GPUs.
Co-authored-by: Daniel Arndt <arndtd@ornl.gov>
…os into openacc_cmakechange_clacc
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please withdraw the KOKKOS_OPENACC_WITHOUT_GPU
changes.
You may propose them in another PR but, as far as I can tell, these don't belong here and they are potentially controversial.
-Wno-pass-failed | ||
) | ||
COMPILER_SPECIFIC_DEFS( | ||
Clang KOKKOS_WORKAROUND_OPENMPTARGET_CLANG |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Where do you use this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That macro is originally used by OpenMPTarget/Clang to disable unsupported unit tests in core/unit_test/TestComplex.hpp
, and the same tests are disabled for OpenACC/Clacc too since Clacc is built on top of the same OpenMP implementation in Clang/LLVM. (Clacc internally performs OpenACC-to-OpenMP translation to use the existing OpenMP implementation in LLVM to support OpenACC.)
handled in a separate PR.
Changes related to |
core/unit_test/CMakeLists.txt
Outdated
# FIXME_OPENACC - does not select specified device beyond 0 until OpenACC | ||
# backend is initialized. For example, adding a Kokkos::parallel_for to | ||
# the start of main in core/unit_test/UnitTest_DeviceAndThreads.cpp makes | ||
# the test pass. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure I understand your point. Are you saying that acc_set_device_num
does not have effect until a parallel region actually gets executed?
int dev_num = <some-non-zero-device-num>;
acc_set_device_num(dev_num);
assert(acc_get_device_num() == dev_num); // fails
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The comment and condition-checking code are outdated; deleted.
The original issue was about the interaction between Kokkos initialization and the OpenACC backend implementation, which is fixed now.
core/unit_test/CMakeLists.txt
Outdated
${CMAKE_CURRENT_BINARY_DIR}/openacc/TestOpenACC_WithoutInitializing.cpp | ||
${CMAKE_CURRENT_BINARY_DIR}/openacc/TestOpenACC_ViewAPI_d.cpp | ||
) | ||
# Somehow on ExCL's explorer (AMD GPU), these cause clang-linker-wrapper to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's "ExCL's explorer"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Explorer is the name of the machine in the ORNL ExCL cluster (https://excl.ornl.gov).
Updated the comment to provide detailed system info instead of the system name.
Co-authored-by: Damien L-G <dalg24+github@gmail.com>
…ted by the code review.
code review. Re-enabled TestCompilerMacros.cpp for the OpenACC backend compilers (NVHPC and Clacc)
elementValue(team_id, current_step * chunk_size + thread_id); | ||
} else { | ||
ValueType localValue = | ||
elementValue(team_id, current_step * chunk_size + thread_id); | ||
final_reducer.join( | ||
&localValue, &elementValue(team_id, current_step * chunk_size + | ||
thread_id - step_size)); | ||
elementValue(team_id, next_step * chunk_size + thread_id) = | ||
localValue; | ||
} | ||
} | ||
temp = current_step; | ||
current_step = next_step; | ||
next_step = temp; | ||
} | ||
chunk_values(team_id) = | ||
elementValue(team_id, current_step * chunk_size + chunk_size - 1); | ||
} | ||
|
||
ValueType tempValue; | ||
#pragma acc parallel loop num_gangs(1) num_workers(1) vector_length(1) \ | ||
present(chunk_values, offset_values, final_reducer) async(async_arg) | ||
for (IndexType team_id = 0; team_id < n_chunks; ++team_id) { | ||
if (team_id == 0) { | ||
final_reducer.init(&offset_values(0)); | ||
final_reducer.init(&tempValue); | ||
} else { | ||
final_reducer.join(&tempValue, &chunk_values(team_id - 1)); | ||
offset_values(team_id) = tempValue; | ||
} | ||
} | ||
|
||
#pragma acc parallel loop gang vector_length(chunk_size) \ | ||
create(element_values [0:n_chunks * 2 * chunk_size]) \ | ||
present(functor, offset_values, final_reducer) copyin(m_result_total) \ | ||
async(async_arg) | ||
for (IndexType team_id = 0; team_id < n_chunks; ++team_id) { | ||
IndexType current_step = 0; | ||
IndexType next_step = 1; | ||
IndexType temp; | ||
#pragma acc loop vector | ||
for (IndexType thread_id = 0; thread_id < chunk_size; ++thread_id) { | ||
const IndexType local_offset = team_id * chunk_size; | ||
const IndexType idx = local_offset + thread_id; | ||
ValueType update; | ||
final_reducer.init(&update); | ||
if (thread_id == 0) { | ||
final_reducer.join(&update, &offset_values(team_id)); | ||
} | ||
if ((idx > 0) && (idx < N)) functor(idx - 1, update, false); | ||
elementValue(team_id, thread_id) = update; | ||
} | ||
for (IndexType step_size = 1; step_size < chunk_size; step_size *= 2) { | ||
#pragma acc loop vector | ||
for (IndexType thread_id = 0; thread_id < chunk_size; ++thread_id) { | ||
if (thread_id < step_size) { | ||
elementValue(team_id, next_step * chunk_size + thread_id) = | ||
elementValue(team_id, current_step * chunk_size + thread_id); | ||
} else { | ||
ValueType localValue = | ||
elementValue(team_id, current_step * chunk_size + thread_id); | ||
final_reducer.join( | ||
&localValue, &elementValue(team_id, current_step * chunk_size + | ||
thread_id - step_size)); | ||
elementValue(team_id, next_step * chunk_size + thread_id) = | ||
localValue; | ||
} | ||
} | ||
temp = current_step; | ||
current_step = next_step; | ||
next_step = temp; | ||
} | ||
#pragma acc loop vector | ||
for (IndexType thread_id = 0; thread_id < chunk_size; ++thread_id) { | ||
const IndexType local_offset = team_id * chunk_size; | ||
const IndexType idx = local_offset + thread_id; | ||
ValueType update = | ||
elementValue(team_id, current_step * chunk_size + thread_id); | ||
if (idx < N) functor(idx, update, true); | ||
if (idx == N - 1) { | ||
if (m_result_ptr_device_accessible) { | ||
*m_result_ptr = update; | ||
} else { | ||
m_result_total() = update; | ||
} | ||
} | ||
} | ||
} | ||
if (!m_result_ptr_device_accessible && m_result_ptr != nullptr) { | ||
DeepCopy<HostSpace, Kokkos::Experimental::OpenACCSpace, | ||
Kokkos::Experimental::OpenACC>(m_policy.space(), m_result_ptr, | ||
m_result_total.data(), | ||
sizeof(ValueType)); | ||
} | ||
|
||
#pragma acc exit data delete (functor, chunk_values, offset_values, \ | ||
final_reducer)async(async_arg) | ||
acc_wait(async_arg); | ||
} | ||
#endif |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The diff between the two implementations is
24c24
< new ValueType[2 * chunk_size]);
---
> new ValueType[n_chunks * 2 * chunk_size]);
26a27,31
>
> auto elementValue = [=](IndexType teamID, IndexType i) -> ValueType& {
> return element_values[teamID * 2 * chunk_size + i];
> };
>
29,31c34,37
< #pragma acc parallel loop gang vector_length(chunk_size) private( \
< element_values [0:2 * chunk_size]) \
< present(functor, chunk_values, final_reducer) async(async_arg)
---
>
> #pragma acc parallel loop gang vector_length(chunk_size) \
> create(element_values [0:n_chunks * 2 * chunk_size]) \
> present(functor, chunk_values, final_reducer) async(async_arg)
43c49
< element_values[thread_id] = update;
---
> elementValue(team_id, thread_id) = update;
49,50c55,56
< element_values[next_step * chunk_size + thread_id] =
< element_values[current_step * chunk_size + thread_id];
---
> elementValue(team_id, next_step * chunk_size + thread_id) =
> elementValue(team_id, current_step * chunk_size + thread_id);
53,57c59,64
< element_values[current_step * chunk_size + thread_id];
< final_reducer.join(&localValue,
< &element_values[current_step * chunk_size +
< thread_id - step_size]);
< element_values[next_step * chunk_size + thread_id] = localValue;
---
> elementValue(team_id, current_step * chunk_size + thread_id);
> final_reducer.join(
> &localValue, &elementValue(team_id, current_step * chunk_size +
> thread_id - step_size));
> elementValue(team_id, next_step * chunk_size + thread_id) =
> localValue;
65c72
< element_values[current_step * chunk_size + chunk_size - 1];
---
> elementValue(team_id, current_step * chunk_size + chunk_size - 1);
66a74
>
68,69c76,77
< #pragma acc serial loop present(chunk_values, offset_values, final_reducer) \
< async(async_arg)
---
> #pragma acc parallel loop num_gangs(1) num_workers(1) vector_length(1) \
> present(chunk_values, offset_values, final_reducer) async(async_arg)
79,82c87,91
< #pragma acc parallel loop gang vector_length(chunk_size) private( \
< element_values [0:2 * chunk_size]) \
< present(functor, offset_values, final_reducer) copyin(m_result_total) \
< async(async_arg)
---
>
> #pragma acc parallel loop gang vector_length(chunk_size) \
> create(element_values [0:n_chunks * 2 * chunk_size]) \
> present(functor, offset_values, final_reducer) copyin(m_result_total) \
> async(async_arg)
97c106
< element_values[thread_id] = update;
---
> elementValue(team_id, thread_id) = update;
103,104c112,113
< element_values[next_step * chunk_size + thread_id] =
< element_values[current_step * chunk_size + thread_id];
---
> elementValue(team_id, next_step * chunk_size + thread_id) =
> elementValue(team_id, current_step * chunk_size + thread_id);
107,111c116,121
< element_values[current_step * chunk_size + thread_id];
< final_reducer.join(&localValue,
< &element_values[current_step * chunk_size +
< thread_id - step_size]);
< element_values[next_step * chunk_size + thread_id] = localValue;
---
> elementValue(team_id, current_step * chunk_size + thread_id);
> final_reducer.join(
> &localValue, &elementValue(team_id, current_step * chunk_size +
> thread_id - step_size));
> elementValue(team_id, next_step * chunk_size + thread_id) =
> localValue;
123c133
< element_values[current_step * chunk_size + thread_id];
---
> elementValue(team_id, current_step * chunk_size + thread_id);
139a150
>
which suggests that it shouldn't be too hard to unify them with a couple of macro switches.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Merged the two implementations using macros as suggested.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fine with me.
cmake/kokkos_arch.cmake
Outdated
COMPILER_SPECIFIC_LIBS( | ||
Clang -lm |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm surprised that you need this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It was added for an unexpected linking error when compiling some unit tests on AMD GPUs with old ROCm, but the error no longer exists (when tested with ROCm V5.4.0); deleted.
#define ACCESS_ELEMENTS(THREADID) \ | ||
element_values[team_id * 2 * chunk_size + THREADID] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did you consider creating an unmanaged view instead of doing that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that the current approach is better than using an unmanaged view in terms of the number of code changes between two implementations (one implementation uses the element_values as a gang-private array and the other uses it as a gang-shared array).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You need to prefix all macros with KOKKOS_IMPL_[ACC_]
.
What prevents you to use a regular variable instead of a macro for the number of elements in the array?
ELEMENT_VALUES_SIZE
#ifdef KOKKOS_COMPILER_CLANG
int const num_elements = n_chunks * 2 * chunk_size;
#else
int const num_elements = 2 * chunk_size;
#endif
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Prefixed all macros with KOKKOS_IMPL_ACC_
.
Removed ELEMENT_VALUES_SIZE.
@@ -25,6 +25,9 @@ struct OpenACC_Traits { | |||
#if defined(KOKKOS_IMPL_ARCH_NVIDIA_GPU) | |||
static constexpr acc_device_t dev_type = acc_device_nvidia; | |||
static constexpr bool may_fallback_to_host = false; | |||
#elif defined(KOKKOS_ARCH_VEGA) || defined(KOKKOS_ARCH_NAVI) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove ELEMENT_VALUES_SIZE
#ifdef KOKKOS_COMPILER_CLANG | ||
int const num_elements = n_chunks * 2 * chunk_size; | ||
#else | ||
int const num_elements = 2 * chunk_size; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was excepting you would use num_elements
in the definition of KOKKOS_IMPL_ACC_ELEMENT_VALUES_CLAUSE
Not blocking though
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated.
Use `num_elements` in the definition of `KOKKOS_IMPL_ACC_ELEMENT_VALUES_CLAUSE`.
Ignoring unrelated SYCL failure that was fixed in #6293 |
This PR contains changes on the cmake configurations and OpenACC codes that are necessary for the LLVM-Clacc compiler to compile the OpenACC backend.