Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OpenACC CMakechange Clacc #6250

Merged
merged 20 commits into from
Jul 20, 2023
Merged

Conversation

seyonglee
Copy link
Contributor

This PR contains changes on the cmake configurations and OpenACC codes that are necessary for the LLVM-Clacc compiler to compile the OpenACC backend.

OpenACC/Clacc removal list.
Remove NVHPC-specific changes from cmake/kokkos_enable_devices.cmake
Remove incomplete changes to Makefile.kokkos.
cmake/kokkos_arch.cmake Outdated Show resolved Hide resolved
algorithms/CMakeLists.txt Outdated Show resolved Hide resolved
cmake/kokkos_arch.cmake Outdated Show resolved Hide resolved
Comment on lines 766 to 772
# When not compiling for offload to any GPU, we're compiling for kernel
# execution on the host. In that case, memory is shared between the OpenACC
# space and the host space.
COMPILER_SPECIFIC_DEFS(
Clang KOKKOS_OPENACC_WITHOUT_GPU
NVHPC KOKKOS_OPENACC_WITHOUT_GPU
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we actually care about that case apart from debugging (and then the shared memory space doesn't really help)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A user may want to execute a kernel on a host (in parallel).

Clang -fopenacc -fopenacc-fake-async-wait
-Wno-openacc-and-cxx -Wno-openmp-mapping -Wno-unknown-cuda-version
-Wno-pass-failed
# -Wno-defaulted-function-deleted
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's up with this flag?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Deleted.

Comment on lines 206 to 207
// Alternative implementation to work around OpenACC features not yet
// implemented by Clacc
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the difference between the two implementations?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The current implementation of Clacc does not support gang-private variables; thus, the alternative implementation allocates the gang-private arrays on GPU global memory using array expansion.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please make sure your comment (in code) captures this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated the comment.

@@ -25,6 +25,9 @@ struct OpenACC_Traits {
#if defined(KOKKOS_IMPL_ARCH_NVIDIA_GPU)
static constexpr acc_device_t dev_type = acc_device_nvidia;
static constexpr bool may_fallback_to_host = false;
#elif defined(KOKKOS_ARCH_VEGA)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also NAVI?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added.

Copy link
Member

@dalg24 dalg24 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please withdraw the KOKKOS_OPENACC_WITHOUT_GPU changes.
You may propose them in another PR but, as far as I can tell, these don't belong here and they are potentially controversial.

-Wno-pass-failed
)
COMPILER_SPECIFIC_DEFS(
Clang KOKKOS_WORKAROUND_OPENMPTARGET_CLANG
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where do you use this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That macro is originally used by OpenMPTarget/Clang to disable unsupported unit tests in core/unit_test/TestComplex.hpp, and the same tests are disabled for OpenACC/Clacc too since Clacc is built on top of the same OpenMP implementation in Clang/LLVM. (Clacc internally performs OpenACC-to-OpenMP translation to use the existing OpenMP implementation in LLVM to support OpenACC.)

@seyonglee
Copy link
Contributor Author

Please withdraw the KOKKOS_OPENACC_WITHOUT_GPU changes.
You may propose them in another PR but, as far as I can tell, these don't belong here and they are potentially controversial.

Changes related to KOKKOS_OPENACC_WITHOUT_GPU are removed, which will be addressed by a separate PR.

Comment on lines 1220 to 1223
# FIXME_OPENACC - does not select specified device beyond 0 until OpenACC
# backend is initialized. For example, adding a Kokkos::parallel_for to
# the start of main in core/unit_test/UnitTest_DeviceAndThreads.cpp makes
# the test pass.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure I understand your point. Are you saying that acc_set_device_num does not have effect until a parallel region actually gets executed?

int dev_num = <some-non-zero-device-num>;
acc_set_device_num(dev_num);
assert(acc_get_device_num() == dev_num);  // fails

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment and condition-checking code are outdated; deleted.
The original issue was about the interaction between Kokkos initialization and the OpenACC backend implementation, which is fixed now.

core/unit_test/CMakeLists.txt Outdated Show resolved Hide resolved
core/unit_test/CMakeLists.txt Show resolved Hide resolved
${CMAKE_CURRENT_BINARY_DIR}/openacc/TestOpenACC_WithoutInitializing.cpp
${CMAKE_CURRENT_BINARY_DIR}/openacc/TestOpenACC_ViewAPI_d.cpp
)
# Somehow on ExCL's explorer (AMD GPU), these cause clang-linker-wrapper to
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's "ExCL's explorer"?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Explorer is the name of the machine in the ORNL ExCL cluster (https://excl.ornl.gov).
Updated the comment to provide detailed system info instead of the system name.

Comment on lines 206 to 365
elementValue(team_id, current_step * chunk_size + thread_id);
} else {
ValueType localValue =
elementValue(team_id, current_step * chunk_size + thread_id);
final_reducer.join(
&localValue, &elementValue(team_id, current_step * chunk_size +
thread_id - step_size));
elementValue(team_id, next_step * chunk_size + thread_id) =
localValue;
}
}
temp = current_step;
current_step = next_step;
next_step = temp;
}
chunk_values(team_id) =
elementValue(team_id, current_step * chunk_size + chunk_size - 1);
}

ValueType tempValue;
#pragma acc parallel loop num_gangs(1) num_workers(1) vector_length(1) \
present(chunk_values, offset_values, final_reducer) async(async_arg)
for (IndexType team_id = 0; team_id < n_chunks; ++team_id) {
if (team_id == 0) {
final_reducer.init(&offset_values(0));
final_reducer.init(&tempValue);
} else {
final_reducer.join(&tempValue, &chunk_values(team_id - 1));
offset_values(team_id) = tempValue;
}
}

#pragma acc parallel loop gang vector_length(chunk_size) \
create(element_values [0:n_chunks * 2 * chunk_size]) \
present(functor, offset_values, final_reducer) copyin(m_result_total) \
async(async_arg)
for (IndexType team_id = 0; team_id < n_chunks; ++team_id) {
IndexType current_step = 0;
IndexType next_step = 1;
IndexType temp;
#pragma acc loop vector
for (IndexType thread_id = 0; thread_id < chunk_size; ++thread_id) {
const IndexType local_offset = team_id * chunk_size;
const IndexType idx = local_offset + thread_id;
ValueType update;
final_reducer.init(&update);
if (thread_id == 0) {
final_reducer.join(&update, &offset_values(team_id));
}
if ((idx > 0) && (idx < N)) functor(idx - 1, update, false);
elementValue(team_id, thread_id) = update;
}
for (IndexType step_size = 1; step_size < chunk_size; step_size *= 2) {
#pragma acc loop vector
for (IndexType thread_id = 0; thread_id < chunk_size; ++thread_id) {
if (thread_id < step_size) {
elementValue(team_id, next_step * chunk_size + thread_id) =
elementValue(team_id, current_step * chunk_size + thread_id);
} else {
ValueType localValue =
elementValue(team_id, current_step * chunk_size + thread_id);
final_reducer.join(
&localValue, &elementValue(team_id, current_step * chunk_size +
thread_id - step_size));
elementValue(team_id, next_step * chunk_size + thread_id) =
localValue;
}
}
temp = current_step;
current_step = next_step;
next_step = temp;
}
#pragma acc loop vector
for (IndexType thread_id = 0; thread_id < chunk_size; ++thread_id) {
const IndexType local_offset = team_id * chunk_size;
const IndexType idx = local_offset + thread_id;
ValueType update =
elementValue(team_id, current_step * chunk_size + thread_id);
if (idx < N) functor(idx, update, true);
if (idx == N - 1) {
if (m_result_ptr_device_accessible) {
*m_result_ptr = update;
} else {
m_result_total() = update;
}
}
}
}
if (!m_result_ptr_device_accessible && m_result_ptr != nullptr) {
DeepCopy<HostSpace, Kokkos::Experimental::OpenACCSpace,
Kokkos::Experimental::OpenACC>(m_policy.space(), m_result_ptr,
m_result_total.data(),
sizeof(ValueType));
}

#pragma acc exit data delete (functor, chunk_values, offset_values, \
final_reducer)async(async_arg)
acc_wait(async_arg);
}
#endif
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The diff between the two implementations is

24c24
<         new ValueType[2 * chunk_size]);
---
>         new ValueType[n_chunks * 2 * chunk_size]);
26a27,31
> 
>     auto elementValue = [=](IndexType teamID, IndexType i) -> ValueType& {
>       return element_values[teamID * 2 * chunk_size + i];
>     };
> 
29,31c34,37
< #pragma acc parallel loop gang vector_length(chunk_size) private( \
<     element_values [0:2 * chunk_size])                            \
<     present(functor, chunk_values, final_reducer) async(async_arg)
---
> 
> #pragma acc parallel loop gang vector_length(chunk_size) \
>     create(element_values [0:n_chunks * 2 * chunk_size]) \
>         present(functor, chunk_values, final_reducer) async(async_arg)
43c49
<         element_values[thread_id] = update;
---
>         elementValue(team_id, thread_id) = update;
49,50c55,56
<             element_values[next_step * chunk_size + thread_id] =
<                 element_values[current_step * chunk_size + thread_id];
---
>             elementValue(team_id, next_step * chunk_size + thread_id) =
>                 elementValue(team_id, current_step * chunk_size + thread_id);
53,57c59,64
<                 element_values[current_step * chunk_size + thread_id];
<             final_reducer.join(&localValue,
<                                &element_values[current_step * chunk_size +
<                                                thread_id - step_size]);
<             element_values[next_step * chunk_size + thread_id] = localValue;
---
>                 elementValue(team_id, current_step * chunk_size + thread_id);
>             final_reducer.join(
>                 &localValue, &elementValue(team_id, current_step * chunk_size +
>                                                         thread_id - step_size));
>             elementValue(team_id, next_step * chunk_size + thread_id) =
>                 localValue;
65c72
<           element_values[current_step * chunk_size + chunk_size - 1];
---
>           elementValue(team_id, current_step * chunk_size + chunk_size - 1);
66a74
> 
68,69c76,77
< #pragma acc serial loop present(chunk_values, offset_values, final_reducer) \
<     async(async_arg)
---
> #pragma acc parallel loop num_gangs(1) num_workers(1) vector_length(1) \
>     present(chunk_values, offset_values, final_reducer) async(async_arg)
79,82c87,91
< #pragma acc parallel loop gang vector_length(chunk_size) private(         \
<     element_values [0:2 * chunk_size])                                    \
<     present(functor, offset_values, final_reducer) copyin(m_result_total) \
<         async(async_arg)
---
> 
> #pragma acc parallel loop gang vector_length(chunk_size)                      \
>     create(element_values [0:n_chunks * 2 * chunk_size])                      \
>         present(functor, offset_values, final_reducer) copyin(m_result_total) \
>             async(async_arg)
97c106
<         element_values[thread_id] = update;
---
>         elementValue(team_id, thread_id) = update;
103,104c112,113
<             element_values[next_step * chunk_size + thread_id] =
<                 element_values[current_step * chunk_size + thread_id];
---
>             elementValue(team_id, next_step * chunk_size + thread_id) =
>                 elementValue(team_id, current_step * chunk_size + thread_id);
107,111c116,121
<                 element_values[current_step * chunk_size + thread_id];
<             final_reducer.join(&localValue,
<                                &element_values[current_step * chunk_size +
<                                                thread_id - step_size]);
<             element_values[next_step * chunk_size + thread_id] = localValue;
---
>                 elementValue(team_id, current_step * chunk_size + thread_id);
>             final_reducer.join(
>                 &localValue, &elementValue(team_id, current_step * chunk_size +
>                                                         thread_id - step_size));
>             elementValue(team_id, next_step * chunk_size + thread_id) =
>                 localValue;
123c133
<             element_values[current_step * chunk_size + thread_id];
---
>             elementValue(team_id, current_step * chunk_size + thread_id);
139a150
> 

which suggests that it shouldn't be too hard to unify them with a couple of macro switches.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Merged the two implementations using macros as suggested.

Copy link
Contributor

@masterleinad masterleinad left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fine with me.

Comment on lines 759 to 760
COMPILER_SPECIFIC_LIBS(
Clang -lm
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm surprised that you need this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It was added for an unexpected linking error when compiling some unit tests on AMD GPUs with old ROCm, but the error no longer exists (when tested with ROCm V5.4.0); deleted.

Comment on lines 33 to 34
#define ACCESS_ELEMENTS(THREADID) \
element_values[team_id * 2 * chunk_size + THREADID]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you consider creating an unmanaged view instead of doing that?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that the current approach is better than using an unmanaged view in terms of the number of code changes between two implementations (one implementation uses the element_values as a gang-private array and the other uses it as a gang-shared array).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You need to prefix all macros with KOKKOS_IMPL_[ACC_].

What prevents you to use a regular variable instead of a macro for the number of elements in the array?

ELEMENT_VALUES_SIZE

#ifdef KOKKOS_COMPILER_CLANG
int const num_elements = n_chunks * 2 * chunk_size;
#else
int const num_elements = 2 * chunk_size;
#endif

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Prefixed all macros with KOKKOS_IMPL_ACC_.
Removed ELEMENT_VALUES_SIZE.

@@ -25,6 +25,9 @@ struct OpenACC_Traits {
#if defined(KOKKOS_IMPL_ARCH_NVIDIA_GPU)
static constexpr acc_device_t dev_type = acc_device_nvidia;
static constexpr bool may_fallback_to_host = false;
#elif defined(KOKKOS_ARCH_VEGA) || defined(KOKKOS_ARCH_NAVI)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cc @Rombur who is working on #6266
I think it os ok though.

core/unit_test/CMakeLists.txt Outdated Show resolved Hide resolved
#ifdef KOKKOS_COMPILER_CLANG
int const num_elements = n_chunks * 2 * chunk_size;
#else
int const num_elements = 2 * chunk_size;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was excepting you would use num_elements in the definition of KOKKOS_IMPL_ACC_ELEMENT_VALUES_CLAUSE
Not blocking though

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated.

Use `num_elements` in the definition of
`KOKKOS_IMPL_ACC_ELEMENT_VALUES_CLAUSE`.
@dalg24
Copy link
Member

dalg24 commented Jul 20, 2023

Ignoring unrelated SYCL failure that was fixed in #6293

@dalg24 dalg24 merged commit 8452f8d into kokkos:develop Jul 20, 2023
26 of 28 checks passed
@masterleinad masterleinad mentioned this pull request Sep 14, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants