OpenACC CMakechange Clacc #6250

seyonglee · 2023-06-30T18:11:37Z

This PR contains changes on the cmake configurations and OpenACC codes that are necessary for the LLVM-Clacc compiler to compile the OpenACC backend.

Clacc supports both NVIDIA GPUs and AMD GPUs as the Kokkos OpenACC backend target.
Clacc can be downloaded from the LLVM-DOE repository (https://github.com/llvm-doe-org/llvm-project/tree/clacc/main).

…ompiler can comple the OpenACC backend.

when targeting AMD GPUs.

OpenACC/Clacc removal list. Remove NVHPC-specific changes from cmake/kokkos_enable_devices.cmake Remove incomplete changes to Makefile.kokkos.

cmake/kokkos_arch.cmake

algorithms/CMakeLists.txt

cmake/kokkos_arch.cmake

masterleinad · 2023-06-30T19:39:50Z

cmake/kokkos_arch.cmake

+    # When not compiling for offload to any GPU, we're compiling for kernel
+    # execution on the host.  In that case, memory is shared between the OpenACC
+    # space and the host space.
+    COMPILER_SPECIFIC_DEFS(
+      Clang KOKKOS_OPENACC_WITHOUT_GPU
+      NVHPC KOKKOS_OPENACC_WITHOUT_GPU
+    )


Do we actually care about that case apart from debugging (and then the shared memory space doesn't really help)?

A user may want to execute a kernel on a host (in parallel).

masterleinad · 2023-06-30T19:40:10Z

cmake/kokkos_enable_devices.cmake

+    Clang -fopenacc -fopenacc-fake-async-wait
+          -Wno-openacc-and-cxx -Wno-openmp-mapping -Wno-unknown-cuda-version
+          -Wno-pass-failed
+          # -Wno-defaulted-function-deleted


What's up with this flag?

masterleinad · 2023-06-30T19:42:16Z

core/src/OpenACC/Kokkos_OpenACC_ParallelScan_Range.hpp

+  // Alternative implementation to work around OpenACC features not yet
+  // implemented by Clacc


What's the difference between the two implementations?

The current implementation of Clacc does not support gang-private variables; thus, the alternative implementation allocates the gang-private arrays on GPU global memory using array expansion.

Please make sure your comment (in code) captures this.

Updated the comment.

masterleinad · 2023-06-30T19:46:23Z

core/src/OpenACC/Kokkos_OpenACC_Traits.hpp

@@ -25,6 +25,9 @@ struct OpenACC_Traits {
 #if defined(KOKKOS_IMPL_ARCH_NVIDIA_GPU)
  static constexpr acc_device_t dev_type     = acc_device_nvidia;
  static constexpr bool may_fallback_to_host = false;
+#elif defined(KOKKOS_ARCH_VEGA)


also NAVI?

…ding on AMD GPUs.

Co-authored-by: Daniel Arndt <arndtd@ornl.gov>

…os into openacc_cmakechange_clacc

dalg24

Please withdraw the KOKKOS_OPENACC_WITHOUT_GPU changes.
You may propose them in another PR but, as far as I can tell, these don't belong here and they are potentially controversial.

dalg24 · 2023-07-05T12:55:05Z

cmake/kokkos_enable_devices.cmake

+          -Wno-pass-failed
+  )
+  COMPILER_SPECIFIC_DEFS(
+    Clang KOKKOS_WORKAROUND_OPENMPTARGET_CLANG


Where do you use this?

That macro is originally used by OpenMPTarget/Clang to disable unsupported unit tests in core/unit_test/TestComplex.hpp, and the same tests are disabled for OpenACC/Clacc too since Clacc is built on top of the same OpenMP implementation in Clang/LLVM. (Clacc internally performs OpenACC-to-OpenMP translation to use the existing OpenMP implementation in LLVM to support OpenACC.)

handled in a separate PR.

seyonglee · 2023-07-05T20:43:54Z

Please withdraw the KOKKOS_OPENACC_WITHOUT_GPU changes.
You may propose them in another PR but, as far as I can tell, these don't belong here and they are potentially controversial.

Changes related to KOKKOS_OPENACC_WITHOUT_GPU are removed, which will be addressed by a separate PR.

dalg24 · 2023-07-05T21:01:14Z

core/unit_test/CMakeLists.txt

+       # FIXME_OPENACC - does not select specified device beyond 0 until OpenACC
+       # backend is initialized.  For example, adding a Kokkos::parallel_for to
+       # the start of main in core/unit_test/UnitTest_DeviceAndThreads.cpp makes
+       # the test pass.


I am not sure I understand your point. Are you saying that acc_set_device_num does not have effect until a parallel region actually gets executed?

int dev_num = <some-non-zero-device-num>; acc_set_device_num(dev_num); assert(acc_get_device_num() == dev_num); // fails

The comment and condition-checking code are outdated; deleted.
The original issue was about the interaction between Kokkos initialization and the OpenACC backend implementation, which is fixed now.

core/unit_test/CMakeLists.txt

dalg24 · 2023-07-05T21:08:57Z

core/unit_test/CMakeLists.txt

+    ${CMAKE_CURRENT_BINARY_DIR}/openacc/TestOpenACC_WithoutInitializing.cpp
+    ${CMAKE_CURRENT_BINARY_DIR}/openacc/TestOpenACC_ViewAPI_d.cpp
+    )
+  # Somehow on ExCL's explorer (AMD GPU), these cause clang-linker-wrapper to


What's "ExCL's explorer"?

Explorer is the name of the machine in the ORNL ExCL cluster (https://excl.ornl.gov).
Updated the comment to provide detailed system info instead of the system name.

Co-authored-by: Damien L-G <dalg24+github@gmail.com>

…ted by the code review.

code review. Re-enabled TestCompilerMacros.cpp for the OpenACC backend compilers (NVHPC and Clacc)

masterleinad · 2023-07-07T15:57:57Z

core/src/OpenACC/Kokkos_OpenACC_ParallelScan_Range.hpp

+                elementValue(team_id, current_step * chunk_size + thread_id);
+          } else {
+            ValueType localValue =
+                elementValue(team_id, current_step * chunk_size + thread_id);
+            final_reducer.join(
+                &localValue, &elementValue(team_id, current_step * chunk_size +
+                                                        thread_id - step_size));
+            elementValue(team_id, next_step * chunk_size + thread_id) =
+                localValue;
+          }
+        }
+        temp         = current_step;
+        current_step = next_step;
+        next_step    = temp;
+      }
+      chunk_values(team_id) =
+          elementValue(team_id, current_step * chunk_size + chunk_size - 1);
+    }
+
+    ValueType tempValue;
+#pragma acc parallel loop num_gangs(1) num_workers(1) vector_length(1) \
+    present(chunk_values, offset_values, final_reducer) async(async_arg)
+    for (IndexType team_id = 0; team_id < n_chunks; ++team_id) {
+      if (team_id == 0) {
+        final_reducer.init(&offset_values(0));
+        final_reducer.init(&tempValue);
+      } else {
+        final_reducer.join(&tempValue, &chunk_values(team_id - 1));
+        offset_values(team_id) = tempValue;
+      }
+    }
+
+#pragma acc parallel loop gang vector_length(chunk_size)                      \
+    create(element_values [0:n_chunks * 2 * chunk_size])                      \
+        present(functor, offset_values, final_reducer) copyin(m_result_total) \
+            async(async_arg)
+    for (IndexType team_id = 0; team_id < n_chunks; ++team_id) {
+      IndexType current_step = 0;
+      IndexType next_step    = 1;
+      IndexType temp;
+#pragma acc loop vector
+      for (IndexType thread_id = 0; thread_id < chunk_size; ++thread_id) {
+        const IndexType local_offset = team_id * chunk_size;
+        const IndexType idx          = local_offset + thread_id;
+        ValueType update;
+        final_reducer.init(&update);
+        if (thread_id == 0) {
+          final_reducer.join(&update, &offset_values(team_id));
+        }
+        if ((idx > 0) && (idx < N)) functor(idx - 1, update, false);
+        elementValue(team_id, thread_id) = update;
+      }
+      for (IndexType step_size = 1; step_size < chunk_size; step_size *= 2) {
+#pragma acc loop vector
+        for (IndexType thread_id = 0; thread_id < chunk_size; ++thread_id) {
+          if (thread_id < step_size) {
+            elementValue(team_id, next_step * chunk_size + thread_id) =
+                elementValue(team_id, current_step * chunk_size + thread_id);
+          } else {
+            ValueType localValue =
+                elementValue(team_id, current_step * chunk_size + thread_id);
+            final_reducer.join(
+                &localValue, &elementValue(team_id, current_step * chunk_size +
+                                                        thread_id - step_size));
+            elementValue(team_id, next_step * chunk_size + thread_id) =
+                localValue;
+          }
+        }
+        temp         = current_step;
+        current_step = next_step;
+        next_step    = temp;
+      }
+#pragma acc loop vector
+      for (IndexType thread_id = 0; thread_id < chunk_size; ++thread_id) {
+        const IndexType local_offset = team_id * chunk_size;
+        const IndexType idx          = local_offset + thread_id;
+        ValueType update =
+            elementValue(team_id, current_step * chunk_size + thread_id);
+        if (idx < N) functor(idx, update, true);
+        if (idx == N - 1) {
+          if (m_result_ptr_device_accessible) {
+            *m_result_ptr = update;
+          } else {
+            m_result_total() = update;
+          }
+        }
+      }
+    }
+    if (!m_result_ptr_device_accessible && m_result_ptr != nullptr) {
+      DeepCopy<HostSpace, Kokkos::Experimental::OpenACCSpace,
+               Kokkos::Experimental::OpenACC>(m_policy.space(), m_result_ptr,
+                                              m_result_total.data(),
+                                              sizeof(ValueType));
+    }
+
+#pragma acc exit data delete (functor, chunk_values, offset_values, \
+                              final_reducer)async(async_arg)
+    acc_wait(async_arg);
+  }
+#endif


The diff between the two implementations is

24c24 < new ValueType[2 * chunk_size]); --- > new ValueType[n_chunks * 2 * chunk_size]); 26a27,31 > > auto elementValue = [=](IndexType teamID, IndexType i) -> ValueType& { > return element_values[teamID * 2 * chunk_size + i]; > }; > 29,31c34,37 < #pragma acc parallel loop gang vector_length(chunk_size) private( \ < element_values [0:2 * chunk_size]) \ < present(functor, chunk_values, final_reducer) async(async_arg) --- > > #pragma acc parallel loop gang vector_length(chunk_size) \ > create(element_values [0:n_chunks * 2 * chunk_size]) \ > present(functor, chunk_values, final_reducer) async(async_arg) 43c49 < element_values[thread_id] = update; --- > elementValue(team_id, thread_id) = update; 49,50c55,56 < element_values[next_step * chunk_size + thread_id] = < element_values[current_step * chunk_size + thread_id]; --- > elementValue(team_id, next_step * chunk_size + thread_id) = > elementValue(team_id, current_step * chunk_size + thread_id); 53,57c59,64 < element_values[current_step * chunk_size + thread_id]; < final_reducer.join(&localValue, < &element_values[current_step * chunk_size + < thread_id - step_size]); < element_values[next_step * chunk_size + thread_id] = localValue; --- > elementValue(team_id, current_step * chunk_size + thread_id); > final_reducer.join( > &localValue, &elementValue(team_id, current_step * chunk_size + > thread_id - step_size)); > elementValue(team_id, next_step * chunk_size + thread_id) = > localValue; 65c72 < element_values[current_step * chunk_size + chunk_size - 1]; --- > elementValue(team_id, current_step * chunk_size + chunk_size - 1); 66a74 > 68,69c76,77 < #pragma acc serial loop present(chunk_values, offset_values, final_reducer) \ < async(async_arg) --- > #pragma acc parallel loop num_gangs(1) num_workers(1) vector_length(1) \ > present(chunk_values, offset_values, final_reducer) async(async_arg) 79,82c87,91 < #pragma acc parallel loop gang vector_length(chunk_size) private( \ < element_values [0:2 * chunk_size]) \ < present(functor, offset_values, final_reducer) copyin(m_result_total) \ < async(async_arg) --- > > #pragma acc parallel loop gang vector_length(chunk_size) \ > create(element_values [0:n_chunks * 2 * chunk_size]) \ > present(functor, offset_values, final_reducer) copyin(m_result_total) \ > async(async_arg) 97c106 < element_values[thread_id] = update; --- > elementValue(team_id, thread_id) = update; 103,104c112,113 < element_values[next_step * chunk_size + thread_id] = < element_values[current_step * chunk_size + thread_id]; --- > elementValue(team_id, next_step * chunk_size + thread_id) = > elementValue(team_id, current_step * chunk_size + thread_id); 107,111c116,121 < element_values[current_step * chunk_size + thread_id]; < final_reducer.join(&localValue, < &element_values[current_step * chunk_size + < thread_id - step_size]); < element_values[next_step * chunk_size + thread_id] = localValue; --- > elementValue(team_id, current_step * chunk_size + thread_id); > final_reducer.join( > &localValue, &elementValue(team_id, current_step * chunk_size + > thread_id - step_size)); > elementValue(team_id, next_step * chunk_size + thread_id) = > localValue; 123c133 < element_values[current_step * chunk_size + thread_id]; --- > elementValue(team_id, current_step * chunk_size + thread_id); 139a150 >

which suggests that it shouldn't be too hard to unify them with a couple of macro switches.

Merged the two implementations using macros as suggested.

macros.

masterleinad

Fine with me.

masterleinad · 2023-07-10T18:02:26Z

cmake/kokkos_arch.cmake

+    COMPILER_SPECIFIC_LIBS(
+      Clang -lm


I'm surprised that you need this.

It was added for an unexpected linking error when compiling some unit tests on AMD GPUs with old ROCm, but the error no longer exists (when tested with ROCm V5.4.0); deleted.

dalg24 · 2023-07-12T18:28:08Z

core/src/OpenACC/Kokkos_OpenACC_ParallelScan_Range.hpp

+#define ACCESS_ELEMENTS(THREADID) \
+  element_values[team_id * 2 * chunk_size + THREADID]


Did you consider creating an unmanaged view instead of doing that?

I think that the current approach is better than using an unmanaged view in terms of the number of code changes between two implementations (one implementation uses the element_values as a gang-private array and the other uses it as a gang-shared array).

You need to prefix all macros with KOKKOS_IMPL_[ACC_].

What prevents you to use a regular variable instead of a macro for the number of elements in the array?

~~ELEMENT_VALUES_SIZE~~

#ifdef KOKKOS_COMPILER_CLANG int const num_elements = n_chunks * 2 * chunk_size; #else int const num_elements = 2 * chunk_size; #endif

Prefixed all macros with KOKKOS_IMPL_ACC_.
Removed ELEMENT_VALUES_SIZE.

dalg24 · 2023-07-15T02:49:15Z

core/src/OpenACC/Kokkos_OpenACC_Traits.hpp

@@ -25,6 +25,9 @@ struct OpenACC_Traits {
 #if defined(KOKKOS_IMPL_ARCH_NVIDIA_GPU)
  static constexpr acc_device_t dev_type     = acc_device_nvidia;
  static constexpr bool may_fallback_to_host = false;
+#elif defined(KOKKOS_ARCH_VEGA) || defined(KOKKOS_ARCH_NAVI)


cc @Rombur who is working on #6266
I think it os ok though.

core/unit_test/CMakeLists.txt

Remove ELEMENT_VALUES_SIZE

dalg24 · 2023-07-17T18:39:33Z

core/src/OpenACC/Kokkos_OpenACC_ParallelScan_Range.hpp

+#ifdef KOKKOS_COMPILER_CLANG
+    int const num_elements = n_chunks * 2 * chunk_size;
+#else
+    int const num_elements = 2 * chunk_size;


I was excepting you would use num_elements in the definition of KOKKOS_IMPL_ACC_ELEMENT_VALUES_CLAUSE
Not blocking though

Use `num_elements` in the definition of `KOKKOS_IMPL_ACC_ELEMENT_VALUES_CLAUSE`.

dalg24 · 2023-07-20T04:06:55Z

Ignoring unrelated SYCL failure that was fixed in #6293

seyonglee added 3 commits June 29, 2023 14:51

Update cmake configurations and OpenACC code so that the LLVM-Clacc c…

9f450ad

…ompiler can comple the OpenACC backend.

Disable problematic unit tests that cause clang-linker-wrapper to hang

1bdd0ac

when targeting AMD GPUs.

Update unit_test/CMakeLists.txt to remove supported unit tests from the

ddeb708

OpenACC/Clacc removal list. Remove NVHPC-specific changes from cmake/kokkos_enable_devices.cmake Remove incomplete changes to Makefile.kokkos.

masterleinad reviewed Jun 30, 2023

View reviewed changes

seyonglee and others added 4 commits July 2, 2023 12:02

Disable unit tests that cause Clacc to hang for a long time when buil…

29b1561

…ding on AMD GPUs.

Apply suggestions from code review

f4ed9fe

Co-authored-by: Daniel Arndt <arndtd@ornl.gov>

Minor update according to the code review.

aca7f65

Merge branch 'openacc_cmakechange_clacc' of github.com:seyonglee/kokk…

c779c65

…os into openacc_cmakechange_clacc

dalg24 requested changes Jul 5, 2023

View reviewed changes

Revert the changes related to KOKKOS_OPENACC_WITHOUT_GPU, which will be

844547e

handled in a separate PR.

dalg24 reviewed Jul 5, 2023

View reviewed changes

seyonglee and others added 5 commits July 5, 2023 22:39

Fix an error in algorithms/CMakeLists.txt

a79f9d3

Update core/unit_test/CMakeLists.txt

81d986a

Co-authored-by: Damien L-G <dalg24+github@gmail.com>

Update the comment in Kokkos_OpenACC_ParallelScan_Range.hpp as reques…

9497dd6

…ted by the code review.

Update comments in core/unit_test/CMakeLists.txt as suggested by the

5ecaafd

code review. Re-enabled TestCompilerMacros.cpp for the OpenACC backend compilers (NVHPC and Clacc)

Delete outdated comment and code in core/unit_test/CMakeLists.txt

b490174

masterleinad reviewed Jul 7, 2023

View reviewed changes

seyonglee added 2 commits July 10, 2023 11:27

Merge OpenACC parallel_scan(range) implementations into single one using

92edcb8

macros.

Simplify the parallel_scan(range) implementation further using macro.

c99bf25

masterleinad approved these changes Jul 10, 2023

View reviewed changes

Remove -lm option from Clang when targeting AMD GPUs with OpenACC

08deb13

dalg24 reviewed Jul 15, 2023

View reviewed changes

seyonglee added 2 commits July 15, 2023 11:57

Undo removing space in line 1219 of core/unit_test/CMakeLists.txt

f6647d3

Prefix macros with KOKKOS_IMPL_ACC

46d07bb

Remove ELEMENT_VALUES_SIZE

dalg24 reviewed Jul 17, 2023

View reviewed changes

Merge branch 'develop' into openacc_cmakechange_clacc

5dd238b

dalg24 approved these changes Jul 17, 2023

View reviewed changes

Remove TestCompilerMacros.cpp from the removal list.

88b0fa6

Use `num_elements` in the definition of `KOKKOS_IMPL_ACC_ELEMENT_VALUES_CLAUSE`.

dalg24 merged commit 8452f8d into kokkos:develop Jul 20, 2023
26 of 28 checks passed

masterleinad mentioned this pull request Sep 14, 2023

CHANGELOG: 4.2.0 #6197

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

OpenACC CMakechange Clacc #6250

OpenACC CMakechange Clacc #6250

seyonglee commented Jun 30, 2023

masterleinad Jun 30, 2023

seyonglee Jul 3, 2023

masterleinad Jun 30, 2023

seyonglee Jul 4, 2023

masterleinad Jun 30, 2023

seyonglee Jul 3, 2023

dalg24 Jul 6, 2023

seyonglee Jul 6, 2023

masterleinad Jun 30, 2023

seyonglee Jul 4, 2023

dalg24 left a comment

dalg24 Jul 5, 2023

seyonglee Jul 5, 2023

seyonglee commented Jul 5, 2023

dalg24 Jul 5, 2023

seyonglee Jul 6, 2023

dalg24 Jul 5, 2023

seyonglee Jul 6, 2023

masterleinad Jul 7, 2023

seyonglee Jul 10, 2023

masterleinad left a comment

masterleinad Jul 10, 2023

seyonglee Jul 12, 2023

dalg24 Jul 12, 2023

seyonglee Jul 15, 2023

dalg24 Jul 16, 2023

seyonglee Jul 17, 2023

dalg24 Jul 15, 2023

dalg24 Jul 17, 2023

seyonglee Jul 17, 2023

dalg24 commented Jul 20, 2023

		// Alternative implementation to work around OpenACC features not yet
		// implemented by Clacc

		#define ACCESS_ELEMENTS(THREADID) \
		element_values[team_id * 2 * chunk_size + THREADID]

OpenACC CMakechange Clacc #6250

OpenACC CMakechange Clacc #6250

Conversation

seyonglee commented Jun 30, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dalg24 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

seyonglee commented Jul 5, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

masterleinad left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dalg24 commented Jul 20, 2023